While migrating our infrastructure to Kubernetes, we had multiple s3 buckets to migrate from our old infrastructure to the new one.
The blog’s goal is to walk you through different experiments we did on the final solution we’ve implemented. If you’re interested in how Kubernetes helped us on this operation, you can head directly to the “dealing with bigger buckets.”
To understand the complexity of the task, I have drawn the architectural schema below:
As the image describes, we have 3 buckets (bucket-A-*) in account A, in a virtual private cloud A and hosted in the Ireland region (EU-west-1). Conversely, we have 3 empty buckets (bucket-B-*) in account B, in a virtual private cloud B, and hosted in the Paris region (EU-west-3).
The goal is to transfer all the data in buckets A to buckets B.
Multiple challenges we’ve faced in this migration:
- both buckets are secured with bucket policies
- buckets A contains a massive number of small files
- the total amount of data to be transferred is 10 To
- different accounts, different vpc, and different regions make the operation complex to handle
- the destination bucket is encrypted
To simplify the article manipulations, we will use only one bucket per account:bucket-A and bucket-B. You will find some tips to industrialize the process if you deal with multiple buckets.
we will assume that the bucket destination(bucket-b) is created and empty
Ask for permissions to prepare for the operation:
To create the s3cross-account user account and add the proper permissions accordingly, you need at first an admin account to manage it.
You can skip this part if you have admin access in accounts A and B.
If you can’t access admin accounts, you can at least request access from the administrators of accounts A and B to:
- full access bucket-A
- full access bucket-B
- Permissions to create IAM users in account B
Create s3cross-account IAM user :
The first step is to create the user and set up the required permissions to ensure the transfer between the buckets. This user account will access as read-only on bucket A and read-write on bucket B.
The user account needs to create in account B. We assume you know how to create an IAM user account on the AWS console.
Since our buckets are secured by both bucket policies and IAM user’s policy, granting permissions to s3-cross-account will be divided into two parts:
- Giving access to both buckets from the user perspective will guarantee that the account can access the buckets ( arrows in black in the schema below)
- Granting access to both buckets from the buckets policy will allow the bucket to authorize the account to make a change. (bucket policy in green in the schema below )
Create s3-cross-account user’s policy :
On the s3-cross-account user’s policy, create two inline policies :
if you’re dealing with multiple buckets to transfer, i would recommand that you grant access to all the buckets-B with “Resource”: “arn:aws:s3:::bucket-B-*” . for bucket-A-* you need to write them manually one by one .
You can notice that the bucket B policy grants access to the KMS key to allow the s3-cross-account user to put encrypted files in the bucket. If you don’t use encryption, you can delete the grantKMSkey section.
Note that you can’t put any unencrypted files in an encrypted s3 bucket. you will have a “permission denied ” error.
Change bucket’s policies:
The next step is to grant s3-cross-account from the bucket’s perspective.
On bucket A add the following bucket policy( bucket policies are under the permissions tab):
On bucket B, add the following bucket policy:
for multiple buckets modifications, you can use this script to automate change of the policy on multiple buckets without the need to access to them manually one by one.
Smoke Test for permissions:
After configuring your access key and secret key of the s3-cross-account user (if you’re not familiar with the process, you can take a look here )
We can now test that the user has the correct permissions. You can do the following:
# test bucket A access $ aws s3 ls --profile s3-prod-cross-account s3://bucket-A --region eu-west-1 $ aws s3 cp --profile s3-prod-cross-account s3://bucket-A/some-file . --region eu-west-1 -> some-file downloaded locally# test bucket B access $ aws s3 ls --profile s3-prod-cross-account s3://bucket-B --region eu-west-3 $ touch dummyfile $ aws s3 cp --profile s3-prod-cross-account ./dummyfile s3://bucket-B --region eu-west-3 -> dummyfile uploaded to s3
make sure that the name of the profile and the region are correct when you try those commands.
Now that we are ready, we can now begin our transfer.
Transferring the data :
As I mentioned, we deal with terabytes of data and thousands of files across our buckets. We had some buckets with just gigabytes of data and others with terabytes.
This section aims to walk you through the different solutions we’ve tested. There is no good or bad solution; it depends on your architecture && the size of the buckets and the total number of objects.
Dealing with small buckets:
Dealing with small buckets does not mean that the copy will be swift. The number of objects is also significant since it can multiply the copy time by 10.
Here is an example of a bucket that we had to copy:
The generic command to copy buckets from one region to another is :
aws s3 cp --profile s3-prod-cross-account s3://bucket-source/ s3://bucket-target/ --source-region source-region --region target-region --recursive
If we consider that bucket A is a small bucket, you can copy it with:
aws s3 cp --profile s3-prod-cross-account s3://bucket-A/ s3://bucket-B/ --source-region eu-west-1 --region eu-west-3 --recursive
Some optimizations that you can consider to speed up the copy :
- Copy command “cp” instead of sync: sync needs to enumerate all the files within the buckets to decide which files to copy. Copy command will copy the source bucket’s content to the destination bucket. If the file exists within the destination, it will be overwritten
- Parallel run: you can run multiple instances of awscli in separate terminals to increase the speed of the transfers.
- Splitting the run into pieces: you can use “exclude” and “include” filters to divide your bucket into multiple parts and copy them at the same time (further notes here).
- Background jobs: I recommend always running the AWS s3 copy on a background job either with “cronjobs” or with the “nohup” command
Dealing with bigger buckets:
Dealing with bigger buckets in such complex architecture was a pain point during the migration(it was the reason that pushed me to write this blog). Here is an example of one of the buckets that we transferred.
The first run was using the
aws cp command (like in the previous section) from my laptop to perform the copy. I give the idea since it will take DAYS to finish the copy with the rate announced.
On the second run, I tried to be the closest to AWS infrastructure; Since we were running a Kubernetes cluster on account B, I ran a job with an awscli image that would launch the cp command for me. You can find below a snippet of the job.
On this approach, I hit the following error :
An error occurred (AccessDenied) when calling the CopyObject operation: VPC endpoints do not support cross-region requests
When digging into the issue, I figure out that I can’t do the copy from inside the vpc since VPC endpoints for Amazon S3 currently don’t support cross-Region requests. ( as I mentioned earlier, vpc-A that contains bucket A has a peering with vpc-B that contains bucket B).
Multiple solutions are possible to fix this:
- Temporarily disable the VPC endpoint by Removing the VPC endpoint from the route table. this procedure will force awscli to pass through the internet instead of using intra-vpc calls
- Create a new VPC without a VPC endpoint and launch an EC2 there.
- Launch an EC2 machine that is neither in region A nor region B
None of the solutions was possible in our setup; we needed to migrate without changes on the architectural aspect or addition to resources.
On the third run (that’s the good one !), I based my solution on this comment and adjusted it to run on Kubernetes.
To explain the setup, you can find below a schema:
We have created a pod ( controlled by a job ) that contains two containers with a shared folder :
- s3-sync-local will copy the files from the source bucket (bucket-a) to a local folder /data
- s3-sync-remote will move the files in /data to the remote bucket (bucket-b). If the folder is empty, he will wait for 5 seconds and rerun the move command. The while loop stops if the folder is empty for two iterations (10 seconds of sleep.)
/data is shared folder between the two containers that will be used as a buffer ( s3-sync-local has write access and s3-sync-remote have read only access). the folder is backed by a PVC on the kubernetes side. you can also use tempdir but you may saturate the node ( you will have disk pressure warnings on the kubelet side)
You can find below a snippet of the code that I used:
The benefits of this solution are:
- workaround the vpc cross region limitation
- use the power of Kubernetes to automate copies of buckets at the same time
- can be scheduled with a cronjob
- requires only 5% of the total size of the bucket to copy the whole bucket( based on my testings, for a bucket of 5,4 terabytes, you will need ~250 Gb of storage as a buffer)
- you can use the subpath variable to create multiple parallel runs of the same bucket
despite all the advantages , you need to take into account that this method will copy two times the same file (bucket-A -> pod -> bucket-B) which may cause a network saturation . you need to be also aware that if the job fails, you need to relaunch it for the begining .
A real test for one of the biggest buckets:
- Total number of files: 946 803
- The total size of the bucket: 4,391To
- the size of the files is heterogenous; we have files between 1ko to 200ko and 15Mb to 30Mb.
The bucket had ten main folders that contained multiple subfolders and objects. I created ten jobs; each job handles one main folder. This sampling helped me to speed up the copy and limit the impact if the job fails.
It took me 77 minutes to copy the whole bucket ( ~4To)!
NAME COMPLETIONS DURATION AGEs3-copy-bucket-a 1/1 66m 22ds3-copy-bucket-a1 1/1 77m 22ds3-copy-bucket-a2 1/1 74m 22ds3-copy-bucket-a3 1/1 72m 22ds3-copy-bucket-a4 1/1 65m 22ds3-copy-bucket-a5 1/1 64m 21ds3-copy-bucket-a6 1/1 66m 22ds3-copy-bucket-a7 1/1 67m 22ds3-copy-bucket-a8 1/1 77m 22ds3-copy-bucket-a9 1/1 74m 22d
As I said before, you need to monitor the networking for this operation since it may saturate your network. Here is a snapshot of the traffic when I was doing the copy:
alternative solution (AWS managed):When i was writing this blog , i discovered you can use this blog to test it as a possible solution for copying cross account cross region.
Even though the final solution was networking-consuming, the results were very satisfying. We were able to optimize the transfer of data across regions, cross VPC, and cross-account from days to fewer hours.
AWS has recently announced that Amazon S3 Cross-Region Replication (CRR) now supports copying existing objects. This service could be an alternative solution. Make sure to estimate the transfer cost since the bill can rise exponentially.
Original post: https://itnext.io/transfer-terabytes-of-data-between-aws-s3-buckets-cross-account-cross-region-and-cross-vpc-ccdbec15e53