It Took Two Days and Seven Engineers to Move Data Between Two S3 Buckets

A team of engineers tried to quickly transfer 25TB of data from one S3 bucket to another [1]. Their requirement was to move a large number of small log files (in the range of MB), ideally within the next two hours. It turned out that they needed two full working days and a team of seven engineers to complete the task. It’s easy to just call them extremely inefficient — from a business perspective, this problem seems simple. But the truth is that performing such large data transfers is not something that can be easily accomplished within two hours, without preparation.

Let’s learn from their mistakes and look at how we can accomplish this faster and more efficiently.

What Went Wrong?

From the original Reddit post, we don’t get a lot of background information other than that they wanted to quickly transfer 25 TB of data, mostly made up of small log files. We don’t know which AWS regions they have to transfer this data to and from. All we know about their approach is that they briefly conducted their research and concluded that all options are too time-consuming, so they decided to migrate the data by running parallel uploads using AWS CLI, similar to the following:

This way, they split the data transfer into multiple operations, leveraging multi-threading [2]. Each --include block generates a new upload thread for files that start with a specific prefix. The --exclude "*" block ensures that we exclude all files before starting to include only files with a specific prefix. In the command shown above, we only include log files that start with a specific year and month prefix for each upload thread.

This command would be performed and monitored by one engineer, while another one (or the same engineer but within another terminal session) could run the transfer for other months:

Even though this is one of the options recommended by AWS [2] to move large amounts of data between S3 buckets, it reminds me of a map-reduce problem: How to count the number of books in a library. Divide the work between several people (workers/processes), split the work evenly between them so that each person counts only books from specific shelves and every worker reports the result to the coordinating person (master).

This approach works, but it generates a lot of overhead (they needed seven engineers and two days to coordinate this). There must be a better way!

Note: Even though we used aws s3 cp we could also use aws s3 mv to ensure that the data is not only copied to the destination but also deleted from the source bucket.

Setting Up Replication From Bucket A to Bucket B

Ideally, we don’t want to migrate single files separately. We’d prefer to just configure things to transfer data from bucket A to bucket B. There is one option that would let us do that: replication.

Replication allows us to configure bucket A to be constantly in sync with bucket B and to automatically make sure that all files are copied over.

Replication is particularly useful if we want to copy data from a production bucket to some development bucket. This way we can ensure our development environment has an exact copy of production data, which allows for a reliable development setup.

AWS allows us to use [3]:

  • CRR (Cross Region Replication)
  • SRR (Same Region Replication)

Those two options allow us to move the data between buckets across regions (CRR) or within the same region (SRR).

Note that replication only works if both S3 buckets have versioning enabled.

To implement this, we go to the management console and within our source bucket, we select Management → Replication → Add rule. Then, we follow the three steps shown in the screenshot below, to enable versioning and replication from bucket A to bucket B:

Setting up replication — image by author

In the end, we should see a screen confirming that the replication has been established:

Replication successful — image by author

Are we done? Not quite.

Overall, replication sounds great because as long as we configure it once, there’s nothing else left to do for us: AWS automatically replicates the objects from bucket A to bucket B. But there’s one caveat: After we’ve set this up, the replication only works for the files that we will upload in the future, it won’t replicate the existing objects!

There is a trick, though: it’s sufficient to change the storage class of the S3 bucket (or, alternatively, to change the encryption status). This could involve changing the storage class from Standard to Intelligent tiering, but the main point is: it must be from one class to a different one. Trying to change from Standard to Standard wouldn’t modify the objects. By changing the storage class, we can ensure that all the files will be:

  • Moved from bucket A back to bucket A (but with a new storage class).
  • Automatically replicated to bucket B.

We could achieve this by the following command [2]:

You can do the same from the console:

Changing a storage class — image by author

Having made those changes, all of the data gets automatically copied over from one bucket to another:

We could now change back to our previous storage class.

S3 Batch Operations

The first method that we introduced (AWS CLI) suffers from the fact that we need to do a lot of work on our side (our own “map-reduce”) and make many API calls, which can incur larger costs. The second method (replication) suffers from the fact that this process is asynchronous, which means that all objects will eventually get replicated. According to AWS:

“Most objects replicate within 15 minutes, but sometimes replication can take a couple hours or more.” [4]

The potential latency may be the reason why those engineers didn’t choose this option — they wanted to accomplish the data transfer within two hours.

In this case, the service “S3 batch operations” seems to be an attractive alternative. It promises to quickly process a large amount of S3 objects within a single API request [2].

S3 batch operations in action: the process

The entire process of moving data from bucket A to bucket B entails the following steps:

  • Setting up the inventory report (it could be stored in the same bucket we want to copy data into bucket B) to generate a list of all objects that need to be copied over from bucket A to B.
  1. Creating IAM roles for the S3 batch operations to give the job permissions to read and write data to and from both buckets (or three buckets if you configured to store the inventory report to a third bucket).
  2. Within the AWS Management Console (or within AWS CLI) create an S3 batch operation job with a PUT copy operation to do the actual data transfer based on the inventory job’s output.
  3. Run the job and view the completion report to validate that all objects have been successfully transferred.

The entire process should take a couple of hours.

Note that an important prerequisite for creating a job in S3 batch operations is having the inventory report in place (step 1).

S3 batch operations in action: the implementation

We start by creating an S3 inventory report of our bucket A (select your bucket → Management → Inventory), which will (when completed) list all objects in our S3 bucket:

Configuring inventory report — image by author

Now we can customize the report: select the destination bucket, choose whether we want to create this report daily or weekly, add optional fields to include extra metadata such as object size, last modified date, or whether an object is encrypted. We should select CSV, as this is the only format that can be used for S3 batch operations:

Configuring the inventory report in S3 — image by author

After we click on save, the configuration is finished. However, AWS informs us that it may take up to 48 hours to deliver the first report!

The inventory report should be delivered within 48 hours — image by author

I was trying to fake it and generate the inventory report myself by just listing the files with AWS CLI, saving the results to a CSV file, and uploading it to my S3 bucket:

# create the file manually
aws s3 ls s3://ecommerce-marketplace --recursive > manifest.csv# upload to S3
aws s3 cp manifest.csv s3://e-commerce-marketplace/manifest.csv

But it seems that this service requires the inventory report to be in a specific format, which caused my “S3 batch operations” attempt to fail:

Attempt to “fake” the inventory report — image by author

Apparently, no inventory report means no S3 batch operations job!

If we had this inventory already in place, we could continue as follows to create an S3 batch operations job:

  • Create a new job:
Create job — image by author
  • Specify the path to the CSV inventory report and the S3 destination (bucket B), and then select the type of operation we want to perform (Copy):
Configuring the job — image by author

The final steps are configuring the completion report and IAM role to grant the job permissions to access our S3 resources:

All subsequent steps needed to complete the S3 batch operations job to move big data between S3 buckets — image by author

Alternative Options

In addition to the methods above, AWS offers other ways of transferring large amounts of data between S3 buckets:

  • AWS SDK: This entails you write a custom application (for example, in Java) to do this simple COPY operation.
  • Spinning up a Hadoop cluster on Amazon EMR and performing an S3DistCp operation to copy big data from S3 to a new destination. This involves running parallel copy commands that would download the data from bucket A to a Hadoop cluster and would write files in parallel to bucket B.

If you ask me, those options seem to be an example of overengineering, but it can make sense in certain scenarios, especially if you have such unexpected requirements as having to move terabytes of data within a couple of hours.

Conclusion

In this article, we discussed several options to move large amounts of data between S3 buckets: AWS CLI copy command, replication, and S3 batch operations. We can conclude by saying that without planning more time for such migrations (for example, planning for a large enough buffer to wait until an inventory report is generated, or waiting until replication will take care of the process for us), the process of moving large amounts of data is very involved. It requires custom, often overengineered solutions such as:

  • Writing custom applications with AWS SDK.
  • Performing S3DistCp on a Hadoop cluster.
  • Even trying to perform your own map-reduce job by splitting the AWS CLI copy process into separate sessions and dividing the work across seven engineers.

We shouldn’t require such large data transfers to be conducted within two hours by a single engineer. Additionally, planning ahead may eliminate the need for such large data transfers in the first place.

Overall, I wish that business owners and managers would recognize that there are many things in engineering that just don’t happen overnight (certainly not within two hours). Everything requires planning, preparation, gathering and discussing requirements with stakeholders, infrastructure setup, and plenty of testing. This is the only way to provide high-quality IT solutions to business problems.

I learned a lot from the experience of those engineers and I’m grateful that they shared their story. I hope it was useful for you, too. Thank you for reading!

References

[1] Reddit post: https://www.reddit.com/r/aws/comments/irkshm/moving_25tb_data_from_one_s3_bucket_to_another/

[2] AWS Knowledge Center: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

[3] Amazon S3 — Replication: https://docs.aws.amazon.com/AmazonS3/latest/dev/replication.html

[4] S3 CRR Replication time: https://aws.amazon.com/premiumsupport/knowledge-center/s3-crr-replication-time/

Photo by Soumil Kumar from Pexels

Image contains an affiliate link. If you join Datacamp this way, you're not only learning a ton of valuable information, but you're also supporting this blog.