object-collection

Turbocharging your S3 Uploads: A Guide to Blazing-Fast Data Transfer with the AWS CLI

Aaron Albrighton

11 Dec 2024 • 4 min read

RONIN's Object Storage is the same as Amazon S3, a fantastic service for storing vast amounts of data. But uploading large datasets can sometimes feel like watching paint dry. Fear not, speed demons! This blog post will equip you with the knowledge and tools to transform your uploads into Formula 1-worthy sprints. Buckle up as we explore how to leverage the AWS CLI in Windows, multithreading, compression, and optimized settings to achieve blistering transfer speeds.

Prerequisites

AWS CLI: Ensure the AWS CLI is installed and configured on your machine. You can download it from the official AWS website and follow the installation instructions.
7-Zip or gzip (optional): for compressing files to make uploads smaller, and therefore faster!

Optimizing AWS CLI Settings

Before diving into specific scenarios, let's discuss some AWS CLI options to help maximise your upload throughput:

--multipart_threshold - This setting determines the size threshold (in bytes) for when the CLI automatically switches to multipart uploads. Multipart uploads significantly improve performance for larger files by splitting them into smaller parts and uploading them concurrently. Increase this value to ensure larger files are always uploaded using multipart uploads.
--multipart_chunksize - This defines the size of each part in a multipart upload. Experiment with different values to find the sweet spot for your network and file sizes. A good starting point is 100MB.
--max_concurrent_requests - This setting controls the maximum number of concurrent requests the AWS CLI can make. Increasing this number can improve upload speeds, especially when dealing with many small files. Start with a value like 10 and adjust based on your system's capabilities and network conditions.

Scenario 1: Uploading Multiple Large Files

💡

Consider utilising 7-Zip to compress your files to reduce your file sizes and speed up your transfer on slow networks. More on using compression below

When dealing with a directory containing multiple large files, the aws s3 cp command with the --recursive flag and the optimized settings mentioned above will be your weapon of choice. Here's an example of how to use it:

set AWS_S3_MULTIPART_THRESHOLD=104857600  # Set to 100MB
set AWS_S3_MULTIPART_CHUNKSIZE=104857600  # Set to 100MB

aws s3 cp <file_location/your-folder/> s3://your-bucket-name/your-folder/ --recursive

This command recursively copies all files from your current directory to the specified S3 bucket and folder. The --multipart_threshold and --multipart_chunksize settings ensure that large files are uploaded efficiently using multipart uploads.

Scenario 2: Uploading Many Small Files within Directories

💡

Consider utilising 7-Zip to compress your files to reduce your file sizes and speed up your transfer on slow networks. More on using compression below

For a directory containing numerous small files scattered across subfolders, the aws s3 sync command is the optimal solution. This command synchronizes your local directory with your S3 bucket, efficiently handling the complexities of numerous small files and directory structures. Coupled with the --max_concurrent_requests we can increase the amount of files that sync concurrently (in this example 10)

set AWS_S3_MULTIPART_THRESHOLD=104857600  # Set to 100MB
set AWS_S3_MULTIPART_CHUNKSIZE=104857600  # Set to 100MB

aws s3 sync <file_location/your-folder/> s3://your-bucket-name/your-folder/ --max_concurrent_requests 10

This command synchronises your current directory with the specified S3 bucket and folder, utilizing multiple concurrent requests to accelerate the upload process.

Pre-Upload Compression: The Need for Speed

Before uploading your data, compressing it can drastically reduce the amount of data transferred, leading to significantly faster uploads. 7-Zip is an excellent cross-platform compression tool that works seamlessly on Windows, macOS, and Linux. It offers a high compression ratio and supports various archive formats.

Option 1: Individual File Compression for Large Files

For large files, compressing each file individually strikes a good balance between compression efficiency and the ability to download files independently. This approach minimizes the overhead of decompressing a massive archive when you only need a single file.

For example in Linux:

for f in *; do gzip -9 -k "$f"; done

This command iterates through each file in the current directory and creates a separate 7z archive for each file. You can adjust the compression level (-9) to fine-tune the balance between compression ratio and speed.

Option 2: Directory-Based Compression for Small Files

When dealing with a directory filled with many small files, compressing them into smaller, directory-based archives can offer a good compromise. This approach allows you to download related files together while avoiding the overhead of a single massive archive.

For example in Linux:

for d in *; do gzip -9 -kr "$d"; done

This command iterates through each subdirectory in the current directory and creates a separate archive for each subdirectory. This keeps related files grouped together while allowing for independent downloads of specific directories. You can adjust the compression level (-9) to fine-tune the balance between compression ratio and speed.

Important Considerations

Compression Level: Experiment with different compression levels (-9, -1) in both options to find the optimal balance between compression ratio and speed. Higher compression levels (-9) offer better compression but take longer, while lower levels (-1) are faster but less efficient.
File Types: Some file types, like images and videos, may already be compressed. Compressing them further might not yield significant size reductions. Consider excluding such files from the compression process.
Testing: Always test your upload speeds with different compression settings to identify the most efficient approach for your specific data and network conditions.

Additional Tips for Maximum Throughput

Network Optimization: Ensure you have a stable and high-bandwidth internet connection. Consider using a wired connection instead of Wi-Fi for optimal performance.
Storage Optimization: If you're dealing with extremely large datasets, consider using an SSD instead of a traditional HDD for faster read speeds during the upload process.

By combining these techniques—optimized AWS CLI settings, multithreading, compression, and network optimization—you can unleash the full potential of your internet connection and achieve blazing-fast S3 uploads. So, ditch the slow lane and experience the thrill of high-speed data transfer!