In this lesson I'll show you how to upload large files using aiobotocore.
Aiobotocore is async version of boto3 library. I didn't find any example of using async python to upload large files. So I hope this post will help you.
Multipart upload to S3 using asyncio
Running S3 compatible object storage in docker
First of all we should run S3 storage somewhere. You can use hosted AWS S3 storage but in case of testing you can use free minio image from docker hub.
Minio is an object storage which has S3 compatible API and can be used via AWS S3 API libraries.
To run minio on your local machine using docker copy and paste the following command:
docker run \ -p 9000:9000 \ -p 9001:9001 \ -e "MINIO_ROOT_USER=AKIAIOSFODNN7EXAMPLE" \ -e "MINIO_ROOT_PASSWORD=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \ quay.io/minio/minio server /data --console-address ":9001"
After that you need to create a bucket and obtain access key along with access secret key. To do that open your browser, navigate to 127.0.0.1:9000 to see the login page.
- Use provided environment variables
AKIAIOSFODNN7EXAMPLEas login and
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEYas password to sign in.
- Click on the
Bucketsbutton inside left menu and create a bucket. In this case we named it as
- Then click on
Usersbutton, create a user with
FAKE_KEYaccess key and
FAKE_KEYaccess secret key. Grant all needed policies(permissions) to the created user. For instance, we will need
Check out the video below if you want some visuals on how to setup minio.
What is multipart upload
Multipart upload is S3 API which allows to upload a file in several parts.
Let's say we have a social network service and within the service an image gallery exists for a particular user.
We want an ability to download an archive with all photos.
How can we do that?
- Gather all user photos inside the image gallery
- Create an archive
- Upload the archive into S3
We have a couple of problems here:
- According to the S3 documentation you can't upload files large than 5GB to the S3 in single PUT operation.
- Bad network may cause issues preventing from uploading large archive
Solution - Multipart upload
Multipart upload is designed to solve issues with uploading large files with size from 5mb to 5TB. The AWS Docs recommends consider using it when file size > 100mb.
So, in our case it's the best solution for uploading an archive of gathered photos since the size of the archive may be > 100mb.
Multipart upload steps
In order to to use multipart upload you have to follow several steps:
- Send create multipart upload request and receive info about
2. Split a file into several chunks, then upload each chunk providing part number, upload id, data, etc. Each chunk info must be recorded somewhere. It will be used to complete multipart upload
3. Send list parts request with upload id and receive info about uploaded parts including their count
4. Compare number of uploaded parts inside S3 with your local parts info.
If number of parts in total equals count of parts inside S3 then complete multipart upload, else – abort.
When you upload chunks of data using multipart upload and don't complete/abort the upload, it leads to several issues. Stored parts before completion are hidden inside S3. Which means they occupy some space. If you fail to handle finishing multipart upload in right way, they are going to cost you extra money. That's why expiration attribute can be utilized within create multipart upload API call.
Splitting File into chunks
How do we divide a file into chunks? Let's recall what does a file object represents.
File holds bytes data. Let's just imagine an array of bytes. This is the file.
Reading from file can be handled using a pointer to byte number from which we can start reading the file. The start pointer is usually handled via
seek operation. For example,
seek(0) moves pointer inside a file to the zero byte, whereas
seek(6) moves pointer to the sixth byte.
Starting from there we can read next N bytes using
read operation. For instance, we want to read and save first 5mb of a file. Let's calculate how much this is in bytes.
1024 bytes = 1kb, 1024kb = 1Mb, so, we need
1(byte) * 1024 = 1kb, 1(kbyte) * 1024 = 1mb, 1(mb) * 5 = 5mb.
1 * 1024 * 1024 * 5 = 5242880 bytes or
5242880 bytes to read from the file to retrieve first chunk.
You can do that by operation
To save first chunk of data we need to move file pointer to the beginning of 1 chunk and call read operation to get first 5mb. Seek offset here equals
To read next chunk of data we must move our pointer to the beginning of the second chunk. Seek offset here equals
1 * BYTES_PER_CHUNK
Number of chunks
To calculate number of chunks we must firstly get size of the file we are reading. To do that you can use
os.stat('large.txt').st_size. Next we can divide this number by the bytes per chunk value
number_of_chunks = source_size / BYTES_PER_CHUNK. In python the result will be float. Let's use
math.ceil and extract the result as integer.
chunks_count = int(math.ceil(source_size / BYTES_PER_CHUNK)).
In order to speed up our upload we are going to utilize the power of asyncio. This means we will split our uploading into several tasks – asyncio coroutines.
A coroutine is a single unit of execution running inside a single thread. Its benefit is using only a single thread which means less OS resources needed and less headache with synchronization compared to multithreaded programming. Each task will be executed untill blocking IO operation appears. This situation is marked using
await word. After sending initial request the coroutine will get back into awaiting list. The event loop will pick up another ready to go coroutine.
Install required libraries:
pip install aiobotocore aiofiles
We are ready to write async upload. Note that minimum part size must be equal to 5 mb!
Generate 5 gb file.
dd if=/dev/urandom of=large.txt bs=2048 count=2500000
mkfile 5g large.txt
Questions and links
Does asyncio give a speed up boost?
According to my manual tests it has a little speed advantage over a single PUT operation. Just a little. Adjusting chunk size may affect the total time and is a good reason to play around with some values. Honestly I don't know if this is really good solution to use. Report any bugs or optimizations you find.
Possible issues and further optimizations
The provided code is just an example and requires some enhancements from your side for your particular situation. You need to adjust chunk size as you need. Also be aware that a lot of such coroutines may cause too much open file descriptors error. In this case you can implement pool of courotines via queue.
You can also optimize RAM usage. This is not covered in the article.
Big thanks to authors who shared solutions related to multipart upload.
PS: Also thanks to Zvika Ferentz who contacted me, pointed to bug in the code and shared his solution
You can leave your comments here: https://gist.github.com/skonik/b20a21fc39f97e16c979c49267d90e05