
There are plenty of guides about how to upload files to AWS S3 but pretty much none of them covers the handling of large files. What happens if the file you are trying to upload has a size of about 1GB? Well, the chances are that the upload will fail more often than succeed. It is because the standard upload method is made to handle files typically less than 100MB in size. Then what to do when file sizes are too big to handle? I faced the same problem last week when working on my project using NodeJS. The short answer is to use multipart upload.
Before I discuss more in-depth about the topic, I would assume that you know how to set up your AWS credentials in NodeJS and if not this guide might help set you up.
Multipart Upload to the Rescue!
In this approach, we first have to initiate a request to AWS S3 using our credentials along with the bucket we want to upload to and the key of the object. If the request is successful it will provide an upload-ID with the response. This upload-ID will be used to upload the data/object as well as finishing or aborting the upload process. And as the name suggests, we have to divide the large object into smaller objects before uploading it and we have to keep track of the parts too.
Let’s get our hands dirty!
Initiate the upload process.
To start a multipart upload we need to request an upload process that will return a UploadId. To create a process we need to call the createMultipartUpload method on the s3 object. This method takes two parameters — upload params and a callback function. The upload params support all the properties of the params used in the standard upload method like Bucket, Key, Content-Length and so on.
s3 = new AWS.S3({ apiVersion: "2006-03-01" }); const startUpload = () => {
const params = {
Key: "<Your Key for the file>",
Bucket: "<Your Bucket_name>",
}; return new Promise((resolve, reject) => {
s3.createMultipartUpload(params, (err, data) => {
if (err) return reject(err);
return resolve(data);
/*
data = {
Key: "<Your Key for the file>",
Bucket: "<Your Bucket_name>",
UploadId: "ibZBv_75gd9r8lH_gqXatLdxMVpAlj6ZQjEs.Sjng--" //some jebrish!
}
*/
});
})
}
Once the process completes, it will return the response with UploadId.
Upload small parts.
Now we are ready to upload files. For that, we are going to use the uploadPart method of the s3. This method’s parameters are the same as the createMultipartUpload one, we just need to add the UploadId returned from the previous method and the PartNumber of the upload in the params object. The PartNumber can be any number between 1 and 10,000. On successful upload, it will return an ETag that is needed to complete the upload. We need to store all the ETags of all the parts.
const uploadPart = (buffer, uploadId, partNumber) => {
const params = {
Key: "<Your Key for the file>",
Bucket: "<Your Bucket_name>",
Body: buffer,
PartNumber: partNumber, // Any number from one to 10.000
UploadId: uploadId, // UploadId returned from the first method
};
return new Promise((resolve, reject) => {
s3.uploadPart(params, (err, data) => {
if (err) return reject({ PartNumber: partNumber, error: err });
return resolve({ PartNumber: partNumber, ETag: data.ETag });
});
});
}
But before uploading, we need the most important thing i.e. the file and have to break the file into several chunks of Buffer Object. For this, we are going to use the slice method on the file. And we are going to break it into chunks of 10MB in size each (feel free to use any size you want but the minimum requirement is 5MB except for the last part). We’ll create a function named upload to wrap everything up inside and use ES6 promise to resolve the value.
Uploading all the file parts.
const upload = async (filePath) => {
const file = fs.readFileSync(filePath); // read the file from the path specified
const chunkSize = Math.pow(1024, 2) * 10; // chunk size is set to 10MB
const fileSize = file.byteLength;
const iterations = Math.ceil(size / chunkSize); // number of chunks to be broken
const arr = Array.from(Array(iterations).keys()); // dummy array to loop through
let uploadId = ''; // we will use this later
try {
uploadId = await startUpload().UploadId; // this will start the connection and return UploadId
const parts = await Promis.all(
arr.map(item => {
uploadPart(
file.slice((item-1) * chunkSize, item * chunkSize),
uploadId,
item
)
})
)
console.log(parts);
} catch(err) {
console.error(err);
}
}
The above approach is OK! but 1 part among all the others might get failed which will cause the Promise.all to throw an error and in turn, abort all the successful uploads. Since I want network resiliency, so I would use Promise.allSettled which resolves all the uploads even if there are some failed uploads (You may skip this part). With this approach, we can retry all the failed parts once again (which is awesome!). Let’s see it in action!
try {
uploadId = await startUpload().UploadId; // this will start the connection and return UploadId
const parts = await Promis.allSettled(
arr.map(item => {
uploadPart(
file.slice((item-1) * chunkSize, item * chunkSize),
uploadId,
item
)
})
)
console.log(parts);
/*
The response looks like this ->
[
{ status: "rejected", reason: { PartNumber: "1234", error: {...} } }
{ status: "fulfilled", reason: { PartNumber: "1234", ETag: '"d8c2eafd90c266e19ab9dcacc479f8af"' } }
]
Now we can retry uploading the rejected Parts!
*/
const failedParts = parts
.filter((part) => part.status === "rejected")
.map((part) => part.reason);
const succeededParts = parts
.filter((part) => part.status === "fulfilled")
.map((part) => part.value);
let retriedParts = [];
if (!failedParts.length) // if some parts got failed then retry
retriedParts = await Promise.all(
failedParts.map(({ PartNumber }) =>
uploadPart(
data.slice((PartNumber - 1) * chunkSize, PartNumber * chunkSize),
UploadId,
PartNumber
)
)
);
} catch(err) {
console.error(err);
}
Finishing the upload.
Now we are lacking the most important part of the process, to close the process or abort in case of a fatal error. Without closing or aborting the uploads, you may incur some charges on your AWS account for the storage occupied by the uploaded parts. So let’s implement the abortUpload and completeUpload functions.
const abortUpload = async (uploadId) => {
const params = {
Key: "<Your Key for the file>",
Bucket: "<Your Bucket_name>",
UploadId: uploadId,
}; return new Promise((resolve, reject) => {
s3.abortMultipartUpload(params, (err, data) => {
if (err) return reject(err);
return resolve(data);
});
});
}const completeUpload = async (uploadId, parts) => {
const params = {
Key: "<Your Key for the file>",
Bucket: "<Your Bucket_name>",
UploadId: uploadId,
MultipartUpload: {
Parts: parts,
},
};
return new Promise((resolve, reject) => {
s3.completeMultipartUpload(params, (err, data) => {
if (err) return reject(err);
return resolve(data);
});
});
}
We are all done now. Let us add the completeUpload and abortUpload inside the try-catch block of the upload function.
try{
uploadId = await startUpload().UploadId; // this will start the connection and return UploadId
const parts = await Promis.allSettled(
// .........
)
const failedParts = // .....
const succeededParts = // .....
let retriedParts = [];
if (failedParts.length !== 0) // if some parts got failed then retry
// .....
succeededParts.push(...retriedParts); // when the failed parts succeed after retry
const data = await completeUpload(
UploadId,
succeededParts.sort((a, b) => a.PartNumber - b.PartNumber) // needs sorted array
);
console.log(data.Location) // the URL to access the object in S3
} catch(err) {
const done = await abortUpload(UploadId); // in case of failure even after retry, abort the process
console.log(err);
}
Now calling the upload function with the path to the large file will start the upload process. And this time around there’s a pretty high chance of success. The complete code can be found here.
What are the benefits of this approach?
- There’s a very low chance of upload failure even in a spotty network. And even if some parts fail, we can always retry uploading them (here I’ve retried only once but you can create a recursive function and retry as many times as your heart’s content).
- This approach is a lot faster in reality as the upload process is done parallelly inside the Promise.all or Promise.allSettled functions. The standard upload will only upload in a single stream which is quite slow (like a sloth for big file sizes!).
- Another great feature is that you can pause the upload of the unfulfilled parts and resume it later at some point in time (maybe 5 days later, why not!). But always remember if you don’t close the request, the uploaded parts (if any) will be stored in your S3 Bucket and will incur charges. So no matter what, you must close it later.
Conclusion.
This approach requires way more lines of code but when we are dealing with huge file sizes and want reliability for uploads then this is the only option. And the benefits of this approach outshines the standard upload method too.
I am writing this article using the version 2 SDK of AWS and version 3 has recently been released so this process might not work with the new version!