At work we developed an app to build dynamic sql queries using sql alchemy. The user can build the query they want and get the results in csv file. We also make use of AWS’s ability to unload a query to Redshift. The reason behind this is that if a query returns more X amount of rows, we can just have Redshift run it, and store the csv file in S3 for us. There is a slight problem with this. When a query isn’t unloaded, we have the ability to put in column headers, and then serve the csv file. But when a query is unloaded, only the results are in the csv and the column headers are left out.
Boto3 is the library to use for AWS interactions with python. The docs are not bad at all and the api is intuitive. At it's core, Boto3 is just a nice python wrapper around the AWS api. Even though Boto3 might be python specific, the underlying api calls can be made from any lib in any language.
Since only the larger queries were unloaded to a csv file, these csv files were large. Very large. Large enough to throw Out Of Memory errors in python.
The whole process had to look something like this..
Download the file from S3 -> Prepend the column header -> Upload the file back to S3
As I mentioned, Boto3 has a very simple api, especially for Amazon S3. If you're not familiar with S3, then just think of it as Amazon's unlimited FTP service or Amazon’s dropbox. The folders are called buckets and "filenames" are keys. Let's say you wanted to download a file in S3 to a local file using boto3, here's a pretty simple approach from the docs using the Object class:
Notice there's no authentication information? More on that here
As I mentioned before, these files are large. So I couldn’t just use download_file as obj would hit a MemoryError before the file could finish processing.
After a little bit of searching, I learned that calling .get() on an s3.Object() will retrieve the object information and metadata from S3. One of the keys in that dict is Body. Which is the contents of the file. Calling .get()['Body'] doesn't download the file immediately, though. What you get is something called a StreamingBody instance. It's generator and it comes with a .read(num_of_bytes) method! So it's really easy to chunk the downloads and control how many bytes you want at once. I’m going to pretend I have a file in S3. In the bucket ‘mybucket’, it’s named ‘hello.txt’ and it’s contents are "hello".
With chunking the downloads, we can avoid memory errors on the download part completely. In my case, I didn’t want to store anything locally. I wanted the download, modifications, and the upload to happen all around the same time. You can definitely yield more than one byte at a time, by the way! (It was just easier to demonstrate)
I love generators. I love them so much! They’re almost always an elegant solution. We know that we can get a generator object of the file we want from S3. Which immediately told me that I could manipulate my yieldings. Let’s create a higher level function that will act as our file stream.
This function basically grabs the object from s3 and starts yielding chunks of it at a time. With a little modification this could be changed so that it would yield the headers first, then the file. But what if I just checked if the iteration was the first one, and yield the column headers + first chunk of the file if it was? Let’s
The output of this would look like this:
Amazon S3 offers an api for Multi-Part Uploads, which essentially let's us upload a single file in multiple parts. Boto3 also supports this as well. That process is a little more complicated, but still not hard!
Note: Even though python is mentioned specifically, that's just the language I used, and the language that the code examples are in. This could really be done using any language as the main functionality that we'll be using is built in to Amazon S3's api.
I'm also not saying this is a good solution... just the one I came up with that seems work pretty well.
This solution worked great and is still used daily in our app. As I mentioned, I’m certain there are many other ways to solve this problem and I’d love to hear what you think!
Authored by Anthony Fox on 2017-07-21