In this post, we will discuss a solution for extracting data from a warehouse and saving to a remote location. In a typical corporate IT environment, multiple applications read and manipulate the same data. With this in mind, transactional applications may need their own copy of the data. So today we will design a serverless batch process to do the following:

  • Extract data from a warehouse on a regular schedule
  • Upload the extracted data to an AWS S3 bucket

Extract data

The first step to developing our batch process is identifying the sql needed to extract our data. Once we finalize our sql queries, we have to determine where to house our sql queries. Based on the design of our batch process, we can store our sql queries in the following locations:

  1. Within our source code
  2. Remotely in an AWS S3 bucket

If we store our sql within our source code, we will have to rebuild and deploy are application for every sql change. On the other hand, if we store our sql in S3 then we won’t have to rebuild our app when our sql changes. Allowing for us to quickly change and test our sql. For this design, we will store our sql queries in a S3 bucket.

Upload data

Now that we can extract the data, we can start developing the program that will handle and store this data. While we won’t get into the specifics of developing the actual extract program, below you can find the aspects of the program we deemed necessary to include.

  • Storing extracted data in a highly redundant and available service
  • Detailed logging to simplify the debugging process
  • Dynamic execution of sql files to future proof our program
  • Modularization of our batch in order rerun individual extracts instead of the entire batch when failures occur

We will store our extracted data in a versioned S3 bucket where applications can easily access our data at all times. When failures occur, we always want to have enough information to be able to diagnose the root cause. Without detailed logging, our engineers will spend a lot of time debugging issues. Furthermore, we want to design our batch process where future improvements can be added without creating technical debt. Therefore, we shouldn’t hard code the sql files being used within our application. Instead we should have the sql files configurable, in case other queries are added to the batch process in the future. Finally, we want to modularize our application where we can rerun individual sql files on demand. This will save us time and reduce the risk of introducing new errors when rerunning failed extracts.

Containerize our application

We need to containerize our application so that we can host it in the cloud and execute it on a schedule. Since typical batch processes are long running, we will be leveraging the AWS Elastic Container Service (ECS) instead of an AWS Lambda. Upon successfully creation of our dockerfile and entrypoint script, we can build and push our image to an AWS Elastic Container Registry (ECR). With our docker image in ECR, we should create a task definition to reference our registry and create a CloudWatch event to trigger our task.

Conclusion

We have officially designed a serverless batch process to extract data from a data warehouse and store in an AWS S3 bucket. Based on our initial design, we know the following about our serverless batch process:

  • With the help of CloudWatch, our batch will run on a regular schedule
  • We only use batch resources during execution
  • Our extracted data is in a highly available location that provides versioning

The specifics may vary case by case, but we can use this blueprint across various cloud providers and data warehouses. If you have further questions regarding this design, please do not hesitate to contact us.

About the author