How to design a simple serverless batch process

The first step to developing our batch process is identifying the sql needed to extract our data. Once we finalize our sql queries, we have to determine where to house our sql queries. Based on the design of our batch process, we can store our sql queries in the following locations:

Within our source code
Remotely in an AWS S3 bucket

If we store our sql within our source code, we will have to rebuild and deploy are application for every sql change. On the other hand, if we store our sql in S3 then we won’t have to rebuild our app when our sql changes. Allowing for us to quickly change and test our sql. For this design, we will store our sql queries in a S3 bucket.

Now that we can extract the data, we can start developing the program that will handle and store this data. While we won’t get into the specifics of developing the actual extract program, below you can find the aspects of the program we deemed necessary to include.

Storing extracted data in a highly redundant and available service
Detailed logging to simplify the debugging process
Dynamic execution of sql files to future proof our program
Modularization of our batch in order rerun individual extracts instead of the entire batch when failures occur

We will store our extracted data in a versioned S3 bucket where applications can easily access our data at all times. When failures occur, we always want to have enough information to be able to diagnose the root cause. Without detailed logging, our engineers will spend a lot of time debugging issues. Furthermore, we want to design our batch process where future improvements can be added without creating technical debt. Therefore, we shouldn’t hard code the sql files being used within our application. Instead we should have the sql files configurable, in case other queries are added to the batch process in the future. Finally, we want to modularize our application where we can rerun individual sql files on demand. This will save us time and reduce the risk of introducing new errors when rerunning failed extracts.

We need to containerize our application so that we can host it in the cloud and execute it on a schedule. Since typical batch processes are long running, we will be leveraging the AWS Elastic Container Service (ECS) instead of an AWS Lambda. Upon successfully creation of our dockerfile and entrypoint script, we can build and push our image to an AWS Elastic Container Registry (ECR). With our docker image in ECR, we should create a task definition to reference our registry and create a CloudWatch event to trigger our task.

We have officially designed a serverless batch process to extract data from a data warehouse and store in an AWS S3 bucket. Based on our initial design, we know the following about our serverless batch process:

With the help of CloudWatch, our batch will run on a regular schedule
We only use batch resources during execution
Our extracted data is in a highly available location that provides versioning

The specifics may vary case by case, but we can use this blueprint across various cloud providers and data warehouses. If you have further questions regarding this design, please do not hesitate to contact us.

How to design a simple serverless batch process

Extract data

Upload data

Containerize our application

Conclusion

Vincent Sanders

Recent Posts

Categories