When to use AWS Lambda over AWS Glue Job(A Use case)
We had a problem or task at hand — There were 3 transformations for files generated through execution of multiomics pipeline, let’s say T1, T2 and T3 including converting source file format into parquet one. Those transformations were written in Glue Jobs using spark context. These jobs had to read input files from specific location of S3, transform-clean and write back to another location on S3 in parquet format; Then we had to query on those files using AWS Athena. So in short we had to create an ETL pipeline for a specific set of data.
Input paths for the files were specific with a little dynamicity and could be handled with single line of code; But calculating output paths for writing transformed files were complex. So we dedicated this task to a fourth Job, we called it Master Job which would build a parameter.json file containing all input paths and calculated output paths for all respective jobs then store it on a specific location of S3 then all the other jobs would read that file to get their respective paths. So the output parameter.json would look like:
{
’T1’: [
{
‘input_path’: ‘’,
‘output_path’: ‘’
}
],
’T2’: [
{
‘input_path’: ‘’,
‘output_path’: ‘’
}
],
’T3’: [
{
‘input_path’: ‘’,
‘output_path’: ‘’
}
]
}
First we did implement the solution In AWS Glue Workflow, then after according to our project requirement we had to implement the same into Step Function(I won’t go in detail why we had to migrate from Workflow to SF).
While implementing the ETL in both Workflow and SF, we realised that sending params from a Job to another job is not an easy or next to impossible task.
While implementing master job in AWS Glue Job for workflow we were not fully aware of the capacities of AWS Lambda Functions.
So when we started implementing Step Function for the same ETL process we came to know that we can pass parameters from Lambda Function as an input to another state/step.
So I decided to write a Lambda function instead of a Glue Job. In Lambda you just have to return the object you want to pass as parameter to the next function or job or step. But while doing so we came to know a limitation that output parameters from lambda shouldn’t be more than 256KB in size while passing from one step to another in SF, so we decided to write it in a json file.
Management did ask me the reason behind opting lambda over glue job to which I couldn’t say more than the Lambda is faster than the glue job. I also googled the difference but of no use. I also used ChatGPT but that also didn’t work. So I did a little bit of research on the same to gather conclusive evidence to prove my decision of switching to Lambda from Glue Job.
So in this Article I am going to try to put light on The Difference Between AWS Lambda and AWS Glue Job according to my understanding:
AWS Glue Job:
- Jobs are mainly for ETL workflow whether they are ETL Jobs or Shell Jobs.
- Shell jobs come preloaded with libraries like:
- Boto3
- NumPy
- SciPy
- Pandas
- Csv
- Gzip
- Collections
- And many more.. - 2 to 100 DPUs can be allocated; the default is 10.
- 1DPU = 4 vCPU + 16 GB of memory.
- Shell Job can be triggered with scheduled trigger where Lambda cannot be triggered with scheduled trigger.
- Lambda runs at most 15 minutes, Glue Job can run for 48 Hours.
- Lambda function’s startup time is much faster than that of Shell Job or Spark Job.
- Jobs support only Python and Scala languages.
AWS Lambda:
- One of the distinctive architectural properties of AWS Lambda is that many instances of the same function, or of different functions from the same AWS account, can be executed concurrently.
- Each Lambda function runs in its own container. When a function is created, Lambda packages it into a new container and then executes that container on a multi-tenant cluster of machines managed by AWS. Before the functions start running, each function’s container is allocated its necessary RAM and CPU capacity.
- Lambda runs at most 15 minutes, Glue Job can run for 48 Hours.
- Lambda function’s startup time is much faster than that of Shell Job or Spark Job.
- Lambda supports many languages including- Python, NodeJS, Ruby, Java, Go, C# etc.
- Lambda functions can support many Events.
- It charges you for the compute you used, network traffic you generated.
- As Lambda doesn’t require entire server and acts when called, it can work on triggered or scheduled cleanup jobs.
- Use cases:
- Individual tasks run for a short time.
- Each task is generally self-contained.
- There is a large difference between the lowest and highest levels in the workload of the application. - Limitations:
- A Lambda function will time out after running for 15 minutes.
- The options for the amount of RAM available to the Lambda functions range from 128MB to 3,008MB with a 64MB step.
- It has limited concurrent executions.
- It is suitable only for small-lived and short computations.
- The zipped Lambda code package should not exceed 50MB in size, and the unzipped version shouldn’t be larger than 250MB.
- By default, the concurrent execution for all AWS Lambda functions within a single AWS account are limited to 1,000.
Conclusion:
- So according to your applications’ need you have to choose between AWS Glue Jobs and AWS Lambda by considering above points.
Disclaimer:
- Above information is prepared from the AWS documentation and may change over time as AWS services upgrade.
- The points mentioned above are my own and not of my employer/organisation.
- I claim no accuracy upon given information as base information and documentation may change periodically.
- Please feel free to correct any point or give your valuable suggestions.