Author: Kamlesh Kumar | Published: 05-Dec-2022 |
AWS Data Pipeline can be used to ease the movement of data between different platforms at regular intervals. It is a web service and consists of three basic parts:
1. The Pipeline – To enable data movement
2. A Scheduler – To schedule data flows
3. A Task Runner – The task agent configures specified resources, e.g. EC2 or EMR cluster, to process all activity objects. When the resource is terminated by the AWS data pipeline, the logs are saved to an S3 bucket before shutting down.
The AWS Data Pipeline can be used to create a workflow that integrates components— S3, Redshift, RDS, EMR, EC2, and SNS—corresponding to other AWS services commonly used in a Data Warehouse/Data Analytics application.
There are three distinct ways to create a workflow on AWS Data Pipeline
1. AWS Console
2. AWS CLI (Command Line Interface)
3. AWS SDK
A hypothetical example: An input file is received on S3, then processed through a ‘Pig’ script on EMR in any classic Data Warehouse Application on AWS. After processing the output file is loaded on Redshift, which pipes it downstream to enable analysis and/or reporting. You can also schedule and send appropriate notifications.
Components for the implementation of the above workflow in the Amazon Data pipeline include:
The workflow sets the order of the load and terminates the orchestration of any other framework. Notifications are handled by the integrated SNS, and data load operations are configured by components of the Data Pipeline.
This workflow provides an option to spin up/down EC2 instances and EMR clusters as needed and to help reduce the cost.
Another option is installing “warm” EC2 instances. These can be installed using the Task runner JAR on these EC2 Instances.
It is important to remember that the user need not hardcode the input and output paths for any activities in Data Pipeline. These are determined by the workflow. As an example, the ‘Pig’ script without hard coding looks like this:
part = LOAD ${input1} USING PigStorage(‘,’) AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);
Here,“$” is used for ${input} and ${output} instead of hardcoded S3 paths.
This same code can also be applied for the task runner JAR on a non-AWS machine to integrate it with the AWS pipeline.
Business logic can be implemented on the AWS data pipeline by integrating a PIG, HIVE, or MapReduce script. The logic can be implemented using an SQL query, integrated into SQLActivity, and run on an EC2 instance.
While looping and processing multiple files, we can use various components of the AWS Data Pipeline to create an inner pipeline to implement this business logic in an “inner” pipeline.
‘Outer’ Pipeline: Using the Shell Activity script looping through the input and output S3 paths
‘Inner’ Pipeline: It can be executed using CLI commands with the same above paths as parameters.
Using the AWS Data Pipeline is certainly different from working on regular ETL tools, but it leverages the benefits of the cloud in a significantly better way.
Looking to simplify your data migration? Let our cloud experts guide you every step of the way. Teleglobal is an AWS Advanced Partner specializing in AWS Data Pipeline services that connect S3, EMR, and Redshift, enabling secure workflows, faster processing, and reliable analytics.