All About AWS Data Pipeline

Author: Kamlesh Kumar

Published: 05-Dec-2022/em>

AWS Data Pipeline can be used to ease the movement of data between different platforms at regular intervals. It is a web service and consists of three basic parts:

1. The Pipeline – to enable data movement
2. A Scheduler – to schedule data flows
3. A Task Runner – the task agent configures specified resources, e.g. EC2 or EMR cluster, to process all activity objects. When the resource is terminated by the AWS data pipeline, the logs are saved to an S3 bucket before shutting down

The AWS Data Pipeline can be used to create a workflow that integrates components— S3, Redshift, RDS, EMR, EC2, and SNS—corresponding to other AWS services commonly used in a Data Warehouse/Data Analytics application.

Creating a Workflow on AWS Data Pipeline
There are three distinct ways to create a workflow on AWS Data Pipeline

1. AWS Console
2. AWS CLI (Command Line Interface)
3. AWS SDK

A hypothetical example: An input file is received on S3, then processed through a ‘Pig’ script on EMR in any classic Data Warehouse Application on AWS. After processing the output file is loaded on Redshift, which pipes it downstream to enable analysis and/or reporting. You can also schedule and send appropriate notifications.

Components

Components for the implementation of the above workflow in the Amazon Data pipeline include:

• Configuration: metadata at a workflow level, i.e.
– Name of the workflow
– Scheduling
– The storage location of logs

• Input Data Format: metadata for the file format provided as input, e.g. file format, such as. CSV; names and the number of rows/columns

• EMR Cluster: defining configuration for the EMR cluster. This cluster also has the Pig script running on it. Details include:
– Number of nodes
– Type of node
– Pricing for EC2 cluster

• S3 Input Data Node: Contains metadata for the Input file, i.e.
– File path
– Manifest file path
– Encryption, etc.

• SNS Alarm: SNS service to set up the alarm when the PIG activity is completed

• Output Data Format: Similar to Input Data Format, the Output Data Format specifies metadata for the file format for the output file after it is processed by the Pig script.

• Pig Activity: includes the S3 Path of the ‘Pig’ script along with other metadata like:
– Name of the cluster
– Where the job will run
– Arguments for the PIG script
– Maximum re-tries

• S3 Data Node: Has the path of the output file, which is processed by the ‘Pig’ script on the EMR cluster

• EC2 Resource: enables the EC2 server to trigger the copy command on the Redshift cluster. It contains data for:
– Type of EC2 instance
– Pricing type used for the instance
– Redshift Database: Includes information on the connection to the Redshift cluster, such as the Connection, URL, Username, and Password

• Redshift Copy Activity: Contains information such as:
The EC2 instance which will trigger the Redshift activity, number of re-tries, copy command options

• Redshift Data Node: has metadata like name & schema of the destination table

The workflow sets the order of the load and terminates the orchestration of any other framework. Notifications are handled by the integrated SNS, and data load operations are configured by components of the Data Pipeline.
This workflow provides an option to spin up/down EC2 instances and EMR clusters as needed and to help reduce the cost.

Option: Warm EC2 Instances

Another option is installing “warm” EC2 instances. These can be installed using the Task runner JAR on these EC2 Instances.
As in option 1, The Task runner continuously pings the Data Pipeline for instructions.
Meanwhile, in the EMR cluster, the process runs using the name-node protocol.
It is important to remember that the user need not hardcode the input and output paths for any activities in Data Pipeline. These are determined by the workflow. As an example, the ‘Pig’ script without hard coding looks like this:

part = LOAD ${input1} USING PigStorage(‘,’) AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);

Here,“$” is used for ${input} and ${output} instead of hardcoded S3 paths. This same code can also be applied for the task runner JAR on a non-AWS machine to integrate it with the AWS pipeline.

How to apply business logic to AWS Data Pipeline

Business logic can be implemented on the AWS data pipeline by integrating a PIG, HIVE, or MapReduce script. The logic can be implemented using an SQL query, integrated into SQLActivity, and run on an EC2 instance.

While looping and processing multiple files, we can use various components of the AWS Data Pipeline to create an inner pipeline to implement this business logic in an “inner” pipeline.

‘Outer’ & ‘Inner’ Pipeline Workflows

‘Outer’ Pipeline: Using the Shell Activity script looping through the input and output S3 paths

‘Inner’ Pipeline: It can be executed using CLI commands with the same above paths as parameters.
Using the AWS Data Pipeline is certainly different from working on regular ETL tools, but it leverages the benefits of the cloud in a significantly better way.

Need a hand with your data migration? Reach out, today. Teleglobal is an AWS Advanced Partner, Azure Cloud Partner, Google Cloud Partner, Cloud Managed Services Provider, and independent Cloud Consultant.