Skip links
AMAZON EKS BASED GENOMICS DATA ANALYSIS PLATFORM WITH AWS PRIMARY STORAGE SOLUTIONS

AMAZON EKS Based Genomics Data Analysis Platform with AWS Primary Storage Solutions 

AMAZON EKS BASED GENOMICS DATA ANALYSIS PLATFORM WITH AWS PRIMARY STORAGE SOLUTIONS

Name and Sector of Client:

The healthcare company is a giant in Genomics research Field and Healthcare sector spanning services pan-India. Their expertise includes:

  • Advanced genetic data analysis capabilities to urgent health challenges, developed in response to diseases.
  • Partnerships with local hospitals across Hyderabad, enabling wide-reaching impact in healthcare research.
  • Large-scale genetic information processing using a sophisticated data analysis platform built on Amazon EKS.
  • Handling of complex genomic workflows, from raw sequencing data to final analysis outputs.
  • Multi-institutional collaboration, analyzing genetic data from multiple healthcare institutions across India.

Data Sources of Client:

Due to this broad aspect of their research model, they gather huge volume of data spanning these listed sources & beyond.

The org pulls genomics data from External Research Databases: 

  • Public genomic databases (e.g., NCBI, EBI)
  • SRA (Sequence Read Archive) for raw sequencing data
  • Human Genome Diversity Project (HGDP) for population genetics studies.

Partnered Hospitals/Healthcare Institutions: 

  • Hospital information systems
  • Laboratory information management systems (LIMS)
  • Electronic health records (EHRs)

Data type and Volume:

Raw Sequencing Data (FASTQ files):

  • Category: Unprocessed genomic data
  • Amount: Can vary widely, but typically 10-30(approx) GB per human genome
  • Daily ingestion: Depends on sequencing capacity, could be 1-10 full genomes per day

Aligned Sequence Data (BAM files):

  • Category: Processed genomic data
  • Amount: Likely or slightly larger than raw data, 10-50 GB (approx)per genome
  • Daily ingestion: Corresponds to raw data processing, 1-10 per day

Clinical Data:

  • Category: Patient information, medical history, test results
  • Amount: Much smaller, typically <5 MB per patient
  • Daily ingestion: Could be hundreds of records, depending on hospital partnerships

These file types often represent different stages in genomic analysis procedure. The process starts with FASTQ files and by aligning them to a reference genome to produce BAM files, and then perform various analyses on the aligned data.

Current Total Monthly Data Ingestion: Approximately 15-20 TB per month

All these data from various sources are ingested in Amazon EKS by using APIs provided by the 3rd parties.

 

Applications running on Amazon EKS:

FastQC a quality control tool for checking the quality of raw sequencing data (FASTQ files)

 

Trimmomatic  CLI tool for trimming and filtering low-quality reads or text formats

 

BWA (Burrows-Wheeler Aligner) for aligning DNA sequences to a reference genome

 

Bowtie2 tool used for aligning for DNA and RNA sequences.

 

Picard for manipulating and analyzing SAM/BAM files.

 

Analysis process:

  1. Raw data (FASTQ files) would be read from DB and processing via the tools starts.
  1. Intermediate results (e.g., aligned BAM files, variant calling formats) would be stored in Amazon EFS for quick access by other pods which would be used by subsequent steps in the pipeline.
  1. Final processed data (json, yaml, csv, HTML, BAM) would be generated and potentially stored back in long-term storage.

Challenges faced by the client:

  • Critical requirement for ICMR-compliant and Indian regulations for data accessibility across multiple Availability Zones (AZs), ensuring continuous availability and enabling disaster recovery capabilities for sensitive genomic information in accordance with healthcare regulations.
  • Necessity to streamline storage provisioning and management, significantly reducing administrative overhead and allowing researchers to focus exclusively on genomics work.
  • Requires a cost-optimization of the environment that aligns storage and compute resource usage with actual needs, eliminating unnecessary expenses.
  • Single-region storage violates industry standards for data redundancy and disaster recovery, risking critical data loss and regulatory non-compliance.
  • Inconsistent Backup and retention practices implemented by the organization possess to meet regulatory requirements for genomic research data preservation, jeopardizing research integrity and legal compliance.
  • The researchers/org users were given certain bucket level permissions which leads to the users access to all the data present in the bucket.

No advance level permissions with granularity were present in the environments for the users to access the data from Amazon S3 which raised the risk for the users to access certain data that are not to be accessible by them.

 

  • The environment suffered from significant over-provisioning and underutilization of resources. This resulted in unnecessary costs and inefficient resource allocation and calls for optimization requirement procedure.

Proposed Solution and Architecture:

Amazon EFS is utilized to store various types of genomic data crucial for research and analysis. This includes large volumes of raw sequencing data in FASTQ files generated by sequencing machines, which need to be accessed quickly and concurrently by multiple researchers and pods for quality control and preprocessing.

During sequence alignment, worker nodes and pods in the Amazon EKS cluster process these raw data files using tools like BWA and Bowtie, producing aligned BAM files.

These BAM files, along with intermediate results and reference genomes, must be immediately available to other nodes and pods for further processing and analysis, ensuring data consistency and eliminating the need for data duplication or transfer.

Type of Data stored in Amazon EFS and Amazon S3:

Data kept in Amazon EFS:

  • Active research data: Currently processed genomic sequences and ongoing analyses.
  • Intermediate results: Temporary files generated during analysis pipelines.
  • Frequently accessed reference data: Common reference genomes or annotation files which are frequently accessed by the Researchers.
  • Shared scripts and tools: Custom analysis scripts and frequently used bioinformatics tools.

Data transferred to Amazon S3 periodically via Amazon DataSync:

  • Completed analysis results: Finalized variant calls, gene expression profiles, or assembled genomes.
  • Raw sequencing data: Original FASTQ files after initial quality control and preprocessing.
  • Large, processed datasets: Aligned BAM files or variant call format (VCF) files from completed analyses.
  • Periodic snapshots: Hourly backups of critical research data for disaster recovery.

These kinds of data cannot be stored in Amazon EFS for an extended period due to the high cost and potential performance degradation over time, especially as the volume of data grows. Therefore, they are moved to Amazon S3 via Amazon DataSync job for more cost-effective, scalable, and durable long-term storage.

 

Dynamic Provisioning Implementation for Streamlining storage provisioning

Why choose Dynamic Provisioning over Static Provisioning?

Dynamic provisioning allocates storage resources precisely when they are needed, based on the application’s requirements. This avoids the inefficiencies of pre-allocating fixed storage sizes with static provisioning. Dynamic provisioning allows for seamless adjustment of storage allocation without downtime or manual resizing optimizing the whole volume provisioning.

In this scenario the uncertainty of the amount of volume that is required to be provisioned lead to the perfect use of Dynamic Provisioning. DP uses Kubernetes StorageClasses to automatically create Persistent Volumes (PVs) when Persistent Volume Claims (PVCs) are made. 

For example, if an application needs a 100GB volume, a PVC can be created specifying this size, and Kubernetes will automatically provision a new PV of 100GB using the defined StorageClass. This eliminates the need for manual creation and management of PVs, reducing administrative overhead and ensuring that storage is always available when needed without manual intervention.

Teleglobal Installed the Amazon EFS CSI Driver in the Amazon EKS cluster, which automated the installation of necessary utilities and manages Amazon EFS volumes through Amazon EFS CSI Driver Add-on and by utilizing the add-on the Dynamic provisioning solution was implemented.

 

Why introduce Amazon EFS?

Amazon S3: Amazon S3 is an object storage service that does not support POSIX compliance file system semantics. It is designed for storing and retrieving large objects but does not support the fine-grained control and real-time file access needed for computational tasks that require standard file operations.

 

Amazon EBS: Amazon EBS is POSIX compliant but cannot be used in this case as we cannot mount it on multiple Amazon EC2s (worker nodes). Also, Amazon EBS are AZ specific, and it cannot share data out of AZ.

 

Amazon EFS: Whereas Amazon EFS is POSIX compliant, allowing it to support the standard file system semantics required for bioinformatics tools.

Amazon EFS can be mounted on multiple Amazon EC2 instances across different AZs, providing shared, scalable, and distributed file storage that ensures data redundancy, high availability, and seamless access to shared data across your Amazon EKS cluster.

This makes it ideal for the high-performance, real-time data processing needs of your genomics analysis platform.

 

Amazon S3 enhancements suggested by Teleglobal:

Data coming into Amazon S3 is categorized and then stored in the buckets with certain tagging.

 

The users were given certain permissions based on their requirements but only on bucket level but not on object level.

 

This increases the risk of the users to see all the different kinds of data present in the folder.

 

Teleglobal suggested to implement tag-based objects and this lead to the users of accessing the required bucket, required folder and along with that the required objects only and thus implementing the advance level accessing control over the users, way deep till the objects.

 

Teleglobal team wrote AWS Lambda function to tag the new uploads happening to Amazon S3 buckets through Amazon Datasync.

 

This AWS Lambda function runs after Amazon Datasync job is successful and is triggered by Amazon Eventbridge.

 

This AWS Lambda function with permissions like PutObjectTagging, ListBuckets, etc, extracts bucket name and the folder prefix in which data is stored in Amazon S3.

 

The naming convention of the objects were unique like for processed_genome (pg), raw_sequence (rs), etc.

Using the unique parameters such as naming convention and prefix of the destination folder of the object’s tagging logic has been built and the objects were tagged and uploaded to Amazon S3 bucket.

How teleglobal ensured the tagging of the already existing data?

This is done by using AWS Lambda function which checks if the tagging on the objects is present or not in the certain bucket for which it is written for and if not then it tags the objects based on the parameters like project_name, naming convention, etc.

How Teleglobal monitored the implemented tagging mechanism?

Teleglobal has written a AWS Lambda function which runs in daily to check if the tagging is done on the buckets or not.

If any untagged objects were found, an Amazon SNS alert with the list of untagged objects is sent to the team and immediate action is taken on it.

 

Transfer Data from AMAZON EFS to AMAZON S3 Using Datasync

  • Amazon DataSync efficiently transfers data from Amazon EFS to Amazon S3 at regular intervals, ensuring cost-effective long-term storage of processed genomic data.
  • AWS Lambda function automates the Amazon DataSync job initiation, minimizing operational overhead and reducing the need for manual intervention.
  • Amazon EventBridge triggers the AWS Lambda function every ½ hour, maintaining a consistent data transfer cadence that aligns with research workflows and data generation patterns.
  • By moving data from Amazon EFS to Amazon S3, the solution optimizes storage costs while maintaining data accessibility for future reference or analysis.
  • This setup seamlessly handles increasing data volumes, supporting the organization’s growth and evolving research needs without requiring significant infrastructure changes.

Regular data transfers facilitate data lifecycle management, aiding in meeting retention policies and regulatory requirements specific to genomic research.

Script:




Architecture:



Security:

  •       Encryption of Amazon EBS volumes by using AWS KMS for implementing security on data stored on Amazon EBS. Teleglobal suggested to used Multi-regional AWS KMS keys to reduce the overhead for managin multiple keys for single volumes. 
  • ·        Data transfer between Amazon EFS to Amazon S3 via Amazon Datasync by using TLS/SSL encryption.
  • ·       Configured security groups to allow NFS traffic (port 2049) between worker nodes and Amazon EFS mount targets.       

    Updated the
    Amazon EFS file system policy to allow both elasticfilesystem:ClientWrite,elasticfilesystem: ClientRead,and  elasticfilesystem:ClientMount actions.

 

 

 

 

·        
Encrypted the data at rest when present in Amazon EFS.





·        
Encrypted data in-transit by using Amazon EFS policies.

 

 

          
“Enforce in-transit encryption for all clients”





Monitoring:

Implemented container
insights to get performance insights through metrics for the
Amazon EKS pods.

Various important
metrics like pod restarts, CPU/RAM utilizations of the nodes were monitored
through
Amazon
CloudWatch automatic dashboards
created while deploying Container Insights on Am
azon EKS cluster.

Utilized Amazon CloudWatch metrics for monitoring of Amazon
EFS such as DataReadIOBytes,
PercentIOLimit, PermittedThroughput, etc to gauge the performance of
Amazon
EFS.

Similarly, Teleglobal
used cloudwatch metrics for monitoring
Amazon S3
such as Request metrics, Replication metrics, etc.

Some important
metrics utilized for Datasync were BytesTransferred, BytesPreparedSource,
BytesWritten, etc.

 

Logging:

Suggested Enabling API
server logging for
Amazon EKS.

Implemented container
insights where all the pod logs were transferred to
Amazon CloudWatch logs.

Helped in central
logging of all the pods, nodes,
Amazon EKS
components present in the environment.

 

Alerting:

Teleglobal suggested using
Amazon
SNS and utilizing topics to send
customized alerts based on subscription filters setupd on A
mazon EKS to reduce response time of an
error interruption events.

Alerts for whenever
the
Amazon
Datasync job is triggered by AWS
Lambda and this alert is sent to the concerned personnels.

Once the Amazon Datasync job is done a Amazon SNS notification is again sent to the
team.

Implementation of
Alerting when the AWS Lambda Cleaning Job start or is done.

Backup and Disaster Ready
mechanism:

 

 

 

Cross-region
replication for taking backups of
Amazon S3
using
Amazon S3
cross-region replication
.





This Amazon S3
replication setting once setupd, worked
24/7
continuously
, to automatically replicate new objects added to the
source bucket to the destination bucket in the secondary region.

For Amazon EBS
volumes AWS backup was used to send the snapshot from ap-south-1(Mumbai)
to ap-south-2 (Hyderabad).

The backups were set
to be taken daily after Business hours.

 

 

AWS Backup was configured to automatically delete the snapshots
after 1 week thus managing the lifecycle of your backups without need
for additional scripts or manual intervention.





Cost Optimization
Implementations and Automations:

Teleglobal provided AWS
Lambda scripts
for cost-optimization in Amazon EBS which does the work of deleting
idle unattached
Amazon EBS volume resources.

These scripts were utilized
monthly
which provided the actual unattached volumes a window period of
1 month
, if any necessary data is present.

AWS Lambda scripts
were written in python for Amazon EC2s after Testing hours.

The data transfer job
from Amazon EFS to Amazon S3 is done throughout the day and during
non-Business hours the cleaning of the temporary data becomes a necessity to
maintain the performance of the Amazon EFS.

For this a data
cleaning AWS Lambda function is triggered weekly at specific time
by setting a cron job under Amazon Eventbridge to clean the data from Amazon
EFS.

The notification of
this is sent out to the concerned personnels and the AWS Lambda does its work
of cleaning.

Once the cleaning is
done another Amazon SNS notification
is sent out to let the concerned team
know about the completion of the event.

This leads to an
allowance of a weekly retention period for the temporary data and cleans
the data to free Amazon EFS of unnecessary temp or intermediate results data.

 

Factors and Outcomes:

By implementing Amazon
EFS with Amazon EKS worker nodes in across multiple
Availability Zones (AZs)
, the platform ensured continuous data access and
enhanced fault tolerance. This setup provided resilience against potential
outages and maintained high availability of data.

 

Dynamic provisioning played a crucial role in optimizing costs by ensuring
that the organization only paid for the Amazon EFS resources it needed at any
given time. This approach avoided the cost of over-provisioning while still
maintaining the necessary performance levels for intensive computational tasks.

 

The solution dynamically
provisioned Amazon EFS storage
as needed, thereby eliminating the need for
manual storage management. This automation allowed researchers to focus
entirely on genomics analysis without the overhead of managing storage
resources.

 

Security was kept in
check with all data being encrypted both in transit and at rest. Secure
access controls were meticulously managed through Amazon EFS policies and
security groups, safeguarding sensitive genomic data against unauthorized
access.

 

The architecture was
designed to be highly scalable, supporting the easy expansion of AWS
resources in storage. This scalability is essential for accommodating the
ever-growing datasets and increasingly complex analyses typical in genomics
research.

 

The ReadWriteMany
access mode of Amazon EFS was particularly beneficial, enabling multiple pods
across different nodes to simultaneously read and write data. This capability
fostered collaboration among researchers, allowing them to work together
seamlessly and efficiently.

 

AWS Backup was used to ensure seamless and reliable data
protection for the Bio company’s critical analysis data stored on Amazon EBS.
By scheduling daily backups, it provided assurance against data loss,
maintaining integrity during intensive genomics analysis.

 

Conclusion:

Teleglobal provided a
robust, cost-effective, and secure infrastructure that facilitated advanced
genomics research by automating storage provisioning, ensuring high
availability and fault tolerance, optimizing costs, and enabling collaborative
data access. 

 

 

Teleglobal aided in
addressing the pain-points of the organization in storage areas while
maintaining compliance requirements and security in check.

Leave a comment

Explore
Drag