Skip links
AMAZON EKS BASED GENOMICS DATA ANALYSIS PLATFORM WITH AWS PRIMARY STORAGE SOLUTIONS

AMAZON EKS Based Genomics Data Analysis Platform with AWS Primary Storage Solutions 

AMAZON EKS BASED GENOMICS DATA ANALYSIS PLATFORM WITH AWS PRIMARY STORAGE SOLUTIONS

Name and Sector of Client:

The healthcare company is a giant in Genomics research Field and Healthcare sector spanning services pan-India. Their expertise includes:

  • Advanced genetic data analysis capabilities to urgent health challenges, developed in response to diseases.
  • Partnerships with local hospitals across Hyderabad, enabling wide-reaching impact in healthcare research.
  • Large-scale genetic information processing using a sophisticated data analysis platform built on Amazon EKS.
  • Handling of complex genomic workflows, from raw sequencing data to final analysis outputs.
  • Multi-institutional collaboration, analyzing genetic data from multiple healthcare institutions across India.

Data Sources of Client:

Due to this broad aspect of their research model, they gather huge volume of data spanning these listed sources & beyond.

The org pulls genomics data from External Research Databases: 

  • Public genomic databases (e.g., NCBI, EBI)
  • SRA (Sequence Read Archive) for raw sequencing data
  • Human Genome Diversity Project (HGDP) for population genetics studies.

Partnered Hospitals/Healthcare Institutions: 

  • Hospital information systems
  • Laboratory information management systems (LIMS)
  • Electronic health records (EHRs)

Data type and Volume:

Raw Sequencing Data (FASTQ files):

  • Category: Unprocessed genomic data
  • Amount: Can vary widely, but typically 10-30(approx) GB per human genome
  • Daily ingestion: Depends on sequencing capacity, could be 1-10 full genomes per day

Aligned Sequence Data (BAM files):

  • Category: Processed genomic data
  • Amount: Likely or slightly larger than raw data, 10-50 GB (approx)per genome
  • Daily ingestion: Corresponds to raw data processing, 1-10 per day

Clinical Data:

  • Category: Patient information, medical history, test results
  • Amount: Much smaller, typically <5 MB per patient
  • Daily ingestion: Could be hundreds of records, depending on hospital partnerships

These file types often represent different stages in genomic analysis procedure. The process starts with FASTQ files and by aligning them to a reference genome to produce BAM files, and then perform various analyses on the aligned data.

Current Total Monthly Data Ingestion: Approximately 15-20 TB per month

All these data from various sources are ingested in Amazon EKS by using APIs provided by the 3rd parties.

 

Applications running on Amazon EKS:

FastQC a quality control tool for checking the quality of raw sequencing data (FASTQ files)

 

Trimmomatic  CLI tool for trimming and filtering low-quality reads or text formats

 

BWA (Burrows-Wheeler Aligner) for aligning DNA sequences to a reference genome

 

Bowtie2 tool used for aligning for DNA and RNA sequences.

 

Picard for manipulating and analyzing SAM/BAM files.

 

Analysis process:

  1. Raw data (FASTQ files) would be read from DB and processing via the tools starts.
  1. Intermediate results (e.g., aligned BAM files, variant calling formats) would be stored in Amazon EFS for quick access by other pods which would be used by subsequent steps in the pipeline.
  1. Final processed data (json, yaml, csv, HTML, BAM) would be generated and potentially stored back in long-term storage.

Challenges faced by the client:

  • Critical requirement for ICMR-compliant and Indian regulations for data accessibility across multiple Availability Zones (AZs), ensuring continuous availability and enabling disaster recovery capabilities for sensitive genomic information in accordance with healthcare regulations.
  • Necessity to streamline storage provisioning and management, significantly reducing administrative overhead and allowing researchers to focus exclusively on genomics work.
  • Requires a cost-optimization of the environment that aligns storage and compute resource usage with actual needs, eliminating unnecessary expenses.
  • Single-region storage violates industry standards for data redundancy and disaster recovery, risking critical data loss and regulatory non-compliance.
  • Inconsistent Backup and retention practices implemented by the organization possess to meet regulatory requirements for genomic research data preservation, jeopardizing research integrity and legal compliance.
  • The researchers/org users were given certain bucket level permissions which leads to the users access to all the data present in the bucket.

No advance level permissions with granularity were present in the environments for the users to access the data from Amazon S3 which raised the risk for the users to access certain data that are not to be accessible by them.

 

  • The environment suffered from significant over-provisioning and underutilization of resources. This resulted in unnecessary costs and inefficient resource allocation and calls for optimization requirement procedure.

Proposed Solution and Architecture:

Amazon EFS is utilized to store various types of genomic data crucial for research and analysis. This includes large volumes of raw sequencing data in FASTQ files generated by sequencing machines, which need to be accessed quickly and concurrently by multiple researchers and pods for quality control and preprocessing.

During sequence alignment, worker nodes and pods in the Amazon EKS cluster process these raw data files using tools like BWA and Bowtie, producing aligned BAM files.

These BAM files, along with intermediate results and reference genomes, must be immediately available to other nodes and pods for further processing and analysis, ensuring data consistency and eliminating the need for data duplication or transfer.

Type of Data stored in Amazon EFS and Amazon S3:

Data kept in Amazon EFS:

  • Active research data: Currently processed genomic sequences and ongoing analyses.
  • Intermediate results: Temporary files generated during analysis pipelines.
  • Frequently accessed reference data: Common reference genomes or annotation files which are frequently accessed by the Researchers.
  • Shared scripts and tools: Custom analysis scripts and frequently used bioinformatics tools.

Data transferred to Amazon S3 periodically via Amazon DataSync:

  • Completed analysis results: Finalized variant calls, gene expression profiles, or assembled genomes.
  • Raw sequencing data: Original FASTQ files after initial quality control and preprocessing.
  • Large, processed datasets: Aligned BAM files or variant call format (VCF) files from completed analyses.
  • Periodic snapshots: Hourly backups of critical research data for disaster recovery.

These kinds of data cannot be stored in Amazon EFS for an extended period due to the high cost and potential performance degradation over time, especially as the volume of data grows. Therefore, they are moved to Amazon S3 via Amazon DataSync job for more cost-effective, scalable, and durable long-term storage.

 

Dynamic Provisioning Implementation for Streamlining storage provisioning

Why choose Dynamic Provisioning over Static Provisioning?

Dynamic provisioning allocates storage resources precisely when they are needed, based on the application’s requirements. This avoids the inefficiencies of pre-allocating fixed storage sizes with static provisioning. Dynamic provisioning allows for seamless adjustment of storage allocation without downtime or manual resizing optimizing the whole volume provisioning.

In this scenario the uncertainty of the amount of volume that is required to be provisioned lead to the perfect use of Dynamic Provisioning. DP uses Kubernetes StorageClasses to automatically create Persistent Volumes (PVs) when Persistent Volume Claims (PVCs) are made. 

For example, if an application needs a 100GB volume, a PVC can be created specifying this size, and Kubernetes will automatically provision a new PV of 100GB using the defined StorageClass. This eliminates the need for manual creation and management of PVs, reducing administrative overhead and ensuring that storage is always available when needed without manual intervention.

Teleglobal Installed the Amazon EFS CSI Driver in the Amazon EKS cluster, which automated the installation of necessary utilities and manages Amazon EFS volumes through Amazon EFS CSI Driver Add-on and by utilizing the add-on the Dynamic provisioning solution was implemented.

 

Why introduce Amazon EFS?

Amazon S3: Amazon S3 is an object storage service that does not support POSIX compliance file system semantics. It is designed for storing and retrieving large objects but does not support the fine-grained control and real-time file access needed for computational tasks that require standard file operations.

 

Amazon EBS: Amazon EBS is POSIX compliant but cannot be used in this case as we cannot mount it on multiple Amazon EC2s (worker nodes). Also, Amazon EBS are AZ specific, and it cannot share data out of AZ.

 

Amazon EFS: Whereas Amazon EFS is POSIX compliant, allowing it to support the standard file system semantics required for bioinformatics tools.

Amazon EFS can be mounted on multiple Amazon EC2 instances across different AZs, providing shared, scalable, and distributed file storage that ensures data redundancy, high availability, and seamless access to shared data across your Amazon EKS cluster.

This makes it ideal for the high-performance, real-time data processing needs of your genomics analysis platform.

 

Amazon S3 enhancements suggested by Teleglobal:

Data coming into Amazon S3 is categorized and then stored in the buckets with certain tagging.

 

The users were given certain permissions based on their requirements but only on bucket level but not on object level.

 

This increases the risk of the users to see all the different kinds of data present in the folder.

 

Teleglobal suggested to implement tag-based objects and this lead to the users of accessing the required bucket, required folder and along with that the required objects only and thus implementing the advance level accessing control over the users, way deep till the objects.

 

Teleglobal team wrote AWS Lambda function to tag the new uploads happening to Amazon S3 buckets through Amazon Datasync.

 

This AWS Lambda function runs after Amazon Datasync job is successful and is triggered by Amazon Eventbridge.

 

This AWS Lambda function with permissions like PutObjectTagging, ListBuckets, etc, extracts bucket name and the folder prefix in which data is stored in Amazon S3.

 

The naming convention of the objects were unique like for processed_genome (pg), raw_sequence (rs), etc.

Using the unique parameters such as naming convention and prefix of the destination folder of the object’s tagging logic has been built and the objects were tagged and uploaded to Amazon S3 bucket.

How teleglobal ensured the tagging of the already existing data?

This is done by using AWS Lambda function which checks if the tagging on the objects is present or not in the certain bucket for which it is written for and if not then it tags the objects based on the parameters like project_name, naming convention, etc.

How Teleglobal monitored the implemented tagging mechanism?

Teleglobal has written a AWS Lambda function which runs in daily to check if the tagging is done on the buckets or not.

If any untagged objects were found, an Amazon SNS alert with the list of untagged objects is sent to the team and immediate action is taken on it.

 

Transfer Data from AMAZON EFS to AMAZON S3 Using Datasync

  • Amazon DataSync efficiently transfers data from Amazon EFS to Amazon S3 at regular intervals, ensuring cost-effective long-term storage of processed genomic data.
  • AWS Lambda function automates the Amazon DataSync job initiation, minimizing operational overhead and reducing the need for manual intervention.
  • Amazon EventBridge triggers the AWS Lambda function every ½ hour, maintaining a consistent data transfer cadence that aligns with research workflows and data generation patterns.
  • By moving data from Amazon EFS to Amazon S3, the solution optimizes storage costs while maintaining data accessibility for future reference or analysis.
  • This setup seamlessly handles increasing data volumes, supporting the organization’s growth and evolving research needs without requiring significant infrastructure changes.

Regular data transfers facilitate data lifecycle management, aiding in meeting retention policies and regulatory requirements specific to genomic research.

Script:




Architecture: