Amazon Eks Based Genomics Data Analysis Platform With Aws Primary Storage Solutions

Name and Sector of Client 

The client is a healthcare giant in genomics research and healthcare services across India. Their expertise includes: 

  • Advanced genetic data analysis for urgent health challenges and diseases. 
  • Partnerships with local hospitals in Hyderabad for research support. 
  • Large-scale genetic information processing on amazon eks. 
  • Handling genomic workflows from raw sequencing data to final outputs. 
  • Multi-institutional collaboration with hospitals and labs across India. 

Data Sources of Client 

The research model requires large data from multiple sources. 

External Research Databases: 

  • Public genomic databases (NCBI, EBI). 
  • SRA (Sequence Read Archive) for raw sequencing data. 
  • HGDP (Human Genome Diversity Project) for population studies. 

Partnered Hospitals: 

  • Hospital information systems. 
  • Laboratory Information Management Systems (LIMS). 
  • Electronic Health Records (EHRs). 

Data Types and Volume 

Raw Sequencing Data (FASTQ files): 

  • Category: Unprocessed genomic data. 
  • Size: 10–30 GB per human genome. 
  • Daily ingestion: 1–10 full genomes. 

Aligned Sequence Data (BAM files): 

  • Category: Processed genomic data. 
  • Size: 10–50 GB per genome. 
  • Daily ingestion: Matches raw data volume. 

Clinical Data: 

  • Category: Patient history and test results. 
  • Size: <5 MB per patient. 
  • Daily ingestion: Hundreds of records. 

Monthly total ingestion: 15–20 TB. 
All this data enters the amazon eks cluster through APIs from third parties. 

Applications on Amazon EKS 

The genomics platform runs several tools inside amazon eks pods: 

  • FastQC for raw sequencing data quality checks. 
  • Trimmomatic for trimming and filtering low-quality reads. 
  • BWA for DNA sequence alignment to reference genomes. 
  • Bowtie2 for aligning DNA and RNA sequences. 
  • Picard for manipulating SAM/BAM files. 

Analysis Process 

  1. Raw data is read from external DBs. 
  1. Tools process FASTQ files. 
  1. Intermediate BAM and VCF files are stored in Amazon EFS. 
  1. Final results (JSON, YAML, CSV, HTML, BAM) are stored long-term. 

Challenges Faced 

  • Need for ICMR-compliant access across multiple Availability Zones. 
  • Complex storage management added overhead for researchers. 
  • High costs due to over-provisioned resources. 
  • Single-region storage risked compliance and disaster recovery. 
  • Inconsistent backup practices threatened research integrity. 
  • Bucket-level permissions in S3 exposed sensitive data. 
  • No fine-grained access control for users. 

Proposed Solution and Amazon EKS Architecture 

Amazon EFS stores active genomic data for ongoing research. 
Worker nodes in the amazon eks cluster process raw data with BWA and Bowtie. 
Intermediate BAM files and reference genomes remain accessible for further steps. 

Data Stored in Amazon EFS 

  • Active research data. 
  • Intermediate results. 
  • Frequently used reference genomes. 
  • Shared scripts and tools. 

Data Stored in Amazon S3 

  • Completed analysis results. 
  • Raw data after preprocessing. 
  • Processed datasets like BAM and VCF files. 
  • Hourly snapshots for recovery. 

Data transfers occur through Amazon DataSync to optimize storage cost. 

Dynamic Provisioning 

Dynamic provisioning avoids static storage issues. 
It uses kubernete StorageClasses to auto-create Persistent Volumes. 
For example, a 100 GB PVC request creates a matching PV automatically. 
This eliminated manual setup and reduced storage overhead. 

Teleglobal installed the Amazon EFS CSI Driver in the amazon eks cluster, enabling dynamic provisioning. 

Why Amazon EFS? 

  • Amazon S3 lacks POSIX compliance, not suitable for direct compute tasks. 
  • Amazon EBS is POSIX compliant but limited to single AZ and cannot scale across nodes. 
  • Amazon EFS is POSIX compliant, AZ-independent, and mountable across nodes. 

This made EFS ideal for real-time genomics data analysis. 

Amazon S3 Enhancements 

  • Data in S3 is tagged after upload. 
  • Bucket-level permissions were replaced with tag-based object permissions. 
  • Users now access only the specific files they need. 

Teleglobal built AWS Lambda functions to tag new uploads after each DataSync job. 
Daily Lambda jobs also checked old data for missing tags. 
If untagged data was found, Amazon SNS alerts were sent. 

Data Transfer with Amazon DataSync 

  • Data moves from Amazon EFS to Amazon S3 regularly. 
  • AWS Lambda triggers DataSync jobs every 30 minutes. 
  • EventBridge manages automation and scheduling. 
  • This reduced manual overhead and ensured regulatory compliance. 

Security Measures 

  • Encryption at rest with AWS KMS. 
  • TLS/SSL encryption during data transfer. 
  • Security groups allowed controlled NFS traffic. 
  • Enforced “in-transit encryption” on Amazon EFS. 

Monitoring and Logging 

  • Container Insights monitored pod metrics on the amazon eks cluster. 
  • CloudWatch tracked EFS metrics like throughput and I/O usage. 
  • S3 and DataSync metrics provided detailed transfer stats. 
  • Centralized pod and cluster logs were sent to CloudWatch. 

Alerting 

  • Amazon SNS alerts for DataSync job start and completion. 
  • Lambda cleanup jobs also triggered alerts after execution. 
  • This reduced downtime and improved response times. 

Backup and Disaster Readiness 

  • Cross-region replication for Amazon S3 buckets. 
  • Amazon EBS snapshots sent daily from Mumbai to Hyderabad. 
  • Snapshots auto-deleted after one week via AWS Backup lifecycle. 

Cost Optimization 

  • Lambda scripts deleted idle EBS volumes monthly. 
  • EC2s were shut down after testing hours. 
  • Weekly cleanup jobs removed temp data from Amazon EFS. 
  • Notifications kept teams aware of completed cleanups. 

Factors and Outcomes 

  • Amazon eks architecture improved data availability across AZs. 
  • Dynamic provisioning reduced over-provisioning costs. 
  • Encryption protected sensitive research data. 
  • Scalability allowed the system to handle growing genomics datasets. 
  • ReadWriteMany access in EFS enabled multiple pods to work together. 
  • AWS Backup assured daily protection for analysis data. 

Conclusion 

We at Teleglobal built a secure and cost-effective genomics platform. It combined amazon eks, aws storage solutions, and dynamic provisioning. The setup supported compliance, reduced costs, and improved collaboration. By using the right aws storage services, the client gained scalability and efficiency.

teleBot

close
send

Tell us about you