The Client is a giant in Agri-tech and agri-business sector spanning services pan-India. Their service includes:
Due to this broad aspect of their business model, they gather huge volume of data spanning these listed fields.
The total volume of data was about 33 TB as given by the client in their inventory details, and distribution of the volume of data is given below:
Their local backup setup is not scalable enough to meet the demand of the ever evolving and massively growing data on the above-mentioned aspects.
Without proper backup their business will be incompatible with the various compliances of the Indian govt.
Personal details of the farmers should be handled properly to meet the data privacy law of govt. of India.
Single Point of Failure: Relying solely on on-premises servers for data storage increases the risk of losing critical data due to hardware failures, cyberattacks, or natural disasters
Resource Intensive: Regular backups on on-premises infrastructure require significant IT resources, including ongoing hardware and software maintenance, which can be costly and time-consuming.
So, collecting these kind of data for years on their on-premise servers they were facing these issues:
Complex Recovery Processes
Time-Consuming Recovery: Recovering data from on-premise backups after an incident can be a complex and slow process, potentially leading to prolonged downtime.
Proposed Solution and Architecture:
To ensure efficient and compliant data backup and archival, we will utilize MSP360 Backup (formerly CloudBerry) to securely store data in AWS Glacier’s tiered services. This solution will involve carefully managed roles and policies, allowing minimal access to streamline the backup and archival process while maintaining strict control over sensitive data. By leveraging AWS Glacier’s cost-effective long-term storage and MSP360’s robust management features, the company can achieve a scalable, secure, and compliant data preservation strategy tailored to its unique needs.
Architecture:
1. MSP360 Backup agent and job setup:
The client procured required number of licenses to back up their data from existing
on-premise servers to S3 glacier tier.
There were 7 repositories
containing farmer data, genome research data, geospatial data, agri-business and Agri-market data, farmer training data, IoT data, and Intellectual property data in different file formats.
So, we used licenses for each repository and created parallel backup jobs into S3 Glacier buckets separately.
Only the farmer data repository containing PIIs were sent to Glacier instant recovery so we can do some
operation on it like masking and further encrypting.
After encryption and masking those data gets transferred to Glacier tier.
2. Mitigating PII:
We used Amazon Macie to locate the columns and rows in CSV file where personal details like full name, phone number, Aadhar, Insurance, Kishan-credit card number.
The result of Macie is stored in another bucket in encrypted format with KMS key. The result is used by Glue and data is taken from the Glacier instant retrieve then those values are masked. Then the new data with masked values are dumped into S3 glacier tier buckets.
The original file with unencrypted values is still available and can be fetched so the whole data gets encrypted by KMS and encrypted file gets transferred to S3 Glacier. Then the original file gets removed.
So, when the data will be required again this data can be produced by decrypting from S3 bucket.
To automate the glue workflow when a new data will come to the instant recovery bucket it will create an Event Bridge trigger and glue will run again.
3. Retention policy:
To meet the CAP and DPDP compliance standard retention policies on the basis of repository are defined in the below listed manner:
Farmer personal data: S3 glacier instant retrieval – mitigation à S3 glacier flexible retrieval -1yrà s 3 glacier deep archive – 10 yrs à delete.
Genomics data: S3 glacier flexible retrieval — 5 yrs à S3 glacier deep archive.
Research data: S3 glacier flexible retrieval – 5 yrs à S3 glacier deep archive.
Geospatial data: S3 glacier flexible retrieval – 20 yrs à S3 glacier deep archive
SQL data: S3 glacier flexible retrieval — 5 yrs à S3 glacier deep archive – 20 yrsà delete.
Training data: S3 glacier flexible retrieval — 5 yrs à S3 glacier deep archive – 10 yrs à delete.
IoT data: S3 glacier flexible retrieval — 1 yrs à S3 glacier deep archive—2 yrs à delete.
IP data: S3 glacier flexible retrieval — 10 yrs à S3 glacier deep archive
4. Optimised bucket policy for enhanced Security:
Apart from IAM based roles and policies, separate bucket policies were applied to avoid any vulnerabilities that may arise.
Outcomes and Achievements:
Archival needs and satisfaction of SLA:
With our plan and with tested metrics and regular drills after deployment we have met the SLA in terms of retrieval time and mitigation standards.
Retrieval time metrics: Withsome retrieval and some statistical interpolation we got the actual time required for retrieval of data. We found out about 127.6 minutes (about 2 hours) to retrieve 1TB data from instant retrieval and 1,690.5 minutes (about 28.2 hours) from flexible retrieval.
Optimisation of Cost: Reduction in cost by 68% with going S3 glacier instant retrieval rather than S3 standard.
Storage Scalability: With proper lifecycle management and utilising AWS storage classes the backup archival infrastructure is highly scalable for ever growing data volume across their operation.
Enhanced Security: Implementing different security best practices in AWS infrastructure and AWS’ security boosted
the security level of all the data that is being archived.