Name and Sector of Client:
The Client is a giant in Agri-tech and agri-business sector spanning services pan-India. Their service includes:
Environment Details
Due to this broad aspect of their business model, they gather huge volume of data spanning these listed fields.
The total volume of data was about 33 TB as given by the client in their inventory details, and distribution of the volume of data is given below:
Problem faced by client:
So, collecting these kind of data for years on their on-premise servers they were facing these issues:
Proposed Solution and Architecture:
To ensure efficient and compliant data backup and archival, we will utilize MSP360 Backup (formerly CloudBerry) to securely store data in AWS Glacier’s tiered services. This solution will involve carefully managed roles and policies, allowing minimal access to streamline the backup and archival process while maintaining strict control over sensitive data. By leveraging AWS Glacier’s cost-effective long-term storage and MSP360’s robust management features, the company can achieve a scalable, secure, and compliant data preservation strategy tailored to its unique needs.
Architecture:
<>
1. MSP360 Backup agent and job setup:
The client procured required number of licenses to back up their data from existing
on-premise servers to S3 glacier tier.
– There were 7 repositories
containing farmer data, genome research data, geospatial data, agri-business
and Agri-market data, farmer training data, IoT data, and Intellectual property
data in different file formats.
– So, we used licenses for each
repository and created parallel backup jobs into S3 Glacier buckets separately.
– Only the farmer data repository
containing PIIs were sent to Glacier instant recovery so we can do some
operation on it like masking and further encrypting.
– After encryption and masking
those data gets transferred to Glacier tier.
2. Mitigating PII:
– We used Amazon Macie to locate the
columns and rows in CSV file where personal details like full name, phone
number, Aadhar, Insurance, Kishan-credit card number.
– The result of Macie is stored
in another bucket in encrypted format with KMS key. The result is used by Glue
and data is taken from the Glacier instant retrieve then those values are
masked. Then the new data with masked values are dumped into S3 glacier tier
buckets.
– The original file with
unencrypted values is still available and can be fetched so the whole data gets
encrypted by KMS and encrypted file gets transferred to S3 Glacier. Then the
original file gets removed.
– So, when the data will be
required again this data can be produced by decrypting from S3 bucket.
– To automate the glue workflow when
a new data will come to the instant recovery bucket it will create an
EventBridge trigger and glue will run again.
3. Retention policy:
To meet the CAP and DPDP compliance standard retention policies on the basis of
repository are defined in the below listed manner:
i. Farmer personal data: S3 glacier instant retrieval –
mitigationà S3 glacier flexible retrieval -1yrà s3
glacier deep archive – 10 yrs à delete.
ii. Genomics data: S3 glacier flexible retrieval
— 5 yrs à S3 glacier deep archive.
iii. Research data: S3 glacier flexible retrieval–
5 yrs à
S3 glacier deep archive.
iv. Geospatial data: S3 glacier flexible
retrieval – 20 yrs à S3 glacier deep archive
v. SQL data: S3 glacier flexible retrieval —
5 yrs à
S3 glacier deep archive – 20 yrsà delete.
vi. Training data: S3 glacier flexible retrieval —
5 yrs à
S3 glacier deep archive – 10 yrs àdelete.
vii. IoT data: S3 glacier flexible retrieval —
1 yrs à
S3 glacier deep archive—2 yrs à delete.
viii. IP data: S3 glacier flexible retrieval —
10 yrs à
S3 glacier deep archive
4. Optimised bucket policy for enhanced Security:
Apart from IAM based roles and policies, separate bucket policies were applied to avoid any vulnerabilities that may arise.
Outcomes and Achievements:
–
Archival needs and satisfaction of SLA:
With our plan and with tested metrics and regular drills after deployment we have met the SLA in terms of retrieval time and mitigation standards.
–
Retrieval time metrics: Withsome retrieval and some statistical interpolation we got the actual time required for retrieval of data. We found out about 127.6 minutes (about 2 hours) to retrieve 1TB data from instant retrieval and 1,690.5 minutes (about 28.2 hours) from flexible retrieval.
–
Optimisation of Cost: Reduction in cost by 68% with going S3 glacier instant retrieval rather than S3 standard.
–
Storage Scalability: With proper lifecycle management and utilising AWS storage classes the backup archival infrastructure is highly scalable for ever growing data volume across their operation.
–
Enhanced Security: Implementing different security best practices in AWS infrastructure and AWS’ security boosted
the security level of all the data that is being archived.