Name and Sector of Client:
The client Client is a giant in Agri-tech and agri-business sector spanning services pan-India. Their service includes:
Environment Details
- Farmer training on their product, knowledge transfer on their high yielding seeds, environmentally friendly ways of pest management, and crop management.
- Their primary work is on doing research on high yielding and natural calamity sustaining breeds of crops, genetically modifying crops.
- Saving and procuring regional breeds of crops.
- Providing marketplace for organic yields.
- IoT devices and products to monitor crop fields.
- Gathering climate data and geospatial data for better crop yield.
Due to this broad aspect of their business model, they gather huge volume of data spanning these listed fields.
- Farmer training data: Personal info (name, age, education, Aadhaar, etc.), crops they produce, land mass, YoY ROI, historical crop data, attendance records, performance assessment, yield improvement records, and feedback from customers.
- Research data: experimental data on fertilizers, water levels, crop variety, genetic profiles of disaster sustaining breeds, breeding records, trait marker, genomic sequences.
- Soil and Climate data: historically collected data on temperature, rainfall, soil condition, soil chemical composition, climate data, pH, nutrient level, etc.
- Pest and Disease monitoring: data on pests, diseases, impact on crop growth, environmentally friendly approaches to tackle, etc.
- Agri-business data: Market analysis, Supply chain data, economic conditions, regulatory compliances these are some of the data they produce and process.
- Geospatial data: remote sensed data, water level data, field boundary mapping, geographical features, these are some of the geo spatial data they collect and use for analysis for better yield and train on these data.
- IoT and sensors: the IP, patents developed by their team. Data gathered from devices and sensors on nutrient level, crop health, water content.
- Customer and Stakeholder data: CRM, stakeholder feedback.
- Collaboration and partnership data: data on research collaboration, fundings and grants for research and development.
The total volume of data was about 33 TB as given by the client in their inventory details, and distribution of the volume of data is given below:
- Farmer data: in the form of CSV and SQL: 500 GB
- Genomics data containing base pairs: FASTA and GENBANK: 20 TB
- Research and analytics: (SPSS, SQLITE): 2 GB
- Geospatial data: (geojson, GML, images): 5 TB
- Agri-market data: (SQL): 200 GB
- Farmer Training Data: (Images, Videos, XLSX): 5 TB
- IoT data: (Avro, JSON): 100 GB
- IP data: (Patent XML, PDF): 150 GB
Problem faced by client:
So, collecting these kind of data for years on their on-premise servers they were facing these issues:
- Backup Challenges
- Inefficient Local backup:
- Their local backup setup is not scalable enough to meet the demand of the ever evolving and massively growing data on the above-mentioned aspects.
- Without proper backup their business will be incompatible with the various compliances of the Indian govt.
- Personal details of the farmers should be handled properly to meet the data privacy law of govt. of India.
- Data Loss Risk
- Single Point of Failure: Relying solely on on-premises servers for data storage increases the risk of losing critical data due to hardware failures, cyberattacks, or natural disasters.
- High Maintenance Costs
- Resource Intensive: Regular backups on on-premises infrastructure require significant IT resources, including ongoing hardware and software maintenance, which can be costly and time-consuming.
- Complex Recovery Processes
- Time-Consuming Recovery: Recovering data from on-premise backups after an incident can be a complex and slow process, potentially leading to prolonged downtime.
- Archival Pain Points
- Compliance Challenges
- Regulatory Requirements: Maintaining compliance with data retention and archival regulations, such as PDPB, DPDP, and CAP can be complex, requiring meticulous management and secure storage.
- When archiving data, especially in the Agri-Tech sector, it’s important to comply with several regulations:
- Information Technology Act, 2000: This includes provisions for reasonable security practices and procedures to protect sensitive personal data, such as farmers’ personal information.
- Intellectual Property Laws: Ensure proper archiving of IP-related data, ensuring it is secure and accessible only to authorized individuals.
- Data Protection Laws: Safeguard the privacy and integrity of personal data, particularly during long-term storage, adhering to national and international data protection standards.
- Long-Term Data Integrity
- Data Corruption Risks: Over time, data stored on on-premise servers may be at risk of corruption or degradation, which can compromise the integrity of archived information.
- Access and Retrieval Issues
- Difficulty in Accessing Archived Data: Retrieving archived data for audits, legal inquiries, or business analysis can be cumbersome if not well-organized and securely managed.
- Physical Security
- Protection Against Disasters: Ensuring that archived data is protected against physical threats, such as fires, floods, or unauthorized access, is a significant challenge with on-premise solutions.
Proposed Solution and Architecture:
To ensure efficient and compliant data backup and archival, we will utilize MSP360 Backup (formerly CloudBerry) to securely store data in AWS Glacier’s tiered services. This solution will involve carefully managed roles and policies, allowing minimal access to streamline the backup and archival process while maintaining strict control over sensitive data. By leveraging AWS Glacier’s cost-effective long-term storage and MSP360’s robust management features, the company can achieve a scalable, secure, and compliant data preservation strategy tailored to its unique needs.
Architecture:
1. MSP360 Backup agent and job setup:
The client procured required number of licenses to back up their data from existing
on-premise servers to S3 glacier tier.
– There were 7 repositories
containing farmer data, genome research data, geospatial data, agri-business
and Agri-market data, farmer training data, IoT data, and Intellectual property
data in different file formats.
– So, we used licenses for each
repository and created parallel backup jobs into S3 Glacier buckets separately.
– Only the farmer data repository
containing PIIs were sent to Glacier instant recovery so we can do some
operation on it like masking and further encrypting.
– After encryption and masking
those data gets transferred to Glacier tier.
2. Mitigating PII:
– We used Amazon Macie to locate the
columns and rows in CSV file where personal details like full name, phone
number, Aadhar, Insurance, Kishan-credit card number.
– The result of Macie is stored
in another bucket in encrypted format with KMS key. The result is used by Glue
and data is taken from the Glacier instant retrieve then those values are
masked. Then the new data with masked values are dumped into S3 glacier tier
buckets.
– The original file with
unencrypted values is still available and can be fetched so the whole data gets
encrypted by KMS and encrypted file gets transferred to S3 Glacier. Then the
original file gets removed.
– So, when the data will be
required again this data can be produced by decrypting from S3 bucket.
– To automate the glue workflow when
a new data will come to the instant recovery bucket it will create an
EventBridge trigger and glue will run again.
3. Retention policy:
To meet the CAP and DPDP compliance standard retention policies on the basis of
repository are defined in the below listed manner:
i. Farmer personal data: S3 glacier instant retrieval –
mitigationà S3 glacier flexible retrieval -1yrà s3
glacier deep archive – 10 yrs à delete.
ii. Genomics data: S3 glacier flexible retrieval
— 5 yrs à S3 glacier deep archive.
iii. Research data: S3 glacier flexible retrieval–
5 yrs à
S3 glacier deep archive.
iv. Geospatial data: S3 glacier flexible
retrieval – 20 yrs à S3 glacier deep archive
v. SQL data: S3 glacier flexible retrieval —
5 yrs à
S3 glacier deep archive – 20 yrsà delete.
vi. Training data: S3 glacier flexible retrieval —
5 yrs à
S3 glacier deep archive – 10 yrs àdelete.
vii. IoT data: S3 glacier flexible retrieval —
1 yrs à
S3 glacier deep archive—2 yrs à delete.
viii. IP data: S3 glacier flexible retrieval —
10 yrs à
S3 glacier deep archive
4. Optimised bucket policy for enhanced Security:
Apart from IAM based roles and policies, separate bucket policies were applied to avoid any vulnerabilities that may arise.
Outcomes and Achievements:
–
Archival needs and satisfaction of SLA:
With our plan and with tested metrics and regular drills after deployment we have met the SLA in terms of retrieval time and mitigation standards.
–
Retrieval time metrics: Withsome retrieval and some statistical interpolation we got the actual time required for retrieval of data. We found out about 127.6 minutes (about 2 hours) to retrieve 1TB data from instant retrieval and 1,690.5 minutes (about 28.2 hours) from flexible retrieval.
–
Optimisation of Cost: Reduction in cost by 68% with going S3 glacier instant retrieval rather than S3 standard.
–
Storage Scalability: With proper lifecycle management and utilising AWS storage classes the backup archival infrastructure is highly scalable for ever growing data volume across their operation.
–
Enhanced Security: Implementing different security best practices in AWS infrastructure and AWS’ security boosted
the security level of all the data that is being archived.