Problem Statement: Our client belongs to Agriculture industry with 100+ servers where they were running the Blockchain applications on Azure Kubernetes Service. They wanted to handle their traffic in an elegant way and hence were trying to migrate their servers to AKS but were facing challenges w.r.t setting up CI/CD pipeline in AKS and configuring proper alerting mechanism.
Solution: We helped them migrate their application to Kubernetes with the help of tools like Git, Docker, Jenkins and Argo CD. The pipeline was fully automated to build the image and send the image to Azure Container Registry and scan for any vulnerabilities if present. Argo CD was then responsible for continuous deployment to AKS cluster. To handle the traffic spikes Horizontal Pod Autoscaler and cluster autoscaler was set up. The whole infrastructure on Azure (networking, roles, encryption keys, cluster etc.) was set up using the IAAC tool Terraform. The application was using the MariaDB and CosmosDB as the database. Along with this, proper monitoring tool using Grafana was set up for alerting purpose of all the resources in the AKS Cluster.
Challenges faced after deployment: The AKS environment was working perfectly fine for around 3 months until we encountered the first issue. In the logs we saw that one of application container went to pending state for about 10 minutes due to lack of nodes in the cluster due to which the application became slow.
How our team handled and proposed a new solution: As the cluster didn’t have the required number of nodes, we checked the autoscaling group logs and came to conclusion that the extra node took around 7-8 minutes to provision. We than implemented a concept called node overprovisioning where the fake applications were configured on a standby node and as soon as the production application needed to be deployed it would evict the fake application and make room for itself, thereby reducing the pending state near to zero.
In addition to above points before the implementation on production pipeline, we faced different challenges which was handled by our team:
⦿ Auto-trigger the Jenkins job while code is pushed to private repository.
⦿ Continuous deployment to AKS Cluster with auto-sync between repository and Cluster with best security-practices.
⦿ Triggering one job from the other Jenkins job using groovy script.