Continuous Kubernetes blue-green deployments on Azure using Nginx, AppGateway or TrafficManager — part 2
This is part two of my series on advanced deployment practices. If you have been following part 1 we finished with a working continuous deployment pipeline and some rudimentary automated rollback mechanism using helm. Unfortunately we were always replacing the existing helm release, which means that if there is something wrong with the new version, customers will always be impacted and possibly experiencing errors, while we are rolling back to the previous working version. This is obviously not ideal.
Here now enters the practice of blue/green deployments which means that instead of replacing the previous version (here we refer to this version as blue), we bring up the new version (here referred to as the green version) next to the existing version, but not expose it to the actual users right away. On the condition of having successfully validated that the green version works correctly, we will promote this version to the public version by changing the routing configuration without downtime. If something is wrong with the green version we can revert back without users every noticing interruptions.
Sound easy right? Maybe you have asked yourself the following questions:
In this post I want to introduce you into different flavours of implementing blue/green deployments on Azure using Nginx ingress controller (any other ingress controller like Ambassador, Traefik or Kong will also work), Azure Application Gateway and Azure Traffic Manager to compare their capabilities, help you make your choice to tools and provide you with samples to get you starting on adopting more advanced deployment practices in your own apps.
Same as before we are aiming to achieve a couple of design considerations that we typically see in enterprise environments:
- The process and all assets should be versioned and stored with the source code to make is easy for additional microservices to implement blue/green deployments without overhead.
- We want to see, measure and compare telemetry and metrics for each deployment and also evaluate the blue and green deployments side by side.
- In case of a failure we want to roll back our application automatically and have the ability to start over with the next release.
Let’s get started
Again we will continue with the phoenix sample application. The easiest way to deploy everything in your own azure subscription is to open up an azure shell , clone your repo there and follow these instructions to deploy at least one environment which should bring up an instance of AKS, KeyVault, Application Insights, Azure Container Registry, Application Gateway and Traffic Manager. You can check the last post on how to configure azure devops project and permissions with the newly created service principals.
Each technology has its own azure devops pipeline, which you can find here and in the scripts folder you can find matching scripts for each type. All you have to do is configure the AZURE_CONTAINER_REGISTRY_NAME for your deployment and the AZURE_KEYVAULT_NAME for each environment secret store in the pipeline after you have imported it into your azure devops project.
We will start with the Nginx deployment, which will leverage a single azure load balancer in front of our Nginx ingress controller in the AKS cluster. The idea is to deploy our application multiple times in different namespaces leveraging the canary feature, which allows us to hook up two ingress objects to the same dns name and route them to different backend Kubernetes services depending on the existence of a custom HTTP header in the request.
To make the automation work, we also need an extra helm variable called “canary” which will be used to define which of the two deployments is currently the canary. Before deploying a new version we will read back all deployed helm charts from the cluster, determine if there is already a canary deployment or not and perform a new deployment in either the blue or the green slot — which ever is currently not used or in a older version than the other slot.
In the new canary deployment we will feed in the canary=true helm variable, which, which will ensure the configuration for the annotation value for the Nginx canary header in the ingress object. At the same time we have defined the weight of the canary to be 0, which will ensure that no traffic will be routed there unless the header is set.
As you can see below in the initial version we only have the blue version 3.0.378 deployed in the blue-calculator namespace.
Now in step 5 of our process we are deploying the green version 3.0.379 into the green-calculator namespace — but with the canary annotation in the ingress object.
By setting the canary deployment variable we have ensured that normal production traffic will still be routed to the blue version. Only if we are setting the canary header “canary: always” in the HTTP request we are ending up in the green version.
At this point we are in step 7 in our process diagram using the routeTraffic step in azure devops while using the custom headers to validate that the application works as expected. Only if that is the case we will allow the deployment to progress and end up in the on success step — if not we will execute the on failure step to clean up the canary and delete the helm deployment in the green slot.
Assuming everything worked out we will promote the canary by upgrading the helm deployment with canary=true and weight=100. This will basically override the existing ingress object for the same DNS name in the other namespace and ensure that from now on all traffic will be routed to the green deployment slot.
Now we will simply delete the existing production slot and as a last step do one more upgrade of the existing green slot with canary=false to promote it to be the new production slot and ensure we can start over at the beginning when the next release with a new canary gets deployed.
Since we are using Nginx and Application Insights in our app that means we are getting very detailed metrics, logs and dashboards which will allow us to compare the performance and functionality of our new apps. With a little tuning we can easily set up a Grafana dashboard that allows us to compare both deployments side by side.
That is it for using Nginx!
The next type of deployment will leverage the azure application gateway as ingress controller to achieve the same. As of today the AKS managed appgateway addon is still in preview and cannot be automatically deployed via terraform. That means you have to manually activate the addon after you have deployed the cluster as described here.
The process will look like this and is implemented in the appgw azure pipeline template:
In comparison to the Nginx ingress controller the AppGateway Ingress controller is running outside of the AKS cluster as its own dedicated managed service, but just as you would expect it will route traffic directly to the individual containers, which is why our application does not have to change to make it work.
The essential process will be the same and controlled by the helm deployment parameters, which will differ only in the ingress.class.
Since the Application Gateway does not support canary routing by HTTP header we are in this case exposing the canary under a different HTTP route named /canary. This allows us same as before validate the canary deployment after it has been deployed and perform a switch between the / — route and the /canary -route after we have confirmed that it works as expected.
In terms of metrics I have not found a way to compare the traffic monitoring statistics between our two deployment slots, which is a disadvantage, since you have to rely solely on the application insights metrics. However the nice thing about this is that the application gateway can also act as a router between multiple clusters — if they are part of the same VNET. Be aware that the managed Application Gateway Ingress Controller Addon does not work in shared mode, that means if you want one Application Gateway for multiple cluster you have to manually deploy and configure the ingress controller via helm. This makes it possible to perform blue/green deployments not only with our own applications running in different namespaces, but also with blue and green versions running in different clusters. If you are interested in fully automating blue/green deployments with different clusters I encourage you to take a look at Bedrock.
As final variation we want to take a look at blue/green deployments using Azure Traffic Manager, which is again an azure managed service that works based on DNS resolution. It has the advantage that it will also work on multiple clusters, which do not need to be part of the same VNET or even the same Azure region. However since it is DNS based it will also require is to bring up dedicated Load Balancer IPs for each deployment slot with dedicated DNS entries.
The overall process will look like this:
Our instance of azure traffic manager will be configured with the weighted traffic-routing method- which will defines how a HTTP request for our Traffic Manager DNS will get resolved to one of our AKS Load balancer DNS names. This will allow us to decide if the blue or the green version should be preferred by the traffic manager profile.
Since there is no higher level integration for Kubernetes to interact with Azure Traffic Manager, we have to rely on azure cli scripts in our deployment pipeline to configure the endpoints and the routing weight distribution. Our existing shell scripts already have the right spots for these steps and during our terraform deployment we have already set up dedicated load balancer ips and dns entries for both blue and green deployment slots.
The azure traffic manager will also support more than two public endpoints and allow us to register multiple applications, clusters and instances that can be connected behind the same profile.
In terms of metrics the azure traffic manager brings even less metrics than the Application Gateway, which is understandable since it works on DNS. Our metrics have to come from our own application metrics if we want to compare our two deployment slots.
For final comparison I wanted to summarise some of the key aspects of the different implementations and encourage you to try all of them using the samples provided.
I hope you learned something today and I am curious to hear your feedback on how you have implemented your blue/green deployment practice, if my story and samples have helped you to do it and if there is something I have missed.
The end ;)