Azure’s approach to smooth App Service updates worldwide
In today’s insight we’ll take a quick look at how Azure handles the updates of their servers without breaching the SLA agreement. Given the high usage of this PaaS offering worldwide, and especially by critical domains, like healthcare and finance, I thought it would be interesting to follow this process with you and see how Microsoft handles the updates of their servers and VMs without having to shut down the millions of applications that run on them.
But first of all, what is an App Service?
The App Service is a PaaS offering for hosting web applications, APIs and mobile back ends which allows you to focus on your code and nothing else.
With PaaS, you don’t have to think or worry about managing the underlying infrastructure (VMs) in any way, whether it’s hardware updates, OS updates, or security patches. That’s Azure’s job, and you shouldn’t care how or when they do it.
And that’s the true beauty of PaaS platforms like App Service. You just know that your apps are running on a secure and reliable infrastructure all the time without you having to do anything about it. Your only responsibility is to build and deploy your code.
Of course, App Service comes with all the benefits of such a platform, like load balancing, autoscaling, security, logging, etc., but that’s something we’ll take a look at some other time.
So, let’s get back to the update part.
To make sure that their infrastructure is up to date, safe and secure from any threats, Azure applies monthly updates to their infrastructure and thus even if there’s a new threat that’s just been detected, since this is a PaaS offering, Azure will spare no resources to make sure its whole infrastructure is protected from it asap.
So, before Azure starts to update their VMs globally, they start by deploying those updates to a private region, used for testing these changes internally. Only after it’s all thoroughly tested they move on with applying everything globally.
So, now that the Azure team has green light, which means the updates went through without any issues in the testing environments, it’s time to apply all of them globally. And this is where it gets interesting.
Before we look at how they do that, however, we’ll need to discuss the concepts of availability set, fault domain and update domain.
Availability set
Availability set is Azure’s way of providing redundancy and fault-tolerance for your VMs within a single datacenter. It does that by automatically distributing your VMs across multiple fault domains and update domains. This protects you from some datacenter-level threats, like fire in some part of the datacenter, hardware, network or power failures.
Note that if the whole datacenter goes down your applications goes down with it. If you want to protect your apps against datacenter failures, please take a look at availability zones which deploys your apps to separate datacenters. However, we won’t discuss this today.
Fault domain
A fault domain is a fancy name for a rack of servers. You have the option to create up to 3 fault domains per availability set. What that means is Azure will provision three identical VMs and put them inside three separate server racks in the same datacenter. Each fault domain has its own power source and network switch so even if the network or the power failed only that particular rack would be affected while the other two would keep running.
Update domain
While fault domains protect you from unplanned and unexpected events, like downtime due to power outage or hardware failures, update domains is all about protecting you from getting all of your VMs taken down at once due to a planned server maintenance from Azure.
In a nutshell, update domains is a logical grouping of VMs and their underlying physical hardware which are rebooted together for a planned maintenance. So, by putting your VMs into multiple update domains you are guaranteed that they won’t all go down at once the next time Azure starts their planned maintenance as only one update domain is ever updated at a given time.
So, now that we have these important concepts cleared, lets see how Azure proceeds with the global updates.
They start with update domains that contain only unallocated machines intended for provisioning of new apps, scaling operations, or replacing existing machines in case of failures.
After these unallocated update domains and the underlying physical hardware are successfully updated, they start moving VMs from update domains that are yet to be updated. Then, the emptied out update domains and the underlying hardware are updated and the process continues.
Here’s how that looks like.
If you found this useful, please consider sharing it on social media and subscribing. Insights like this one take a lot of time to produce and your support motivates me to keep going. 🙂