Azure's Resilience Building Blocks: Mastering Regions, Availability Zones, Domains & Pairs

In this post, I’ll walk you through Azure’s resilience architecture in plain language—the way I’d explain it to my own team. We’ll explore what Regions, Availability Zones, Fault Domains, Update Domains, and Region Pairs actually are, how they work together, and most importantly, how you can leverage them to keep your systems running when problems strike.

Azure Regions

What Are Azure Regions?

In the simplest terms, an Azure Region is a geographic area where Microsoft has built a cluster of data centres. Think of London, East US, or Southeast Asia—Microsoft operates over 60 of these regions worldwide. Each region contains multiple data centres networked together with blazing-fast connections.

The key thing to understand about regions is their independence. Each one operates with its own power grid, cooling systems, and networking infrastructure. I’ve worked with clients who initially struggled to grasp why this matters, but the benefit is clear: when East US has an issue, it doesn’t bring down West Europe. This isolation is your first line of defence against major outages.

How to Use Azure Regions

When creating Azure resources, you’ll always need to select a region. Consider these factors when choosing:

Proximity to users: Selecting regions close to your users reduces latency.
Compliance requirements: Some industries and countries have data residency requirements.
Service availability: Not all Azure services are available in every region.
Pricing: Costs can vary between regions.

Best Practices for Azure Regions

In my experience implementing cloud solutions for enterprises, here’s what works:

Deploy to multiple regions if you can’t afford significant downtime. Yes, it costs more, but I’ve seen this decision pay for itself the first time a region has issues. Remember, even cloud providers experience outages.
Implement Traffic Manager or Front Door to intelligently route users to the closest healthy region. These services are surprisingly easy to set up and provide enormous benefits for user experience.
Actually test regional failover before you need it. I’ve watched too many teams discover their failover process doesn’t work during an actual outage. Schedule regular drills.
Map out your regional dependencies. I keep a simple diagram showing which services are in which regions for every client environment. When Azure sends a regional service notice, you’ll know immediately what might be affected.

Availability Zones

What Are Availability Zones?

If regions are Microsoft’s geographic strategy, Availability Zones (AZs) are their local redundancy strategy. In practical terms, AZs are separate physical datacenters within a single region, typically separated by several kilometres.

Here’s what makes them special: each zone operates with its own independent power, cooling, and networking infrastructure. This means when a cooling system fails in Zone 1, Zones 2 and 3 keep running. When a construction crew accidentally cuts a fibre line to Zone 2, Zones 1 and 3 remain unaffected.

I need to mention that not all Azure regions offer Availability Zones yet—currently about 30 regions support them. Always check the latest Azure documentation when planning zone-based architectures. I’ve had clients caught off guard when they assumed a region supported zones only to discover it didn’t.

How to Use Availability Zones

When creating certain Azure resources, you can specify the Availability Zone:

# Deploy a VM to a specific availability zone
az vm create \
  --resource-group myResourceGroup \
  --name myVM \
  --image UbuntuLTS \
  --zone 1

Many Azure services offer zone-redundant options:

Virtual Machines can be deployed to specific zones
Managed Disks can be zone-redundant
SQL Database offers zone-redundant configurations
Storage Accounts can be configured for zone-redundancy (ZRS)

Best Practices for Availability Zones

Deploy critical workloads across zones to protect against datacenter failures.
Use zone-redundant services when possible (e.g., zone-redundant storage).
Design applications to handle zone failures without downtime.
Consider load balancing across zones using Azure Load Balancer or Application Gateway.

Fault Domains

What Are Fault Domains?

Let’s get down to the nuts and bolts. A Fault Domain is essentially a server rack with its own power supply and network switch. If you’ve ever walked through a datacenter, picture a tall metal rack filled with servers—that’s roughly equivalent to a fault domain.

Why does this matter? Because when a power distribution unit fails or a top-of-rack switch goes down, everything in that rack can go offline simultaneously. Fault domains give you a way to spread your workloads across different physical infrastructure within a datacenter.

In my experience explaining this concept to teams, I often describe fault domains as the “smallest unit of potential failure” in Azure’s infrastructure. They represent Microsoft’s recognition that even within a single datacenter, hardware failures happen—and they’ve built a way for you to design around that reality.

How to Use Fault Domains

In practical terms, you don’t directly select fault domains for your resources. Instead, you use Availability Sets, which handle the distribution for you. When you create an Availability Set for your VMs, Azure automatically spreads them across different fault domains:

# Create an availability set with 3 fault domains
az vm availability-set create \
  --resource-group myResourceGroup \
  --name myAvailabilitySet \
  --platform-fault-domain-count 3

Most regions support 2-3 fault domains per Availability Set. I’ve found that many teams don’t realize this number varies by region. For critical workloads, I always verify how many fault domains are supported in our target region rather than assuming the maximum.

The beauty of this approach is its simplicity—once you’ve set up your Availability Set, Azure handles the physical distribution automatically. You don’t need to worry about manually balancing workloads across different hardware.

Best Practices for Fault Domains

Deploy multiple instances of your application tier VMs in an Availability Set.
Design applications to handle the loss of a fault domain without service disruption.
Use managed disks with VMs in Availability Sets for better fault isolation.
Consider Availability Zones for even greater fault isolation when available.

Update Domains

What Are Update Domains?

While fault domains protect against hardware failures, update domains protect against something equally important: planned maintenance.

Here’s the reality of cloud computing that many overlook: Microsoft regularly updates the underlying host servers and infrastructure. Without update domains, Microsoft would potentially need to reboot all your VMs simultaneously during these maintenance events.

Update domains solve this problem by creating logical groupings of VMs that will be updated together. When Microsoft needs to perform platform maintenance, they update one domain at a time, allowing VMs in other domains to keep running.

I often explain update domains as Microsoft’s way of saying, “We promise not to reboot all your servers at once.” It’s a simple concept, but crucial for maintaining availability during routine platform maintenance.

How to Use Update Domains

Like Fault Domains, Update Domains are primarily used with Availability Sets. When you create an Availability Set, you can specify the number of update domains:

# Create an availability set with 5 update domains
az vm availability-set create \
  --resource-group myResourceGroup \
  --name myAvailabilitySet \
  --platform-update-domain-count 5

By default, an Availability Set uses 5 update domains.

Best Practices for Update Domains

Deploy critical applications across multiple update domains to ensure availability during planned maintenance.
Consider increasing the default number of update domains for more gradual rollouts.
Design applications to handle the temporary loss of VMs during maintenance.
Test your application’s resilience to planned maintenance scenarios.

Region Pairs

What Are Region Pairs?

Region pairs represent Microsoft’s strategy for large-scale disaster recovery. In simple terms, each Azure region is paired with another region, typically at least 300 miles away but often within the same geographic area or regulatory jurisdiction.

What makes region pairs special isn’t just their distance from each other, but how Microsoft treats them:

Microsoft never updates both regions in a pair simultaneously, minimizing the chance that both regions experience maintenance issues at the same time.
If there’s a widespread outage affecting multiple regions, Microsoft prioritizes restoring at least one region from each pair before moving on to other regions.
Many Azure services automatically leverage region pairs for geo-redundant storage and disaster recovery.

Common region pairs you’ll encounter include:

East US 2 and Central US
North Europe and West Europe
Southeast Asia and East Asia

I’ve found region pairs particularly valuable when explaining disaster recovery to business stakeholders. Being able to say “even if an entire region goes down, we have systems in place to recover” provides significant peace of mind.

How to Use Region Pairs

Region pairs should factor into your disaster recovery and data replication strategies:

Use Azure Site Recovery to replicate VMs between paired regions
Configure geo-redundant storage (GRS) that automatically replicates to the paired region
Deploy Azure SQL Database with geo-replication to the paired region

Best Practices for Region Pairs

Deploy disaster recovery resources in the paired region for your primary region.
Use geo-redundant storage for critical data that needs cross-regional replication.
Document and test region failover procedures regularly.
Consider data residency requirements when using region pairs (some pairs cross national boundaries).

Putting It All Together

Let’s visualize how these concepts work together:

Azure Global Infrastructure
│
├── Regions (e.g., East US, West Europe)
│   │
│   ├── Availability Zones (1, 2, 3)
│   │   │
│   │   └── Multiple datacenters with independent power/cooling/networking
│   │
│   └── Availability Sets
│       │
│       ├── Fault Domains (typically 2-3)
│       │   │
│       │   └── Servers that share power source and network switch
│       │
│       └── Update Domains (typically 5)
│           │
│           └── VMs that update together during planned maintenance
│
└── Region Pairs (e.g., East US paired with West US)

Designing for Maximum Resilience

For mission-critical applications, consider using multiple layers of resilience:

Deploy across multiple Availability Zones within a region
Use zone-redundant services when possible
Implement cross-region replication to the paired region
Design applications to be resilient to component failures

Remember that each layer of resilience adds complexity and cost, so balance your availability requirements with your budget and operational capabilities.

Bringing It All Together: A Real-World Perspective

After implementing dozens of Azure environments, I’ve found that resilience isn’t built with a single technology—it’s created through thoughtful layering of these building blocks. The most successful organisations don’t just understand these concepts; they deliberately design with them in mind.

For business-critical applications, I typically recommend starting with a single-region deployment using availability zones, then expanding to a secondary region as the application matures and the budget allows. This pragmatic approach balances initial costs with growing resilience needs.

Remember that each additional layer of resilience brings:

Increased infrastructure costs
More complex deployment and management processes
Greater protection against different failure scenarios

The right approach for your organization depends on your specific workloads, budget constraints, and availability requirements. A customer-facing payment system might justify a multi-region approach with zone redundancy, while an internal reporting tool might be perfectly served by an availability set.

I hope this explanation helps demystify Azure’s resilience building blocks. These concepts have helped me architect solutions that whether both planned and unplanned outages, and I’m confident they’ll do the same for you.