[Azure Service Health] Critical:Activated:Azure Resource Manager : Global

Incident Report for Deutsche Fiskal

Resolved

This incident has been resolved.

Posted Jan 24, 2024 - 10:48 CET

Investigating

What happened?

Between 01:57 and 08:58 UTC on 21 January 2024, customers attempting to leverage Azure Resource Manager (ARM) may have experienced issues when performing resource management operations. This impacted ARM calls that were made via Azure CLI, Azure PowerShell and the Azure portal. This also impacted downstream Azure services, which depend upon ARM for their internal resource management operations. While the impact was predominantly experienced in East US, South Central US, Central US, West Central US, and West Europe, due to the global nature of ARM impact may have been experienced to a lesser degree in other regions.

What do we know so far?

In June 2020, ARM deployed a feature in preview, to support continuous access evaluation (https://learn.microsoft.com/entra/identity/conditional-access/concept-continuous-access-evaluation), which was only enabled for a small set of tenants. Unbeknownst to us, this preview feature contained a latent code defect. This caused ARM nodes to fail on startup whenever ARM could not authenticate to an Entra tenant enrolled in the preview. On 21 January 2024, an internal maintenance process made a configuration change to an internal tenant which was enrolled in this preview . This triggered the latent code defect and caused any ARM nodes, which are designed to restart periodically, to fail repeatedly upon startup. The reason that ARM nodes restart periodically is due to transient changes in the underlying platform, and to protect against accidental resource exhaustion such as memory leaks. Due to these failed startups, ARM began experiencing a gradual loss in capacity to serve requests. Over time, this impact spread to additional regions, predominantly affecting East US, South Central US, Central US, West Central US, and West Europe. Eventually this loss of capacity led to an overwhelming of the remaining ARM nodes, which created a negative feedback loop and led to a rapid drop in availability.

How did we respond?

At 01:59 UTC, our monitoring detected a decrease in availability, and we began an immediate investigation. Automated communications to a subset of impacted customers began shortly thereafter and, as impact to additional regions became better understood, we decided to communicate publicly via the Azure Status page. The causes of the issue were understood by 04:25 UTC. We mitigated impact by making a configuration change to disable the preview feature. The mitigation began roll out at 04:51 UTC, and all regions except West Europe were recovered by 05:30 UTC. The recovery of West Europe was slowed because of a retry storm from failed calls, which intensified traffic in West Europe. We increased throttling of certain requests in West Europe which eventually enabled its recovery by 08:58 UTC, at which point all customer impact was fully mitigated.

• 21 January 2024 @ 01:59 UTC – Monitoring detected decrease in availability for the ARM service, and on-call engineers began immediate investigation.

• 21 January 2024 @ 02:23 UTC – Automated communication sent to impacted customers started.

• 21 January 2024 @ 03:04 UTC – Additional ARM impact was detected in East US and West Europe.

• 21 January 2024 @ 03:24 UTC – Due to additional impact identified in other regions, we raised the severity of the incident, and engaged additional teams to assist in troubleshooting.

• 21 January 2024 @ 03:30 UTC – Additional ARM impact was detected in South Central US.

• 21 January 2024 @ 03:57 UTC – We posted broad communications via the Azure Status page.

• 21 January 2024 @ 04:25 UTC – The causes of impact were understood, and a mitigation strategy was developed.

• 21 January 2024 @ 04:51 UTC – We began the rollout of this configuration change to disable the preview feature.

• 21 January 2024 @ 05:30 UTC – All regions except West Europe were recovered.

• 21 January 2024 @ 08:58 UTC – West Europe recovered, fully mitigating all customer impact.

What happens next?

• We have already disabled the preview feature through a configuration update. (Completed)

• We are gradually rolling out a change to proceed with node restart when a tenant-specific call fails. (Estimated completion: February 2024)

• After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

Posted Jan 24, 2024 - 10:46 CET