Microsoft has revealed it took five hours to acknowledge lengthy disruptions affecting European customers in late March because the task of informing customers relied on a US-based incident manager, who was asleep at the time.
The delays affected customers in Europe and the UK for three days beginning around 9am UTC on March 24. However, at the outset, as customers struggled with extra-sluggish Azure services, Microsoft missed its 10-minute target for acknowledging issues by a wide margin.
In a post mortem, Chad Kimes, director of engineering at Azure admits Microsoft’s “communication during this incident was also problematic” and apologized for the frustration and confusion this caused to the 6,136 customers affected.
The technical issue itself was caused by virtual-machine capacity constraints due to a surge in demand for Azure compute resources during COVID-19 coronavirus pandemic, which resulted in 21-minute delays affecting Microsoft’s Pipelines DevOps service for releasing new builds targeting Windows and Linux agents in Azure. The longest delay was nine hours, according to Kimes.
“The problem here is that our live-site processes have a gap for these types of incidents,” Kimes said of the communication issue.
“When incidents involve customer request failures or performance impacts, we have automated tooling that starts an incident and loops in both a DRI (designated responsible individual) and what we call a PIM (primary incident manager). The PIM is typically the person responsible for posting external communications acknowledging the incident,” he adds.
“Pipeline delays are detected by different tooling, and the PIM is not currently paged for these types of incidents. As a result, while the DRI was hard at work understanding the technical issues and looking for potential mitigations, the PIM was still asleep. Only when the PIM joined the incident bridge at roughly the beginning of business hours in the Eastern United States was the incident finally acknowledged.”
Microsoft says it is planning to improve its live-site processes to “ensure that initial communication of pipeline delay incidents happens on the same schedule as other incident types”.
The company is also rolling out architectural changes to mitigate bottlenecks in spinning up new agents from its hosted agent pool.
Source: Networking - zdnet.com