2018-05-17 Workflows Outage Postmortem


#1

On May 17 between 16:35 and 23:47 UTC we experienced an outage that caused Workflows to be unavailable or degraded for the majority of customers.

We identified the root cause of the incident as hitting scale limits on our database service, resulting in increased wait times to run queries. Consequently, our backend was unable to keep up with the growing backlog, and customers experienced long wait periods to run Workflows. We addressed our resource limits immediately by increasing capacity for the database. This process took 2 hours and 20 minutes, and once complete we were able to restore normal operations. Our build job and Machine execution fleets were scaled up to handle the higher volume, whereas our macOS fleet suffered longer degraded performance due to limited capacity.

In these past weeks, we have taken several measures to prevent this failure from happening again:

  • We monitor our database utilization more rigorously.
  • We have split up queries across multiple database instances to distribute the load better.
  • We have been working on improving the efficiency of our queries and also adding the ability to track down long-running and resource-intensive queries.
  • We’re also prioritizing working on long-term architectural improvements, such as making the User Interface resilient to backend outages, so each of these components can be available independently of each other.

Recent outages
#2