2018-05-17 Workflows Outage Postmortem

drazisil · July 12, 2018, 8:46pm

On May 17 between 16:35 and 23:47 UTC we experienced an outage that caused Workflows to be unavailable or degraded for the majority of customers.

We identified the root cause of the incident as hitting scale limits on our database service, resulting in increased wait times to run queries. Consequently, our backend was unable to keep up with the growing backlog, and customers experienced long wait periods to run Workflows. We addressed our resource limits immediately by increasing capacity for the database. This process took 2 hours and 20 minutes, and once complete we were able to restore normal operations. Our build job and Machine execution fleets were scaled up to handle the higher volume, whereas our macOS fleet suffered longer degraded performance due to limited capacity.

In these past weeks, we have taken several measures to prevent this failure from happening again:

We monitor our database utilization more rigorously.
We have split up queries across multiple database instances to distribute the load better.
We have been working on improving the efficiency of our queries and also adding the ability to track down long-running and resource-intensive queries.
We’re also prioritizing working on long-term architectural improvements, such as making the User Interface resilient to backend outages, so each of these components can be available independently of each other.

Topic		Replies	Views
Post Incident Report: April 4, 2025 - Delays in starting workflows Announcements incident , post-incident-review	0	46	April 16, 2025
Post Incident Report: January 22, 2025 - Delays starting and canceling workflows and jobs, and viewing them in the UI Announcements incident , post-incident-review	0	24	January 31, 2025
April 25 Machine Outage Announcements incident	1	1580	June 18, 2018
Incident Report: September 27, 2021 - Delays in Workflows Announcements incident , post-incident-review	0	1442	October 5, 2021
Postmortem: March 26 - April 10 Workflow Delay Incidents Announcements	0	1625	April 29, 2019

2018-05-17 Workflows Outage Postmortem

Related topics