Post Mortem: Workflows not running and jobs failing 01-03-2022

Summary

On March 1, 2022, starting from approximately 14:50 UTC until approximately 16:40 UTC jobs that accessed contexts were unable to run. To our customers, this looked like jobs appearing to start in the UI, and then moving to “Failed” after a few minutes. This affected users using GitHub on both cloud and server.

This incident was caused by a change in GitHub’s API. An endpoint we used to determine permissions was moved to a different path. To resolve this incident, we updated the path for this request. We apologize for this disruption and are taking steps to prevent a future occurrence.

What Happened

On January 21, 2020, GitHub provided notice that a set of API endpoints were due to be migrated. We looked at the changes and mistakenly concluded that they did not apply to us. On March 1, 2022, at 14:00 UTC GitHub implemented a 12-hour brownout of this endpoint. All requests to this endpoint responded with a 404 status code with explanation text in the body of the response.

We do not have the ability to deal with a response in this format and threw an error. This request was part of the logic needed to determine whether a user was able to access a context. Because this information was not available, jobs accessing contexts were unable to start.

In the initial stages of investigating this issue, we saw that our jobs queue was growing quickly. We determined that an update to a vital service had been merged at around the same time as the symptoms started. This led us to believe that this update was the root cause of the symptoms we were seeing. At 15:30 UTC we reverted that change and saw that the queue cleared out. This led us to believe that we had addressed the root cause of the issue and so we moved the status of the incident to Monitoring. The queues started growing again, and manual runs of jobs were failing so we moved the incident back into Investigating.

When we realized that the requests to the upstream provider were failing we investigated and realized at 16:04 UTC that we were making requests to one of the endpoints which were browned out. We were able to quickly update the path and finished deploying the updated service at 16:35.

Upon resolution of this incident, we took steps to ensure customers using the server version of our platform were not impacted. We prepared a patch and notified these customers with a plan for mitigation.

Future Prevention and Process Improvement

We realized during the incident that although we had been notified several times about this change, we had not realized that it would affect us. We misread the announcement as applying to only the endpoints called out in the document rather than all endpoints under the ‘teams’ path. This is not something that should have happened in a company whose mission includes managing change.

We have moved the notification of breaking changes from our providers to an alerts channel which is monitored by our senior engineers to ensure that change notifications like this are investigated carefully when they occur and any mitigations are put in place in advance of the deadline.

We are also adding changes to our roadmap to decouple our tight dependencies from upstream service providers to avoid having such a large impact if a similar issue arises in the future.