At 09:53 UTC on April 29th a code change was deployed that resulted in all customer jobs using an
attach_workspace step running on Google Cloud Platform (GCP) to fail. We rolled back the deployment immediately, which resolved customer impact by 12:21 UTC. We thank our customers for their patience and understanding during this outage.
The original status page can be found here.
All timestamps are UTC.
A code change was released at 09:53 which would attempt to restore workspaces from Google Cloud Storage (GCS) for jobs running on GCP, and fail back to S3 in the case of any errors. Starting at 10:55 we began to receive support tickets from customers experiencing job failures due to failures in the
attach_workspace step. At 12:12 we reverted the pull request for the contributing code change and immediately observed
attach_workspace errors declining.
We are currently implementing a change in our workspace service to write workspaces to two providers. At the time of the incident, the double-write was only activated for a handful of internal projects. The code change on the 29th attempted to download workspaces from GCP and fallback to an alternative provider if the download failed. A bug in the code caused a failed download attempt to report as successful, so the failover was never triggered and almost all calls to download a workspace reported a
Not Found error.
Therefore any job running on GCP that included an
attach_workspace step would:
- attempt to download the workspace from GCS;
- receive a
- erroneously report success due to the bug;
- continue processing the step (without the required workspace present);
- fail when something attempted to use the missing workspace.
The incident revealed several issues in our detection and response processes. We have monitoring for
attach_workspace failures but no alerting and have since added those alerts. We have updated automated testing to validate the expected behavior for this specific change. And we have fixed a glitch in our rollback script that prevented us from reverting the change faster.
We once again thank our customers for their patience as we worked to resolve this issue.