Public IR Problems Restoring Workspaces for Some Executors 2022-04-29

Summary:

At 09:53 UTC on April 29th a code change was deployed that resulted in all customer jobs using an attach_workspace step running on Google Cloud Platform (GCP) to fail. We rolled back the deployment immediately, which resolved customer impact by 12:21 UTC. We thank our customers for their patience and understanding during this outage.

The original status page can be found here.

What Happened

All timestamps are UTC.

A code change was released at 09:53 which would attempt to restore workspaces from Google Cloud Storage (GCS) for jobs running on GCP, and fail back to S3 in the case of any errors. Starting at 10:55 we began to receive support tickets from customers experiencing job failures due to failures in the attach_workspace step. At 12:12 we reverted the pull request for the contributing code change and immediately observed attach_workspace errors declining.

We are currently implementing a change in our workspace service to write workspaces to two providers. At the time of the incident, the double-write was only activated for a handful of internal projects. The code change on the 29th attempted to download workspaces from GCP and fallback to an alternative provider if the download failed. A bug in the code caused a failed download attempt to report as successful, so the failover was never triggered and almost all calls to download a workspace reported a Not Found error.

Therefore any job running on GCP that included an attach_workspace step would:

  • attempt to download the workspace from GCS;
  • receive a Not Found error;
  • erroneously report success due to the bug;
  • continue processing the step (without the required workspace present);
  • fail when something attempted to use the missing workspace.

Future Prevention and Process Improvement:

The incident revealed several issues in our detection and response processes. We have monitoring for attach_workspace failures but no alerting and have since added those alerts. We have updated automated testing to validate the expected behavior for this specific change. And we have fixed a glitch in our rollback script that prevented us from reverting the change faster.

We once again thank our customers for their patience as we worked to resolve this issue.