Public IR Test Results Processing Failure 2022-04-29

Summary:

From 19:15 UTC to 22:50 UTC on April 29th, test result processing failed. Test results were not available for customer jobs, and test splitting did not function.

The original status page can be found here.

What Happened

All timestamps are UTC.

We are currently migrating away from a legacy test results processing service. As part of this migration, we needed to understand the true sizes of test result blobs for customer workloads. We added instrumentation to gather this data in a change that deployed at 19:15 on April 29th. This change contained a bug that resulted in a null value being returned instead of a file.

Since this null appeared to be a valid response, no errors were fired and no alerts triggered. The message system that reports test result processing also worked normally, as these retrievals registered as successes. We discovered the failure at 22:50 as a team member did a routine check of dashboards and saw no test result processing, and immediately rolled back the change.

When investigating the failure on Monday, the team discovered the full impact of the bug and we declared a retroactive incident.

Future Prevention and Process Improvement:

While we have monitoring for test result processing failures, we did not alert based on that monitoring. We’ll be adding those alerts, as well as looking into our post-deploy validation strategy. Additionally, as we continue with the migration away from the legacy service we are ensuring the new service has tests to cover this scenario.