Post Incident Report: Elevated Errors in the CircleCI UI

Summary

On June 21, 2024, code changes were deployed to two services at 16:19 UTC, causing CircleCI customers with actively running workflows to experience errors in the UI when trying to view projects or workflows. This was isolated to the UI and had no impact on builds. Both deployments were reverted by 16:27 UTC, but customers who had started workflows between 16:19 and 16:27 continued to see errors until those workflows completed or until an additional change was deployed at 18:27 UTC. We thank our customers for their patience and understanding as we worked to resolve this incident.

What Happened

All timestamps are UTC.

Code changes were deployed to two related services at 16:19, which would allow information about actively running workflows to be processed and provided to the UI by the same service that provides information about completed workflows. These changes were made as part of an ongoing effort to improve reliability and performance. The changes involved deployments to a service that processes workflow events, and to the API service that serves that information to the UI, which had been tested via unit and integration tests. The change to process the active running workflow events was thought to have been disabled via a feature flag. At 16:20, the API service deployment failed and was rolled back automatically, and we began to see data type errors in that service as well as in two services related to the UI. We rolled back the deployment of the service that processes the workflow events at 16:27, but the errors continued.

We first ensured that the rollouts were reverted correctly and that the downstream services related to the UI were not using cached data. At 17:09, we discovered that the feature flag to disable the new event processing had been misconfigured. This led to the processing of events between the deployment at 16:19 and the rollback at 16:27. At 17:14, we identified that the data models had not been fully updated to handle the active workflow data in a backwards-compatible way. The system was attempting to serve data that was not compatible with the API or database spec for the actively running workflows that had been processed. While tests had been written for the initial code change, this particular change in the data had been overlooked. Due to the size of the change, it was also overlooked during code review.

A PR was created to fix the issue in the API service at 17:18, but there were some issues with several tests that needed to be addressed before merging. This took longer than expected due to test complexity. The change was deployed once all tests were passing at 18:27, resolving the issue. We continued to monitor until 19:40 and then declared the incident resolved.

Future Prevention and Process Improvement

We have updated the service and the tests to properly account for the differences in data between actively running workflows and completed workflows. It was identified during the incident that the tests for these services were overly complex, which added to the time it took to fully resolve the incident, so we are also prioritizing improvements to the tests. While the API service is now updated to handle the events appropriately, the team responsible for providing the workflow events will also be making changes so that the events for actively running workflows include data for the same fields that completed workflows have.

We’ve acknowledged that deploying the read and write changes simultaneously created more complexity and added to the size of the pull request for review. We are updating our process to deploy read changes separately from write changes, and to ensure that feature flag functionality is fully documented prior to deployment.

We also identified that continuing to focus on fixing the tests during the incident added significant delay to mitigating the issue. We intend on sharing these learnings across the organization to encourage incident commanders and responding engineers to be focused on mitigation, and potentially time boxing solutions if other options exist.

(copy from https://manage.statuspage.io/pages/6w4r0ttlx5ft/incidents/txmlx271z61b#postmortem)