Post Incident Report: January 22, 2025 - Delays starting and canceling workflows and jobs, and viewing them in the UI

Summary

From January 21, 2025 at 23:50 UTC to January 22, 2025 at 00:56 UTC, CircleCI customers experienced increased latency with starting and canceling workflows and jobs, and experienced delays and difficulty viewing workflows in the UI. We appreciate your patience and understanding as we worked to resolve this incident.

The original status page can be found here.

What Happened

(all times UTC)

At approximately 23:00 on January 21, an automated alert indicated that a database instance responsible for holding archived data was almost out of free storage space. At 23:09, the team halted a blue/green deployment on the database to free a logical replication slot, thinking that may have been the cause, but that did not help the database recover.

The archival service is called synchronously by the service responsible for orchestrating workflows. When the archival service’s database reached capacity, these requests started timing out, which impacted the overall performance of the workflows service. At 23:26, the workflows queue began to grow, leading to increased latency starting workflows and jobs, canceling jobs, and viewing workflows in the UI. This was not immediately attributed to the archival database issues in part because there was a separate alert at approximately the same time related to request volume, but when the queue continued to grow after that issue resolved, a separate team began to investigate workflows further and scaled up the event consumer responsible for processing the queue at 23:44.

The team investigating the unhealthy database instance promoted a read replica to a standalone primary at 23:55. By 00:03, the workflows queue depth returned to normal, which resolved workflow latency and UI impacts. However, at around the same time, Linux machine jobs began to queue downstream due to errors trying to provision instances with our cloud provider, which was actively investigating increased API error rates to the provisioning endpoint in our region. Requests began to be fulfilled around 00:32, but due to the volume of requests being processed, we also experienced rate limiting that extended the length of impact. Our queues returned to normal levels at 00:56, and the incident resolved at 01:26 after confirming there was no further impact.

Post-incident, the team continued to investigate. The root cause was determined to be a code change made to a function in the impacted database on January 16th, which unintentionally created an excessive number of log messages. The function has been updated to fix this behavior.

Future Prevention and Process Improvement

We have added a max duration to the workflows retry policy for archiving workflows to allow it to fail earlier than the default timeout, limiting the potential impact on the workflows service should there be a future issue with the archival service. Longer-term, we intend to shift the workflow archival process to an event-based model to decouple the services.

While alerting did indicate an issue with the archival database, the team did not have much time to address the problem before it caused customer impact because the database was filling significantly more quickly than previously forecasted. We will be implementing forecast and anomaly monitoring for our databases to alert us of unusual activity before it reaches critical levels.