Summary:
On October 22, 2024, from 14:45 to 15:52 and again from 17:41 to 18:22 UTC, CircleCI customers experienced failures on new job submissions as well as failures on jobs that were in progress. A sudden increase in the number of tasks completing simultaneously and requests to upload artifacts from jobs overloaded the service responsible for managing job output.
On October 28, 2024, from 13:27 to 14:13 and from 14:58 to 15:50, CircleCI customers experienced a recurrence of these effects due to a similar cause.
During these sets of incidents, customers would have experienced their jobs failing to start with an infrastructure failure. Jobs that were already in progress also failed with an infrastructure failure. We want to thank our customers for your patience and understanding as we worked to resolve these incidents.
The original status pages for the incidents on October 22 can be found here and here. The status pages for incidents on October 28 can be found here and here.
What Happened:
(All times UTC)
On October 22, 2024, at 14:45 there was a sudden increase of customer tasks completing at the same time within CircleCI. In order to record each of these task end events, including the amount of storage the task used, the system that manages task state (distributor) made calls to our internal API gateway, which subsequently queried the system responsible for storing job output (output service). At this point, output service became overwhelmed with requests; although some requests were handled successfully, the vast majority were delayed before finally receiving a 499 Client Closed Request
error response.
Distributor task end calls to the internal API gateway
Additionally, at 14:50, output service received an influx of artifact upload requests, further straining resources in the service.
An incident was officially declared at 14:57. Output service was scaled horizontally at 15:16 to handle the additional load it was receiving. Internal health checks began to recover at 15:25, and we continued to monitor output service until incoming requests returned to normal levels. The incident was resolved at 15:52 and we kept output service horizontally scaled.
At 17:41, output service received another sharp increase in requests to upload artifacts and was unable to keep up with the additional load, causing jobs to fail again. An incident was declared at 17:57. Because output service was still horizontally scaled from the initial incident, it automatically recovered by 18:00. As a proactive measure, we further scaled output service horizontally at 18:02. We continued to monitor our systems until the incident was resolved at 18:22.
Following incident resolution, we continued our investigation and uncovered on October 25 that our internal API gateway was configured with low values for the maximum number of connections allowed to each of the services that experienced increased load on October 22. We immediately increased these values so that the gateway could handle increased volume of task end events moving forward.
Despite these improvements, on October 28, 2024, at 13:27, customer jobs started to fail in the same way as they previously did on October 22. An incident was officially declared at 13:38. By 13:48, the system automatically recovered without any intervention and the incident was resolved at 14:13.
We continued to investigate the root cause of the delays and failures, but at 14:45 customer jobs started to fail again in the same way. We declared another incident at 14:50. In order to reduce the load on output service, we removed the retry logic when requesting storage used per task from output service. This allowed tasks to complete even if storage used could not be retrieved (to the customer’s benefit). Additionally, we scaled distributor horizontally at 15:19 in order to handle the increased load. At 15:21, our systems began to recover. We continued to monitor and resolved the incident at 15:51.
We returned to our investigation into the root cause of this recurring behavior and discovered that there was an additional client in distributor that was configured with a low value for maximum number of connections to our internal API gateway. We increased this value at 17:33.
Future Prevention and Process Improvement:
Following the remediation on October 28, we conducted an audit of all of the HTTP clients in the execution environment and proactively increased those that were similarly configured to ones in the internal API gateway and distributor.
Additionally, we identified a gap in observability with these HTTP clients that prevented us from identifying the root cause of these sets of incidents sooner. We immediately added additional observability to all of the clients in order to enable better alerting if connections pools were to become exhausted again in the future.