On April 20th, 2022, from 21:50 UTC to 23:27 UTC CircleCI customers saw increased timeouts and errors when loading caches, artifacts, and workspaces, and users could not access the site when using app.circleci.com or app.circleci.com/dashboards. This was caused by a change to our distributed tracing infrastructure which unexpectedly impacted our API Gateway.
Thank you for your understanding and patience as we worked to resolve this issue.
All timestamps are in UTC
We were migrating our distributed tracing infrastructure from OpenCensus (which has been deprecated) to OpenTelemetry. At 21:50 the phased rollout was completed. At 21:52 our monitoring alerted us to an increase in API response latency.
API Gateway availability
We found that this was caused by a misconfiguration of OpenTelemetry which prevented our API Gateway from being able to report tracing data. A limitation of the gateway component that’s responsible for tracing means that it uses a different protocol and routing than our other services that report tracing data. This protocol and routing hadn’t been set up in OpenTelemetry.
The tracing component reports data asynchronously from request handling but used insufficient timeouts when doing so, which meant that too many resources were used waiting for tracing submissions to fail, which eventually impacted request handling.
Latency (by internal service)
Some clients, particularly ones that we own and have short timeouts, gave up and reported errors.
Client timeouts (by internal service)
The exhaustion of resources also meant that other gateway components such as authentication were affected and reported errors to clients.
5xx backend errors (by status code and internal service)
At 22:29 the old routing was deleted and new routing was created for OpenTelemetry. We manually modified the gateway component to send traces to OpenTelemetry to alleviate pressure. While we did see a decrease in timeouts, due to an omitted configuration, we were still seeing connection errors and were not yet sending traces.
At 22:55, this configuration was added to enable the appropriate receiver on OpenTelemetry, which resolved connection errors. We monitored and began seeing positive results from these changes. Due to portions of the gateway services being oversaturated for so long, their recovery was slower than desired.
At 23:04, we created a PR to make our manual change permanent, and force a deployment of the API Gateway to remove the saturated instances and instantiate new ones. By 23:21 new instances of the gateway began to serve traffic and systems were looking healthier.
At 23:27 we moved to monitoring and operational. We stayed in monitoring for 20 minutes to confirm all was well.
At 23:48, this incident was marked as resolved.
We have set timeouts on all gateway components that use asynchronous network calls in order to prevent this happening in the future. Whilst we can’t change the tracing protocol, we will be changing the routing so that it works similarly to the other services that report tracing data. In addition, we are improving tests and observability that would have been valuable during and after this incident.