Postmortem: Incidents of October 22nd–29th

Sara · November 15, 2019, 10:53pm

Between October 22nd and October 29th, 2019, CircleCI experienced a number of incidents that affected the availability and reliability of our platform, causing delays and errors for customers using the UI and attempting to run builds. The issues affected the following areas of our application:

Onboarding and pipelines experiences
Workflows
Machine and remote Docker jobs
CircleCI UI

The underlying issues for each incident are largely independent, and we will detail each below. Each of the 6 incidents has been fully resolved, and no action is needed from any user at this time.

We know that consistency and reliability are incredibly important to our users, and we are deeply sorry for any disruption these outages may have caused to your work. We have performed an internal postmortem for each incident and have taken actions to make our system more resilient against the underlying causes. We have also worked to decrease our time to recovery in the event of an incident and have taken extensive measures to reduce the likelihood of these incidents occurring in the future.

October 22nd - New onboarding & pipelines experiences inaccessible for some users

What happened?

On October 22nd, 2019, at 20:14 UTC we received synthetic testing alerts that our user onboarding and user settings pages were experiencing loading failures. Upon investigation, we found that users of our new user interface were receiving either white screens or pages that were not fully loading.

We identified the cause of this issue to be errors retrieving page assets, like Javascript and CSS, from our CDN provider. While diagnosing these issues, some of our staff noticed they were able to retrieve the underlying assets directly while others could not, depending on DNS configuration. This led us to believe that the problem was an intermittent external problem with DNS resolution affecting our CDN’s ability to retrieve our static assets from S3. At 20:57 UTC, we pushed a patch that would allow our web servers to directly serve these page assets. Once pushed, we confirmed that by 21:14 UTC the issues had been remediated and users were no longer experiencing issues with the website.

On October 23rd, 2019, Amazon AWS declared that there had been a DDoS (Distributed Denial of Service) attack on their Route53 DNS service. The attack began on October 22nd, 2019, at 18:30 UTC, and was resolved by October 23rd, 2019, at 02:30 UTC. The problems that occurred prevented our CDN provider from properly resolving the locations of our assets, causing the white pages and partial page loads.

Who was affected?

Users of the new CircleCI user interface or user onboarding accessing the site from 20:14 UTC through 21:16 UTC may have been affected.

What did we do to resolve the issue?

Once we identified that our CDN provider was no longer able to serve our page assets, we pushed a change that allowed our web servers to serve assets directly. Once we pushed this change, our internal testing and synthetic testing confirmed that pages were loading properly again.

What are we doing to make sure this doesn’t happen again?

We will continue to monitor the health of our cloud provider services with both internal and synthetic testing to reduce our time to recovery in the event that a similar outage occurs.

October 22nd - Drop in number of running workflows

What happened?

On October 18th, 2019 at 00:05 UTC we received notification from Amazon AWS that a MongoDB replica set member was on an EC2 instance that was suffering from hardware degradation and was scheduled to be decommissioned. We started receiving exceptions at 18:54 UTC the same day concerning connection timeouts to the same MongoDB replica. Our site reliability engineering (SRE) team hid the replica and shut down the EC2 instance. Exceptions stopped occurring and service returned to normal. The instance was left shut down so the SRE team could schedule an appropriate time to migrate the instance to new hardware.

On October 22nd, 2019 at 09:09 UTC, the SRE team began the process of migrating the EC2 instance to new hardware and adding the replica back to the MongoDB cluster. This process finished by 09:31 UTC, with the node successfully unhidden and handling connections. At 22:46 UTC, we received an alert that there was a sudden drop in the number of scheduled running workflows. Upon investigation, we noticed high CPU load on the EC2 instance our SRE team had returned back to service earlier in the day. Acting quickly, the MongoDB replica on this instance was rehidden. While most of our services responded quickly to this change and automatically disconnected from the hidden replica, a service providing our core API was slow to respond and remained connected. This service was restarted manually and workflow performance returned back to normal by 23:25 UTC. The degraded EC2 instance was kept out of service.

Workflow processing rate

AWS sent a notification on October 23rd, 2019 at 11:56 UTC stating that hardware we had migrated to on the 22nd had also been suffering from degradation. The EC2 instance and EBS volumes were subsequently rebuilt from scratch and unhidden to bring it back into service on October 23rd at 13:23 UTC.

Who was affected?

Some customers may have experienced problems scheduling or running workflows during this time period.

What did we do to resolve the issue?

Once the issue was discovered, the CircleCI SRE team hid the problematic MongoDB replica from the replica set. Some connections to the hidden replica remained open and the responsible service was restarted manually.

What are we doing to make sure this doesn’t happen again?

The CircleCI engineering team is adjusting client timeout values for MongoDB connections in order to eliminate the need to restart services manually when replica set members are hidden or otherwise fail.

October 23rd - Machine and remote Docker provisioning delays

What happened?

On October 23rd at 20:18 UTC, we began receiving error responses from GCP indicating that the zones in which we run machine executors and remote Docker instances were out of capacity. As a result, we were unable to create new VMs for customer jobs.

We were delayed in correctly diagnosing this root cause due to a few tangential, but ultimately unrelated, issues:

The system responsible for tracking VM creation requests had a bug which caused it to mishandle the VM creation requests which were rejected by GCP due to insufficient zone capacity. As a result, these VMs were left in a “starting” state within CircleCI systems rather than being transitioned to an “error” state.
The system responsible for automatically scaling our pool of ready VMs makes its decisions based on the number of jobs waiting to execute, and the number of existing VMs in the pool. The large number of VMs mistakenly left in a “starting” state prevented this autoscaler from meeting the underlying demand.
During the incident, we began hitting rate limits while calling GCP APIs. These rate limit errors caused a worker responsible for processing “destroy-vm” requests to be killed. This was due to a recent change in our error handling.

At 21:03 UTC we initiated a failover to zones in our alternate GCP region and delays began recovering.

VM creation errors

Who was affected?

Over the course of this incident, customers using machine executors and remote Docker experienced delays. A subset of these tasks (approximately 10%) were marked as failed and would have required manual re-run.

What did we do to resolve the issue?

We resolved the issue by failing over to our alternate GCP region.

What are we doing to make sure this doesn’t happen again?

A number of fixes have been deployed to address the tangential issues identified during this incident. Bug fixes were made to correctly handle zone capacity errors and fix the error handling in our “destroy-vm” worker. We have also adjusted our retry backoff times to respond more conservatively to rate limiting by GCP.

Additionally, we’ve introduced additional instrumentation to the affected services to increase observability, particularly around error responses from GCP.

Lastly, we are investigating the possibility of changing our default zones for machine executors and remote Docker instances to span regions rather than being colocated in a single region. We believe doing this will decrease our chances of hitting capacity in multiple zones simultaneously.

October 24th - Machine and remote Docker provisioning delays

What happened?

On October 24th, at 18:33 UTC we began receiving error responses from GCP indicating that the zones in which we run machine executors and remote Docker instances were out of capacity. We also began receiving rate limit error responses from GCP at this time.

Focusing first on the rate limit issues, we reduced the number of worker processes making calls to the GCP API. This successfully addressed the rate limit errors, but we continued to receive errors related to zone capacity.

At 19:10 UTC we initiated a failover to zones in our alternate GCP region and delays began recovering. By 19:16 UTC delays had returned to normal levels and the incident was resolved.

Who was affected?

Customers using machine executors and remote Docker experienced delays. A subset of these tasks (approximately 30%) were marked as failed and would have required manual re-run.

What did we do to resolve the issue?

We resolved the issue by failing over to our alternate GCP region.

What are we doing to make sure this doesn’t happen again?

In addition to the actions taken in response to the related incident on Oct 23rd, we have automated the failover process to our alternate zones.

October 28th - UI errors and prolonged build delays

What happened?

On Oct 28th, 2019 at 13:36 UTC we were alerted to issues loading UI pages by DataDog Synthetics monitors. The tests showed intermittent failure and subsequent manual and automated testing showed a fractional but relatively consistent error level. A set of Nginx instances that are responsible for proxying this traffic were found to be failing their Kubernetes readiness checks. The Nginx proxy pods were restarted and the issue was resolved at 14:30 UTC.

At 17:30 UTC a deploy of a key service triggered a substantial increase in API errors, both internally and externally. This caused a brief period of elevated external API errors from approximately 17:33 UTC-18:01 UTC. We again discovered that a number of Nginx proxy instances were unhealthy. Subsequent investigation revealed that the underlying cause was the eviction of kube-proxy on a small number of Kubernetes workers. Without kube-proxy running, routing rules on the affected host were no longer updated in response to service deployments. This led to Nginx proxy instances on the affected hosts being unable to communicate with the backing service after the deploy. We moved the affected Nginx proxy instances to healthy hosts and the API recovered at 18:00 UTC.

API Errors

Internally, the request errors led to a backlog in a system called the run queue. However, even after the API errors were resolved, the backlog in this queue persisted. At 18:19 UTC we identified the cause to be excess load on the database instance storing this queue, which had reached 100% CPU utilization. A number of attempts were made to reduce load on this database by scaling down services, and limiting the rate at which work is added to the run queue, but these were unsuccessful.

Run queue growing

By 20:00 UTC we had identified the underlying cause to be a database query whose performance had gotten progressively worse as the run queue grew. At this time, we began work to develop and deploy a more performant query. At 20:53 UTC an updated query was applied, but no change in performance was observed. At 21:18 UTC a new database index was proposed and added, but also failed to improve performance. Finally, at 21:40 UTC a new more performant query was applied and throughput improved immediately.

Run queue clearing

At this point a number of other systems were scaled up to help drain the backlog of delayed jobs. By 22:20 UTC the run queue had been cleared, but additional delays remained in downstream systems as the backlog was processed. By 23:09 delays for non-macOS jobs returned to normal levels. By 23:33 delays for macOS also returned to normal levels and the incident was declared resolved.

Who was affected?

The first UI issue affected users accessing our UI, particularly pages including data on user settings or onboarding. The second UI issue affected all customers using CircleCI during this time period. Complex workflows were more likely to be dramatically affected because every serial job was delayed. Mac users experienced delays even after the run queue issue was fixed.

What did we do to resolve the issue?

We altered a poorly performing query and temporarily increased concurrency to flush delayed work through the system.

What are we doing to make sure this doesn’t happen again?

Following the first UI incident, continued investigation determined that some kube-proxy pods were being evicted from Kubernetes workers. We have updated the configuration of these pods such that they are much less likely to be evicted, and if evicted, will recover automatically.

In response to the run queue delays we have load-tested and optimized the affected query and improved our internal guidance for how to test and roll out changes to hot-path queries.

October 29th - 504 gateway timeout errors

What happened?

On Oct 29th, 2019 at 15:10 UTC CircleCI engineers detected slow response times and an elevated error rate for an internal GraphQL service. At 15:20 UTC, instances of the GraphQL service began failing Kubernetes liveness checks which led Kubernetes to restart these service instances. The restarts caused in-flight requests to be aborted, leading to a spike in the error rate for the service. This, in turn, led to errors and timeouts loading parts of the CircleCI UI. The restarts also reduced the number of available instances of the GraphQL service, putting increased load on the remaining instances, and compounding the underlying problem.

Initially, we suspected that the restarts were caused by misconfigured memory settings for the GraphQL service. These settings were corrected and the service was redeployed at 16:05 UTC, but the restarts continued and the error rate for the service remained elevated.

At approximately 16:35 UTC, load on the GraphQL service began decreasing and liveness checks began passing again. With liveness checks passing, the instance restarts ceased and error rates returned to normal levels. The 504 Gateway Timeout errors in the UI subsided shortly thereafter, but slow response times from the GraphQL service persisted.

Further investigation via thread dumps found that a majority of the threads in the GraphQL service were blocked in calls to resolve the address of another internal CircleCI service. Of these threads, only one was performing an address lookup with the system resolver. The remaining threads were instead waiting for a lock within Java’s DNS caching code. We subsequently discovered that our practice of disabling this cache (by setting the cache’s TTL to 0 seconds) was inadvertently causing these address lookups to be executed sequentially rather than in parallel. This locking code is designed to prevent redundant lookup requests from being made while populating the cache, but in our case, it instead became a bottleneck on our request processing.

At 20:05 UTC a change was deployed to the GraphQL service to re-enabled Java DNS caching with a TTL of 5 seconds. Following this change, we observed an increase in throughput and reduction in response times.

Response time improvement (log scale; Oct 29th in grey, Oct 30th in blue)

Who was affected?

Customers using the CircleCI UI between 15:20 and 16:35 UTC may have experienced errors. In particular, customers viewing workflows and workflow maps, re-running a workflow from the beginning, approving jobs, or managing contexts and orbs may have experienced these errors. Customers using the UI in the hours before and after this time period may have experienced slowness and occasionally timeouts.

What did we do to resolve the issue?

The most severe symptoms of this incident resolved without intervention as load on the service decreased. Later, a Java DNS caching configuration change was made leading to improved performance for an internal GraphQL service and the CircleCI UI by extension.

What are we doing to make sure this doesn’t happen again?

The Java DNS caching configuration change described above is being applied to all of CircleCI’s Java services. The memory configuration changes made to the GraphQL service during the incident have been made permanent and will improve the service’s reliability. Additional caching was also introduced in the GraphQL service to further improve performance.

Topic		Replies	Views
Postmortem: March 26 - April 10 Workflow Delay Incidents Announcements	0	1627	April 29, 2019
Postmortem: May 21, 2021 - Delay in starting Docker Jobs. Machine & remote Docker environments blocked Announcements incident , machine-executor , remote-docker	0	1599	June 3, 2021
Information on recent Workflow delays Announcements	11	2966	April 11, 2019
CircleCI broken for AWS Integrations and Accounts	1	1193	June 18, 2018
Internal CircleCI endpoint returns "502 Bad Gateway" Feedback & Bug Reports	3	2406	August 28, 2018

Postmortem: Incidents of October 22nd–29th

October 22nd - New onboarding & pipelines experiences inaccessible for some users

October 22nd - Drop in number of running workflows

October 23rd - Machine and remote Docker provisioning delays

October 24th - Machine and remote Docker provisioning delays

October 28th - UI errors and prolonged build delays

October 29th - 504 gateway timeout errors

Related topics