Incident Report: October 27, 2021 - VMs failing to be created

Summary

Customers experienced lengthy delays in job starts and stuck builds from 17:00 UTC October 27 to 01:16 UTC October 28. All jobs were affected, starting with machine jobs and continuing into docker executor as the incident progressed.

We thank customers for your patience and understanding as we worked to resolve this incident. We know how critical CircleCI is to our customers, and we want to share what happened along with our efforts to ensure this kind of incident doesn’t happen again. We will follow up this report with a more detailed one later in the week, as our investigation is ongoing.

The original status page can be found here.

What happened

Three issues, each necessary but only jointly sufficient, resulted in an outage that affected all workloads and lasted seven hours.

  • Errors from AWS increased unexpectedly then subsided quickly;
  • Our VM scaling algorithm scaled more rapidly than expected in response to changes in the system;
  • An independent failover mechanism that quickly creates VMs in an alternate provider behaved unexpectedly when returning to the primary provider.

Most of our compute workload operates in AWS, with the remaining ~20% in GCP. For some of our AWS workload, we can fail over to GCP in case of emergency. While GCP is the “secondary” provider for this flexible workload, it remains the primary provider for other VM workloads.

As traffic increased on October 27th, we experienced a brief spike in “out of capacity” errors from AWS. This triggered a failover mechanism to move workload to a secondary provider, GCP. Creating VMs in the secondary provider proceeded more rapidly than expected because the automated scaler did not receive messages indicating VMs had been successfully created. As the scaler continued to create VMs, we hit our secondary provider CPU quota. That prevented VMs for other, unrelated workloads from being created.

Once the capacity errors for AWS subsided, the failover mechanism returned flow to its normal location. The VMs created in GCP remained in place, continuing to block our normal GCP workload. Our VM orchestrator responded to the blocked workload by trying to create more instances, which resulted in rate-limiting by both cloud providers. At this point, no work was able to flow through the system.

To unstick the system, we disabled VM creation and manually purged job queues. We estimate about 40,000 jobs were dropped during this restart, which appeared to users as a “canceled job,” as if you’d canceled it yourself. We manually deleted the excess VMs in GCP, and re-enabled VM creation, restoring normal service for all jobs.

We apologize for this outage. We know that it was a prolonged disruption for our customers, who rely on CircleCI for critical workflows. We are taking active and involved steps to prevent issues of this type in the future.

Future prevention

Immediately after the incident, our engineering teams made changes to mitigate runaway API calls resulting from the three contributing factors. When traffic increased later the next day, we observed the system remained stable, which gave us confidence that this type of incident will not re-occur in the future.

We’re also implementing ways to recover faster when problems do occur. Some of these are already finished, and some of them are in progress.

Additionally, we’ve implemented an easier way to shut off traffic at an earlier point in the system. A common failure mode in this system involves hitting limits – API rate limits, CPU limits, or others. Having an easy way of turning off the flow in critical situations will help us recover faster once we’ve reached that state. We anticipate that our customers would experience this as a much shorter-lived service interruption impacting a small number of jobs.

Longer-term, we are making larger investments into the systems that handle these workloads. We’ve started small, with a new Mac resource class currently in closed preview, and will continue to build from there.