Post Incident Report: June 23, 2026 - Delays starting Docker jobs

anthonycastro · June 27, 2026, 12:15am

Summary

On Tuesday, June 23, 2026, our engineering team identified that a single Docker job cluster was operating below expected performance levels. Our infrastructure is organized into routing groups, each responsible for handling a specific segment of traffic, and this cluster was the sole member of its group. To address the performance issue, they initiated a standard two-step procedure to safely take it out of rotation: first, update the routing configuration to stop directing traffic to the cluster, then take it offline.

When the team moved to take the cluster offline, the first step of the procedure was skipped. The cluster was removed from the available pool, but the routing configuration was not updated, meaning traffic continued to be directed to the routing group even though it had no capacity to process work. With no other clusters in the routing group to absorb the traffic, affected jobs had nowhere to go and began queuing rather than starting. This is a routine and safe operation when the two-step procedure is followed correctly, this incident exposed a gap in our tooling that allowed the procedure to be completed out of order.

The issue began at approximately 17:25 UTC and was fully resolved by 19:46 UTC, for a total impact window of roughly 2 hours and 20 minutes. Only Docker jobs were affected; other job types, including machine and macOS jobs, continued to operate normally.

Beyond the missed step, several factors extended the impact window. An automated alert fired within a minute of the change deploying, but was dismissed as expected behavior. This delayed detection by nearly an hour. When engineers attempted to accelerate recovery by restoring a prior known-good version, they encountered unexpected issues with the rollback process, adding further delay before the fix could be deployed.

The original status page for this incident can be found here.

How Job Routing Works

CircleCI uses Nomad as its job scheduling system for Docker-based jobs. When a job is triggered, it is routed to one of several Nomad clusters, organized into routing groups each responsible for handling a specific segment of traffic. This multi-cluster architecture is intentional: clusters are isolated from one another so that an issue in one does not impact the others, limiting blast radius and improving overall resiliency. Under normal circumstances, if a cluster becomes unavailable, traffic is automatically rerouted to the remaining operational clusters in its routing group, allowing jobs to continue without interruption.

What Happened

(All times UTC)

Prior to the incident, our engineering team identified that a single Docker job cluster was operating below expected performance levels. Our infrastructure is organized into routing groups, each responsible for handling a specific segment of traffic, and this cluster was the sole member of its group. To address the performance issue, they initiated a standard two-step procedure to safely take it out of rotation.

At 16:58, our engineering team applied a configuration change to remove the affected cluster from service. The first step of the two-step procedure was missed. The cluster was removed from the available pool, but the routing configuration was not updated, leaving traffic still being directed to a routing group with no capacity to process it.

When the configuration change was fully deployed at 17:25, the affected jobs began to queue.

At 17:26, our monitoring alerted us to reduced job throughput on the affected cluster. Because throughput on that cluster was expected to drop as part of the cluster rotation procedure, the alert was interpreted as expected behavior and no action was taken.

By 18:19, a series of customer support tickets made clear that something beyond expected behavior was occurring. A formal incident was declared at 18:28 and engineers began investigating.

By 18:33, the team confirmed the impact was isolated to a subset of Docker jobs and identified a large backlog of queued jobs. At 18:41, a fix was merged to restore the affected cluster back into the available pool. Deployment of that fix was delayed because the CI smoke tests assume functioning clusters, leading to some rework of the pipeline itself. In parallel, the team attempted to accelerate recovery at 18:54 by restoring a prior known-good configuration directly, but encountered a conflict that left multiple active versions running simultaneously, preventing a clean rollback. The fix was fully deployed at 19:20 and Docker jobs began processing again.

By 19:28, the team confirmed the backlog was actively clearing and newly submitted jobs were starting at normal times.

Customers who had accumulated queued jobs during the incident period may have experienced a surge in concurrent job starts upon recovery, potentially hitting plan concurrency limits, which would have resulted in jobs still queueing while we processed the backlog.

In total, the period during which Docker jobs were unable to start lasted from approximately 17:25 to 19:20, roughly 1 hour and 55 minutes. The additional time from 19:20 to 19:46 was spent processing the accumulated backlog, during which some previously queued jobs completed with elevated wait times.

Future Prevention and Process Improvement

We are taking the following steps to prevent a recurrence and improve our response time:

We are strengthening our failover architecture. Our multi-cluster design is built for resiliency, and while automatic failover mechanisms are already in place, this incident exposed a gap in our safeguards. We are closing that gap to ensure traffic is rerouted to healthy clusters across routing groups regardless of whether a cluster goes down unexpectedly or is intentionally taken offline.

We are enhancing our cluster rotation procedure. This incident was possible because our tooling allowed an invalid state to exist, a cluster removed from the available pool while the routing configuration still directed traffic to it. We are updating this tooling to enforce that routing configuration and cluster availability are always updated together, and that the procedure cannot be completed out of order.

We are improving our monitoring and alerting. The automated alert that fired during this incident was dismissed because it appeared consistent with the operation being performed. We are improving our monitoring so that unexpected side effects are surfaced clearly and are not confused with expected behavior.

We are improving our rollback capabilities. During the incident, our team encountered delays when attempting to deploy the fix. We are auditing our deployment configurations to ensure that rollbacks can be executed more quickly and reliably during incidents.

Reliability is our top priority, and we are committed to ensuring that infrastructure upgrades like this one are executed with the appropriate safeguards in place.

Topic	Replies	Views
Postmortem: May 21, 2021 - Delay in starting Docker Jobs. Machine & remote Docker environments blocked Announcements incident , machine-executor , remote-docker	1634	June 3, 2021
Post Incident Report: May 1, 2025 - Delays in starting some jobs Announcements incident , post-incident-review	86	May 22, 2025
Incident Report: November 8, 2021 - Jobs stuck in a "not running" state Announcements incident , post-incident-review	1143	November 11, 2021
Incident Report: 2023-03-14 - Delays Starting Jobs Announcements incident , post-incident-review	882	March 24, 2023
Postmortem: March 26 - April 10 Workflow Delay Incidents Announcements	1673	April 29, 2019

Post Incident Report: June 23, 2026 - Delays starting Docker jobs

Summary

How Job Routing Works

What Happened

Future Prevention and Process Improvement

Related topics