Public IR Slow spin up times for docker jobs using a custom docker image hosted on GCR 2022-04-26

Summary:

Between 14:15 UTC on April 25th and 03:00 UTC on April 28th, customers using container images hosted in Google Container Registry (GCR) experienced build delays and failures due to timeouts. During this time, about 2% of Docker jobs experienced problems. A subset of these, about 0.9% of all jobs using GCR, failed. This resulted from a GCR service degradation that increased pull latency during the affected interval. We thank customers for their patience during this period.

The original status page can be found here. It includes guidance and workarounds in the event of a similar outage.

What Happened

All times are in UTC.

At 14:00 on April 25th, GCR pull latency started to spike, shown in the graph below. Several customers reported build failures over the next several hours, and at 22:48 we started investigating potential internal causes.


GCR pull latency starts to spike around 14:00 UTC on April 25 and resolves by 22:30 on April 28.

By 01:07 on April 26th we confirmed that the slowness was not due to internal code changes, and posted a status page update. We narrowed the issues to private (customer-specified) registries, and recommended customers retry failed builds.

From here, we engaged our support partners and continued to inform customers as we learned more. We recommended alternative container registries as workarounds, and monitored pull latency from GCR until the problem was resolved. At 20:47 on April 27th, GCR’s status page reflected the outage, and by 22:30 on April 28th service returned to normal.

Future Prevention and Process Improvement:

While we cannot prevent this particular type of issue, we did discover improvements we can make to our systems and processes.

We will add registry-specific monitoring for container pull latency. This will allow us to proactively inform customers when degradation like this occurs, and offer workaround advice sooner.

Additionally, we discovered a bug in a rollback script that has since been fixed. While this would not have changed the outcome of this incident, catching it now prevents future problems.