Post Mortem for machine provisioning issues 25th - 27th Jan 2022

Summary

From 10:00 UTC on Tuesday, January 25th until 23:00 UTC Thursday, January 27th, customers using machine and remote Docker executors experienced intermittent delays in job starts, with the most severe delays observed on Wednesday. Additionally, Docker Layer Caching (DLC) was not available for most of Thursday.

There were a number of contributing factors that led to the severity and duration of this incident.

From the beginning of the incident, our teams faced many potential avenues for mitigation and investigation for contributing factors. We picked mitigations based on our experience operating the systems involved, and conducted parallel investigations prioritized on what we believed at the time to be the most likely possibilities. Our historical bias on the likeliest sources of the issue, coupled with the large number of errors and alerts happening simultaneously, meant that we knocked down a long list of possible trouble spots without full resolution, including: provisioning issues, network timeouts, SSH errors, authentication issues with GCE’s backend, API rate limits, CPU rate limits, IP space allocation, and more.

Our investigations were further complicated as we were essentially battling the clock. The issues only appeared during peak load time, so we were unable to confirm fixes outside of that window. As we passed peak hours of usage each day, we became unable to reproduce the conditions leading to the issues. This created periods where we were unable to test our hypotheses about the source of the issue, or test avenues towards resolution.

We ultimately determined that this incident was caused by a small change we were testing to our Docker Layer Caching (DLC) functionality, intended to improve efficiency and performance of this feature. The change introduced to DLC was constrained to metrics gathering on two internal projects, and rolled out using feature flags. This update was low on our list of likely suspects, as we believed the change to be sufficiently constrained. In this instance, however, our standard tools behaved in an unusual manner. We discovered that the change to the DLC setup process produced an unexpectedly large amount of string output (90MB+) that was then buffered into memory by a core internal library, leading to a failure cascade in our VM management service.

We apologize for this outage. Even when most workloads are running, the choice to disable a major feature is disruptive. As an engineering organization, we are continually striving to deliver improvements in our service to customers with both speed and safety. On average, we deploy approximately 800 updates to our services every week, and are constantly in search of ways to improve our systems, and our process for finding and recovering from issues.

What happened

We’ve been working to improve the efficiency of Docker Layer Caching (DLC), a feature that lets our customers speed up machine and remote Docker build times. As an initial rollout of the new version, we started with gathering metrics alongside the current version to collect information and identify any inconsistencies. We tested this with synthetic workloads, and moved to the next phase of collecting metrics for real jobs. We chose two internal projects, which would provide data on real workloads while limiting potential trouble to our own builds. We rolled this out using feature flags, as we do any feature in testing or preview.

Day 1: Tuesday, January 25th

Around 13:30 UTC, the team noticed slowness in VM provisioning and believed a slowdown in GCE provisioning VMs caused it. We often see this slowdown in GCP, and it usually clears on its own. Later on, at 22:31 UTC, the slowness had not resolved, so we chose to divert machine jobs to the new provisioner system in AWS. The issues subsided and work began to flow as normal, which seemed to support our hypothesis that intermittent, unreported issues upstream were the ultimate source.

Day 2: Wednesday, January 26th

As we were working post-incident changes the following day, our team was more alert to potential problems in machine jobs. We made several changes to the VM management service to mitigate the effect of the apparent slowness, including changing timeouts and optimizing calls to the GCE API. At the same time, we identified turbulence in VM boot times, which caused some queues to back up and a minimal number of customer jobs to take longer than usual to spin up.

The changes we made to the VM management service did not mitigate the issue, and we opened the second incident just before 15:00 UTC when we started to observe customer impact. Slow VM provisioning caused a backup in queues in the VM management service, which led to upstream API contention. As deletion uses the same APIs, this contention caused a buildup of unused VMs, leading to capacity errors. To restore some service for jobs using machine executors, we needed to stabilize the management service manually. We also engaged our upstream provider to increase our IP quota and temporarily ease the pressure. Once we achieved this, we started parallel investigations into the source of the slowness. One investigation path led us to reduce the number of retries for a failing VM, and as this change rolled out, we observed service returning to regular operation. However, this change also happened to coincide with our daily decrease in traffic. We closed the incident at 01:14 UTC Thursday, January 27th as service returned to normal, but were on alert for continuing issues as traffic ramped up the next day.

Day 3: Thursday, January 27th

We opened the third incident at 12:10 UTC when slowness in provisioning again caused machine job delays. We restarted the VM provisioning service, which temporarily mitigated the delays and allowed us to investigate potential causes. DLC volume attachment errors surfaced early, and we identified DLC as the location of the problem though we weren’t yet sure what the problem was. At 14:26 UTC we disabled DLC to allow provisioning to proceed and enable customer work to flow. This action could slow customer jobs that relied on DLC but enabled all jobs to run while we continued investigating.

As during the previous day, we initiated several parallel investigations. By 22:41 UTC we confirmed the cause was in the metrics gathering mechanism for the DLC update. We disabled this data collection and turned DLC back on for all customers. Customer workloads that had been flowing normally after we disabled DLC continued to do so, which confirmed we had identified the culprit. After a short period of monitoring, we closed the incident at 01:00 UTC on Friday, January 28th.

The DLC metrics collection setup captures errors in setup using a standard library that converts output into a string, buffering it into memory. When we set up metric collection on a real project, we obtained large error outputs (greater than 90MB) that led to out of memory errors on the pods managing those projects. As these pods failed and restarted, the jobs associated with them were requeued and provisioning queues increased. As provisioning slowed, the VM management system attempted to scale up in response, which caused us to hit quotas with our provider. From there, repeated retries to provision VMs caused us to hit our API rate limits, which affected our ability to both service workloads and recover from the failure cascade. Once we identified the issue in DLC, we were able to rapidly recover.

Future Prevention and Process Improvement

We have addressed the issue with DLC that led to the failure cascade in the VM management service by limiting the amount of data collected at each touchpoint. As with many bugs, finding the problem was the hard part and fixing it was straightforward. Additionally, we are looking into ways of detecting this particular issue earlier in the feature lifecycle.

We are also taking the lessons of this incident to improve our incident process overall. Mid-way through the incident we were able to tweak our process to bring in additional teams to increase the number of investigations we were able to run in parallel. We will be shifting our future approach to this method with the aim of faster resolution time.

Additionally, we recognize that this long investigation period meant that status updates, while frequent, were sometimes lacking substantive information for our customers. We are looking for ways to improve this so that these updates can be both frequent and useful for our customers.