Convenience Images Update: 2018.11.27 Node variants incident

During the week of Sunday, November 25th, an upstream change affected a subset of CircleCI convenience images, causing Node variants of some images to fail in CircleCI jobs.

What happened

We create Node variants of our images by pulling an upstream Dockerfile from the set of official Docker Library’s Node images, removing its FROM statement, and concatenating the remainder onto our existing Dockerfiles. In this case, specifically, we were pulling the carbon tag (the latest LTS Node.js release).

On Sunday, November 25th, Docker moved its official Node images’ carbon tag from a cluster of tags attached to a Debian Jessie-based image, to a cluster of tags attached to a Debian Stretch-based image (see diff on GitHub).

As a result of the specific pattern-matching logic that was in place to automate the creation of Node variants of CircleCI convenience images, this change resulted in the upstream Node Dockerfile’s original FROM statement no longer being successfully removed, causing some CircleCI Node variants to build with an extra FROM statement.

(In the past, Docker did not allow multiple FROM statements, and thus this error would have prevented these Node variants from building and pushing. However, as of Docker version 17.05, multiple FROM statements are allowed as a feature of multi-stage builds.)

This extra FROM statement had unexpected consequences: affected images were missing a circleci user, among other issues, and many failed to start.

Resolution

We patched the issue in the majority of affected image variants on Wednesday, November 28th. We discovered some minor, adjacent issues preventing a small number of affected variants from rebuilding with the patch; we patched those issues on Thursday, November 29th.

Next steps

  • We have fortified our image-building logic so it is not as sensitive to upstream changes
  • We are adding additional pre-deployment testing to our images, to prevent bugs such as these from reaching production
  • We are creating new monitoring/reporting infrastructure, so we can more easily notify users when convenience image issues do arise
  • We will be modifying our Node variants to immutably install Node, rather than relying on upstream machinery
  • We are exploring the possibility of building our convenience images from scratch, rather than extending community images that can change without warning

Timeline

  • 2018.11.25: upstream commit that triggered this incident
  • 2018.11.26: upstream changes are first picked up in our images
  • 2018.11.27 (12:49 UTC): first customer report of the issue
  • 2018.11.27 (23:16 UTC): after much investigation, an incident is declared
  • 2018.11.28 (2:49 UTC): the majority of affected images are patched
  • 2018.11.28 (19:22 UTC): first of two patches is shipped to fix a small remaining subset of affected images
  • 2018.11.29 (15:44 UTC): second patch ships; 100% of affected images have rebuilt
3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.