During the week of Sunday, November 25th, an upstream change affected a subset of CircleCI convenience images, causing Node variants of some images to fail in CircleCI jobs.
We create Node variants of our images by pulling an upstream Dockerfile from the set of official Docker Library’s Node images, removing its
FROM statement, and concatenating the remainder onto our existing Dockerfiles. In this case, specifically, we were pulling the
carbon tag (the latest LTS Node.js release).
On Sunday, November 25th, Docker moved its official Node images’
carbon tag from a cluster of tags attached to a Debian Jessie-based image, to a cluster of tags attached to a Debian Stretch-based image (see
diff on GitHub).
As a result of the specific pattern-matching logic that was in place to automate the creation of Node variants of CircleCI convenience images, this change resulted in the upstream Node Dockerfile’s original
FROM statement no longer being successfully removed, causing some CircleCI Node variants to build with an extra
(In the past, Docker did not allow multiple
FROM statements, and thus this error would have prevented these Node variants from building and pushing. However, as of Docker version 17.05, multiple
FROM statements are allowed as a feature of multi-stage builds.)
FROM statement had unexpected consequences: affected images were missing a
circleci user, among other issues, and many failed to start.
We patched the issue in the majority of affected image variants on Wednesday, November 28th. We discovered some minor, adjacent issues preventing a small number of affected variants from rebuilding with the patch; we patched those issues on Thursday, November 29th.
- We have fortified our image-building logic so it is not as sensitive to upstream changes
- We are adding additional pre-deployment testing to our images, to prevent bugs such as these from reaching production
- We are creating new monitoring/reporting infrastructure, so we can more easily notify users when convenience image issues do arise
- We will be modifying our Node variants to immutably install Node, rather than relying on upstream machinery
- We are exploring the possibility of building our convenience images from scratch, rather than extending community images that can change without warning
- 2018.11.25: upstream commit that triggered this incident
- 2018.11.26: upstream changes are first picked up in our images
- 2018.11.27 (12:49 UTC): first customer report of the issue
- 2018.11.27 (23:16 UTC): after much investigation, an incident is declared
- 2018.11.28 (2:49 UTC): the majority of affected images are patched
- 2018.11.28 (19:22 UTC): first of two patches is shipped to fix a small remaining subset of affected images
- 2018.11.29 (15:44 UTC): second patch ships; 100% of affected images have rebuilt