During the week of Sunday, November 25th, an upstream change affected a subset of CircleCI convenience images, causing Node variants of some images to fail in CircleCI jobs.
What happened
We create Node variants of our images by pulling an upstream Dockerfile from the set of official Docker Library’s Node images, removing its FROM
statement, and concatenating the remainder onto our existing Dockerfiles. In this case, specifically, we were pulling the carbon
tag (the latest LTS Node.js release).
On Sunday, November 25th, Docker moved its official Node images’ carbon
tag from a cluster of tags attached to a Debian Jessie-based image, to a cluster of tags attached to a Debian Stretch-based image (see diff
on GitHub).
As a result of the specific pattern-matching logic that was in place to automate the creation of Node variants of CircleCI convenience images, this change resulted in the upstream Node Dockerfile’s original FROM
statement no longer being successfully removed, causing some CircleCI Node variants to build with an extra FROM
statement.
(In the past, Docker did not allow multiple FROM
statements, and thus this error would have prevented these Node variants from building and pushing. However, as of Docker version 17.05, multiple FROM
statements are allowed as a feature of multi-stage builds.)
This extra FROM
statement had unexpected consequences: affected images were missing a circleci
user, among other issues, and many failed to start.
Resolution
We patched the issue in the majority of affected image variants on Wednesday, November 28th. We discovered some minor, adjacent issues preventing a small number of affected variants from rebuilding with the patch; we patched those issues on Thursday, November 29th.
Next steps
- We have fortified our image-building logic so it is not as sensitive to upstream changes
- We are adding additional pre-deployment testing to our images, to prevent bugs such as these from reaching production
- We are creating new monitoring/reporting infrastructure, so we can more easily notify users when convenience image issues do arise
- We will be modifying our Node variants to immutably install Node, rather than relying on upstream machinery
- We are exploring the possibility of building our convenience images from scratch, rather than extending community images that can change without warning
Timeline
- 2018.11.25: upstream commit that triggered this incident
- 2018.11.26: upstream changes are first picked up in our images
- 2018.11.27 (12:49 UTC): first customer report of the issue
- 2018.11.27 (23:16 UTC): after much investigation, an incident is declared
- 2018.11.28 (2:49 UTC): the majority of affected images are patched
- 2018.11.28 (19:22 UTC): first of two patches is shipped to fix a small remaining subset of affected images
- 2018.11.29 (15:44 UTC): second patch ships; 100% of affected images have rebuilt