Docker Executor Infrastructure Upgrade

DominicLavery · November 7, 2024, 3:20pm

What is happening?

In order to keep our systems secure we will be gradually rolling out an update to the infrastructure that runs your Docker executor jobs. This rollout is ongoing and will be completed by April 2025.

This update is intended to be transparent to users. While we have undertaken significant developer effort to ensure that you do not notice any differences in behaviour, this is a considerable change to the underlying platform and as such as some users may experience unexpected issues. In this unlikely eventuality we ask that you let us know on this post or by opening a support ticket.

There are 2 known changes that to be aware of:

Jobs are run using cgroupv2, where they used to be run with cgroupv1
The /.dockerenv file no longer exists. A similar file is present in /run/.containerenv

You will be able to tell if your job has run on the new infrastructure if the “Spin up environment” step includes the phrase Using the v2 container runtime.

Why are we doing this?

New versions of our host OS that Docker jobs run on changes features that the executor relies on. In order to upgrade the OS, we’ve had to make changes to how jobs are run. This will allow us to upgrade to newer releases and continue to receive security and bug fixes.

When are we doing this?

We will start the rollout today and customers will be gradually included over the coming weeks. We don’t have a fixed schedule at this time, as we want to be flexible to address any issues as they come up however it must be completed by April.

Who is affected?

Anyone running jobs on the Docker executor will be included eventually. Remote docker (Docker jobs with a setup_remote_docker step) are unaffected.

Troubleshooting

Some tools have known issues that required upgrades to work on the new infrastructure

Erlang/OTP 25
- Upgrade to the latest 25 or 26 release
Jest
- Upgrade to node 23.1+ or specify the maximum number of workers to suit your resource class
Pylint
- Upgrade to 3.3.4 or specify the maximum number of workers to suit your resource class

vcan · November 8, 2024, 3:55pm

Hello, our builds just started using the upgrade and now they are all failing. It seems that they sort of hang on the postgres step. There are no errors. The last build that passed was yesterday AM and I confirmed that those builds were using v1 (Build-agent version 1.0.256087-2b0fd3f9 (2024-11-05T16:22:22+0000).) Any suggestions on how to proceed? We already submitted a support ticket.

MarkFowler · November 8, 2024, 6:27pm

@vcan Thank you for your support ticket, and our apologies for any inconvenience caused. We have disabled the changes while we investigate the issue. You should be able to build normally.

vcan · November 8, 2024, 9:54pm

@MarkFowler Thank you for the quick reply. Our builds are working as expected now! I appreciate your help.

DominicLavery · November 12, 2024, 9:31am

Hi @vcan. We’ve deployed a fix for the issue you encountered and re-enabled the rollout. Thank you again for your understanding and for letting us know.

carlosjgp · November 12, 2024, 4:21pm

Hi Dominic,

I’m seeing an error about CCI not being able to find the cimg/python:3.12.2 on a particular project but other use the same executor without trouble

Build-agent version 1.0.256087-2b0fd3f9 (2024-11-05T16:22:22+0000).
System information:
 Server Version: 24.0.9
 Storage Driver: overlay2
  Backing Filesystem: xfs
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Kernel Version: 5.15.0-1057-aws
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64

Starting container cimg/python:3.12.2 
  image cache not found on this host, downloading cimg/python:3.12.2 

invalid reference format

DominicLavery · November 12, 2024, 4:42pm

Hi @carlosjgp,

From the output you’ve provided, it doesn’t look like this job was included in the role out, so this upgrade is unlikely to be the cause.

Please could you create a new topic with any extra details you have such as the config and when you started to see this behaviour? Thanks

AntoineAugusti · November 18, 2024, 3:44pm

Hello,

We’re seeing failures related to this upgrade in our open-source repository.

Our GitHub repo is etalab/transport-site and this happened when running again commits on the master branch. You will find an example of a success and a failure for the same commit. The second run starts with

Using the v2 container runtime

This happens on multiple commits/branches.

DominicLavery · November 18, 2024, 3:52pm

Hi @AntoineAugusti!

Sorry about that! I’ve opted your project out whilst we investigate. It can take around 10 mins to take effect.

Thanks for letting us know

jbpenrath · November 19, 2024, 11:03am

Hi,

Some our repositories have been migrated to our new system and since that moment lint jobs using pylint are currently failing with the following error :

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Here it is two of our repositories that have been migrated :

github/openfun/joanie/8448/workflows/d03fa070-f1a8-4835-96c0-b6d2a7fa66b2/jobs/107949
github/openfun/marsha/16065/workflows/ef473a85-31e0-4c79-b3b6-f455ab4dcecd/jobs/477725

And here a repo not yet migrated where all is fine :

pipelines/github/openfun/richie/11122/workflows/c4f94451-477c-48bc-8fc6-ba1b545e878a/jobs/251077

It seems to be an issue about parallelization. Actually, without proper argument, pylint tries to autodetect the number of available CPU cores. If we set manually the number of core to use, it works well but I did know how to get the number of vCPU cores available. With the only ways I know I only get the total number of CPU cores (36). Furthermore, if I set manually the number of core to use by pylint to 36, I get exactly the same error.

DominicLavery · November 19, 2024, 11:23am

Hi @jbpenrath.

Sorry for the troubles you’re having there! I’ve disabled the roll out for the openfun org. It can take about 10 minutes to apply.

Thank you for the extra info you have provided as well. That gives us a great lead to follow.

DominicLavery · November 20, 2024, 8:20pm

Hi @AntoineAugusti & @jbpenrath,

I’ve replicated both of these failures in our testing set up. They appear to be linked to the upgrade from cgroupsv1 to cgroupsv2 that comes with the update to the host OS.

The tools in use are unable to identify the number of CPU cores available to the jobs containers when using v2.

I’ve implemented some changes that make more information available inside the container for this. For etalab/transport-site, these changes appear to be enough to fix the issue. I’ll opt this project back in tomorrow and keep an eye on it’s success rate.

For openfun pylint is still unable to calculate the correct value. @jbpenrath, /proc/cpuinfo now reflects the number of CPUs allocated to the job. Please try this fix when your projects are opted back in:
Add

-j $(cat /proc/cpuinfo | grep processor | wc -l)

to your pylint command. e.g.

command: ~/.local/bin/pylint -j $(cat /proc/cpuinfo | grep processor | wc -l) joanie

I’ve tested this internally for the joanie project on the lint-back job and it started passing.

DominicLavery · November 21, 2024, 9:52am

Just confirming, openfun and etalab/transport-site are both opt-ed back in now. I’ll keep an eye on the joanie and transport-site projects for the next couple of days but please don’t hesitate to let us know if you spot anything else.

Thanks again for your patience and understanding

JeremieDoctrine · December 7, 2024, 9:28pm

Is it possible to opt in for our repositories DoctrineLegal/nodejs-doctrine ? We have an issue with Next.js detecting docker environment through cgroup and detecting ephemeral environment and therefor not persisting its cache

I was trying to reproduce the issue with an open source repository and realized the issue was not present there since JeremieDoctrine/cache-issue-nextjs uses the v2 container runtime

JeremieDoctrine · December 10, 2024, 10:23am

@DominicLavery for your information I’ve openned an issue on the next.js github repository issues/73679 if you want more information. I cannot link it from here.

DominicLavery · December 12, 2024, 9:05pm

Hi @JeremieDoctrine!

Sorry for the delay getting back to you. DoctrineLegal/nodejs-doctrine has now been opted in.

Thanks for the extra info

victorlee134 · January 6, 2025, 5:15pm

Hi @DominicLavery

One of our projects was migrated to the new system in the last 12 hours and we are seeing consistent failures blocking our pipelines.

Our Github repo is loomhq/loom. This is happening on all new commits and when re-running jobs for old commits.

Example: github/loomhq/loom/178940/workflows/48a2a6a4-508b-4e32-8999-b8a4a09dc7a3/jobs/2612061

The failing runs always start with
Using the v2 containe runtime

Could we opt-out our project? Or whole org if possible?

DominicLavery · January 6, 2025, 7:10pm

Hi @victorlee134

Thanks for the report and sorry about the issue your running into.

We’ve opted out the loomhq org whilst we investigate the cause

masterginger · January 6, 2025, 10:35pm

Hey @DominicLavery ,

The gradle build jobs in our org are failing due to the following issue:

[21.618s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.

We re-ran an old gradle build which passed 20 days ago but it failed and we did see “Using the v2 container runtime” in the failed job (didn’t see it in the passed job).

This issue is currently blocking most of our Gradle builds. Given that the error seems on the OS level we are wondering if it’s also caused by the v2 container runtime, could you also opt our org out?

Our org is: attentive-mobile

Thanks!

DominicLavery · January 7, 2025, 9:18am

Hi @masterginger

Sorry to hear that! I see one of our support engineers has opted your org out now.

Thanks

Topic		Replies	Views
Upgrading the underlying operating system for the Docker executor Announcements docker	13	2374	April 2, 2019
Docker executor upgrade Announcements docker	0	1218	December 4, 2020
Filesystem update for all Docker jobs Announcements	1	3366	February 5, 2021
Upgrading the underlying operating system for the Docker executor - 18.04 to 20.04 Build Environment	2	2840	December 16, 2021
Upgrading the Docker version used for the Docker executor to Docker 24 Build Environment	1	1037	November 27, 2023