As a part of our commitment to keep our systems secure and performant we will be gradually rolling out an update to the infrastructure that runs your Docker executor jobs.
This update is intended to be transparent to users. While we have undertaken significant developer effort to ensure that you do not notice any differences in behaviour, this is a considerable change to the underlying platform and as such as some users may experience unexpected issues. In this unlikely eventuality we ask that you let us know on this post or by opening a support ticket.
You will be able to tell if your job has run on the new infrastructure if the “Spin up environment” step includes the phrase Using the v2 container runtime.
Why are we doing this?
New versions of our host OS that Docker jobs run on changes features that the executor relies on. In order to upgrade the OS, we’ve had to make changes to how jobs are run. This will allow us to upgrade to newer releases and continue to receive security and bug fixes.
When are we doing this?
We will start the rollout today and customers will be gradually included over the coming weeks. We don’t have a fixed schedule at this time, as we want to be flexible to address any issues as they come up.
Who is affected?
Anyone running jobs on the Docker executor will be included eventually. Remote docker (Docker jobs with a setup_remote_docker step) are unaffected.
Hello, our builds just started using the upgrade and now they are all failing. It seems that they sort of hang on the postgres step. There are no errors. The last build that passed was yesterday AM and I confirmed that those builds were using v1 (Build-agent version 1.0.256087-2b0fd3f9 (2024-11-05T16:22:22+0000).) Any suggestions on how to proceed? We already submitted a support ticket.
@vcan Thank you for your support ticket, and our apologies for any inconvenience caused. We have disabled the changes while we investigate the issue. You should be able to build normally.
Hi @vcan. We’ve deployed a fix for the issue you encountered and re-enabled the rollout. Thank you again for your understanding and for letting us know.
We’re seeing failures related to this upgrade in our open-source repository.
Our GitHub repo is etalab/transport-site and this happened when running again commits on the master branch. You will find an example of a success and a failure for the same commit. The second run starts with
Some our repositories have been migrated to our new system and since that moment lint jobs using pylint are currently failing with the following error :
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Here it is two of our repositories that have been migrated :
It seems to be an issue about parallelization. Actually, without proper argument, pylint tries to autodetect the number of available CPU cores. If we set manually the number of core to use, it works well but I did know how to get the number of vCPU cores available. With the only ways I know I only get the total number of CPU cores (36). Furthermore, if I set manually the number of core to use by pylint to 36, I get exactly the same error.
I’ve replicated both of these failures in our testing set up. They appear to be linked to the upgrade from cgroupsv1 to cgroupsv2 that comes with the update to the host OS.
The tools in use are unable to identify the number of CPU cores available to the jobs containers when using v2.
I’ve implemented some changes that make more information available inside the container for this. For etalab/transport-site, these changes appear to be enough to fix the issue. I’ll opt this project back in tomorrow and keep an eye on it’s success rate.
For openfun pylint is still unable to calculate the correct value. @jbpenrath, /proc/cpuinfo now reflects the number of CPUs allocated to the job. Please try this fix when your projects are opted back in:
Add
Just confirming, openfun and etalab/transport-site are both opt-ed back in now. I’ll keep an eye on the joanie and transport-site projects for the next couple of days but please don’t hesitate to let us know if you spot anything else.
Is it possible to opt in for our repositories DoctrineLegal/nodejs-doctrine ? We have an issue with Next.js detecting docker environment through cgroup and detecting ephemeral environment and therefor not persisting its cache
I was trying to reproduce the issue with an open source repository and realized the issue was not present there since JeremieDoctrine/cache-issue-nextjs uses the v2 container runtime
@DominicLavery for your information I’ve openned an issue on the next.js github repository issues/73679 if you want more information. I cannot link it from here.