Docker Executor Infrastructure Upgrade

What is happening?

As a part of our commitment to keep our systems secure and performant we will be gradually rolling out an update to the infrastructure that runs your Docker executor jobs.

This update is intended to be transparent to users. While we have undertaken significant developer effort to ensure that you do not notice any differences in behaviour, this is a considerable change to the underlying platform and as such as some users may experience unexpected issues. In this unlikely eventuality we ask that you let us know on this post or by opening a support ticket.

You will be able to tell if your job has run on the new infrastructure if the “Spin up environment” step includes the phrase Using the v2 container runtime.

Why are we doing this?

New versions of our host OS that Docker jobs run on changes features that the executor relies on. In order to upgrade the OS, we’ve had to make changes to how jobs are run. This will allow us to upgrade to newer releases and continue to receive security and bug fixes.

When are we doing this?

We will start the rollout today and customers will be gradually included over the coming weeks. We don’t have a fixed schedule at this time, as we want to be flexible to address any issues as they come up.

Who is affected?

Anyone running jobs on the Docker executor will be included eventually. Remote docker (Docker jobs with a setup_remote_docker step) are unaffected.

2 Likes

Hello, our builds just started using the upgrade and now they are all failing. It seems that they sort of hang on the postgres step. There are no errors. The last build that passed was yesterday AM and I confirmed that those builds were using v1 (Build-agent version 1.0.256087-2b0fd3f9 (2024-11-05T16:22:22+0000).) Any suggestions on how to proceed? We already submitted a support ticket.

@vcan Thank you for your support ticket, and our apologies for any inconvenience caused. We have disabled the changes while we investigate the issue. You should be able to build normally.

1 Like

@MarkFowler Thank you for the quick reply. Our builds are working as expected now! I appreciate your help.

Hi @vcan. We’ve deployed a fix for the issue you encountered and re-enabled the rollout. Thank you again for your understanding and for letting us know.

1 Like

Hi Dominic,

I’m seeing an error about CCI not being able to find the cimg/python:3.12.2 on a particular project but other use the same executor without trouble

Build-agent version 1.0.256087-2b0fd3f9 (2024-11-05T16:22:22+0000).
System information:
 Server Version: 24.0.9
 Storage Driver: overlay2
  Backing Filesystem: xfs
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Kernel Version: 5.15.0-1057-aws
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64

Starting container cimg/python:3.12.2 
  image cache not found on this host, downloading cimg/python:3.12.2 

invalid reference format

Hi @carlosjgp,

From the output you’ve provided, it doesn’t look like this job was included in the role out, so this upgrade is unlikely to be the cause.

Please could you create a new topic with any extra details you have such as the config and when you started to see this behaviour? Thanks

1 Like

Hello,

We’re seeing failures related to this upgrade in our open-source repository.

Our GitHub repo is etalab/transport-site and this happened when running again commits on the master branch. You will find an example of a success and a failure for the same commit. The second run starts with

Using the v2 container runtime

This happens on multiple commits/branches.

Hi @AntoineAugusti!

Sorry about that! I’ve opted your project out whilst we investigate. It can take around 10 mins to take effect.

Thanks for letting us know

2 Likes

Hi,

Some our repositories have been migrated to our new system and since that moment lint jobs using pylint are currently failing with the following error :

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Here it is two of our repositories that have been migrated :

  • github/openfun/joanie/8448/workflows/d03fa070-f1a8-4835-96c0-b6d2a7fa66b2/jobs/107949
  • github/openfun/marsha/16065/workflows/ef473a85-31e0-4c79-b3b6-f455ab4dcecd/jobs/477725

And here a repo not yet migrated where all is fine :

  • pipelines/github/openfun/richie/11122/workflows/c4f94451-477c-48bc-8fc6-ba1b545e878a/jobs/251077

It seems to be an issue about parallelization. Actually, without proper argument, pylint tries to autodetect the number of available CPU cores. If we set manually the number of core to use, it works well but I did know how to get the number of vCPU cores available. With the only ways I know I only get the total number of CPU cores (36). Furthermore, if I set manually the number of core to use by pylint to 36, I get exactly the same error.

Hi @jbpenrath.

Sorry for the troubles you’re having there! I’ve disabled the roll out for the openfun org. It can take about 10 minutes to apply.

Thank you for the extra info you have provided as well. That gives us a great lead to follow.

1 Like

Hi @AntoineAugusti & @jbpenrath,

I’ve replicated both of these failures in our testing set up. They appear to be linked to the upgrade from cgroupsv1 to cgroupsv2 that comes with the update to the host OS.

The tools in use are unable to identify the number of CPU cores available to the jobs containers when using v2.

I’ve implemented some changes that make more information available inside the container for this. For etalab/transport-site, these changes appear to be enough to fix the issue. I’ll opt this project back in tomorrow and keep an eye on it’s success rate.

For openfun pylint is still unable to calculate the correct value. @jbpenrath, /proc/cpuinfo now reflects the number of CPUs allocated to the job. Please try this fix when your projects are opted back in:
Add

-j $(cat /proc/cpuinfo | grep processor | wc -l)

to your pylint command. e.g.

command: ~/.local/bin/pylint -j $(cat /proc/cpuinfo | grep processor | wc -l) joanie

I’ve tested this internally for the joanie project on the lint-back job and it started passing.

2 Likes