Docker Executor Infrastructure Upgrade

lenoirzamboni · March 27, 2025, 1:22pm

We are also facing issues with the new v2 container runtime on certain projects. We’re frequently encountering “Received ‘killed’ signal” errors across multiple jobs. Is there a way to opt out until this is resolved?

DominicLavery · March 27, 2025, 1:46pm

Hi @lenoirzamboni

Sorry to hear that!

I’ve sent you a DM, please could you reply to it or here with a build link to your issue and we will look into it.

Dom

DominicLavery · March 27, 2025, 4:45pm

As we approach the last increase of this rollout, I wanted to give some additional details around the most common issue encountered. Increases in “signal: killed” and running out of memory in memory managed languages.

As we stated in the email regarding this rollout sent mid February and in this post one of the changes in this upgrade is the host OS moving from cgroupv1 to cgroupv2. Tooling and scripting may need to be upgraded to support this. In particular, memory managed languages like Java often use cgroups to detect how much memory they can safely allocate. Without this detection, they run the risk of being killed by the OS for using too much memory.

Older versions of these languages, runtimes and associated tools may not have v2 support. If you encounter such issues, please consider trying the following:

Upgrade your runtime, languages and tooling
Set the maximum memory manually e.g Java’s -Xmx1024m -xms1024m options or mavens memory allocation enhancements
Increase your resource class size

mcculls · March 27, 2025, 6:27pm

Hi Dom, we’re seeing a lot of issues with the v2 container runtime on http://app.circleci.com/pipelines/github/DataDog/dd-trace-java - is it possible to opt out of v2 for now?

DominicLavery · March 27, 2025, 6:40pm

Hi @mcculls

Sorry about that!

I took a quick look at your builds. The issue I saw looks like to the ones discussed above

I’ve applied the temporary opt out. It takes about 10 minutes to take effect

Dom

nicholaspshaw · March 27, 2025, 8:53pm

Hi @DominicLavery and @DomParfitt ,

is it possible to get an opt out for Swoop / joinswoop? It has blocked our merge pipeline as a large amount of our Jest tests are timing out:

https://app.circleci.com/pipelines/github/joinswoop/swoop/667527/workflows/485bca97-a447-4d24-abcf-963caeb9a223/jobs/8902730

zaki · March 27, 2025, 9:22pm

Hi @nicholaspshaw, I’ve opted out your project and it will take up to 10 minutes for the opt-out to take effect.

Sorry about any problems this has caused!

nicholaspshaw · March 27, 2025, 9:27pm

Thank you very much for doing this @zaki , I’ll report back if this doesn’t appear to resolve in the next 10-20mins

zaki · March 27, 2025, 9:33pm

@nicholaspshaw Please do! I did just notice that this job is using Node 20.11. We’ve found that using Node 23+ or specifying the max workers using --maxWorkers=1 (as an example) will result in a more reliable pipeline with the new runtime.

I’m happy to opt this project back in when you’d like to test those changes

nicholaspshaw · March 27, 2025, 9:37pm

Thank you @zaki , I don’t think Jest is compatible with Node 23+ yet? So this isn’t really an option for us right now?

JarmBlueOak · March 27, 2025, 10:07pm

@DominicLavery No problem, will reach out soon for a re-opt in when we have our Java version ready that supports cgroupv2 natively.

DominicLavery · March 28, 2025, 9:24am

Hi @nicholaspshaw

Sorry to hear your project faced issues here. It looks like your build is already attempting to set jests max workers, which is great. Your script however appears to be using cgroupv1 files to do this and so does not get the correct value on v2.

You can get the max cpus of your resource class using something like this

read -r QUOTA PERIOD < /sys/fs/cgroup/cpu.max
MAX_WORKERS=$(($QUOTA/$PERIOD))

Depending on your plan type, you may need to additionally divide this by 2

$(($MAX_WORKERS/2))

to get the right value.

You can support v1 & v2 at the same time during this transition by testing for the existence of the cgroup files. Putting it all together, I suggest you do something like this to calculate MAX_WORKERS:

declare MAX_WORKERS
if [ -f /sys/fs/cgroup/cpu.max ]; then          
  read -r QUOTA PERIOD < /sys/fs/cgroup/cpu.max
  MAX_WORKERS=$(($QUOTA/$PERIOD))
  # optionally divide by 2 depending on plan
  MAX_WORKERS=$(($MAX_WORKERS/2))
else
  MAX_WORKERS=$(($(cat /sys/fs/cgroup/cpu/cpu.shares) / 1024))
fi

nicholaspshaw · March 28, 2025, 11:40am

Thank you @DominicLavery !

That’s a bunch of really helpful information! When you say, depending on plan type, why may need to divide by 2, what do you mean? Can you see our plan type and advise if we need to do this?

Additionally, if I put your MAX_WORKERS solution in place, is there anyone available over the weekend or at least UK morning time where we could switch back to v2 to test if things have been resolved?

Anecdotally, on other pipelines, the switch looked extremely beneficial in terms of resource usage.

DominicLavery · March 28, 2025, 11:52am

HI @nicholaspshaw

It looks like you will need to divide by 2 So make sure to keep this line MAX_WORKERS=$(($MAX_WORKERS/2))

We don’t have any availability this weekend to arrange this. We can definitely be available UK weekday mornings. Alternatively, you can set up a new project with a clone of your job that we can opt in to test against.

Anecdotally, on other pipelines, the switch looked extremely beneficial in terms of resource usage.

That is great to hear! Thank you for the feedback

Dom

nicholaspshaw · March 28, 2025, 12:45pm

Hi @DominicLavery thank you for the help, I’ll get this merged today in preparation for Monday.

What is the best way to organise this to get opted back in for v2 then be on hand to get opted out if we hit a new blocker? Just via this forum? I’m normally online 9.30am UK time.

Thanks again

DominicLavery · March 28, 2025, 12:50pm

No Problem @nicholaspshaw!

Yep, I monitor this post closely and start at about the same time. Let me know when your good to go and I can flag your project in/out accordingly. I’ll try and keep an eye on your effected jobs as well.

Dom

michaeldgoodrum · March 28, 2025, 1:55pm

@DominicLavery been seeing consistent failure in a mocha test suite since the change to v2, could you take a look?

DominicLavery · March 28, 2025, 2:01pm

Hi @michaeldgoodrum,

Sorry to hear that. I can absolutely take a look. I just need a link to one of the effected builds. I’ll send you a DM to reply to now in case you aren’t able to post links here.

Dom

nijave · March 28, 2025, 2:55pm

As some other comments allude, something with DNS has changed. We see Terraform (which makes a significant amount of DNS lookups) taking 20 minutes now instead of 5

DominicLavery · March 28, 2025, 4:04pm

Hi @nijave

Sorry to hear about that.

Whilst the DNS set up has changed a little I’ve not seen any reports of DNS running substantially slower like that. Does the job modify the /etc/resolve.conf at all?

Dom

Topic		Replies	Views
Postmortem: Incidents of October 22nd–29th Announcements	0	1400	November 15, 2019
Linux Image Deprecations and EOL for 2024 Notices linux , machine-images , linux-image , linux-vm	94	25521	September 18, 2024
Remote Docker Image Deprecations and EOL for 2024 Notices convenience-images , machine-images , remote-docker	21	21312	May 28, 2024
[March, 2022] Beta support for new operating system for Windows Executors: Windows Server 2022 Build Environment	47	9902	November 22, 2024
[Product Launch] Rerun Failed Tests (circleci tests run) Running Tests	201	11236	May 13, 2025

Docker Executor Infrastructure Upgrade

Related topics