Docker Executor Infrastructure Upgrade

We are also facing issues with the new v2 container runtime on certain projects. We’re frequently encountering “Received ‘killed’ signal” errors across multiple jobs. Is there a way to opt out until this is resolved?

Hi @lenoirzamboni

Sorry to hear that!

I’ve sent you a DM, please could you reply to it or here with a build link to your issue and we will look into it.

Dom

As we approach the last increase of this rollout, I wanted to give some additional details around the most common issue encountered. Increases in “signal: killed” and running out of memory in memory managed languages.

As we stated in the email regarding this rollout sent mid February and in this post one of the changes in this upgrade is the host OS moving from cgroupv1 to cgroupv2. Tooling and scripting may need to be upgraded to support this. In particular, memory managed languages like Java often use cgroups to detect how much memory they can safely allocate. Without this detection, they run the risk of being killed by the OS for using too much memory.

Older versions of these languages, runtimes and associated tools may not have v2 support. If you encounter such issues, please consider trying the following:

  1. Upgrade your runtime, languages and tooling
  2. Set the maximum memory manually e.g Java’s -Xmx1024m -xms1024m options or mavens memory allocation enhancements
  3. Increase your resource class size

Hi Dom, we’re seeing a lot of issues with the v2 container runtime on http://app.circleci.com/pipelines/github/DataDog/dd-trace-java - is it possible to opt out of v2 for now?

Hi @mcculls

Sorry about that!

I took a quick look at your builds. The issue I saw looks like to the ones discussed above

I’ve applied the temporary opt out. It takes about 10 minutes to take effect

Dom

Hi @DominicLavery and @DomParfitt ,

is it possible to get an opt out for Swoop / joinswoop? It has blocked our merge pipeline as a large amount of our Jest tests are timing out:

https://app.circleci.com/pipelines/github/joinswoop/swoop/667527/workflows/485bca97-a447-4d24-abcf-963caeb9a223/jobs/8902730

Hi @nicholaspshaw, I’ve opted out your project and it will take up to 10 minutes for the opt-out to take effect.

Sorry about any problems this has caused!

1 Like

Thank you very much for doing this @zaki , I’ll report back if this doesn’t appear to resolve in the next 10-20mins

@nicholaspshaw Please do! I did just notice that this job is using Node 20.11. We’ve found that using Node 23+ or specifying the max workers using --maxWorkers=1 (as an example) will result in a more reliable pipeline with the new runtime.

I’m happy to opt this project back in when you’d like to test those changes :slight_smile:

1 Like

Thank you @zaki , I don’t think Jest is compatible with Node 23+ yet? So this isn’t really an option for us right now?

@DominicLavery No problem, will reach out soon for a re-opt in when we have our Java version ready that supports cgroupv2 natively.

1 Like

Hi @nicholaspshaw

Sorry to hear your project faced issues here. It looks like your build is already attempting to set jests max workers, which is great. Your script however appears to be using cgroupv1 files to do this and so does not get the correct value on v2.

You can get the max cpus of your resource class using something like this

read -r QUOTA PERIOD < /sys/fs/cgroup/cpu.max
MAX_WORKERS=$(($QUOTA/$PERIOD))

Depending on your plan type, you may need to additionally divide this by 2

$(($MAX_WORKERS/2))

to get the right value.

You can support v1 & v2 at the same time during this transition by testing for the existence of the cgroup files. Putting it all together, I suggest you do something like this to calculate MAX_WORKERS:

declare MAX_WORKERS
if [ -f /sys/fs/cgroup/cpu.max ]; then          
  read -r QUOTA PERIOD < /sys/fs/cgroup/cpu.max
  MAX_WORKERS=$(($QUOTA/$PERIOD))
  # optionally divide by 2 depending on plan
  MAX_WORKERS=$(($MAX_WORKERS/2))
else
  MAX_WORKERS=$(($(cat /sys/fs/cgroup/cpu/cpu.shares) / 1024))
fi
1 Like

Thank you @DominicLavery !

That’s a bunch of really helpful information! When you say, depending on plan type, why may need to divide by 2, what do you mean? Can you see our plan type and advise if we need to do this?

Additionally, if I put your MAX_WORKERS solution in place, is there anyone available over the weekend or at least UK morning time where we could switch back to v2 to test if things have been resolved?

Anecdotally, on other pipelines, the switch looked extremely beneficial in terms of resource usage.

HI @nicholaspshaw

It looks like you will need to divide by 2 :slight_smile: So make sure to keep this line MAX_WORKERS=$(($MAX_WORKERS/2))

We don’t have any availability this weekend to arrange this. We can definitely be available UK weekday mornings. Alternatively, you can set up a new project with a clone of your job that we can opt in to test against.

Anecdotally, on other pipelines, the switch looked extremely beneficial in terms of resource usage.

That is great to hear! Thank you for the feedback

Dom

1 Like

Hi @DominicLavery thank you for the help, I’ll get this merged today in preparation for Monday.

What is the best way to organise this to get opted back in for v2 then be on hand to get opted out if we hit a new blocker? Just via this forum? I’m normally online 9.30am UK time.

Thanks again

No Problem @nicholaspshaw!

Yep, I monitor this post closely and start at about the same time. Let me know when your good to go and I can flag your project in/out accordingly. I’ll try and keep an eye on your effected jobs as well.

Dom

1 Like

@DominicLavery been seeing consistent failure in a mocha test suite since the change to v2, could you take a look?

Hi @michaeldgoodrum,

Sorry to hear that. I can absolutely take a look. I just need a link to one of the effected builds. I’ll send you a DM to reply to now in case you aren’t able to post links here.

Dom

As some other comments allude, something with DNS has changed. We see Terraform (which makes a significant amount of DNS lookups) taking 20 minutes now instead of 5

Hi @nijave

Sorry to hear about that.

Whilst the DNS set up has changed a little I’ve not seen any reports of DNS running substantially slower like that. Does the job modify the /etc/resolve.conf at all?

Dom