Docker Executor Infrastructure Upgrade

This is likely not common, however after this upgrade my job has started failing.

I was installing and configuring dnsmasq so that I could do wildcard DNS resolution. After the upgrade I’m getting an error that something is already running on port 52.

Here are a few of the log entries from the failure.

address=/.test/127.0.0.1
 * Starting DNS forwarder and DHCP server dnsmasq

dnsmasq: failed to create listening socket for port 53: Address already in use
   ...fail!

Exited with code exit status 1
1 Like

We’re having an issue where our builds are unable to reliably detect the core count in the docker executor where they’re running. We have depended on the core count to run python tests in parallel using xdist.

We used to use /sys/fs/cgroup/cpu/cpu.shares, we switched to /sys/fs/cgroup/cpu.max according to your solution here.

We run the same command in multiple jobs in different executor sizes, and sometimes cpu.max indicates 16 or 32 cores or an xlarge, where we would expect 8. We spawn too many workers, then they crash because they run out of memory.

Is there a better way to determine core count? If not, can we please go back to the older runtime for now?

Thanks

I am having this same issue of “Received “killed” signal" ever since V2 upgrade. @DominicLavery @DomParfitt How do we opt out until this is fixed? Here is one example CI of ours: https://app.circleci.com/pipelines/github/hunters-ai/hunting-detectors/14195/workflows/78a0aebc-d3ec-4364-a416-dd5474059353/jobs/123493

Hi @mark579,

Sorry you’re having issues.

Would you be able to DM me a job link so that we can investigate?

Many thanks,

Dom

Hi @stephanstach,

Sorry that you’re having issues.

Would you be able to DM me a link to one of the jobs you’ve been seeing issues with? That way we can opt you out temporarily whilst we investigate.

Many thanks,

Dom

Hi @jnovick,

I’ve just applied a temporary opt-out for your org, it may take 10-15 minutes to apply after that you should see your Jobs running on V1. Please let me know if you still see things running on V2 after that.

Many thanks,

Dom

I’m not sure how to send you a DM. Here is a build link though.
https://app.circleci.com/pipelines/github/Beterra/debrief/3439/workflows/c1b11835-89a6-4fe1-840c-c0d428dcde59/jobs/30681

Hi @mark579,

I’ve opted that project out, so you should see that take effect in around 10-15 minutes. I’ve only opted out that specific project at the moment, so if you need your entire org opting out then please let me know.

Thanks,

Dom

@DomParfitt I can’t seem to find how to DM you but maybe this component of the URL would be enough since it contains a UUID.

This is running on docker xlarge with cgroups v1 and cat /sys/fs/cgroup/cpu/cpu.shares shows 8. This has always been consistent for us:

484189/workflows/01443a87-2206-4c49-94c1-03032616e3f3/jobs/8638393

Here’s a new job on cgroups v2, cat /sys/fs/cgroup/cpu.max shows 1600000, but we expect half of that.

485414/workflows/531ffbc9-7444-4104-85b0-5a58299a22b5/jobs/8659541

You can see the output of cat /sys/fs/cgroup/cpu/cpu.shares or cat /sys/fs/cgroup/cpu.max if you look at output from the longest step run...tests. In each job.

Maybe this is a factor:

This says “Due to the way we pack workloads on to the underlying compute that runs Docker jobs, very occasionally surplus CPUs exist. Instead of letting them go unused, we will grant them to workloads on a random basis.”

It seems like either we didn’t get extra memory to go along with the extra CPU, or scaling up xdist by 2x had some memory overhead that we can’t handle. We didn’t measure the amount of memory we saw in older jobs.

Hi @stephanstach,

Thanks for that info, I’ve been able to opt your org out from that and we will investigate the issue.

Many thanks,

Dom

We noticed a difference in the switch back and forth.

As part of our cgroups v2 migration, we added a line to one of our builds,

source $CIRCLE_WORKING_DIRECTORY/.circleci/helpers/pytest_xdist_params.sh

This worked on cgroups v2, but when we downgraded to cgroups v1, we saw this:

+ source '~/project/.circleci/helpers/pytest_xdist_params.sh'
/bin/bash: line 13: ~/project/.circleci/helpers/pytest_xdist_params.sh: No such file or directory

This was a bit unexpected, and it’s making us a little worried about switching back and forth.

When we switch back, is there any way that we can test out the change on an isolated build before we roll it out everywhere? If that’s not possible, then can we please coordinate a time? We want to help make sure users on our end aren’t impacted too long by any potential problems.

@DominicLavery We are also experiencing build failures when using the v2 runtime. Would it be possible to get our organization opted out of the upgrade?

Also, one thing that would be helpful was if there was a way to run your build against the v2 before deciding to migrate the project. That way teams could make sure their build passes ahead of time.

Our jobs read out the memory usage because our machine type constrains us by memory more than CPU. This broke today due to getting the rollout, and we had to update to support cgroupv2 suddenly:

            usage=$(cat /sys/fs/cgroup/memory.peak /sys/fs/cgroup/memory/memory.max_usage_in_bytes || true)
            limit=$(cat /sys/fs/cgroup/memory.max  /sys/fs/cgroup/memory/memory.limit_in_bytes || true)
            usage_swap=$(cat /sys/fs/cgroup/memory.swap.peak /sys/fs/cgroup/memory/memory.memsw.max_usage_in_bytes || true)
            limit_swap=$(cat /sys/fs/cgroup/memory.swap.max  /sys/fs/cgroup/memory/memory.memsw.limit_in_bytes || true)

We are also experiencing build failures. Can you please opt us out as well.
Org: BlockGrain
build: https://app.circleci.com/pipelines/github/BlockGrain/api/15740/workflows/85f42317-ddc6-4d97-a4bd-0be5b32565a3/jobs/42588

Hi @snyaggarwal,

Sorry you’re having issues.

I’ve opted your org out, it may take 10-15 minutes to apply but after that you should see your Jobs running on the V1 runtime.

From your link I can see that you’re using an older version of Pylint that doesn’t correctly calculate the number of available CPUs when using the auto-detection for parallel workers, which is likely why you are seeing issues. Are you able to upgrade to use a more recent version of Pylint?

Many thanks,

Dom

We noticed sometimes the “spin up environment” step takes between 30 sec. and 2 min instead of being few seconds. And it costs us real money :frowning:

Here is a example where one of the worker spends 2 minutes to spin up the environment: https://app.circleci.com/pipelines/github/1024pix/pix/113139/workflows/ca6dc931-855e-477c-b2be-a439960e1ee9/jobs/1263122/timing

Hi @eric-flarehealth,

Sorry to hear that you’re having issues.

I’ve just DM’d you if you’d be able to provide me some links to Jobs that are failing so we can opt you out and investigate.

Many thanks,

Dom

Hi @stephanstach,

I can make sure that we coordinate with you when we’re ready to opt you back in.

If you have a project that you can safely use as a testing ground for this rollout then we can explicitly opt just that project in whilst leaving the rest of your org opted out. I’ll DM you shortly if you want to send me a project you’d like to use for this.

Dom

Hi @bpetetot,

Sorry you’re seeing this but thank you for reporting it to us. Do you have any other examples you could share? That would help us to identify potential causes.

Many thanks,

Dom

Hi @DomParfitt ,

Thanks for your answer. Here are 2 other long “spin up environment” timings

2 minutes long:

1 minute long:

Unfortunately, I can give only 2 links by message