Docker Executor Infrastructure Upgrade

Hi @maciej,

I’m very sorry to hear about the issues this has caused you and thank you for the feedback.

Port 80 already being bound is not an expected change nor something we have seen before. I’ll use the information you have provided to attempt to recreate the problem and resolve it so that others don’t experience the same issue.

I see that you have raised a support request about this, and that you have found a work around for the time being (binding to another port). Based on this I will leave your project opt’ed in for now unless you have any objections. If you see any further issues please don’t hesitate to let us know and we can opt the project out pending investigation.

Thanks
Dom

Hi again @maciej,

I’ve investigated the issue you reported. It turned out that the container runtime was not applying the correct kernel capabilities for the additional/background containers to allow them to bind privileged ports.

We’ve applied a fix, so you should be able to revert your workaround now if you’d like. Apologies again for the troubles the bug caused you.

Based on your feedback about communication, we’ve added a banner to the job page when the new runtime is in use to make it easier to find help:

Thanks again for the feedback

Dom

Thank you for looking into that.

The banner is definitely helpful, but we can’t test it as the experiment/rollout has been disabled for all our repos for the time being.

I’m going to continue discussion about next steps on the support ticket.

Pylint users getting this error:

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Can now upgrade to pylint 3.3.4 to receive a fix we contributed for this issue: Calculate linter.config.jobs in cgroupsv2 environments by DominicLavery · Pull Request #10089 · pylint-dev/pylint · GitHub

If for some reason you cannot upgrade, you can still use the work around of adding

-j $(cat /proc/cpuinfo | grep processor | wc -l)

to your pylint invocation

Our CI has suddenly started failing. It seems limited to the tests which are using YugabyteDB inside a docker. Only difference from older working runs is that I see the banner about “This job used the new v2 container runtime.”

Error doesn’t tell me much. Any chance this is related to cgroup changes?

Starting yugabyted...
| Starting the YugabyteDB Processes...Failed to setup master. Exception: Traceback (most recent call last):
  File "bin/yugabyted", line 5695, in setup_master
    master_uuids = retry_op_with_argument(self.get_master_uuids, master_addrs)
  File "bin/yugabyted", line 10480, in retry_op_with_argument
    return func(argument)
  File "bin/yugabyted", line 4179, in get_master_uuids
    raise RuntimeError("process died unexpectedly.")
RuntimeError: process died unexpectedly.

For more information, check the logs in /root/var/logs

exited with error (exited with 137)

Hi @LexLuthr,

Sorry to hear that! It is possible.

Please could you send the last 4 segments of a CIRCLE_BUILD_URL (found in the “Preparing Environment variables” step) e.g. gh/my-org/my-project/1234. I’ll take a closer look

Thanks
Dom

github/filecoin-project/curio/1803/workflows/c2d1a955-9156-4c67-810d-f714b860bcd8/jobs/16970 as requested.

github/filecoin-project/curio/1767/workflows/1b7cbe6c-9032-40c2-980d-4bf3e9fd00e6/jobs/16466 is the one that works

Thank you.

There isn’t much to suggest a cause but the timing certainly points to a link with this roll out.

I’ve applied a temporary opt out so you aren’t blocked whilst we investigate. It can take up to 10 minutes to apply.

Dom

I’ve just had https://app.circleci.com/pipelines/github/i2group/circleci-orb-node/461/workflows/49285575-198f-4658-9dea-a61610d71a3d fail because when it asked for a docker container of various sizes, it got 36 CPUs instead of the expected size.
Now, I recall an earlier conversation where I was told “sometimes CircleCI has an excess of CPUs so it generously doles out more than you expect”, which is why my code is expecting “between 8 and 16” instead of expected “exactly 8” but I’m not expecting to get more than double the amount I asked for.
…but this build just failed and it declared that it’s running on the new docker infrastructure, so I figure that’s why it failed.

FYI my “calc cpu count” command (the code under test) was written entirely to work around the old CircleCI docker infrastructure defect whereby it’d lie to auto-sizing tools about the available resources, which then caused build failures when tools were told “you’ve got half a terabyte of ram and over 100 CPUs to use” when it’d get killed off the moment it went over 8gigs of ram.
It may be that this “new” infrastructure is less predictable but no longer lies, which would make my detection code unnecessary (once everything is on the v2 docker).

I think this “v2 docker” means that you need to revisit the provided workaround specified in Have `nproc` accurately reporter number of CPUs available to container | Feature Requests | CircleCI Ideas because that solution isn’t working anymore - it’s reporting 36 on every size of docker container.

Hi @Peter-Darton-i2,

Thanks for the heads up about that suggested work around.

For your script there, it looks like there are 2 things going wrong.

Firstly weAreOnDocker is looking for cgroupv1 specific files (/sys/fs/cgroup/cpu/cpu.cfs_quota_us) which will no longer be present.

And calcCpusForDocker is also assuming cgroupv1 for it’s calculations.

Am I right in thinking that this is an orb used across many of your projects? If so, I’d suggest 1 od 2 potential paths forward:

  1. We can get this orb to support v1 and v2 at the same time
  2. We can opt your entire org in once this is ready

Please let me know what you think and how I can help :slight_smile:

Dom

It’s used across >1 project, yes.
A solution that handles v1 and v2 at the same time would be good
… especially if the code makes it clear what’s v1-only and what’s v2-only so that, once CircleCI is 100% v2 everywhere, we can drop support for v1 from the code to clean it up.
But, yes, right now I want something that will “just work” regardless of what CircleCI environment it’s running on.

@DominicLavery Things are now working. So, we can be sure that it is the v2 changes. Thank you for helping with this.

1 Like

Thanks @Peter-Darton-i2,

I’ve DM’d you a fix that I hope will get you building green again

Thank you @DominicLavery ; code received and it seems to work. Can you update the thread on that (long running) bug-report/RFE thread accordingly?

Note: In an ideal world, of course, none of that code would be necessary because auto-sizing stuff would “just work” without it.
… and/or there would be CircleCI-provided env vars telling folk the official size of the CircleCI agent they were on.

@DominicLavery looks like our e2e step started failing. The only thing we can think of at this point is the update to the v2 container runtime. What is the process like to get it disabled for my project just as a way to rule it out as the cause of the issue?

What is the process like to go back to < v2 container runtime*

@DominicLavery looks like we just had to update the Playwright docker image to the latest version. Thanks anyway!

Hi @DominicLavery, we are having issues on builds that have started using v2 related to composer installation commands that previously had no issue. Are we able to opt out our project to rule this out?

Hi @JarmBlueOak,

If you could provide a build url (missing the start) I’ll take a look

Thanks
Dom

1 Like

Thanks @iwakoscott.

Could I ask a couple of questions so that I can help any other users debug similar issues?

What version you started on and upgraded to?
What sort of error did you see?

Dom