Docker Executor Infrastructure Upgrade

DominicLavery · January 8, 2025, 3:16pm

We’ve recreated the issue loom was seeing and have released a fix. We’ll renable the org shortly.

Please don’t hesitate to reach out again if you spot anything else.

Thanks again

Dom

victorlee134 · January 8, 2025, 5:03pm

We’ve seen consistent failures in the past 2 hours. The first failure is when the v2 container runtime is being used. Could you opt out our org again?

Examples:
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623678
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623671
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623671

DominicLavery · January 8, 2025, 5:11pm

Sorry about that @victorlee134
I’ve run the opt out, it can take around 10 minutes to apply.

Just a heads up: The first linked issue did start to occur on the old infrastructure so you may still see that one. See: loomhq/loom/180697/workflows/93f2900c-0d84-4700-9828-d45e991d8cf9/jobs/2623493

victorlee134 · January 8, 2025, 5:22pm

Thanks @DominicLavery!

And yeah my mistake, pasted the wrong pipeline failure link

DominicLavery · January 10, 2025, 3:23pm

Hi @masterginger

We believe the issue attentive-mobile saw was due to a new pid limit that was being automatically & implicitly applied in the new environment.

We’ve increased limit substantially and will opt the org back in. I’ll keep an eye on the builds to make sure that the issue is fixed and reapply the opt out if it re-occurs

Thanks

masterginger · January 10, 2025, 3:36pm

Thank you @DominicLavery for the update! We will watch the builds too and let you know if we see anything unusual

masterginger · January 10, 2025, 5:05pm

Hey @DominicLavery ,

It’s happening again: https://app.circleci.com/pipelines/github/attentive-mobile/code/563236/workflows/9d8218d9-3e4e-4282-ad26-05f1deebdf32/jobs/7257268/parallel-runs/0/steps/0-122?invite=true#step-122-639983_99

It’s using v2 container runtime.

Could you opt us out again?

Thanks!
Bian

masterginger · January 10, 2025, 5:06pm

Another question, is there a way we could enable it on an individual job so that we can test it out before it’s rolled out to our org?

DominicLavery · January 10, 2025, 5:10pm

Hey @masterginger

Sorry to hear that! I’ve reapplied the opt out.

We don’t currently have the ability to run it on a per job/branch level. We can do it on a project level if that could be of any help?

Thanks
Dom

masterginger · January 10, 2025, 5:27pm

Hey @DominicLavery ,

Thanks for the quick response! We’ve created a test project https://app.circleci.com/pipelines/github/attentive-mobile/v2-container-runtime-test , could you enable v2 container runtime for that project only?

We will put the reproducible code and steps in that project and hopefully that will provide good signals for CircleCI to debug the issue

Thanks,
Bian

DominicLavery · January 10, 2025, 8:11pm

Thanks @masterginger! That is really helpful.

I’ve opted the new project in and have the next fix lined up. That should hopefully go live on Monday .

DominicLavery · January 13, 2025, 9:49am

Morning @masterginger,

The new fix has gone live. Please give your test project a build when you get a chance

Thanks
Dom

masterginger · January 13, 2025, 6:39pm

Hey @DominicLavery,

Thanks for following up on this!

We examined the recent builds in the test project and didn’t see the “pthread_create failed (EAGAIN)” issue anymore.

We will keep watching the builds after you roll v2 container runtime to our org and let you know anything we come across.

Thanks,
Bian

DominicLavery · January 13, 2025, 7:44pm

Thank you for checking @masterginger! That’s a big help.

I’ve rolled the org back in

Dom

eyalroth · January 16, 2025, 10:49am

Hey @DominicLavery,

We at Melio are seeing significantly reduced performance in CPU intensive builds following the rollout from Agent v1 and v2.

Any idea what might be the reason behind this?

DominicLavery · January 16, 2025, 11:12am

Hi @eyalroth,

One possibility is that the build maybe relying on cgroupv1 to detect how many CPUs it has access to and to configure things like how many worker threads to start based off of that. This upgrade includes a move to cgroupv2, so tools and frameworks used may need to be upgraded too.

Could you provide a couple of example builds before and after and I can take a look?

Thanks
Dom

eyalroth · January 16, 2025, 11:46am

Indeed this affected a project with auto detection of number of CPUs which failed in v2 and was replaced with hardcoded number of CPUs:

Both builds show a 100% CPU usage in the resources view.

But also in other projects with no such funtionality:

V1 build shows more CPU usage in the resources view. There are differences to the code between the builds, but apparently they shouldn’t affect the time much.

DominicLavery · January 16, 2025, 12:48pm

Thanks @eyalroth

I’ve made a configuration change to the cluster that I believe will fix the high CPU case.

The other case seems different, I need to investigate it a little further to try and narrow down the cause(s)

eyalroth · January 16, 2025, 1:17pm

The first project indeed seems to be showing same performance as before.

Thank you for the quick response!

maciej · January 25, 2025, 1:31pm

We ran into test failures all of a sudden yesterday and after hours of wasted time we spotted the link to this discussion in the “Spin up environment” step.

Even after reading through everything here, I see still can’t see any helpful information for how to resolve the issue we ran into or get back to the < v2 container runtime.

The problem we ran into that all of a sudden wast that one of the services we are starting up for testing could no longer run on port 80 because it was blocked / already in use. At first we were assuming it’s an issue with something that changed in that service or on our end in general.

We even tried naming containers as suggested on the “Port conflicts with service containers on Docker executor” page (I tried to link it, but somehow I’m not allowed to add links to my post) to resolve the port clash.

But it all didn’t help because it appears that the port 80 clash is not between different containers but between the container and something in the underlying infrastructure introduced by changes in v2. Can that be right?

Overall, this has been a very frustrating experience and generally a bad week with another two CircleCI incidents happening in the last few days.

Please consider how you can communicate about changes like this better, especially if they could be breaking existing functionality.

Topic		Replies	Views
Postmortem: Incidents of October 22nd–29th Announcements	0	1407	November 15, 2019
Linux Image Deprecations and EOL for 2024 Notices linux , machine-images , linux-image , linux-vm	94	25804	September 18, 2024
Remote Docker Image Deprecations and EOL for 2024 Notices convenience-images , machine-images , remote-docker	21	21377	May 28, 2024
[March, 2022] Beta support for new operating system for Windows Executors: Windows Server 2022 Build Environment	47	9997	November 22, 2024
CircleCI API v.2 retry job/workflow Build Environment 2.0	0	590	July 4, 2019

Docker Executor Infrastructure Upgrade

Related topics