Docker Executor Infrastructure Upgrade

Hi @victorlee134

We’ve recreated the issue loom was seeing and have released a fix. We’ll renable the org shortly.

Please don’t hesitate to reach out again if you spot anything else.

Thanks again

Dom

Hi @DominicLavery

We’ve seen consistent failures in the past 2 hours. The first failure is when the v2 container runtime is being used. Could you opt out our org again?

Examples:
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623678
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623671
github/loomhq/loom/180705/workflows/dae953e6-f14d-465a-8ede-781fed76e6db/jobs/2623671

Sorry about that @victorlee134
I’ve run the opt out, it can take around 10 minutes to apply.

Just a heads up: The first linked issue did start to occur on the old infrastructure so you may still see that one. See: loomhq/loom/180697/workflows/93f2900c-0d84-4700-9828-d45e991d8cf9/jobs/2623493

Thanks @DominicLavery!

And yeah my mistake, pasted the wrong pipeline failure link

Hi @masterginger

We believe the issue attentive-mobile saw was due to a new pid limit that was being automatically & implicitly applied in the new environment.

We’ve increased limit substantially and will opt the org back in. I’ll keep an eye on the builds to make sure that the issue is fixed and reapply the opt out if it re-occurs

Thanks

Thank you @DominicLavery for the update! We will watch the builds too and let you know if we see anything unusual :slight_smile:

1 Like

Hey @DominicLavery ,

It’s happening again: https://app.circleci.com/pipelines/github/attentive-mobile/code/563236/workflows/9d8218d9-3e4e-4282-ad26-05f1deebdf32/jobs/7257268/parallel-runs/0/steps/0-122?invite=true#step-122-639983_99

It’s using v2 container runtime.

Could you opt us out again?

Thanks!
Bian

Another question, is there a way we could enable it on an individual job so that we can test it out before it’s rolled out to our org?

Hey @masterginger

Sorry to hear that! I’ve reapplied the opt out.

We don’t currently have the ability to run it on a per job/branch level. We can do it on a project level if that could be of any help?

Thanks
Dom

Hey @DominicLavery ,

Thanks for the quick response! We’ve created a test project https://app.circleci.com/pipelines/github/attentive-mobile/v2-container-runtime-test , could you enable v2 container runtime for that project only?

We will put the reproducible code and steps in that project and hopefully that will provide good signals for CircleCI to debug the issue :slight_smile:

Thanks,
Bian

Thanks @masterginger! That is really helpful.

I’ve opted the new project in and have the next fix lined up. That should hopefully go live on Monday .

Morning @masterginger,

The new fix has gone live. Please give your test project a build when you get a chance :crossed_fingers:

Thanks
Dom

Hey @DominicLavery,

Thanks for following up on this!

We examined the recent builds in the test project and didn’t see the “pthread_create failed (EAGAIN)” issue anymore.

We will keep watching the builds after you roll v2 container runtime to our org and let you know anything we come across.

Thanks,
Bian

Thank you for checking @masterginger! That’s a big help.

I’ve rolled the org back in

Dom

Hey @DominicLavery,

We at Melio are seeing significantly reduced performance in CPU intensive builds following the rollout from Agent v1 and v2.

Any idea what might be the reason behind this?

1 Like

Hi @eyalroth,

One possibility is that the build maybe relying on cgroupv1 to detect how many CPUs it has access to and to configure things like how many worker threads to start based off of that. This upgrade includes a move to cgroupv2, so tools and frameworks used may need to be upgraded too.

Could you provide a couple of example builds before and after and I can take a look?

Thanks
Dom

Indeed this affected a project with auto detection of number of CPUs which failed in v2 and was replaced with hardcoded number of CPUs:

Both builds show a 100% CPU usage in the resources view.

But also in other projects with no such funtionality:

V1 build shows more CPU usage in the resources view. There are differences to the code between the builds, but apparently they shouldn’t affect the time much.

Thanks @eyalroth

I’ve made a configuration change to the cluster that I believe will fix the high CPU case.

The other case seems different, I need to investigate it a little further to try and narrow down the cause(s)

2 Likes

The first project indeed seems to be showing same performance as before.

Thank you for the quick response!

1 Like

We ran into test failures all of a sudden yesterday and after hours of wasted time we spotted the link to this discussion in the “Spin up environment” step.

Even after reading through everything here, I see still can’t see any helpful information for how to resolve the issue we ran into or get back to the < v2 container runtime.

The problem we ran into that all of a sudden wast that one of the services we are starting up for testing could no longer run on port 80 because it was blocked / already in use. At first we were assuming it’s an issue with something that changed in that service or on our end in general.

We even tried naming containers as suggested on the “Port conflicts with service containers on Docker executor” page (I tried to link it, but somehow I’m not allowed to add links to my post) to resolve the port clash.

But it all didn’t help because it appears that the port 80 clash is not between different containers but between the container and something in the underlying infrastructure introduced by changes in v2. Can that be right?

Overall, this has been a very frustrating experience and generally a bad week with another two CircleCI incidents happening in the last few days.

Please consider how you can communicate about changes like this better, especially if they could be breaking existing functionality.

1 Like