Docker Executor Infrastructure Upgrade

Hi,

We’re seeing failures since one of our jobs was moved to cgroupv2 (here for instance).

This happens with pytest, while collecting tests to run. I couldn’t figure out what was the cause.

Could you either help us solve it, or opt us out (if that’s still possible), as this is blocking a release?

Thanks,
Cyril.

Hi @patatepartie,

Sorry to hear that!

We’ve not seen that particular issue before. It’s hard to say if it is directly related to v2 from looking at the history of the project’s jobs so I’ve applied a temporary opt out to help narrow that down.

In this case, the failing job is an arm one. Which has been handled slightly differently. The opt out I’ve applied will only effect arm jobs

Dom

Several of our pipelines are also failing after being migrated to v2. We’d like to request rollback. We created support tickets for the issue but we’ve had radio silence for over a week. This is severely impacting our deployments

Hi @dasph,

I’m sorry to hear this. If you can provide a link to your builds or a ticket number I can look into this

Dom

Thanks for the reply. Our ticket number is 161360. I’ve provided some updates since opening the ticket.

Thanks @dasph,

I’ve taken a look at your ticket and effected jobs. There are some known issues with cgroupv2 in the jvm version the jobs are using. Java 21 is the first version that fully supported it but it does look like 11.0.16 also got a fix.

To support v2 please could you look at upgrading to 1 of these versions? I’ve applied a 7 day opt out to get your jobs running again in the mean time. It can take 10 minutes to take effect

Dom

Thank you, Dominic.
I can confirm that following jobs worked without issues, so the cause was the move to v2.

One of my colleagues tracked it down to the loading of the spacy library (during pytest collection).
Since it’s likely this will happen again once we’re definitely moved to v2, is there a way for us to test the build on v2 without all arm jobs being moved to it? Maybe just for one branch?

Cheers,
Cyril.

Hey Dominic, thanks for the assistence. The jobs are now working.
We’re going to upgrade to 11.0.16. Could you move one of our pipelines to V1 so that we can test it?
Thanks!

Hi @patatepartie & @dasph,

Thank you both for getting back to me and looking into your respective projects.

We have 2 options for making testing v2 less disruptive at the moment.

  1. We can arrange a date and time to opt your project in and ensure someone is available to provide support.
  2. You can create a new project with a replica of your job which we can ensure is opted in. You can then test before your main project is put back on to v2.

Please let me know which works best for you.

Dom

Thanks, Dom.

I went with option 2 and the new project automatically used V2 (and reproduced the failure).
We’ll be able to investigate with this.

Will let you know when we’ve fixed it.

Cheers,
Cyril.

1 Like

In the last couple of days we have been experiencing browser shutdown errors.

We run a medium size Rails app and the integration tests run in Capybara + Cuprite + Ferrum (with a Chrome under the hood). At testing level we haven’t changed anything in our setup.

We use large instances, and the resources seem to be fine, there’s lot of RAM free.

The errors are random. We have the test execution split in 2 workers and it fails randomly in zero, one or both.

Do you have any suggestion?

Hi @ferblape,

We’ve not made any changes to the v2 environment in the last couple of days that could account for this and the rollout has been complete since the 3rd of April.

If you have additional info as to why you think this relates to this upgrade please let me know. Otherwise, if you believe this is related to a change to CircleCI please open a support request.

Thanks

Dom

My builds don’t run at all. For one job, all the steps succeeded, but CircleCI marked the job as “not running” and didn’t run the rest of the pipeline.

So I cancelled it (it still showed as “running” in the Workflow view) and selected “Rerun workflow from failed”. Now the job is stuck in “Queued” even though no other jobs are running.

https://app.circleci.com/pipelines/github/pictretail/pict/19412/workflows/300b2f7a-276c-4868-a445-d0e5f3e95f07

The exact same thing is happening to me.

I ran my deploy, it shows up as “running,” but when I go to build, it shows up as “queued” and I have a significant upload.

I deployed earlier today without any problems.

Any suggestions?

See the related incident CircleCI Status - Final results of some jobs may not be reported in the UI

This rollout is complete so I’ll shortly be closing this topic.

For users with that have previously requested help with V2 here, you’ll receive a DM once your case has been reviewed.

For users with any new issues, please open a support ticket or start a new topic.

Thank you