Within the same job, we started getting different arch values for parallel runs

So we use arch as a key to cache ruby gems between jobs/steps.

Within same job, some of the parallel runs are starting to get arch = arch1-linux-amd64-6_106

Checking back a couple of days ago, we were consistently getting: arch1-linux-amd64-6_85

Is there a reason these would vary between parallel runs on the same job or workflow? or is this a bug?

It happened to me when I rerun the job with SSH, was that your case?

:wave: Hello @gonzaloserrano,

It doesn’t appear to be related/limited to SSH reruns.

I’m seeing another user reporting this behaviour (Revision inconsistent during workflow).

We’re looking into it.

1 Like

@gonzaloserrano, @sakif-imtiaz,

Please find below the explanation for the discrepancy you’ve observed.

As mentioned in our documentation, the {{ arch }} template captures:

  • architecture
  • family
  • model

The underlying machines on which Docker jobs run are usually EC2 instances with chipsets “Family 6, Model 85”. In some cases, where there isn’t sufficient capacity of these instances, Docker jobs might instead run on EC2 instances with chipsets “Family 6, Model 106”.

Our approach is that we prefer to run on instances with a slightly different CPU than to delay the job, or potentially not run it at all.

We’re currently assessing whether or not we can/should simplify the {{ arch }} template’s granularity to prevent occurrences of the behaviour you’ve observed across your Docker jobs.

For now, I suggest relying on other templates when using the cache feature in Docker jobs.

Let me know if you have further questions.

I would be great if you can remove the family and model. As arch variable is useful for multi arch artifacts building

Thanks a lot for reporting this, we have these in our cache keys and therefore were having issues with cache restores. The arch value should definitely be stripped from the CPU family, it’s about distinguishing build artefacts for the different CPU architectures!

This issue has recently hit us with regards to test splitting. We split our tests across 20 machines. We’ve noticed that sometimes a few machines will take longer than the others, something like 20-25% longer, to run the tests. We’ve tracked it down to this issue - the 106 machines are always fast, then 85 machines are always slower.

This has a significant impact on test splitting. The whole idea behind the split is that you are doing it across equivalent machines based on timing data from a previous run on the same machines. We aim to get our tests done in <=20 minutes, but, sometimes, just getting dealt a single 85 machine takes us to 25 mins which is super frustrating.

There is also the financial side - we’re being charged the same amount for both machines per minute, but actually, as they take longer, we get charged more for the slower machine.

Can we please get a configuration option added to say we want to run on identical machines?