Non-deterministic cache-restore behaviour

I’m relying on the prefix-matching and “most recent” behaviour described here to obtain the cache from the latest build for the current branch.

However, Circle sometimes seems to be picking up random older versions. For example, for my latest two builds on the develop branch:

Build #2227

Found a cache from build 2226 at platform-v3-develop-gradle
[...]

Seems ok.

Build #2229

Found a cache from build 2166 at platform-v3-develop-gradle
[...]

Why is it using the cache from ages ago?

(Note that build #2228 was a different branch.)

The relevant parts of my config are as follows:

- restore_cache:
    keys:
      - platform-v3-{{ .Branch }}-gradle

...

- run:
    command: find ~/.gradle -path ~/.gradle/daemon -prune -o -name '*' > ~/gradle.txt

- save_cache:
    key: platform-v3-{{ .Branch }}-gradle-{{ checksum "~/gradle.txt" }}
    paths:
      - ~/.gradle

Am I doing something wrong, or is there a bug/weirdness in Circle here?

1 Like

I’m seeing this as well. Started happening a few days ago:

Six days ago, build 45083 was pulling a cache from build 45062, the previous build on the same branch.
The next build on the same branch (45129, from two days ago) is now pulling a cache from build 44730–a build from 15 days ago.

[Edit] Nevermind, I spoke too soon. Some builds are restoring the proper cache, but others are still restoring from 44730 (consistently)

Yup, I’m still seeing this too.

I just commented on Spurious build success as well but, it seems like the Restore Cache step is super wonky as of late. Happy to test config changes to debug the issue if anyone has a suggestion.

Why no response yet from circle ci? Today this resulted in deploy of month old version of code without any notice not mentioning all the additional time spend building as the cache is not working properly. This is urgent for us, otherwise we will need to find other service for build and deployment.

1 Like

We’ve had a couple support tickets come in about this and our engineers are currently investigating it. It took our support team some time to rule out that these issues weren’t due to restore cache key ordering or substring matches.

In one case, a user had a config race condition. His workflow restored caches in one job saved earlier in the workflow, but didn’t have a job dependency between the jobs, so sometimes the dependent job ran first. It’s worth checking if this applies to you.

For other users, there is unexpected behavior–that’s what our engineers are investigating. We’ll post back here when we have more information.

Thanks for the response Eric

Do you have any recommendation of what can be used as mechanism to move built docker image from one job to another? As I said we have one job that builds the image and second that deploys it, tied into one workflow. We use cache so we do not have to rebuild the image again in the deploy job. I do not consider this the best possible solution, what we would need is a persistent remote_docker instance across the workflow, but that does not seem to be something thats possible, or is it?

Our engineers have pushed out a fix for this. Can you all run some more jobs on the problematic projects/workflows and post back here if you’re still seeing old caches being used?


My answer to this is “it depends.” If your images aren’t too large, then caching can be a reasonable option. Jobs that rely on caching to get a dependency from past jobs should be robust against a cache-miss, though. For example, in your deploy job, perhaps it can check the git SHA in the image against the checked out code before pushing. If there’s a mismatch, the deploy job should fail or rebuild the docker image you want. When it comes to software, correct/accurate and slower tends to be preferable to incorrect/inaccurate and faster.

If your images are large, then it may be more practical to push and pull from a registry. In practice, I’ve found that the bandwidth for pushing to and pulling from Dockerhub is greater than the bandwidth of saving and restoring a cache. We currently use AWS EC2 instances for the machines that run your jobs, and S3 for caching. I believe Dockerhub runs on EC2 as well, which could explain the higher transfer rates.

It was strange to me to find out EC2 <=> S3 transfer rates differ so much from EC2 <=> EC2 when they’re all in the same region, but I suppose that’s just the current reality.

In any case, I think it’s worth trying out both approaches to see which one works for you, and periodically re-evaluating this decision as your deployment process changes (for instance, when your deployment images grow substantially). By using an external registry like Dockerhub, there’s additional failure modes. Last week, for instance, some major DNS servers had issues and Dockerhub wasn’t resolving properly.

A persistent remote_docker instance across the workflow isn’t currently possible, at least not with guarantees. Reusable docker engines make this possible, but depending on your workflow configuration and how many workflows are running concurrently, your reusable engines might be distributed to other workflows between jobs. For something as critical as deployment, I wouldn’t assume the remote engine has an image from a previous job.

Hi @Eric, thanks for the update/fix! Anecdotally my last seven builds have all done the correct thing in terms of caching, so it looks promising. I’ll keep an eye on things for the next day or so.

1 Like

It looks like after the fix was pushed, my restore_cache steps failed with:

Skipping cache - error checking storage: RequestError: send request failed
caused by: Get https://circle-production-customer-artifacts.s3.amazonaws.com/...snip...: x509: failed to load system roots and no roots provided

I simply have to install the ca-certificates package via a run command before to restore_cache step to get it to work.

1 Like

Thank you for sharing that error message and your fix, Jared. I’ve shared this with our engineers so they can consider the UX of users installing CA certs in Docker images to get cache working.

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.