Non-deterministic cache-restore behaviour

choliver · September 6, 2017, 12:53pm

I’m relying on the prefix-matching and “most recent” behaviour described here to obtain the cache from the latest build for the current branch.

However, Circle sometimes seems to be picking up random older versions. For example, for my latest two builds on the develop branch:

Build #2227

Found a cache from build 2226 at platform-v3-develop-gradle
[...]

Seems ok.

Build #2229

Found a cache from build 2166 at platform-v3-develop-gradle
[...]

Why is it using the cache from ages ago?

(Note that build #2228 was a different branch.)

The relevant parts of my config are as follows:

- restore_cache:
    keys:
      - platform-v3-{{ .Branch }}-gradle

...

- run:
    command: find ~/.gradle -path ~/.gradle/daemon -prune -o -name '*' > ~/gradle.txt

- save_cache:
    key: platform-v3-{{ .Branch }}-gradle-{{ checksum "~/gradle.txt" }}
    paths:
      - ~/.gradle

Am I doing something wrong, or is there a bug/weirdness in Circle here?

DeviateFish · September 7, 2017, 7:56pm

I’m seeing this as well. Started happening a few days ago:

Six days ago, build 45083 was pulling a cache from build 45062, the previous build on the same branch.
The next build on the same branch (45129, from two days ago) is now pulling a cache from build 44730–a build from 15 days ago.

DeviateFish · September 8, 2017, 5:41pm

[Edit] Nevermind, I spoke too soon. Some builds are restoring the proper cache, but others are still restoring from 44730 (consistently)

choliver · September 11, 2017, 6:05pm

Yup, I’m still seeing this too.

jaredledvina · September 12, 2017, 11:38pm

I just commented on Spurious build success as well but, it seems like the Restore Cache step is super wonky as of late. Happy to test config changes to debug the issue if anyone has a suggestion.

aocenas · September 19, 2017, 2:32pm

Why no response yet from circle ci? Today this resulted in deploy of month old version of code without any notice not mentioning all the additional time spend building as the cache is not working properly. This is urgent for us, otherwise we will need to find other service for build and deployment.

Eric · September 21, 2017, 12:13pm

We’ve had a couple support tickets come in about this and our engineers are currently investigating it. It took our support team some time to rule out that these issues weren’t due to restore cache key ordering or substring matches.

In one case, a user had a config race condition. His workflow restored caches in one job saved earlier in the workflow, but didn’t have a job dependency between the jobs, so sometimes the dependent job ran first. It’s worth checking if this applies to you.

For other users, there is unexpected behavior–that’s what our engineers are investigating. We’ll post back here when we have more information.

aocenas · September 21, 2017, 12:33pm

Thanks for the response Eric

Do you have any recommendation of what can be used as mechanism to move built docker image from one job to another? As I said we have one job that builds the image and second that deploys it, tied into one workflow. We use cache so we do not have to rebuild the image again in the deploy job. I do not consider this the best possible solution, what we would need is a persistent remote_docker instance across the workflow, but that does not seem to be something thats possible, or is it?

Eric · September 21, 2017, 2:17pm

Our engineers have pushed out a fix for this. Can you all run some more jobs on the problematic projects/workflows and post back here if you’re still seeing old caches being used?

My answer to this is “it depends.” If your images aren’t too large, then caching can be a reasonable option. Jobs that rely on caching to get a dependency from past jobs should be robust against a cache-miss, though. For example, in your deploy job, perhaps it can check the git SHA in the image against the checked out code before pushing. If there’s a mismatch, the deploy job should fail or rebuild the docker image you want. When it comes to software, correct/accurate and slower tends to be preferable to incorrect/inaccurate and faster.

If your images are large, then it may be more practical to push and pull from a registry. In practice, I’ve found that the bandwidth for pushing to and pulling from Dockerhub is greater than the bandwidth of saving and restoring a cache. We currently use AWS EC2 instances for the machines that run your jobs, and S3 for caching. I believe Dockerhub runs on EC2 as well, which could explain the higher transfer rates.

It was strange to me to find out EC2 <=> S3 transfer rates differ so much from EC2 <=> EC2 when they’re all in the same region, but I suppose that’s just the current reality.

In any case, I think it’s worth trying out both approaches to see which one works for you, and periodically re-evaluating this decision as your deployment process changes (for instance, when your deployment images grow substantially). By using an external registry like Dockerhub, there’s additional failure modes. Last week, for instance, some major DNS servers had issues and Dockerhub wasn’t resolving properly.

A persistent remote_docker instance across the workflow isn’t currently possible, at least not with guarantees. Reusable docker engines make this possible, but depending on your workflow configuration and how many workflows are running concurrently, your reusable engines might be distributed to other workflows between jobs. For something as critical as deployment, I wouldn’t assume the remote engine has an image from a previous job.

choliver · September 21, 2017, 5:52pm

Hi @Eric, thanks for the update/fix! Anecdotally my last seven builds have all done the correct thing in terms of caching, so it looks promising. I’ll keep an eye on things for the next day or so.

jaredledvina · September 23, 2017, 10:05pm

It looks like after the fix was pushed, my restore_cache steps failed with:

Skipping cache - error checking storage: RequestError: send request failed
caused by: Get https://circle-production-customer-artifacts.s3.amazonaws.com/...snip...: x509: failed to load system roots and no roots provided

I simply have to install the ca-certificates package via a run command before to restore_cache step to get it to work.

Eric · September 25, 2017, 8:48am

Thank you for sharing that error message and your fix, Jared. I’ve shared this with our engineers so they can consider the UX of users installing CA certs in Docker images to get cache working.

system · November 6, 2017, 12:48am

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting failed build due to "Github or AWS" circle error every 1 out of 2 builds Feedback & Bug Reports caching	9	2327	June 18, 2018
CircleCI cache issue in Sep 4 outage Feedback & Bug Reports	4	888	September 5, 2018
Circle Bug Error Restoring Cache Build Environment	8	4962	June 18, 2018
Cache immutability, and using the cache from the last build Caching Dependencies	3	4639	January 7, 2020
Cache restore step fails Feedback & Bug Reports cache	2	1925	April 5, 2019

Non-deterministic cache-restore behaviour

Related topics