I’m seeing this as well. Started happening a few days ago:
Six days ago, build 45083 was pulling a cache from build 45062, the previous build on the same branch.
The next build on the same branch (45129, from two days ago) is now pulling a cache from build 44730–a build from 15 days ago.
Why no response yet from circle ci? Today this resulted in deploy of month old version of code without any notice not mentioning all the additional time spend building as the cache is not working properly. This is urgent for us, otherwise we will need to find other service for build and deployment.
We’ve had a couple support tickets come in about this and our engineers are currently investigating it. It took our support team some time to rule out that these issues weren’t due to restore cache key ordering or substring matches.
In one case, a user had a config race condition. His workflow restored caches in one job saved earlier in the workflow, but didn’t have a job dependency between the jobs, so sometimes the dependent job ran first. It’s worth checking if this applies to you.
For other users, there is unexpected behavior–that’s what our engineers are investigating. We’ll post back here when we have more information.
Do you have any recommendation of what can be used as mechanism to move built docker image from one job to another? As I said we have one job that builds the image and second that deploys it, tied into one workflow. We use cache so we do not have to rebuild the image again in the deploy job. I do not consider this the best possible solution, what we would need is a persistent remote_docker instance across the workflow, but that does not seem to be something thats possible, or is it?
Our engineers have pushed out a fix for this. Can you all run some more jobs on the problematic projects/workflows and post back here if you’re still seeing old caches being used?
My answer to this is “it depends.” If your images aren’t too large, then caching can be a reasonable option. Jobs that rely on caching to get a dependency from past jobs should be robust against a cache-miss, though. For example, in your deploy job, perhaps it can check the git SHA in the image against the checked out code before pushing. If there’s a mismatch, the deploy job should fail or rebuild the docker image you want. When it comes to software, correct/accurate and slower tends to be preferable to incorrect/inaccurate and faster.
If your images are large, then it may be more practical to push and pull from a registry. In practice, I’ve found that the bandwidth for pushing to and pulling from Dockerhub is greater than the bandwidth of saving and restoring a cache. We currently use AWS EC2 instances for the machines that run your jobs, and S3 for caching. I believe Dockerhub runs on EC2 as well, which could explain the higher transfer rates.
It was strange to me to find out EC2 <=> S3 transfer rates differ so much from EC2 <=> EC2 when they’re all in the same region, but I suppose that’s just the current reality.
In any case, I think it’s worth trying out both approaches to see which one works for you, and periodically re-evaluating this decision as your deployment process changes (for instance, when your deployment images grow substantially). By using an external registry like Dockerhub, there’s additional failure modes. Last week, for instance, some major DNS servers had issues and Dockerhub wasn’t resolving properly.
A persistent remote_docker instance across the workflow isn’t currently possible, at least not with guarantees. Reusable docker engines make this possible, but depending on your workflow configuration and how many workflows are running concurrently, your reusable engines might be distributed to other workflows between jobs. For something as critical as deployment, I wouldn’t assume the remote engine has an image from a previous job.
It looks like after the fix was pushed, my restore_cache steps failed with:
Skipping cache - error checking storage: RequestError: send request failed
caused by: Get https://circle-production-customer-artifacts.s3.amazonaws.com/...snip...: x509: failed to load system roots and no roots provided
I simply have to install the ca-certificates package via a run command before to restore_cache step to get it to work.