Non-deterministic cache-restore behaviour

Our engineers have pushed out a fix for this. Can you all run some more jobs on the problematic projects/workflows and post back here if you’re still seeing old caches being used?


My answer to this is “it depends.” If your images aren’t too large, then caching can be a reasonable option. Jobs that rely on caching to get a dependency from past jobs should be robust against a cache-miss, though. For example, in your deploy job, perhaps it can check the git SHA in the image against the checked out code before pushing. If there’s a mismatch, the deploy job should fail or rebuild the docker image you want. When it comes to software, correct/accurate and slower tends to be preferable to incorrect/inaccurate and faster.

If your images are large, then it may be more practical to push and pull from a registry. In practice, I’ve found that the bandwidth for pushing to and pulling from Dockerhub is greater than the bandwidth of saving and restoring a cache. We currently use AWS EC2 instances for the machines that run your jobs, and S3 for caching. I believe Dockerhub runs on EC2 as well, which could explain the higher transfer rates.

It was strange to me to find out EC2 <=> S3 transfer rates differ so much from EC2 <=> EC2 when they’re all in the same region, but I suppose that’s just the current reality.

In any case, I think it’s worth trying out both approaches to see which one works for you, and periodically re-evaluating this decision as your deployment process changes (for instance, when your deployment images grow substantially). By using an external registry like Dockerhub, there’s additional failure modes. Last week, for instance, some major DNS servers had issues and Dockerhub wasn’t resolving properly.

A persistent remote_docker instance across the workflow isn’t currently possible, at least not with guarantees. Reusable docker engines make this possible, but depending on your workflow configuration and how many workflows are running concurrently, your reusable engines might be distributed to other workflows between jobs. For something as critical as deployment, I wouldn’t assume the remote engine has an image from a previous job.