Cache immutability, and using the cache from the last build

I’m looking into using CircleCI for our company-wide CI solution, but due to the immutable nature of the cache, has not yet run a build - I’m in the process of creating my config.yaml, but something is really bugging me.

I think I’ve actually figured this out on my own, but have left this wall of text for context, as I feel it’s important, and that the docs should be updated to be clear. You can skip to the end for my conclusion.

This is somewhat of a necro of Circle 2.0 caching is too limited to be very useful

Everything talks about the cache being immutable, which is ok. The comment linked to above explains how the cache restore looks up cache keys w/ partial string matches, instead of exact string matches.

That too seems alright, and makes sense. However, this example is then given:

- restore_cache:
      # Find a cache corresponding to this specific package.json checksum
      # when this file is changed, this key will fail
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      # Find a cache corresponding to any build in this branch, regardless of package.json
      # checksum.  The most recent one will be used.
      - projectname-npm-deps-{{ .Branch }}
      # Find the most recent cache used from any branch
      - projectname-npm-deps-

The problem I have w/ this, is in my mind the use of “the most recent” is contradictory, and implies mutability.

My understanding from the post & the docs is that since the cache is immutable, once it’s been written once, that’s it. It’ll never change. You can make as many caches as you like, but once a key has a cache, that’s it.

In my mind, “the most recent cache” implies that the cache changes - but we’re been told it doesn’t. So “the most recent cache” is the first write to that cache, which is pointless.

Now, we have some control over the cache, since we can do things like - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}, but to me that’s very limiting.

In the linked to comment, Eric gives the above example, and also says:

- restore_cache:
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      - projectname-npm-deps-{{ .Branch }}
      - projectname-npm-deps-

It’s because projectname-npm-deps-{{ .Branch }} will match projectname-npm-deps-feature1-123, projectname-npm-deps-feature1-456, and projectname-npm-deps-feature1-789.

Which I’m sure is true, but to me doesn’t solve the original problem.

As an example, say I’m doing package updates. I create a new branch update-packages.

So I update a package or two, making a commit per package updated - now everytime I update a package, package.json & package-lock.json are changed - so already any caches with those as keys, won’t be matched, and so we can just totally ignore projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}.

Now, I push to my branch. Thus, a new cache is created, w/ the key projectname-npm-deps-update-packages, and that is immutable.

So, now I update some more packages - as before, I do a commit per package, which means projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }} style caches will never be hit, and thus when CI runs, projectname-npm-deps-update-packages is used.

This means that every commit I make on a branch will be using the cache from first commit I made to that branch, which inefficient.

If this is all correct, what I need is the ability to cache based off the last commit. There was a comment on that thread further down that spoke about using git log --pretty=format:'%H' -n 1 -- app/assets > assets_checksum, and that’s what I’ll be trying, but I think that it’s not too much to ask for a variable that holds that already - literally just {{ .PreviousBranchCommit }}.

To me, this shouldn’t cause any major problems, b/c the cache_write happens after a task, and you can’t say “do only if no cache”. Hence the cache is only written if the previous step passes, preventing a bad cache from being written.

For example, if I cache npm ci, then so long as npm ci is successful, the result can be cached, even if my lint, tsc, jest steps afterwards fail.

Overall, this has me treading very lightly, as I feel I can’t play around w/ the cache to test this stuff due to it’s immutable nature.

One possible way caching could make sense in this situation is if the partial matching happened against the whole key, which I hope is the case, but not what I got from Erics phrasing:

It’s because projectname-npm-deps-{{ .Branch }} will match projectname-npm-deps-feature1-123 , projectname-npm-deps-feature1-456 , and projectname-npm-deps-feature1-789 .

If he actually meant ``projectname-npm-deps-{{ .Branch }}-{{ checksum “package.json” }}`, that would make a lot more sense to me, and as would “the most recent”…

After all of this, I think I’ve worked out how caching works

The cache is immutable, but a single “cache” is actually the whole “restore_cache” block, NOT each key in “keys”.

If that’s true, then the whole thing would make a lot more sense, as that’s how you can have “the most recent” cache:

- restore_cache:
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      - projectname-npm-deps-{{ .Branch }}
      - projectname-npm-deps-

That represents a single immutable cache. i.e, if any of the “keys” are matched, the same cache is returned. Originally, I thinking was that each “key” in “keys” was naming one cache, and so the matched key would return “its” cache.

This also means that when you’re saving the cache, you should provide the most specific key possible.

This might seem like just a massive ramble, and maybe it’s b/c I’m just “special”, but it really confused me and took a lot of reading up before this clicked (if I’ve got it right - otherwise, my original question stands).

The more I think about it, the more and more this all makes sense w/ the above logic, but it was a long road to get here - and it’s something that I feel could be explained very easily by just giving a basic example of the step-by-step process CircleCI uses internally, or a nice diagram.

I would still love it if someone from CircleCI could confirm if my thinking is correct, cause it’s doing my head in XD

1 Like

Hi @G-Rath,

I admit that I only jumped to your conclusion and only read that :slight_smile:

Your conclusion is on the mark. The key takeaways (ha) are:

  • Each restore_cache restores from one and only one key, with substring matching given the keys array you’ve provided.
  • You want to save caches with enough specificity. More is usually better, but timestamps turned out to be of questionable use as a cache key, in my experience.
  • Different stages of restore_cache with decreasing specificity help you to get cache hits when you want, that are close enough to what you need.

I’d add that you may see a hard-coded version in some of our examples, like - projectname-npm-deps-v1-{{ .Branch }}-{{ checksum "package.json" }}. This is because dependency managers can get weird in a bunch of situations. Maybe you installed your dependencies with node 10, and changed to node 12 but the cache is still from node 10, so you get some weird errors. In times like that, it’s easier to try bumping the v1 to v2 to see if that fixes things.

1 Like

I admit that I only jumped to your conclusion and only read that :slight_smile:

Totally understandable - I only left the rest b/c for me at least this was a bit of a journey, and once it clicks it all just suddenly made sense; but yeah I wanted to show how much work it seemed to take to get it to click, which I feel something like a simple diagram or step-by-step example could make up.

Pictures worth a 1000 words and all that.

Your conclusion is on the mark.

Thanks for confirming that :slight_smile:

1 Like

Totally agree. Thank you for your write up; it has been far more clarifying than the actual docs.