Circle 2.0 caching is too limited to be very useful


#1

CircleCI 2.0 offers more control to users over how caching works. Great! But unfortunately, I’m having a tough time getting it to be useful.

Let’s say I’m running CI for a simple NodeJS server, using npm to manage dependencies. I decide to cache node_modules. I have a single master branch. I don’t even use other branches; I just push all my changes to master.

The behavior I would want is to use node_modules from the latest successful master build. I trust npm to generally figure out what to install, even given an arbitrarily stale node_modules directory.

Unfortunately, I can’t see a way to do this with the current caching operations. I can’t cache on just a fixed value or the branch key; the cache would never update. On the other hand, caching on the checksum of package.json would get me an up-to-date cache on most commits where I didn’t change my dependencies, but I’d get no benefit whatsoever, not even a partial cache, if my dependencies did change.

It’s possible I’m missing something essential. And of course, I know Circle 2 is still in beta and like many things about it :smile: But I think the caching operations need some additional attention, so they can handle cases like this.


#2

You should be able to achieve what you want with something like:

- restore_cache:
    keys:
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      - projectname-npm-deps-{{ .Branch }}
      - projectname-npm-deps-

See: https://circleci.com/docs/2.0/caching/ for more details.

After the restore_cache step above you run npm install and npm will ‘fill in the gaps’. So let’s say we were only able to restore the last (least specific) cache above - it’s likely that most of your dependencies will be similar and npm install will figure out what else needs installing.


#3

But the fact that cache keys are immutable means that the “projectname-npm-deps-” will never be updated after the first run. So a cache miss on branch or checksum will likely fall through a long way.

I.E. in my case, if I were to change build.gradle (which my checksum depends on), it would fallback to the first ever build, which has quite a different tree of dependencies.

We need a way to update the cache key in some reasonable way, without hacking something like a year/month/day tree.

Also, it seems reasonable that you would want to save a cache under multiple keys, i.e. each master build would update the “master” cache of deps, as well as the one for this exact version of build.gradle.


#4

Yarn has a yarn.lock file that might useful for you. You can’t store one cache as multiple names but you can list multiple keys from which to restore.


#5

For our actual project, we’ve gotten reasonable cache performance by keying our cache off yarn.lock, as you suggest. But lots of JS developers don’t use Yarn (or npm shrinkwrap). Plus, it’s unfortunate that we don’t get partial cache hits: As @corydolphin mentions above, falling back to more general cache names doesn’t work with an immutable cache, since the more general cache names will only be written once, and then will become more stale over time.

I think the caching feature would be significantly improved by a documentation page showing how to cache common situations. This would both make it easier for users and, I suspect, inspire additional features.


#6

Still a work in progress :slight_smile:

For now, a nice workaround is ls -laR > node_checksum and reference it with << checksum node_checksum >> in the cache steps. That does mean you can’t get an exact hit on the restore cache, but it can mitigate reinstalling everything.


#7

Is it documented that cache keys are immutable? I had no idea that they were immutable until I read this thread. I went back over the caching docs and still couldn’t find anything. I see now in the build steps that attempting to cache an existing key will print a warning, I’d suggest that this step get marked in yellow to warn the user about this. Having an immutable cache doesn’t seem particularly useful to me, it would be much better if I could update the cache several times. If I key it off my build.boot file, then any change to that file requires a full rebuild. I’ll also need to figure out caching logic for my boot uberjars too.


#8

Oh, that’s a nice trick. I was disappointed that you can only checksum a single file in a cache key, but of course you can put whatever content you want into that file and checksum the result.

It totally makes sense that caching is a work in progress; I’m excited to see where it goes!


#9

I agree that it would be nice if there was a way to specify that an existing cache can be overwritten. Having them be immutable has no benefit to us and means that we just have to tack {{ epoch }} onto all of them to make it work.

Also, it would be nice if the immutability was documented instead of people having to stumble on the fact after being confused and having to debug why the cache isn’t working.


#10

It’s worth emphasizing that cache restore looks up cache keys by partial string matches, not exact string matches. When there are multiple keys, each key is tried in order until one provides at least one match. When one key matches multiple caches, the most recent cache is applied.

When Tom suggested using this

- restore_cache:
    keys:
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      - projectname-npm-deps-{{ .Branch }}
      - projectname-npm-deps-

It’s because projectname-npm-deps-{{ .Branch }} will match projectname-npm-deps-feature1-123, projectname-npm-deps-feature1-456, and projectname-npm-deps-feature1-789.

You can achieve partial caching by providing multiple restore_cache keys, starting with more specific ones, and progressing to less specific ones as the others miss. To annotate the example above, here’s what each key provides:

- restore_cache:
    keys:
      # Find a cache corresponding to this specific package.json checksum
      # when this file is changed, this key will fail
      - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
      # Find a cache corresponding to any build in this branch, regardless of package.json
      # checksum.  The most recent one will be used.
      - projectname-npm-deps-{{ .Branch }}
      # Find the most recent cache used from any branch
      - projectname-npm-deps-

Non-deterministic cache-restore behaviour
#11

It’s worth emphasizing that the current documentation doesn’t describe this and actually states "be careful about overusing {{ epoch }}". :wink:


#12

https://circleci.com/docs/2.0/configuration-reference/#restore_cache

A key is searched against existing keys as a prefix.
NOTE: When there are multiple matches, the most recent match will be used, even if there is a more precise match.

https://circleci.com/docs/2.0/caching/#use-epoch-wisely

The second sentence of what you are quoting is more useful:

When defining a unique identifier for the cache, be careful about overusing {{ epoch }}. If you limit yourself to {{ .Branch }} or {{ checksum "filename" }}, you’ll increase the odds of a job hitting the cache.

#13

My point was that you have to read multiple piece of documentation to understand how to use the cache. Ideally it would all be in the one article about caching and it wouldn’t contain misleading statements about epoch.

Really, there are multiple use cases for the cache and in some using epoch is good and in other it’s bad. I’d recommend that the documentation on caching explicitly state that fact instead of leaving it to the user to piece together.


#14

Thanks for the feedback. We just updated the 2.0 caching docs: https://circleci.com/docs/2.0/caching/

There’s still more to add, but it’s more useful than the previous version.


#15

+1 on this, the caching system is really great, but it’s hard to do without checksumming on multiple files.
A workaround is probably just to make the key longer with more checksum statements though.


#18

Thanks for the tip but ls -laR did not work for me as between builds the timestamps of the files are different. I was however able to apply the same trick using git to fetch the most recent sha for my app/assets directory like so:

git log --pretty=format:'%H' -n 1 -- app/assets > assets_checksum

#19

Just to add to this, one thing I’ve being trying successfully is to use my default branch cache before the find any cache, since most of the time our feature branches are based off of it.

- restore_cache:
keys:
  - projectname-npm-deps-{{ .Branch }}-{{ checksum "package.json" }}
  - projectname-npm-deps-{{ .Branch }}
  - projectname-npm-deps-master
  - projectname-npm-deps-

This way, it we have multiple feature branch running (and likely will be the last one that was built), the cache on the default branch is more likely a closer match.


#20