Caching pg_dump files by migrations


I have three repos that I run migrations on before doing downstream integration tests. We use Postgres as our database. The migrated repos are django & flask. In one job, we run the migrations in each repo, call pg_dump to dump the data to files on an attached ‘workspace’, then call persist_to_workspace to save the files. This process takes around 10 minutes. In parallel downstream jobs, we call attach_workspace to access those file and restore the database before running the integration tests.

I’d like to come up with a strategy to cache those ‘pg_dump’ files based on the last migration in each repo so, if no new migrations have been introduced, use the cached files. Otherwise, do all of the migrations and cache the new ‘pg_dump’ files.

Has anyone done something similar?

Caching based on the checksum of a directory's contents

You could use the cache device, using the hash of the migration file as a cache key. If the migration file changes, the old cache becomes invalid and you have to re-do the migration.


Unfortunately there isn’t a single migration file. Each ‘app’ in the repo has a ‘migrations’ directory where migrations for that app live.


That’s probably OK - just write something that creates an overall hash of all of them and use that as your cache key.


This one set my brain turning. I haven’t tried this, but I think you can use MD5 to checksum the directory, save that as an exported ENV and then use {{ .Environment.variableName }} in the cache key.

Please let me know if this works

ETA: When I said esport, I mean to export it to $BASH_ENV, as in here

1 Like

This didn’t end up being quite what I wanted, so here’s the working code I have, hope it helps:

      - run: |
          echo $(find ./migrations -type f -exec md5sum {} \; | md5sum | cut -d' ' -f1)  >> CACHE_KEY1
      - run: cat CACHE_KEY1
      - save_cache:
          key: test-cache-v1-{{ checksum "CACHE_KEY1" }}
            - migrations