Caching pg_dump files by migrations

I have three repos that I run migrations on before doing downstream integration tests. We use Postgres as our database. The migrated repos are django & flask. In one job, we run the migrations in each repo, call pg_dump to dump the data to files on an attached ‘workspace’, then call persist_to_workspace to save the files. This process takes around 10 minutes. In parallel downstream jobs, we call attach_workspace to access those file and restore the database before running the integration tests.

I’d like to come up with a strategy to cache those ‘pg_dump’ files based on the last migration in each repo so, if no new migrations have been introduced, use the cached files. Otherwise, do all of the migrations and cache the new ‘pg_dump’ files.

Has anyone done something similar?

You could use the cache device, using the hash of the migration file as a cache key. If the migration file changes, the old cache becomes invalid and you have to re-do the migration.

Unfortunately there isn’t a single migration file. Each ‘app’ in the repo has a ‘migrations’ directory where migrations for that app live.

That’s probably OK - just write something that creates an overall hash of all of them and use that as your cache key.

This one set my brain turning. I haven’t tried this, but I think you can use MD5 to checksum the directory, save that as an exported ENV and then use {{ .Environment.variableName }} in the cache key.

Please let me know if this works

https://circleci.com/docs/2.0/caching/#using-keys-and-templates

ETA: When I said esport, I mean to export it to $BASH_ENV, as in here https://circleci.com/docs/2.0/env-vars/#example-configuration-of-environment-variables

1 Like

This didn’t end up being quite what I wanted, so here’s the working code I have, hope it helps:

      - run: |
          echo $(find ./migrations -type f -exec md5sum {} \; | md5sum | cut -d' ' -f1)  >> CACHE_KEY1
      - run: cat CACHE_KEY1
      - save_cache:
          key: test-cache-v1-{{ checksum "CACHE_KEY1" }}
          paths:
            - migrations

I’ve been given some time to revisit our CI workflows and I’m about to do something very similar. I’ll post my solution once I’ve got it running.

Here’s what I did in my integration test setup job

integration_test_setup:
    <<: *defaults

    resource_class: small

    steps:
      - checkout
      - run: *checkout_services  # this clones all dependent repos locally
      - run:
          name: Create cache directory
          command: |
            mkdir /tmp/cache
      - run:
          name: Create migration checksum files
         command: |
            # The scripts below contain one or more 'ls' and 'find' commands that return a list of all
            # migration files for the repo. To make 'find' commands deterministic, pipe the
            # output to 'sort'.
            shasum -a 256 `../repo1/scripts/ci/integration-test-list-migration-files` > flask_migrations.txt
            shasum -a 256 `../repo2/scripts/ci/integration-test-list-migration-files` > django_migrations.txt
      - run:
          name: Create fixtures checksum files
          command: |
            # The scripts below contain one or more 'ls' and 'find' commands that return a list of all
            # fixture files for the repo. To make 'find' commands deterministic, pipe the
            # output to 'sort'.
            shasum -a 256 `../repo1/scripts/ci/integration-test-list-fixture-files` > flask_fixtures.txt
            shasum -a 256 `../repo2/scripts/ci/integration-test-list-fixture-files` > django_fixtures.txt
      - restore_cache:
          keys:
            - test-v1-migrations-flask_{{ checksum "flask_migrations.txt" }}-django_{{ checksum "django_migrations.txt" }}
      - restore_cache:
          keys:
            - test-v1-fixtures-flask_{{ checksum "flask_fixtures.txt" }}-django_{{ checksum "django_fixtures.txt" }}
      - setup_remote_docker:
          docker_layer_caching: true
      - build_or_pull_images:  # pull docker image from image repo if it exists, otherwise build & push to image repo
          repos: "repo1 repo2"
      - run:
          name: Stop and remove containers
          command: |
            docker-compose down --remove-orphans
      - run: *initialize_db  # start database container wait until accepting connections
      - run:
          name: Load migrations
          command: |
            if [ -f /tmp/cache/db1.schema.dump ]; then
              echo "Restoring schema from cache"
              # copy dumped file to db1
              docker cp /tmp/cache/db1.schema.dump app_db:/tmp/db1.dump
              # restore the schema
              # use different db to connect
              docker exec db_container pg_restore -d 'postgresql://postgres:somepass@db_container/db2' -Fc --create --clean --if-exists tmp/db1.dump
            else
              echo "Cache not found. Running migrations"
              services_to_migrate='repo1 repo2'
              for service in $services_to_migrate; do
                echo "Running migrations for $service..."
                docker-compose run -e CIRCLECI=$CIRCLECI --rm $service make migrations
              done
              # dump schema to cache directory
              docker exec db_container pg_dump 'postgresql://postgres:somepass@db_container/db1' -Fc > /tmp/cache/db1.schema.dump
            fi
      - save_cache:
          key: test-v1-migrations-flask_{{ checksum "flask_migrations.txt" }}-django_{{ checksum "django_migrations.txt" }}
          paths:
            - /tmp/cache/db1.schema.dump

      - run:
          name: Load fixtures
          command: |
            if [ -f /tmp/cache/db1.fixtures.dump ]; then
              echo "Restoring fixtures from cache"
              # copy dumped files to db1
              docker cp /tmp/cache/db1.fixtures.dump db_container:/tmp/db1.dump
              # restore the schema
              docker exec db_container pg_restore -d 'postgresql://postgres:somepass@db_container/db2' -Fc --create --clean --if-exists tmp/db1.dump
            else
              echo "Cache not found. Loading fixtures"
              services='repo1 repo2'
              for service in $services; do
                echo "Loading fixtures for $service..."
                docker-compose run -e CIRCLECI=$CIRCLECI --rm $service make fixtures
              done
              # dump fixtures to cache directory
              docker exec db_container pg_dump 'postgresql://postgres:somepass@db_container/db1' -Fc > /tmp/cache/db1.fixtures.dump
            fi
      - save_cache:
          key: test-v1-fixtures-flask_{{ checksum "flask_fixtures.txt" }}-django_{{ checksum "django_fixtures.txt" }}
          paths:
            - /tmp/cache/db1.fixtures.dump
      - run:
         name: Dump database state
          command: |
            docker exec db_container pg_dump 'postgresql://postgres:somepass@db_container/postgres' -Fc > /tmp/workspace/db1.dump
            docker-compose down
      - persist_to_workspace:
          root: /tmp/workspace
          paths:
            - db1.dump

In a subsequent job, I restore the database from the dump file saved in the workspace before running the integration tests.