Persist_to_workspace vs `save_cache` performance

I’m trying to optimize our build. Up until recently, we had

  1. A dependencies task that checked out from git, ran bundle install and yarn install, and yarn compile, with caching around each of the four steps
  2. A dependencies job that just ran that task
  3. Three jobs that ran in parallel after the first job. Each ran the dependencies task before running tests. Because of the workflow graph, they would (almost) always get cache hits.

Based on this article, we recently changed that to use the workspace for persistence:

  1. A dependencies job that pulls from git, runs bundle install and yarn install, then yarn compile, and pushes the whole working directory into the workspace. This job uses caches for all the dependencies.
  2. Three jobs that run in parallel after the first job. Each pulls the working directory from the workspace.

It turns out that’s slower. Unfortunately the persist_to_workspace task doesn’t offer any transparency into what’s taking time but it definitely takes longer to persist to workspace than to write to cache.

Additionally, this post suggests that persist_to_workspace can accept glob patterns, but I don’t see that information in the docs.

So my questions:

  1. Why is persist_to_workspace slower than save_cache?
  2. Should we use cache or workspaces for git? For ruby gems? For node modules? For compiled assets?
  3. Does persist_to_workspace accept a glob? Does it accept a negative glob so I can ignore ./node_modules/ and ./.git/ but persist everything else?
  4. Can you update the docs with general guidelines about when to use each tool?
7 Likes

Things I use caches for:

  • git directories (for the speed issues you’ve identified)
  • node_modules folders (cannonical use case. a cache of node_modules with a yarn.lock hash for the name is the official recommendation)

Things I use workspaces for:

  • handing folders off to orbs that expect them (check out wealthforge/cypress@1.0.0, it’s just like regular cypress but with more features)
  • storing mutable variables within a build

your dependencies job sound an awful lot like our code_switch one

code_switch:
    executor: default
    working_directory: /home/circleci/build-dir
    steps:
      - checkout:
          path: /home/circleci/build-dir
      # populate-deploy-branch and checkout-diffs are commands for handling building from a mono-repo
      - populate-deploy-branch
      - checkout-diffs 
      - run: rm -rf .git
      # get rid of that big excess folder
      - save_cache:
          key: git-sha-{{ .Revision }}
          paths:
            - /home/circleci/build-dir

and on the receiving side, one of the mono-projects will do something like

  ui_ruby:
    executor: default
    # you can work from any arbitrary file path, and caches will populate appropriately
    working_directory: /home/circleci/build-dir/clients/ui-ruby
    steps:
     # bail-if-current will gracefully kill the job if there are no changes on the detect-path conditionally
      - bail-if-current:
          detect-path: "clients/ui-ruby"
      # 
      - restore_cache:
          key: git-sha-{{ .Revision }}
      # another command that will build and push a docker image.
      - docker-build-deploy:
          repository-image: "ui-ruby"
          docker-layer-caching: true
          # dlc is a life saver on ruby gems

hopefully this is helpful

2 Likes

Here is persist_to_workspace accepting 2 files, it’s not a literal glob, but get’s the same stuff done.

  - persist_to_workspace:
      root: /tmp/dir
      paths:
        - FILE_TO_STORE
        - ALT_FILE

I just stumbled upon this issue after going down the same path @jamesarosen went through.
Is there some kind of documentation or info about the persist_to_workspace performance?

2 Likes

I’m also interested in this performance issue.
I would like to use cache and workspace according to the article mentioned by OP.
If I were to follow the article I would save my node_modules to workspace for the next jobs, while still potentially hitting cache for my yarn install from previous runs.
In reality, saving node_modules to cache and restoring it in every job is much faster.

I think that since workspace has less features than cache (it just lasts one run and there only is one at a time) it should at least be faster.
So either I’m not using it correctly (but I’m using it just like the users above), it is not functioning as expected or is not very useful.

I’m finding the cache restore step to be painfully slow

Downloading a 500 MiB file should take ~5 seconds on a gigabit connection, so what is the remaining ~20 seconds?

1 Like

The docs indicate that Maven repositories are used as the storage location, so there is a lot more going on than just the raw copying of the files you want cached. The store is also likely to be a shared store so performance is likely affected by other customers.

Without knowing what you are trying to do I can not add much, beyond the link to circleci’s docs on usage strategies in case you have not yet come across them.

Beyond the persist and cache options, depending on what equipment you have access to another option could be a self-hosted runner as that can then hold complete copies of file sets that are needed between builds.

I just ripped out my workspace implementation with a zstd compression + upload to S3 solution and the workspace persistence aspect is 2-2.5x faster. So far it doesn’t seem like we need workspace features that we’re losing by this switch.

@maschwenk is your solution open source? can you share your final workflow ?

1 Like