Caching for test runs via Squid


#1

I’ve been looking for a way to speed up the docker build step on CircleCI, specifically I thought it would be a good idea to cache the composer and apt-get packages between test runs as a lot of non-testing time seems to be spent installing project dependencies. I’ve thought through a lot of potential solutions and I think I’ve come up with one that is pretty decent, at least in concept, and should be generically useful to other projects so I thought I would share it with you.

The basic concept is to use a Squid proxy cache in a docker container to capture requests out to the internet, and cache the Squid proxy cache directory between test runs. The following describes how I got it working. Please note, ultimately I got very little or no benefit from this because I don’t know how to configure Squid! But I hope it helps someone else and perhaps provoke some interesting discussions. And at some point I’ll come back to this project and spend some time learning how Squid works.

My circle.yml file looks like this:

version: 2
executorType: machine
containerInfo:
  - image: ubuntu:14.04
stages:
  build:
    workDir: ~/projectname
    steps:
      - type: checkout

      - type: shell
        name: Setup paths
        command: |
          mkdir -p /tmp/artifacts
          mkdir -p /tmp/squid

      - type: cache-restore
        key: cache-{{ .Branch }}

      - type: shell
        name: Run Squid
        command: |
          docker run -d --name=squid --publish 3128:3128 -v /tmp/squid:/var/spool/squid3 -v /tmp/artifacts:/var/log/squid3 sameersbn/squid:3.3.8-23
          
      - type: shell
        name: Build Image
        command: |
          # get the IP address of the squid container and pass
          IP=$(docker inspect --format "{{ .NetworkSettings.IPAddress }}" squid)
          http_host=http://${IP}:3128 https_host=http://${IP}:3128 ftp_host=http://${IP}:3128 bin/build.sh

      - type: shell
        name: Test Image
        command: |
          docker run --link squid:squid -v /tmp/artifacts:/var/log projectname /circle-entrypoint.sh

      - type: shell
        name: Shut down Squid
        command: |
          docker kill -s HUP squid
          chown -R $(whoami) /tmp/squid
          
      - type: deploy
        name: Push image
        command: |
          bin/push.sh

      - type: cache-save
        key: projectname-{{ .Branch }}
        paths:
          - /tmp/squid

      - type: artifacts-store
        path: /tmp/artifacts
        destination: build

      - type: test-results-store
        path: /tmp/artifacts

What this does is create some host directories for artifacts and the squid cache and then run the squid container. Unfortunately, you can’t link to a running container as part of a docker build so in the build step, we grab the actual IP address of the container and assign it as the http_proxy address within the scope of running the build step.

The build.sh script looks like this:

# log into AWS
aws configure set default.region us-east-1
eval $(aws ecr get-login)

docker build --no-cache -t projectname .

Inside the Dockerfile, we allow for “temporary” environment variables via ARG:

ARG http_proxy
ARG https_proxy
ARG ftp_proxy

Any RUN steps in the Dockerfile that access files ‘on the internet’, such as apt-get update will go through the proxy and will get cached. The advantage of using ARG over ENV is that ARG is not retained in the resulting image, so the proxy only affects building of the container.

After the image is built, we can run tests inside it. In my case, the container is ‘production ready’ and the test script installs some additional stuff to support testing like xvfb, java, selenium and firefox (for integration tests) so I add http_proxy=http://squid:3218 https_proxy=http://squid:3128 ftp_proxy=http://squid:3128 in front of the install script that does that for me, it looks like this:

http_proxy=http://squid:3128 https_proxy=http://squid:3128 ftp_proxy=http://squid:3128 tests/test-setup.sh

So now those dependencies get can be cached too. Then the tests run and if they pass, we shut down Squid (important to persist the cache to disk) and change ownership (so Circle will actually upload and restore the cache correctly) and push the image. Finally the squid directory gets cached and the artifact directory gets uploaded.

Ok, so far, so good. The only problem with this so far is that it doesn’t actually shorten build time noticeably. By checking the Squid access logs on subsequent runs, I can confirm that requests are hitting Squid but I’m getting a lot of TCP_MISS's so now its a matter of learning how to configure Squid. And unfortunately, that’s as far as I got with this project, I’m now out of time and have to move on to some other priorities.


#2

I don’t know how to cache the packages through Squid, but we did publish a guide on caching apt packages on 1.0. Does this suit your needs?


#3

Not super helpful in the context of docker build unfortunately. The issue is more that you can’t mount a volume during the build phase so you need some other mechanism to get cacheable stuff in and out. Squid (if I could figure out how to configure it correctly) is transparent to docker build which makes it a (potentially) great solution.


#4

You mentioned in your first post that you want to use this for composer and apt-get packages. Have you considered making a Docker image for your project with these? Since you’re using the machine executor, it could even be a private image.


#5

Maybe I’m missing something … how does a a docker image with these help during the build process?


#6

You can extend an image as the base image. It would essentially be a cache of everything you need for each of the containers, then run the dynamic bits on top of it. You just need FROM your/project:1.0 (with the right image name/tag) in your Dockerfile then.

Alternatively, maybe layer caching on the docker executorType would solve your issues. If you do not have access, you will need to contact your CSM.


#7

This would probably work but the dependencies and the installed versions of them would be in the image, which is not really ideal for image size. We have several front end projects built on a common base ‘web’ image so the dependencies for all of them would need to be baked into the base image and all projects would need to be rebuilt if any of the dependencies changed for any of them, again not ideal.

Really I think this solution would work quite elegantly if I knew how to configure Squid to cache the things I want it to cache. The other solution I’ve seen discussed is to have a web service that provides dependencies via http, in this case I would set it up as a separate container and the build script would need to pull in dependencies if available.


#8

This would probably work but the dependencies and the installed versions of them would be in the image, which is not really ideal for image size.

True. Do you care more about Docker image size or your build time? I’ve found that larger Docker images add minimal download time, especially when compared to package download + install times.

My personal experience with a dozen or so 2.0 conversions is that baking such dependencies into a Docker image adds maybe 5 seconds of extra download time for every 30 seconds of apt-get install time saved. Your experience may vary, but I suggest setting this up to compare the times for yourself.

We have several front end projects built on a common base ‘web’ image so the dependencies for all of them would need to be baked into the base image and all projects would need to be rebuilt if any of the dependencies changed for any of them, again not ideal.

I’m not familiar with the details of your setup, but is it possible to build off of a base “web” image, and have each project use its own image like frontend-1, frontend-2, and so on?


#9

Thanks for the feedback Eric, you make a good point about the tradeoff between image size and build time and I will take that under consideration. I suppose the larger image will also impact image upload after build, deploy time and repository storage costs (we use ECR) so we’ll have to think about the overall impact.


#10