Intermittent failures connecting to docker daemon

We are experiencing intermittent but disruptive failures in our workflow, all with the following error message:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

The failing jobs all use the machine executor type, and fail at a rate of about 2 failures per workflow of ~50 jobs. This appears to have begun a few days ago, and anecdotally has been getting slightly worse over time.

Can you offer any assistance as to what we can do to fix this? I can supply links to failing workflows and jobs upon request.

A current work-around we are using is to select the Rerun workflow from failed option, but that is obviously not a ideal.

Is that a daemon you have installed yourself? If so, that might be out of CircleCI’s control (they provide a VPS and you do as you will with it). Please add the current version of your config here in the question, so readers can take a look.

Thank you for the response halfer.

These jobs are running on the machine executors, with the circleci/classic:latest images:

Our config is rather large and contains proprietary information, so I am unable to post it in full. We are using the version 2.0 of the config, and the following machine definition applies to all of our jobs that are experiencing issues.

machine-defaults: &machine-defaults
  working_directory: ~/<our project>
  machine:
    docker_layer_caching: true
    image: circleci/classic:latest

The docker daemon is what is installed by default on these VMs, and we have been using versions of this workflow for the past 8 months or so. This is the first time we have ever experienced any issues with the executor’s docker daemon.

That may be an ongoing business risk, since it makes it harder for you to obtain help from volunteers on the web. My estimation is that most claims about IP are overblown, and can be trivially overcome by some minor redactions. Be as helpful as you can when asking for help! :slightly_smiling_face:

My guess at the moment is that Docker is not spun up fully at the point when you are trying to use it (and you have a race condition, which is why it works some of the time). Maybe you could write a Bash script to loop, wait a second, perform a trivial Docker operation, and then keep looping until you get a successful reply. If this results in a successful operation, you can then do your real Docker operation.

Retries with a backoff has been in place for a while now. The script tries a total of 5 times, with a 10 second sleep in between each attempt. If the daemon isn’t up after a minute, I don’t think it’s coming up.

Given this has been reliable for months, and then intermittently problematic over the last couple days, I am inclined to believe it’s not a problem on my end.

I had a hunch that one of the VM volumes from the volume pool is bad, so I put together this script to check the last 3000 builds and correlate volumes to failures:

Sure enough, only one volume gets mounted whenever the docker daemon fails to start:
Full results: https://gist.github.com/levi-rs/347077813cf37a8b4c1185a6606e9ce3

Last output:

Found one for volume 171808 on build 132752
Current Stats for 171808:
Succes: 0
Failed: 127
Failed daemon: 125
Total builds evaluated: 2597
List of volumes with daemon failure: ['171808']
Elapsed time: 0:28:55.960237

So in the last 3000 builds, there were 125 failures with the docker daemon issue, and every failure is associated with volume 171808.

I’m hoping there is a way to invalidate a specific VM volume.

You could try opening a support ticket, if you are on a paid tier.

The only other thing I can suggest, from the perspective of getting volunteer help here or on Stack Overflow, is to create an open repo on GitHub that replicates the issue without containing any of your employer’s IP.

A ticket was opened for this and should be resolved. I’ll mark this solved here.

1 Like

Im having the same issue, do I need to open a support ticket as well?

@wprater believe so, yes. The support person was able to invalidate the bad volume, and I believe there is no other way for a user to do as much.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.