Intermittent failures connecting to docker daemon

levi-rs · October 19, 2018, 7:18pm

We are experiencing intermittent but disruptive failures in our workflow, all with the following error message:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

The failing jobs all use the machine executor type, and fail at a rate of about 2 failures per workflow of ~50 jobs. This appears to have begun a few days ago, and anecdotally has been getting slightly worse over time.

Can you offer any assistance as to what we can do to fix this? I can supply links to failing workflows and jobs upon request.

A current work-around we are using is to select the Rerun workflow from failed option, but that is obviously not a ideal.

halfer · October 20, 2018, 11:53am

Is that a daemon you have installed yourself? If so, that might be out of CircleCI’s control (they provide a VPS and you do as you will with it). Please add the current version of your config here in the question, so readers can take a look.

levi-rs · October 20, 2018, 12:28pm

Thank you for the response halfer.

These jobs are running on the machine executors, with the circleci/classic:latest images:

Our config is rather large and contains proprietary information, so I am unable to post it in full. We are using the version 2.0 of the config, and the following machine definition applies to all of our jobs that are experiencing issues.

machine-defaults: &machine-defaults
  working_directory: ~/<our project>
  machine:
    docker_layer_caching: true
    image: circleci/classic:latest

The docker daemon is what is installed by default on these VMs, and we have been using versions of this workflow for the past 8 months or so. This is the first time we have ever experienced any issues with the executor’s docker daemon.

halfer · October 20, 2018, 12:32pm

That may be an ongoing business risk, since it makes it harder for you to obtain help from volunteers on the web. My estimation is that most claims about IP are overblown, and can be trivially overcome by some minor redactions. Be as helpful as you can when asking for help!

My guess at the moment is that Docker is not spun up fully at the point when you are trying to use it (and you have a race condition, which is why it works some of the time). Maybe you could write a Bash script to loop, wait a second, perform a trivial Docker operation, and then keep looping until you get a successful reply. If this results in a successful operation, you can then do your real Docker operation.

levi-rs · October 21, 2018, 12:58am

Retries with a backoff has been in place for a while now. The script tries a total of 5 times, with a 10 second sleep in between each attempt. If the daemon isn’t up after a minute, I don’t think it’s coming up.

Given this has been reliable for months, and then intermittently problematic over the last couple days, I am inclined to believe it’s not a problem on my end.

I had a hunch that one of the VM volumes from the volume pool is bad, so I put together this script to check the last 3000 builds and correlate volumes to failures:

gist.github.com

https://gist.github.com/levi-rs/a83a4e2c225b84fa149ab664cc532312

Circle CI Volume Stats

from collections import defaultdict
from datetime import datetime as dt
import os

from circleci.api import Api
import requests


def get_builds(cci):
    for offset in range(100):

This file has been truncated. show original

Sure enough, only one volume gets mounted whenever the docker daemon fails to start:
Full results: https://gist.github.com/levi-rs/347077813cf37a8b4c1185a6606e9ce3

Last output:

Found one for volume 171808 on build 132752
Current Stats for 171808:
Succes: 0
Failed: 127
Failed daemon: 125
Total builds evaluated: 2597
List of volumes with daemon failure: ['171808']
Elapsed time: 0:28:55.960237

So in the last 3000 builds, there were 125 failures with the docker daemon issue, and every failure is associated with volume 171808.

I’m hoping there is a way to invalidate a specific VM volume.

halfer · October 21, 2018, 12:35pm

You could try opening a support ticket, if you are on a paid tier.

The only other thing I can suggest, from the perspective of getting volunteer help here or on Stack Overflow, is to create an open repo on GitHub that replicates the issue without containing any of your employer’s IP.

drazisil · October 25, 2018, 1:56pm

A ticket was opened for this and should be resolved. I’ll mark this solved here.

wprater · November 1, 2018, 3:18pm

Im having the same issue, do I need to open a support ticket as well?

levi-rs · November 1, 2018, 5:57pm

@wprater believe so, yes. The support person was able to invalidate the bad volume, and I believe there is no other way for a user to do as much.

system · November 11, 2018, 5:57pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CircleCi: Couldn't connect to Docker daemon at http+docker://localhost Build Environment	2	4016	October 1, 2018
Use docker dint in jobs Build Environment docker	1	1529	January 23, 2022
Build Docker image Build Environment docker , circle-yml	5	3628	July 17, 2019
Issues with Docker IP Addresses Running Tests docker	1	1503	March 25, 2021
Couldn't connect to Docker daemon at https Build Environment docker , circle-yml	4	4188	October 5, 2020

Intermittent failures connecting to docker daemon

Related topics