Hi there,
We run docker-compone on a machine executor. In the recent month or so we observe increased rate of failed jobs due to errors when docker-compose pulls an image.
Besides regular downtimes of CircleCI, timeouts of pulling images are maybe the biggest reason for failing builds in our system. Goes even beyond test failures etc. I switched from self hosted Jenkins to CircleCI to get more stable infrastructure, but that seemed to be a false assumption.
Hi folks, we are actively working on reducing the number of timeouts experienced when using the machine executor on our platform. I appreciate your patience and for following the guidance around including retries when possible.
Could you please share a link to your jobs that experience timeouts either in this thread or via private message with me? I want to make sure we have as much data as possible as we test out both theories as to why we’re seeing degraded network performance and any possible solutions.
Additionally, to all on the thread, please make sure you are using the latest machine images as we patched in February, 2022 an issue with Docker that was causing network instability. The patch is present in images ubuntu-2004:202201-02 and later.
We use BuildKit and the Arm execution environment.
We’ve been seeing this in handling the # syntax=docker/dockerfile:1.2 directive of our Dockerfile (see https://docs.docker.com/engine/reference/builder/#syntax) when calling docker build:
#3 resolve image config for docker.io/docker/dockerfile:1.2
#3 sha256:294a0880f468e89a28a1e5e4d13ea717bf9a656005ccb802a8dcbf12ae5d5949
#3 ERROR: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b...(truncated)": net/http: TLS handshake timeout
------
> resolve image config for docker.io/docker/dockerfile:1.2:
------
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b7...(truncated)": net/http: TLS handshake timeout
Exited with code exit status 1
CircleCI received exit code 1
We’ve also seen this after updating the image to ubuntu-2004:2022.04.2
failed to solve: rpc error: code = Unknown desc = failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to do request: Head "https://registry-1.docker.io/v2/library/node/manifests/18.6.0-alpine": EOF
kind of errors which I believe are also caused by some network error under the hood. We are using ubuntu-2004:2022.07.1 machine image.
@sebastian-lerner I’ve updated our config to always use the latest machine image, ubuntu-2204:current, but I am still seeing the same error with our builds consistently now. A rebuild does nothing different.
...load build definition from Dockerfile
#1 transferring dockerfile: 30B
failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF
Exited with code exit status 17
CircleCI received exit code 17