Increased rate of errors when pulling docker images on machine executor

costasovo · November 30, 2021, 8:29am

Hi there,
We run docker-compone on a machine executor. In the recent month or so we observe increased rate of failed jobs due to errors when docker-compose pulls an image.

Here is one example:

Pulling mysql (circleci/mysql:5.7.27-ram)...
5.7.27-ram: Pulling from circleci/mysql

359f1fff: Pulling fs layer
1c853362: Pulling fs layer
54c0af6f: Pulling fs layer
c1a77330: Pulling fs layer
8a88eabc: Pulling fs layer
8658f4dd: Pulling fs layer
f6bff01b: Pulling fs layer
1efb6f83: Pulling fs layer
dbd83183: Pulling fs layer
57d8f022: Pulling fs layer
9495c6e7: Pulling fs layer
96f3c914: Pulling fs layer
545dcb9e: Pull complete=================================================>]     748B/748B9kBBERROR: error pulling image configuration: Get https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/03/03a... ...c7cfeecd0824: dial tcp 52.216.161.126:443: i/o timeout

Exited with code exit status 1

It happens for various images. Usually ok after rebuild. What is causing the issue and how can we prevent it?

timwah · December 2, 2021, 2:57am

Seeing the same issue with an i/o timeout error, and rebuilds usually fix it - but it’s happening for every other build.

3.7.6-slim-buster: Pulling from library/python

d04f60ab: Pulling fs layer 
16f83cca: Pulling fs layer 
30ef4680: Pulling fs layer 
c4206257: Pulling fs layer 
error pulling image configuration: Get https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/84/84de2ffd919d8fc218a243d5eb6fc0f17c7e9ebc196ad10dab4edaa9367ead4f/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2KUBRXV6HSE42Q7X%2F20211202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211202T022810Z&X-Amz-Expires=1200&X-Amz-Security-Token=FwoGZXIvYXdzEFsaDOKqEUdT9jOCINEdViKFBKAmy4t6Gzta9O2sCktpYCF0rhvazivemB3bSiJwXFZn1XxETZzHlbN4SIZ4ack%2FHgI99ghWdRkO96M2kj0EYE%2BeHW%2BxRi9kS2cpaZ8mjWy0PGFvgDYeSeEScJ6k0DLNgwM7SQGMyelxipBnGJzeYzWVMvvjl6MuaNiJSdADEa6069dBjVP0t7yGYs5mi9loKy4LbA8sdTzOgMk5SQ%2Bhj5gQhQMZZLXGhJJEBDK2iir7ZXBko9k3%2FfQeQZX3BIzkPTUAPKi%2BMRrExtBxsjncSd9uuGKR5etS8QJmm4mA%2F9Z4jk%2BrWit3KNtNml5bVECaS8N9Xlk88rXUBc3OibOssLioz2aYbT4SA3dFEQbG2pf4rkvFDy3u8RaudStQrNGskFMd7%2BCLc1R6BFnKgBtfttTPkurZpL0VrpYBBv4eF0gS7I7AodE7A2G%2B%2BOPRvCJ2Ng%2BKrb4gnW9FyO82BT8eRO0IvfvVp9gT6kln%2FCbKJ2C4Ns%2F%2B5UEpEg%2Fj%2B4rOCaVA4eYQr5Plk0aAb3K1%2BqFl3e22p960AAiGV%2F72Zb6NKDxg6s5uMBp0P85vRFt0f%2Bn7Y%2FxhnT6%2F2RbsoVnU7SID41pUbjCbhnBc0ly2J9YGPGtwvQkKw52C7DA7n95cq5w7WIC5C4J4qYeGK5s%2BDTTsS4JmrJWObUkXxq%2FQzeACH9erHvtWZZoovsagjQYyKofXUxQLF5S57ekk4cB2%2FxvCeJ%2F5s4PHElhcKf%2FhUEEWUtqUQFKd2PIh0A%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=e70d46b7f5620a9c3c35d81072b8644f9b366795266641909f496c65f3f08c4b: dial tcp 52.216.80.238:443: i/o timeout

Exited with code exit status 1

costasovo · December 2, 2021, 7:40am

Currently, we use a workaround and we retry pulling images.

- run:
          name: "Pull test docker images"
          command: i='0';while ! docker-compose -f tests/docker/docker-compose.yml pull && ((i < 3)); do sleep 3 && i=$[$i+1]; done

It is not ideal but it helped and we don’t have to rerun every other build manually.

craigjbass · January 28, 2022, 4:57pm

We see this issue quite frequently in our builds too

blimmer · February 17, 2022, 12:28am

Same, we’ve been seeing this a lot today as well.

gonzaloserrano · May 11, 2022, 2:22pm

This is happening A LOT to us lately. I’ll add some retries but it’s a PITA.

villesau · June 23, 2022, 7:16am

Besides regular downtimes of CircleCI, timeouts of pulling images are maybe the biggest reason for failing builds in our system. Goes even beyond test failures etc. I switched from self hosted Jenkins to CircleCI to get more stable infrastructure, but that seemed to be a false assumption.

sebastian-lerner · June 23, 2022, 12:37pm

Hi folks, we are actively working on reducing the number of timeouts experienced when using the machine executor on our platform. I appreciate your patience and for following the guidance around including retries when possible.

Could you please share a link to your jobs that experience timeouts either in this thread or via private message with me? I want to make sure we have as much data as possible as we test out both theories as to why we’re seeing degraded network performance and any possible solutions.

sebastian-lerner · June 29, 2022, 12:02pm

Additionally, to all on the thread, please make sure you are using the latest machine images as we patched in February, 2022 an issue with Docker that was causing network instability. The patch is present in images ubuntu-2004:202201-02 and later.

adavis444 · June 30, 2022, 7:57pm

We use BuildKit and the Arm execution environment.
We’ve been seeing this in handling the # syntax=docker/dockerfile:1.2 directive of our Dockerfile (see https://docs.docker.com/engine/reference/builder/#syntax) when calling docker build:

#3 resolve image config for docker.io/docker/dockerfile:1.2
#3 sha256:294a0880f468e89a28a1e5e4d13ea717bf9a656005ccb802a8dcbf12ae5d5949
#3 ERROR: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b...(truncated)": net/http: TLS handshake timeout
------
 > resolve image config for docker.io/docker/dockerfile:1.2:
------
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b7...(truncated)": net/http: TLS handshake timeout

Exited with code exit status 1
CircleCI received exit code 1

We’ve also seen this after updating the image to ubuntu-2004:2022.04.2

villesau · August 15, 2022, 6:57am

We’ve started seeing increased amount of

failed to solve: rpc error: code = Unknown desc = failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to do request: Head "https://registry-1.docker.io/v2/library/node/manifests/18.6.0-alpine": EOF

kind of errors which I believe are also caused by some network error under the hood. We are using ubuntu-2004:2022.07.1 machine image.

davidwinter · November 18, 2022, 8:59am

@sebastian-lerner I’ve updated our config to always use the latest machine image, ubuntu-2204:current, but I am still seeing the same error with our builds consistently now. A rebuild does nothing different.

...load build definition from Dockerfile
#1 transferring dockerfile: 30B
failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF

Exited with code exit status 17

CircleCI received exit code 17

Does anyone else have any stable workarounds?

sebastian-lerner · November 18, 2022, 2:07pm

@davidwinter Can you please file a support ticket so we can take a closer look? support.circleci.com

davidwinter · November 18, 2022, 5:03pm

Thanks, have done earlier today.

Topic		Replies	Views
Docker Hub rate limiting - how to prepare? Build Environment	13	6868	September 29, 2020
Docker pull fails somtimes Feedback & Bug Reports docker	1	1099	June 18, 2018
Intermittent failures connecting to docker daemon Build Environment docker	9	2261	November 11, 2018
Caching for docker-compose pulls Build Environment docker	1	844	October 3, 2023
No such image after downloading image Build Environment	3	1400	June 10, 2020

Increased rate of errors when pulling docker images on machine executor

Related topics