Increased rate of errors when pulling docker images on machine executor

Hi there,
We run docker-compone on a machine executor. In the recent month or so we observe increased rate of failed jobs due to errors when docker-compose pulls an image.

Here is one example:

Pulling mysql (circleci/mysql:5.7.27-ram)...
5.7.27-ram: Pulling from circleci/mysql

359f1fff: Pulling fs layer
1c853362: Pulling fs layer
54c0af6f: Pulling fs layer
c1a77330: Pulling fs layer
8a88eabc: Pulling fs layer
8658f4dd: Pulling fs layer
f6bff01b: Pulling fs layer
1efb6f83: Pulling fs layer
dbd83183: Pulling fs layer
57d8f022: Pulling fs layer
9495c6e7: Pulling fs layer
96f3c914: Pulling fs layer
545dcb9e: Pull complete=================================================>]     748B/748B9kBBERROR: error pulling image configuration: Get https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/03/03a... ...c7cfeecd0824: dial tcp 52.216.161.126:443: i/o timeout

Exited with code exit status 1

It happens for various images. Usually ok after rebuild. What is causing the issue and how can we prevent it?

3 Likes

Seeing the same issue with an i/o timeout error, and rebuilds usually fix it - but it’s happening for every other build.

3.7.6-slim-buster: Pulling from library/python

d04f60ab: Pulling fs layer 
16f83cca: Pulling fs layer 
30ef4680: Pulling fs layer 
c4206257: Pulling fs layer 
error pulling image configuration: Get https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/84/84de2ffd919d8fc218a243d5eb6fc0f17c7e9ebc196ad10dab4edaa9367ead4f/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA2KUBRXV6HSE42Q7X%2F20211202%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211202T022810Z&X-Amz-Expires=1200&X-Amz-Security-Token=FwoGZXIvYXdzEFsaDOKqEUdT9jOCINEdViKFBKAmy4t6Gzta9O2sCktpYCF0rhvazivemB3bSiJwXFZn1XxETZzHlbN4SIZ4ack%2FHgI99ghWdRkO96M2kj0EYE%2BeHW%2BxRi9kS2cpaZ8mjWy0PGFvgDYeSeEScJ6k0DLNgwM7SQGMyelxipBnGJzeYzWVMvvjl6MuaNiJSdADEa6069dBjVP0t7yGYs5mi9loKy4LbA8sdTzOgMk5SQ%2Bhj5gQhQMZZLXGhJJEBDK2iir7ZXBko9k3%2FfQeQZX3BIzkPTUAPKi%2BMRrExtBxsjncSd9uuGKR5etS8QJmm4mA%2F9Z4jk%2BrWit3KNtNml5bVECaS8N9Xlk88rXUBc3OibOssLioz2aYbT4SA3dFEQbG2pf4rkvFDy3u8RaudStQrNGskFMd7%2BCLc1R6BFnKgBtfttTPkurZpL0VrpYBBv4eF0gS7I7AodE7A2G%2B%2BOPRvCJ2Ng%2BKrb4gnW9FyO82BT8eRO0IvfvVp9gT6kln%2FCbKJ2C4Ns%2F%2B5UEpEg%2Fj%2B4rOCaVA4eYQr5Plk0aAb3K1%2BqFl3e22p960AAiGV%2F72Zb6NKDxg6s5uMBp0P85vRFt0f%2Bn7Y%2FxhnT6%2F2RbsoVnU7SID41pUbjCbhnBc0ly2J9YGPGtwvQkKw52C7DA7n95cq5w7WIC5C4J4qYeGK5s%2BDTTsS4JmrJWObUkXxq%2FQzeACH9erHvtWZZoovsagjQYyKofXUxQLF5S57ekk4cB2%2FxvCeJ%2F5s4PHElhcKf%2FhUEEWUtqUQFKd2PIh0A%3D%3D&X-Amz-SignedHeaders=host&X-Amz-Signature=e70d46b7f5620a9c3c35d81072b8644f9b366795266641909f496c65f3f08c4b: dial tcp 52.216.80.238:443: i/o timeout

Exited with code exit status 1
1 Like

Currently, we use a workaround and we retry pulling images.

- run:
          name: "Pull test docker images"
          command: i='0';while ! docker-compose -f tests/docker/docker-compose.yml pull && ((i < 3)); do sleep 3 && i=$[$i+1]; done

It is not ideal but it helped and we don’t have to rerun every other build manually.

We see this issue quite frequently in our builds too

Same, we’ve been seeing this a lot today as well.

This is happening A LOT to us lately. I’ll add some retries but it’s a PITA.

1 Like

Besides regular downtimes of CircleCI, timeouts of pulling images are maybe the biggest reason for failing builds in our system. Goes even beyond test failures etc. I switched from self hosted Jenkins to CircleCI to get more stable infrastructure, but that seemed to be a false assumption.

Hi folks, we are actively working on reducing the number of timeouts experienced when using the machine executor on our platform. I appreciate your patience and for following the guidance around including retries when possible.

Could you please share a link to your jobs that experience timeouts either in this thread or via private message with me? I want to make sure we have as much data as possible as we test out both theories as to why we’re seeing degraded network performance and any possible solutions.

Additionally, to all on the thread, please make sure you are using the latest machine images as we patched in February, 2022 an issue with Docker that was causing network instability. The patch is present in images ubuntu-2004:202201-02 and later.

We use BuildKit and the Arm execution environment.
We’ve been seeing this in handling the # syntax=docker/dockerfile:1.2 directive of our Dockerfile (see https://docs.docker.com/engine/reference/builder/#syntax) when calling docker build:

#3 resolve image config for docker.io/docker/dockerfile:1.2
#3 sha256:294a0880f468e89a28a1e5e4d13ea717bf9a656005ccb802a8dcbf12ae5d5949
#3 ERROR: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b...(truncated)": net/http: TLS handshake timeout
------
 > resolve image config for docker.io/docker/dockerfile:1.2:
------
failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: failed to copy: httpReadSeeker: failed open: failed to do request: Get "https://docker-images-prod.s3.dualstack.us-east-1.amazonaws.com/registry-v2/docker/registry/v2/blobs/sha256/d9/d9b7...(truncated)": net/http: TLS handshake timeout

Exited with code exit status 1
CircleCI received exit code 1

We’ve also seen this after updating the image to ubuntu-2004:2022.04.2

We’ve started seeing increased amount of

failed to solve: rpc error: code = Unknown desc = failed to solve with frontend dockerfile.v0: failed to create LLB definition: failed to do request: Head "https://registry-1.docker.io/v2/library/node/manifests/18.6.0-alpine": EOF

kind of errors which I believe are also caused by some network error under the hood. We are using ubuntu-2004:2022.07.1 machine image.

@sebastian-lerner I’ve updated our config to always use the latest machine image, ubuntu-2204:current, but I am still seeing the same error with our builds consistently now. A rebuild does nothing different.

...load build definition from Dockerfile
#1 transferring dockerfile: 30B
failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF

Exited with code exit status 17

CircleCI received exit code 17

Does anyone else have any stable workarounds?

@davidwinter Can you please file a support ticket so we can take a closer look? support.circleci.com

Thanks, have done earlier today.