Docker builds getting stuck (possible networking issues)

I have set up a number of scheduled builds for a client, and occasionally they fail. The failure steps are always a Docker build, and always a networking operation, so I am inclined to think there is a highly intermittent networking issue. A manual rebuild always fixes the issue.

Example 1:

RUN yum install -y curl git
...
Complete!
Too long with no output (exceeded 10m0s)

Example 2:

RUN composer install --prefer-source
...
Generating autoload files
Too long with no output (exceeded 10m0s)

Example 3:

RUN yum update -y
...
---> Running in 6e325d7e6098
Too long with no output (exceeded 10m0s)

In each case, an operation that should take 1-3 minutes is still stuck after 10 minutes. I logged a ticket about this, and Scott suggested that I increase the no-output timeout. I’ve replied to say that is not an ideal fix, since if something has consumed a 250% increase of build time, then it is probably stuck permanently, and just needs to be restarted. (It’s ticket 54295 if any employees want to read it).

I am pondering whether I could fix this by adding a retry script in the Dockerfile for every network operation. This seems a bit hacky to me, since I’ve never had this issue locally - and I imagine the network links to the build servers have a fair bit of redundancy built-in. Any suggestions how I can tackle this?

Here is one solution, but it is a bit chunky. I would probably add a timeout here too, so that no operation is permitted to get stuck for more than three minutes.

A bit of an update, in case anyone else bumps into this. Takayuki kindly replied to my ticket, to say that the CircleCI Docker Layer Caching feature would help. Sadly I don’t have budget sign-off for the necessary plan, but I wonder whether that just masks an intermittent problem, rather than fixing it.

However, that does make me wonder whether layer caching would make it more reliable, even if that is “cheating” on the problem. If I get some spare engineering time, I might look at my old sticky-tape solution.

In the short term, I am happy to just re-run failed builds.

1 Like

Another update. I added a timeout device to an item that was running for 10 minutes, and limited the timeout to 3m. I am fairly sure that the step still took 10m, indicating to me that the step is getting stuck. I have a theory that it is how stdout is handled in Docker/Compose builds.

It is unfortunately an intermittent issue, so rather hard to replicate or test!