Self-hosted runners are down

All my runners have been reporting the following for over 7 hours and is a repeat of an issue from 2 weeks ago.

Agent download unsuccessful.
error: allocation=“YPK62R09” download failed: error downloading task agent version=“1.0.127855-4422beb0” os=“linux” arch=“amd64”: could not write file “/tmp/circleci-launch-agent141090225/circleci-agent/1.0.127855-4422beb0/linux/amd64/circleci-agent.tmp”: context deadline exceeded (Client.Timeout or context cancellation while reading body)

Very little info is provided to allow the debugging of this error - as as which runner in a pool reported it, but as I’ve shutdown all members of my pool and then just restarted one member I know I have a ‘clean’ system. The lack of detail can be seen from the fact that this is exactly the same error that is given if no members of a runner pool are running, the only difference being how quickly things time out.

Let me check with some of the internal teams on this!

Thanks for that.

Well some more feedback - firstly being dependent on a CI system that can fail in this way is not great for any company and from what I can see from my end is just down to poor programming within the build system. What I now know -

  • The current agent being published is 1.0.39461-4c647fa. I have validated this by doing a hand install of a new runner using the scripts provided.

  • The CI job is able to communicate with my self-hosted runner as the directory structure shown in the error is created.

So my guess is that for some reason the CI system has decided to upgrade the agent using incorrect version data - the upgrade then fails as the file it is trying to download does not exist. I can validate this in part as I have modified the download script to try and pull down 1.0.127855-4422beb0, but no file is found.

At the moment I can workaround for this issue as much of our build/test process is manual or exists only as future plans, but losing the ability to use our self-hosted runners for whole working days is not viable long term.

1 Like

Things are now working again.

1 Like

Thanks for the updates @rit1010! We just released a new version of launch-agent that should have more robust retry logic.

I know you have a support ticket open, too, but I think it’s really helpful to document it here!

For a little more detail.

The issue relates to the Build-agent, which runs before the Launch-agent, which seems to have a different life cycle in terms of where it is downloaded from. I’ve provided as much info as I can via my ongoing exchange with the support team.

The Launch-agent is something that in the worse case can be updated by hand, but the Build-agent is 100% under the control of the pipeline runner and so can not be resolved locally.

1 Like