Self-hosted runners are down

rit1010 · June 10, 2022, 5:43pm

All my runners have been reporting the following for over 7 hours and is a repeat of an issue from 2 weeks ago.

Agent download unsuccessful.
error: allocation=“YPK62R09” download failed: error downloading task agent version=“1.0.127855-4422beb0” os=“linux” arch=“amd64”: could not write file “/tmp/circleci-launch-agent141090225/circleci-agent/1.0.127855-4422beb0/linux/amd64/circleci-agent.tmp”: context deadline exceeded (Client.Timeout or context cancellation while reading body)

Very little info is provided to allow the debugging of this error - as as which runner in a pool reported it, but as I’ve shutdown all members of my pool and then just restarted one member I know I have a ‘clean’ system. The lack of detail can be seen from the fact that this is exactly the same error that is given if no members of a runner pool are running, the only difference being how quickly things time out.

thekatertot · June 10, 2022, 6:16pm

Let me check with some of the internal teams on this!

rit1010 · June 10, 2022, 6:28pm

Thanks for that.

rit1010 · June 10, 2022, 10:14pm

Well some more feedback - firstly being dependent on a CI system that can fail in this way is not great for any company and from what I can see from my end is just down to poor programming within the build system. What I now know -

The current agent being published is 1.0.39461-4c647fa. I have validated this by doing a hand install of a new runner using the scripts provided.
The CI job is able to communicate with my self-hosted runner as the directory structure shown in the error is created.

So my guess is that for some reason the CI system has decided to upgrade the agent using incorrect version data - the upgrade then fails as the file it is trying to download does not exist. I can validate this in part as I have modified the download script to try and pull down 1.0.127855-4422beb0, but no file is found.

At the moment I can workaround for this issue as much of our build/test process is manual or exists only as future plans, but losing the ability to use our self-hosted runners for whole working days is not viable long term.

rit1010 · June 11, 2022, 8:30am

Things are now working again.

thekatertot · June 15, 2022, 3:06pm

Thanks for the updates @rit1010! We just released a new version of launch-agent that should have more robust retry logic.

I know you have a support ticket open, too, but I think it’s really helpful to document it here!

rit1010 · June 15, 2022, 3:21pm

For a little more detail.

The issue relates to the Build-agent, which runs before the Launch-agent, which seems to have a different life cycle in terms of where it is downloaded from. I’ve provided as much info as I can via my ongoing exchange with the support team.

The Launch-agent is something that in the worse case can be updated by hand, but the Build-agent is 100% under the control of the pipeline runner and so can not be resolved locally.

Topic		Replies	Views
Error on self hosted runners when they start up Build Environment	9	1357	June 10, 2022
Beginning around June 14, MacOS self-hosted runners are failing with 400-Bad Request Build Environment macos	3	1053	July 27, 2022
Failed to upload artifacts from self-hosted runner Build Environment artifacts	2	525	January 22, 2024
"unexpected error claiming task" when running self-hosted runner Build Environment runner	5	139	October 28, 2024
Trouble setting up self host runner Build Environment	4	423	May 30, 2023

Self-hosted runners are down

Related topics