Self-hosted runner setup on GCP with spot instances

Hi there :wave:

Early this year in February, we have setup self-hosted machine runners on GCP to reduce our network-related costs with our private Artifacts Registry where we push and pull docker images.

For costs reasons also, we have setup Managed Instances Groups with spot/pre-emptible VMs.

However, due to the nature of spot VMs, sometimes the VMs are stopped by Google and the jobs that run on these VMs never “stop” and end with the new “infrastructure fail” status :sweat_smile:

:bulb:We were told by some CircleCI support team members to setup a “system” where systemd would send 2 SIGTERM signals to the tasks-agent to make him “drain” and report the job as canceled to CircleCI control plane.

We tried to do so without success :grimacing:

Have someone already done such setup or is facing such situation?
If so, could you share with us your setup :innocent:

Many thanks to you :pray:

If any CircleCI employee find this post, I think it could be a good example to add to the self-hosted machine runners documentatio :wink:

No suggestions at all :disappointed_relieved:

I think you will have to talk to the support team again as the recommended solution of sending 2 SIGTERM signals to the task-agent is at odds with how SIGTERMs are normally handled and CircleCI’s own docs.

A SIGTERM is a ‘polite’ kill as it is a signal that can be handled by the process, so it is possible that the handler could be coded to accept 2 SIGTERMs as a special case, but it is not documented anywhere that I can find and only the internal teams have access to the current code base.

The only docs I can find are here

But these cover a different shutdown cycle that first tries a SIGTERM and then a KILL as the focus is on a clean shutdown over time.

1 Like

Hi @rit1010 ,

Thanks for your message :pray:

I will reach CircleCI’s support team directly then, as you suggested.

I will post the conclusion to this question if I get anything out :innocent: