The “job status” badges in the CircleCI UI now include a badge for “Infrastructure Failure”. An infrastructure failure is the result of CircleCI running into issues with the underlying infrastructure that the job runs on.
Previously, it was difficult to discern whether a job failed due to an infrastructure failure or due to legitimate failures such as a test failure as both cases would have resulted in a “Failure” badge.
Many thanks for this new addition, it is very timely
We have just switched our workflows on CircleCI self-hosted runners running on GCP preemptible VMs.
Due to the nature of these VMs, the CircleCI agent gets killed and the CircleCI “control plane” does not receive updates from it anymore. As a consequence, some of our workflows jobs get into this new infrastructure fail state.
Are there some metrics available we can monitor to measure the number and rate of infrastructure fail jobs?
This would be very handy for us to measure and prioritize troubleshooting for these kinds of issues.
Unfortunately there are no metrics of that kind. I’ll pass along this feedback to the relevant team at CircleCI to see if this might be something they would consider enabling.