[Product Update] "Infrastructure Failure" Badge in CircleCI UI

The “job status” badges in the CircleCI UI now include a badge for “Infrastructure Failure”. An infrastructure failure is the result of CircleCI running into issues with the underlying infrastructure that the job runs on.

Previously, it was difficult to discern whether a job failed due to an infrastructure failure or due to legitimate failures such as a test failure as both cases would have resulted in a “Failure” badge.

infra_fail_badge

1 Like

Hi there :wave:!

Many thanks for this new addition, it is very timely :slight_smile:

We have just switched our workflows on CircleCI self-hosted runners running on GCP preemptible VMs.
Due to the nature of these VMs, the CircleCI agent gets killed and the CircleCI “control plane” does not receive updates from it anymore. As a consequence, some of our workflows jobs get into this new infrastructure fail state.

:question: Are there some metrics available we can monitor to measure the number and rate of infrastructure fail jobs?

This would be very handy for us to measure and prioritize troubleshooting for these kinds of issues.

Unfortunately there are no metrics of that kind. I’ll pass along this feedback to the relevant team at CircleCI to see if this might be something they would consider enabling.

1 Like

Hi @sebastian-lerner ,

Any news about the topic ?

We would like to monitor such things and provide some SLA to our dev teams.

Thank you

Unfortunately there are no updates on our end. It is still something the relevant team is considering, but there are no timelines to share.

Curious,

Do the credits consumed in infrastructure failure builds/workflows get counted against billing for commercial plans?

We have been seeing these infrastructure failure quite often after switching to M1 machines as part of CircleCI’s Intel deprecation plan.

Reference build

@vijaytholpadi I don’t believe infrastructure failure builds consume credits. Let me know if you are seeing different behavior.

1 Like

We’d also like to see metrics on builds that have the infrastructure failure tag or perhaps an a more detailed response instead of the following:

This job appears to have stopped responding, try re-running it.

We’re having a really hard time when we run into these issues as they’re not reproducible on our end unless we disable our build cache.