How job status is gathered (jobs stuck in queued state on 2.18)

(Gone through support, heard nothing, trying here)

We’ve been fighting a problem for several days and finally made good (but not 100% progress today). Skip to the end for the question, or read for the tale of woe!

The problem is that jobs were stuck in queued state (2.18.3). Tracking them through dispatcher/picard-dispatcher/picard-scheduler showed that they were “posted to nomad” and then we heard no more.

We then noticed that nomad.hcl had job_gc_threshold lowered from 4h to 5s (!) This was specifically mentioned in 2.18 release notes. So we thought "ah, so if circleci is polling every 30secs, it could just miss a job entirely as it’s being collected to fast). So we bumped that… and although “nomad status” now showed the jobs (woot!) it didn’t fix the problem.

We then looked at the nomad job status and noticed that out of our four nomad nodes, circleci could see the job status of one node, but not of the other nodes. The only difference we saw was that the working node had a “picard” container that was 2 months old. picard-output-processor seemed to then be sent grpc messages about the status - so our job_gc_threshold was a red-herring - and our hypothesis is that this picard container is responsible for sending circleci the job status.

“Aha” we thought - that container is missing from the other nomad nodes - but what starts it? And then we noticed that actually, when a job runs, there are several other picard containers starting (for the lifetime of those jobs).

So… THE QUESTION… how does job status get back to circleci from nomad? Assuming it’s these picard containers - how are they started (any idea as to why they aren’t starting on the other boxes?) We couldn’t see errors in the logs.

There’s also clearly another bug here - circleci thinks these are queued jobs and yet they’ve been run by nomad. That very low job_gc_threshold makes troubleshooting hard (any reason not to bump it to say 5m?) That would also give you (another) way to get the job status - and certainly to determine that the job isn’t queued - but that it’s “lost” for want of a better word. The network traffic is permitted (port 8585) and is identically configured for all nomad nodes.

Thanks for any light anyone can shed.

PS: was good, it’d be even better if it had full example (e.g. including picard-output-processor) (and add the outcome to the discussion above!)


PS: says:
“Old CircleCI, use for new images”

However the link 404s (which IIRC github also unhelpfully replies for “unauthorised”)