A more scalable, container-friendly self-hosted runner: Container Agent - now in Open Preview

We’re excited to announce: Container Agent (final name TBD), a more scalable and container-friendly self-hosted runner, is now in Open Preview: Container runner open preview - CircleCI

With container agent, self-hosted runner users will have:

  • The ability to easily define, publish, and use custom Docker images during job execution
  • The ability to easily manage dependencies or libraries through custom Docker images by using the Docker executor in config.yml
  • Seamless orchestration of ephemeral Kubernetes pods for every Docker job on self-hosted compute

If you need to run CI/CD jobs on your own infrastructure with Kubernetes, or are using the existing self-hosted runner installation on Kubernetes, visit the container agent docs today to get started.

Container Agent does not replace the existing self-hosted runner, but is instead a complement. The existing self-hosted runner is meant for customers needing to use the Machine executor. Container Agent is the equivalent of the Docker executor for self-hosted runners.


New additions in the past week, mainly improvements to scenarios when deviating from the happy path:

  • If a job using container agent fails, previously the workflow did not always gracefully fail as well. This has now been fixed
  • When the underlying node for a task pod is removed from the cluster (either by kubectl delete node, unexpected shutdowns, or a variety of other reasons) the container-agent garbage collection loop is now able to detect that the node is no longer available and clean up the pod
  • Because container agent allows you to configure tasks pods with the full range of Kubernetes settings, this means pods can be configured in a way which cannot be scheduled due to their constraints. We’ve added a constraint checker which periodically validates each resource class configuration against the current state of the cluster to ensure pods can be scheduled. This prevents container agent claiming jobs which it cannot schedule which would then fail
1 Like

Like the new Container runner!

I am having issues installing the helm chart into multiple namespaces with different resource class names.
ClusterRole and ClusterRoleBindings have conflicts.

Do you have any suggestions?

Hi @yuft, thank you for the feedback! At the moment, one of the limitations is that each container-agent can only be deployed to a single namespace.

We’re looking at how we can change that in the future and will update when we have more!

Will the console interface gain any features to allow more control over a runner - currently, it is possible to create a resource class and runner in the GUI, but there is no way to delete them.

There is also a lack of reporting in terms of runner usage, but that is a longer-term issue for when more people are using runners.

@rit1010 Yup, management of resource classes & resource class tokens via the UI is something on the near-term roadmap. We hope to have something out in the next ~3 months.

Showing Runner usage is also on the roadmap, but further down the line.

is there a way to allow us to intercept the ephemeral task pod creation process?
In my case, I’d like to append a Label to the ephemeral tasks pod so that it can claim to use a Managed Service Identity(MSI) during deployment to Azure.

@yuft Right now the only customization to task pods is through the resource class configuration process. I’m not as familiar with how one goes about appending that Label, is that something that can be added to the pod spec?

Hey there,

We are testing this out and are getting the following panic regularly after task runs:

15:02:05 20bb0 6453.340ms service-work mode=agent result=success service.name=container-agent service_name=container-agent
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1712d00]

goroutine 66 [running]:
github.com/circleci/container-agent/driver/k8s.(*k8sTask).Cleanup(0xc0007b6480, {0x1f28430, 0xc00089a210})
        /home/circleci/project/driver/k8s/task.go:409 +0x400
github.com/circleci/container-agent/service.cleanup({0x1f28430?, 0xc00089a210?}, {0x1f28628?, 0xc0007b6480?})
        /home/circleci/project/service/task_worker.go:257 +0x3b
github.com/circleci/container-agent/service.(*taskWorker).runTask(0xc000819790, {0x1f28430, 0xc00089a210}, 0x7f77a94f8548?, {0xc0006f92a8, 0x8}, 0xc0003763f0?, {0x1f28628?, 0xc0007b6480})
        /home/circleci/project/service/task_worker.go:147 +0x374
github.com/circleci/container-agent/service.(*taskWorker).serviceWork(0xc000819790, {0x1f28388?, 0xc00004b380?}, {0xc0003fa740?, {0xc0003763f0?, 0x1?}})
        /home/circleci/project/service/task_worker.go:125 +0x6df
github.com/circleci/container-agent/service.(*taskWorker).work(0xc000819790, {0x1f28388, 0xc00004b380})
        /home/circleci/project/service/task_worker.go:74 +0x5e
github.com/circleci/container-agent/service.Add.func2({0x1f28388?, 0xc00004b380?})
        /home/circleci/project/service/task_worker.go:62 +0x30
        /home/circleci/go/pkg/mod/github.com/circleci/ex@v1.0.3650-a1109cf/system/system.go:63 +0x25
        /home/circleci/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/circleci/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:72 +0xa5

The task is still cleaned up but the container agent restarts after this, has anyone else reported this issue? Apologies if this is the wrong spot to toss this.

We are running the container-agent on GKE.

This is the right spot! Taking a look with the internal engineering team, I’ll report back.

@uplight-james Can you share the version of container agent you’re using? It should be visible in the “task lifecycle” step from the Job Details page for a job that was run. Or if you go to your inventory screen (“Self-hosted Runners” on the left-hand nav of your UI) it should have the version as well

@sebastian-lerner thanks for the reply! We are using circleci/container-agent:1.0.8569-ccd6594. Let me know what else I can provide.

We see this directly after a task finishes, with garbage collection on or off. The container spun up for the task DOES get removed from the cluster properly but this error still occurs.

It results in the container-agent exiting 2 (according to kubectl describe pod) and restarting, the container-agent does come back up and start working after that.

Thanks, can you try doing a helm update & upgrade to get the latest chart version and let me know if you’re still seeing this issue? CircleCI’s self-hosted runner FAQs - CircleCI

Folks, a couple of updates to share in the recent helm chart upgrades:

  • container-agent can now be run on ARM pods for both the pod that installs container-agent and the “task pods”. No need to specify this in values.yaml, there’s logic built in to pick up the right architecture and work accordingly
  • We now fallback to a generic shell if bash is not included in the image provided. @jpi I think this should fix the issue you were seeing in this thread.

If you upgrade to the latest helm chart these should be available.

Also coming very soon, some logging improvements to the errors that we output to be more actionable.

1 Like

We just pushed a fix with the latest version of the helm chart that fixes issues some users were seeing in this thread. It was preventing some images which worked just fine on CCI-hosted compute from being used with container-agent. This limitation should no longer exist. Reach out to me if there are still issues you’re seeing.

1 Like