Container Agent fails to start Task Agent - exit code 139 segmentation fault

sam-kemp · August 30, 2022, 10:30am

I’m trying to use the new Container Agent setup on a microk8s (first I appreciate this might be my issue as it likely hasn’t been tested) cluster (k8s v1.22). I can deploy everything successfully with Helm and the container-agent connects to CircleCI successfully and polls for jobs. However, when attempting to run any jobs I get the following error present in the CircleCI UI in a step called ‘Instance Failure’:

could not run task: launch circleci-agent on "container-0" failed: command terminated with exit code 139

So it seems there’s a segmentation fault occurring.

The same error log is reflected in the container-agent pod logs
service-work error=1 error occurred:mode=agent result=error service.name=container-agent service_name=container-agent.

Other notes:

If I try and set any resource requests or limits for the task containers (like in the example in the docs) then no task containers ever get launched, the jobs just sit there unprocessed and times out after 10 minutes (no task-agent containers ever get created).
Container agent version is: 1.0.7556-dfb352b
I can see warnings in the container-agent pod logs: httpclient: container-agent /api/v2/runner/claim app.loop_name=claim ................ warning=no content but I’m unsure what this means.
If task-agent pods do get spawned they get into a running state in k8s but just sit there doing nothing with no log output whatsoever.
I also get the WARN: Missing API Key message on the first line of the logs even though the agent seems to be communicating with CircleCI successfully.

Thanks

sam-kemp · August 30, 2022, 11:55am

So it seems this segmentation fault happens only when I use certain docker images such as any alpine based image or the ‘hashicorp/terraform’ images but no problems occur when I use the circleci convenience images or something like ‘python:latest’. I’ve used the alpine based images without problem on the circleci cloud based runners so not sure why they’re failing here, the alpine images also work when I deploy them manually to the cluster. I appreciate this is in beta so I’m fine using the CircleCI convenience images for now.

The issue of not being able to declare the resource requests and limits is still present though. Thanks

sebastian-lerner · September 12, 2022, 1:47pm

Sorry I missed this post. The 139 is often seen with an image that is using an entrypoint that is invalid. See the fifth bullet here: https://circleci.com/docs/container-runner#limitations. This can be worked around by explicitly adding an entrypoint in your circleci config when trying to use that image.

For the task pod configuration, are you including the top-level “agent” key in your values.yml file? This was an issue with our docs that has now been fixed. Can you share a build link and also the values.yml snippet where you’re setting up that resource class specification? You can share via support ticket or via direct message if you prefer not to post it here.

jensrotne · October 14, 2022, 9:07pm

@sebastian-lerner looks like the documentation has changed. Can you point us in a direction on getting the hashicorp/terraform image to work with the container runners?

I too get the command terminated with exit code 139 and segmentation fault in the logs.

sebastian-lerner · October 14, 2022, 9:27pm

@jensrotne I think that image is falling into the same problem that this image ran into: Setup_remote_docker on container runner - #9 by sebastian-lerner

There might not be an explicit entrypoint set, it might be possible to set it via the CircleCI config file. More details on this page: Troubleshoot self-hosted runner - CircleCI.

sebastian-lerner · October 14, 2022, 9:29pm

Actually when I look at that image I think the entry point is set…

What version of container-agent are you using?

sebastian-lerner · October 14, 2022, 9:32pm

And can you send a job link please? We saw this with another user awhile ago for that image and I thought we changed the behavior for how we handle that case.

kelvintaywl · October 15, 2022, 6:23am

hi @jensrotne

I was able to run a CircleCI job using the hashicorp/terraform image on a Container Runner instance.

I was able to run this by setting the entrypoint explicitly.

jobs:
  tf-containter-runner:
    # this is my Container Runner; replace accordingly
    resource_class: kelvintaywl/cntr-agent-orange
    docker:
    - image: hashicorp/terraform
      # NOTE: please set the entrypoint as seen below
      entrypoint: /bin/sh
    steps:
    - checkout
    - run:
        command: |
          terraform version

Here is the build running successfully:
https://app.circleci.com/pipelines/github/kelvintaywl/tf/10/workflows/10c193e1-2d21-4c45-895c-f96cc7b3917d/jobs/10

sebastian-lerner · October 15, 2022, 2:27pm

@kelvintaywl Thank you

jensrotne · October 15, 2022, 6:57pm

@kelvintaywl thanks, but did not work unfortunately.

@sebastian-lerner I use the newest chart version 100.0.0.

I have tested a couple of different images and compute configurations (EKS with EC2 and Fargate), and it looks like it is related to the terraform image specifically. It works with cimg/base.

Job link

sebastian-lerner · October 17, 2022, 1:44pm

@jensrotne we tried copying/pasting your config with the entrypoint override and it seems to be running just fine on our internal dogfooding container agent. Which is strange…could you share via direct message or via support.circleci.com your values.yml file with the resource class token(s) redacted please?

jensrotne · October 17, 2022, 6:54pm

@sebastian-lerner I have created a ticket. Thanks for your help!

sebastian-lerner · October 18, 2022, 4:17pm

Thanks for submitting that ticket @jensrotne . Expect a response from the support team in there.

For others on this thread, we think we’ve isolated this to an issue with alpine-based images when being used with microk8s. We’re trying to figure out a path forward but for now the recommended work-around is to use CircleCI’s base image and build your own image by including parts of the alpine-based image that was being used. If you build that image and store it, you should be able to use it in a job by calling it with the image: key.

This is in the process of being added to our known limitations for container runner until we find a long-term solution.

ismirnov · October 29, 2022, 1:08am

I can confirm we are facing the same issue. Running alpine images on k3s v1.25.3.

Our docker image has an entrypoint set to /bin/bash, based on golang:1.18.0-alpine.

Topic		Replies	Views
Container-agent error on job execution Build Environment	6	1471	September 15, 2022
A more scalable, container-friendly self-hosted runner: Container Agent - now in Open Preview Build Environment	23	3787	May 4, 2023
Aug 13-14 'Unable to run the job runner' issue Announcements	0	1572	August 14, 2019
Error connecting build-agent to ephemeral network Build Environment	1	1212	December 2, 2018
Agents are rerun Build Environment docker	3	821	May 13, 2019

Container Agent fails to start Task Agent - exit code 139 segmentation fault

Related topics