Intermittant Configuration Issues gCloud/Kubernetes Cluster


#1

I am having issues configuring my container to point to my Kubernetes cluster with the command gcloud container clusters get-credentials. When I enable SSH and run it a bunch of times it will eventually run sucessfully; however, I get this error most times:

ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics

I am not seeing a pattern so I am wondering if there is something in the resolv.conf in the machines that is preventing it to pass or something. I am a bit stumped at the moment, so I am just putting this in just in case there is something internally to circle that might explain this.

When I run this container locally, all things work just fine, so I am not entirely sure what is going on, but I have isolated that it specifically is intermittent only in CircleCI, which seems to be that it is more of a networking issue on the circle side.

Appreciate any help you guys might provide on this matter. Thanks!!


#2

I don’t know Kube, but it may be worth ruling out a basic networking problem, rather than a metadata fetch issue. Could you put a ping in a prior run step, so you/we can see if you are getting any packet loss before a crash? Something like this will do X pings and report if you’re getting any drops:

ping -c 5 server.example.com

Then, the next you get the crash, you can see if there was a problem with server reachability. If there is not, you’ll have to see if gcloud has a verbose/debug switch.


#3

You could also put the suggested diagnostics command in a prior run step, too. Unless it takes a long time to run, could you keep it in there permanently?


#4

I am not exactly sure where gcloud is going to when I run that configuration step, so not sure if I can do that yet. I have run the diagnostics and that doesn’t really produce anything fruitful other than saying that there are no network issues. But then the command will still fail.


#5

Have you run it on the server itself, e.g. in a post-fail SSH session?


#6

Yep, that is what I have been doing to test this fully. I get into the ssh session, which is in the docker container, and then I run the diagnostics and then I continue to run the command. It will fail for like 30 times, and then eventually it will pass. I am thinking it is network related, but can’t say for sure. When I run this container locally I can run this just fine, so it leads me to think network, but again, it could certainly be on my side. Thanks for the help!


#7

Sorry, to clarify - what is failing? The diagnostics or the command?

In either case, would you put the console output of the diagnostics here?


#8

Sorry, I wasn’t clear there, the command is failing the diagnostics is passing, and here is the results of that:

bash-4.4# gcloud container clusters get-credentials ci --project icx-ci --zone us-central1-b
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
bash-4.4# gcloud info --run-diagnostics
Network diagnostic detects and fixes local network connection issues.
Checking network connection...done.
Reachability Check passed.
Network diagnostic (1/1 checks) passed.

bash-4.4# gcloud container clusters get-credentials ci --project icx-ci --zone us-central1-b
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics

I basically keep running the get-credentials command over and over again and it fails. Then at some point it will pass, just running the same command. For example, here is the commands of me just doing it over and over (just a small sample of me re-running the command)

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
bash-4.4# gcloud container clusters get-credentials ci --project icx-ci --zone us-central1-b
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
bash-4.4# gcloud container clusters get-credentials ci --project icx-ci --zone us-central1-b
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
bash-4.4# gcloud container clusters get-credentials ci --project icx-ci --zone us-central1-b
Fetching cluster endpoint and auth data.
kubeconfig entry generated for ci.

Something I noticed, which is even weirder is that once it passes in the SSH build, I will re-build that normally and then the command will just start working. It is like once I get that passing for that specific image it is good going foward. However, if I re-build the image it fails, so maybe there is something going on when building the container in Circle as well. I am a bit stumped so just grasping at things at this point.


#9

So I changed the way I built and pushed the image (outside of Circle) and it works. Essentially, I build the image in my CI kube cluster rather than using the local CircleCI resource and now it works. I am thinking there is something going on with the resolv.conf in the image. At this point I can move, so I think I am good here, but definitely an oddity in Circle.


#10

It is weird I am still seeing intermittant issues periodically, I just rebuild and it works. Super weird, I have a ticket in with Circle now to see if I can figure this out.


#11

I have yet to figure this out, I thought I had a beat on it but still can’t figure it out, hoping someone here can help, CircleCI support hasn’t been super helpful just yet, so posting some details here and hoping the community can help me out!

I have two private registries that I am pulling a container from. One is my production registry, and this project runs all of my other applications. I have a private registry in one gcr registry project and another registry in another project.

For example, all of my projects us this image as the container they are using to run their tests, and this spawns other things.

gcr.io/icxmedia-servers/belushi-integration:latest

This image always configures properly. However, if I am working in belushi and want to create a new version, I will then build a new image and then use that in the next step of the workflow to test that belushi is working properly. So for exampe, I will net with teh following.

gcr.io/icx-ci/belushi-integration:1234567

When I use that image in the next step it becomes intermittent if it will configure with gcloud properly or not.

I am wondering if there is something that Circle is doing with domains or something. I have messed with the resolv.conf in the container, but that doesn’t appear to do anything to help the problem.

Thanks in advance!


#12

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.