Building with cache is failing, without cache passes

neomatrix369 · May 16, 2019, 11:58pm

I have noticed during the building of these branches failed on CircleCI:

https://circleci.com/gh/neomatrix369/awesome-graal/360 (with caching enabled)
https://circleci.com/gh/neomatrix369/awesome-graal/354 (with caching enabled)
https://circleci.com/gh/neomatrix369/awesome-graal/350 (with caching enabled)

But this one passes with caching disabled:

https://circleci.com/gh/neomatrix369/awesome-graal/357

The whole purpose of the build with caching enabled is to get faster builds. So it’s counter-productive if I have to disable caching to get a green build.

This always worked till recently, you can see the logs of these jobs.

halfer · May 17, 2019, 8:07am

So, you’ll need to dig into why it is failing. The failure appears to be related to this:

Compiling com.oracle.truffle.api.test with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.amd64 with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.test with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.truffle.compiler.hotspot.libgraal with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.lir.jtt with javac-daemon(JDK 1.8) failed
5 build tasks failed

I don’t do Java, so cannot advise directly, but you’re probably the best person to dig into this. Could you find some logs or a debug switch to find out what “failed” means in this case?

I don’t doubt you’re right that the cache is causing this, but that is not sufficient from a debugging perspective - you need to find out why. Perhaps you could calculate a hash for all items in the cache to see if they are stale, or perhaps there are some file permission issues?

neomatrix369 · May 20, 2019, 7:16pm

Thanks for your reply, I will get back to you soon with details, although not sure if there are flags but hopefully digging into examples or docs will reveal more for me.

Although if you have seen from my past posts, I have been reporting about caching now and then, I agree its one of the hard problems in CS. But lets work and see how we can fix it.

drazisil · May 21, 2019, 1:15pm

Hi Mani. One thing you can try, is narrowing your cache keys. If you delete

            - os-deps

and

            - make-382

You will only get a cache from the right SHA or branch, instead of possibly getting a cache from another branch that could be completely different from what you are using.

neomatrix369 · May 29, 2019, 12:16pm

Thanks I will take a look at that, it could help. Will feedback after giving it a whirl.

neomatrix369 · June 7, 2019, 10:26pm

I tried building the different branches by removing the above definitions from the configuration file and the builds still failed.

In fact I have had green builds on 2 out of 3 branches before the above changes, so it looks like a hit or miss.

There must be something else that might be causing this, I’ll try to rebuild the runs without cache and the build it again there after.

I have seen from past experience, rebuilding without cache helps create a new cache and thereafter the builds go fine - is this expected?

neomatrix369 · June 7, 2019, 10:34pm

Just finished the builds and none of the rebuilds passed. I don’t get these results on the local machine, when rebuilt in a clean environment. I’ll try to do a verbose build and see if this helps.

neomatrix369 · June 10, 2019, 8:47pm

I have added verbose logging to the build tasks now, see https://circleci.com/gh/neomatrix369/awesome-graal/413, please look for the log lines around [Compiler daemon process appears to have crashed] … CalledProcessError:....

You will see a number of these daemons are crashing 5 to 6 of them, in a serial fashion, unless the OS is terminating those jobs or doing something to the I/O device, as the PIDs are not responding to a read action (returning nothing to a readline call).

It would be great to be able to trap these errors and find out how or why they are happening, any tips to be able to find out what could be causing them to fail? In the meanwhile I will Rebuild using SSH and try some debug stunts!

neomatrix369 · June 10, 2019, 9:11pm

The SSH session of the rebuild action failed, see https://circleci.com/gh/neomatrix369/awesome-graal/416

I couldn’t ssh into the box, here’s the response:

$ ssh -vvv -p 64535 3.81.32.200

OpenSSH_7.8p1, LibreSSL 2.6.2
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 48: Applying options for *
debug1: /etc/ssh/ssh_config line 52: Applying options for *
debug2: resolve_canonicalize: hostname 3.81.32.200 is address
debug2: ssh_connect_direct
debug1: Connecting to 3.81.32.200 [3.81.32.200] port 64535.
debug1: connect to address 3.81.32.200 port 64535: Connection refused
ssh: connect to host 3.81.32.200 port 64535: Connection refused

halfer · June 10, 2019, 9:17pm

Does the rest of the verbose output indicate that your private key authentication to the box has been accepted?

neomatrix369 · June 10, 2019, 9:20pm

Thats all the output I have from SSH.

From the CircleCI job it appears the SSH session was terminated even though I hadn’t logged in and it was under 5 minutes since it was waiting for me to SSH into it.

I have always been able to SSH to CircleCI jobs.

Its possible that job terminated, since I hjave been able to log into a new one:

➜  linux_macos git:(master) ssh -p 64537 54.91.50.175
The authenticity of host '[54.91.50.175]:64537 ([54.91.50.175]:64537)' can't be established.
RSA key fingerprint is [snipped]
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[54.91.50.175]:64537' (RSA) to the list of known hosts.

halfer · June 10, 2019, 9:22pm

Perhaps you have some sort of problem connecting to this specific port (64535).

neomatrix369 · June 10, 2019, 9:28pm

I have added verbose logging to the build tasks now, see https://circleci.com/gh/neomatrix369/awesome-graal/413, please look for the log lines around [Compiler daemon process appears to have crashed] … CalledProcessError:... .

Do you know how I could get to the bottom of this one? I don’t get this when I run the same docker container on my local machine or even do the same build in a native environment.

I’m thinking the OS is terminating the running processes or doing something to the I/O device of the instance.

halfer · June 11, 2019, 4:33pm

There could be a conflict between doing some low-level stuff and Docker. Containers have a few limitations and it may be that you are bumping into one of them. That said, you may be bursting through your RAM limits too - by default you only have 4G for the whole container.

I would start off by changing to a Machine executor - this is a traditional VM and is much more like bare-metal, and for low-level tasks it may work more reliably. It has been free for 18 months, and may well be free for the long term (though there is a note to say this could change). It comes with 8G of RAM as standard, which is twice as much as you have now. You could try getting it working there, and then bring it back to Docker as a medium-term goal.

neomatrix369 · June 15, 2019, 5:16pm

Moving to a bare metal defeats the purpose of the using the current setup with and without cache, it would help if there were more debug tools available to diagnose such technical issues.

Are there plans to add system metrics i.e. like Heroku provides https://devcenter.heroku.com/articles/log-runtime-metrics (or from Docker https://docs.docker.com/config/containers/runmetrics/) but in the current dashboard so we can see the rise and fall in resources, also if its running out of file handles or any other type of resources etc… – these resource limits can be a reason of programs terminating silently as well.

I have also have no issues when I run the build scripts inside the same docker container (image on Docker hub) on my local machine (MacBook), as opposted to when running on CircleCI’s instances - it could be memory constrains. We will know this when we move to a bigger resource on CircleCI.

neomatrix369 · June 16, 2019, 11:26pm

I took the opportunity to start a different branch using a machine executor, and it seems to have worked fine https://circleci.com/gh/neomatrix369/awesome-graal/500 so the theory that the docker container running in a smaller instance is hitting memory limits could be true.

Do you suggest I now try using docker inside this instance to run the build and see what the outcome is, is this machine spec good enough for a docker build to finish fine?

neomatrix369 · June 19, 2019, 12:12am

To get to the bottom of the issue of failing builds inside the docker container, I did the following:

installed strace
invoke the build scripts with strace

strace does not get executed, in fact I don’t see it in the build logs at all, as I echo all lines of the script that I execute. Is there any restriction to running strace in the container? Do you have any filter that removes such a command before it being executed.

See https://circleci.com/gh/neomatrix369/awesome-graal/522#usage-queue/containers/0 for logs (you can search for strace and wont find it in the execution logs at all), see https://github.com/neomatrix369/awesome-graal/blob/8df0c620064b35993d5e386e36c2bcdedd8efe94/build/x86_64/linux_macos/lib/buildGraalCompiler.sh#L35 (contain strace but isn’t called).

I re-ran the build in ssh mode and ran the build script manually in the console, and it worked perfectly fine without crashing.

Something not right about the docker builds.

Any explanation to the above?

halfer · June 19, 2019, 7:33am

OK, so it works without cache, but not with, and works in SSH mode. I think this is a bit beyond my skills - not a Java person, unfortunately

One for @drazisil or another Circle person perhaps?

neomatrix369 · June 19, 2019, 2:48pm

Thanks for the reply, not sure if its specific to Java. It could be container related and caching is an industry wide issue - although resolved with recent developments.

The docker containers are not recording failed job logs well enough. I was thinking is it worth installing light-weight strace or sysdig like apps that can be asked to monitor failed processes and log them into /var/log - atm I’m not sure what logs in the container are useful and where to find them.

I did manage to run strace in the end but the other issues remain. I have paddle my way thru some of the issues, so I can now have green builds but would like to know why certain things are happening so I can avoid future debugging efforts.

Topic		Replies	Views
Getting failed build due to "Github or AWS" circle error every 1 out of 2 builds Feedback & Bug Reports caching	9	2327	June 18, 2018
Docker Layer Caching not working Caching Dependencies docker	7	2085	February 22, 2024
Git cache makes builds fail Build Environment caching , github , cache , git	1	1930	June 18, 2018
Caching on a machine executor Build Environment caching , cache	15	2835	August 6, 2019
iOS builds not building anymore Feedback & Bug Reports ios	3	2362	June 18, 2018

Building with cache is failing, without cache passes

Related topics