The whole purpose of the build with caching enabled is to get faster builds. So it’s counter-productive if I have to disable caching to get a green build.
This always worked till recently, you can see the logs of these jobs.
So, you’ll need to dig into why it is failing. The failure appears to be related to this:
Compiling com.oracle.truffle.api.test with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.amd64 with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.test with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.truffle.compiler.hotspot.libgraal with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.lir.jtt with javac-daemon(JDK 1.8) failed
5 build tasks failed
I don’t do Java, so cannot advise directly, but you’re probably the best person to dig into this. Could you find some logs or a debug switch to find out what “failed” means in this case?
I don’t doubt you’re right that the cache is causing this, but that is not sufficient from a debugging perspective - you need to find out why. Perhaps you could calculate a hash for all items in the cache to see if they are stale, or perhaps there are some file permission issues?
Thanks for your reply, I will get back to you soon with details, although not sure if there are flags but hopefully digging into examples or docs will reveal more for me.
Although if you have seen from my past posts, I have been reporting about caching now and then, I agree its one of the hard problems in CS. But lets work and see how we can fix it.
Hi Mani. One thing you can try, is narrowing your cache keys. If you delete
- os-deps
and
- make-382
You will only get a cache from the right SHA or branch, instead of possibly getting a cache from another branch that could be completely different from what you are using.
Just finished the builds and none of the rebuilds passed. I don’t get these results on the local machine, when rebuilt in a clean environment. I’ll try to do a verbose build and see if this helps.
I have added verbose logging to the build tasks now, see https://circleci.com/gh/neomatrix369/awesome-graal/413, please look for the log lines around [Compiler daemon process appears to have crashed] … CalledProcessError:....
You will see a number of these daemons are crashing 5 to 6 of them, in a serial fashion, unless the OS is terminating those jobs or doing something to the I/O device, as the PIDs are not responding to a read action (returning nothing to a readline call).
It would be great to be able to trap these errors and find out how or why they are happening, any tips to be able to find out what could be causing them to fail? In the meanwhile I will Rebuild using SSH and try some debug stunts!
From the CircleCI job it appears the SSH session was terminated even though I hadn’t logged in and it was under 5 minutes since it was waiting for me to SSH into it.
I have always been able to SSH to CircleCI jobs.
Its possible that job terminated, since I hjave been able to log into a new one:
➜ linux_macos git:(master) ssh -p 64537 54.91.50.175
The authenticity of host '[54.91.50.175]:64537 ([54.91.50.175]:64537)' can't be established.
RSA key fingerprint is [snipped]
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[54.91.50.175]:64537' (RSA) to the list of known hosts.
I have added verbose logging to the build tasks now, see https://circleci.com/gh/neomatrix369/awesome-graal/413, please look for the log lines around [Compiler daemon process appears to have crashed] … CalledProcessError:... .
Do you know how I could get to the bottom of this one? I don’t get this when I run the same docker container on my local machine or even do the same build in a native environment.
I’m thinking the OS is terminating the running processes or doing something to the I/O device of the instance.
There could be a conflict between doing some low-level stuff and Docker. Containers have a few limitations and it may be that you are bumping into one of them. That said, you may be bursting through your RAM limits too - by default you only have 4G for the whole container.
I would start off by changing to a Machine executor - this is a traditional VM and is much more like bare-metal, and for low-level tasks it may work more reliably. It has been free for 18 months, and may well be free for the long term (though there is a note to say this could change). It comes with 8G of RAM as standard, which is twice as much as you have now. You could try getting it working there, and then bring it back to Docker as a medium-term goal.
Moving to a bare metal defeats the purpose of the using the current setup with and without cache, it would help if there were more debug tools available to diagnose such technical issues.
I have also have no issues when I run the build scripts inside the same docker container (image on Docker hub) on my local machine (MacBook), as opposted to when running on CircleCI’s instances - it could be memory constrains. We will know this when we move to a bigger resource on CircleCI.
I took the opportunity to start a different branch using a machine executor, and it seems to have worked fine https://circleci.com/gh/neomatrix369/awesome-graal/500 so the theory that the docker container running in a smaller instance is hitting memory limits could be true.
Do you suggest I now try using docker inside this instance to run the build and see what the outcome is, is this machine spec good enough for a docker build to finish fine?
To get to the bottom of the issue of failing builds inside the docker container, I did the following:
installed strace
invoke the build scripts with strace
strace does not get executed, in fact I don’t see it in the build logs at all, as I echo all lines of the script that I execute. Is there any restriction to running strace in the container? Do you have any filter that removes such a command before it being executed.
Thanks for the reply, not sure if its specific to Java. It could be container related and caching is an industry wide issue - although resolved with recent developments.
The docker containers are not recording failed job logs well enough. I was thinking is it worth installing light-weight strace or sysdig like apps that can be asked to monitor failed processes and log them into /var/log - atm I’m not sure what logs in the container are useful and where to find them.
I did manage to run strace in the end but the other issues remain. I have paddle my way thru some of the issues, so I can now have green builds but would like to know why certain things are happening so I can avoid future debugging efforts.