Building with cache is failing, without cache passes

I have noticed during the building of these branches failed on CircleCI:

But this one passes with caching disabled:

The whole purpose of the build with caching enabled is to get faster builds. So it’s counter-productive if I have to disable caching to get a green build.

This always worked till recently, you can see the logs of these jobs.

So, you’ll need to dig into why it is failing. The failure appears to be related to this:

Compiling with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.amd64 with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.hotspot.test with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.truffle.compiler.hotspot.libgraal with javac-daemon(JDK 1.8) failed
Compiling org.graalvm.compiler.lir.jtt with javac-daemon(JDK 1.8) failed
5 build tasks failed

I don’t do Java, so cannot advise directly, but you’re probably the best person to dig into this. Could you find some logs or a debug switch to find out what “failed” means in this case?

I don’t doubt you’re right that the cache is causing this, but that is not sufficient from a debugging perspective - you need to find out why. Perhaps you could calculate a hash for all items in the cache to see if they are stale, or perhaps there are some file permission issues?

1 Like

Thanks for your reply, I will get back to you soon with details, although not sure if there are flags but hopefully digging into examples or docs will reveal more for me.

Although if you have seen from my past posts, I have been reporting about caching now and then, I agree its one of the hard problems in CS. But lets work and see how we can fix it.

Hi Mani. One thing you can try, is narrowing your cache keys. If you delete

            - os-deps


            - make-382

You will only get a cache from the right SHA or branch, instead of possibly getting a cache from another branch that could be completely different from what you are using.

Thanks I will take a look at that, it could help. Will feedback after giving it a whirl.

I tried building the different branches by removing the above definitions from the configuration file and the builds still failed.

In fact I have had green builds on 2 out of 3 branches before the above changes, so it looks like a hit or miss.

There must be something else that might be causing this, I’ll try to rebuild the runs without cache and the build it again there after.

I have seen from past experience, rebuilding without cache helps create a new cache and thereafter the builds go fine - is this expected?

Just finished the builds and none of the rebuilds passed. I don’t get these results on the local machine, when rebuilt in a clean environment. I’ll try to do a verbose build and see if this helps.

I have added verbose logging to the build tasks now, see, please look for the log lines around [Compiler daemon process appears to have crashed]CalledProcessError:....

You will see a number of these daemons are crashing 5 to 6 of them, in a serial fashion, unless the OS is terminating those jobs or doing something to the I/O device, as the PIDs are not responding to a read action (returning nothing to a readline call).

It would be great to be able to trap these errors and find out how or why they are happening, any tips to be able to find out what could be causing them to fail? In the meanwhile I will Rebuild using SSH and try some debug stunts!

The SSH session of the rebuild action failed, see

I couldn’t ssh into the box, here’s the response:

$ ssh -vvv -p 64535

OpenSSH_7.8p1, LibreSSL 2.6.2
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 48: Applying options for *
debug1: /etc/ssh/ssh_config line 52: Applying options for *
debug2: resolve_canonicalize: hostname is address
debug2: ssh_connect_direct
debug1: Connecting to [] port 64535.
debug1: connect to address port 64535: Connection refused
ssh: connect to host port 64535: Connection refused

Does the rest of the verbose output indicate that your private key authentication to the box has been accepted?

Thats all the output I have from SSH.

From the CircleCI job it appears the SSH session was terminated even though I hadn’t logged in and it was under 5 minutes since it was waiting for me to SSH into it.

I have always been able to SSH to CircleCI jobs.

Its possible that job terminated, since I hjave been able to log into a new one:

➜  linux_macos git:(master) ssh -p 64537
The authenticity of host '[]:64537 ([]:64537)' can't be established.
RSA key fingerprint is [snipped]
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[]:64537' (RSA) to the list of known hosts.

Perhaps you have some sort of problem connecting to this specific port (64535).

I have added verbose logging to the build tasks now, see, please look for the log lines around [Compiler daemon process appears to have crashed]CalledProcessError:... .

Do you know how I could get to the bottom of this one? I don’t get this when I run the same docker container on my local machine or even do the same build in a native environment.

I’m thinking the OS is terminating the running processes or doing something to the I/O device of the instance.

There could be a conflict between doing some low-level stuff and Docker. Containers have a few limitations and it may be that you are bumping into one of them. That said, you may be bursting through your RAM limits too - by default you only have 4G for the whole container.

I would start off by changing to a Machine executor - this is a traditional VM and is much more like bare-metal, and for low-level tasks it may work more reliably. It has been free for 18 months, and may well be free for the long term (though there is a note to say this could change). It comes with 8G of RAM as standard, which is twice as much as you have now. You could try getting it working there, and then bring it back to Docker as a medium-term goal.

Moving to a bare metal defeats the purpose of the using the current setup with and without cache, it would help if there were more debug tools available to diagnose such technical issues.

Are there plans to add system metrics i.e. like Heroku provides (or from Docker but in the current dashboard so we can see the rise and fall in resources, also if its running out of file handles or any other type of resources etc… – these resource limits can be a reason of programs terminating silently as well.

I have also have no issues when I run the build scripts inside the same docker container (image on Docker hub) on my local machine (MacBook), as opposted to when running on CircleCI’s instances - it could be memory constrains. We will know this when we move to a bigger resource on CircleCI.

I took the opportunity to start a different branch using a machine executor, and it seems to have worked fine so the theory that the docker container running in a smaller instance is hitting memory limits could be true.

Do you suggest I now try using docker inside this instance to run the build and see what the outcome is, is this machine spec good enough for a docker build to finish fine?

To get to the bottom of the issue of failing builds inside the docker container, I did the following:

  • installed strace
  • invoke the build scripts with strace

strace does not get executed, in fact I don’t see it in the build logs at all, as I echo all lines of the script that I execute. Is there any restriction to running strace in the container? Do you have any filter that removes such a command before it being executed.

See for logs (you can search for strace and wont find it in the execution logs at all), see (contain strace but isn’t called).

I re-ran the build in ssh mode and ran the build script manually in the console, and it worked perfectly fine without crashing.

Something not right about the docker builds.

Any explanation to the above?

OK, so it works without cache, but not with, and works in SSH mode. I think this is a bit beyond my skills - not a Java person, unfortunately :blush:

One for @drazisil or another Circle person perhaps?

Thanks for the reply, not sure if its specific to Java. It could be container related and caching is an industry wide issue - although resolved with recent developments.

The docker containers are not recording failed job logs well enough. I was thinking is it worth installing light-weight strace or sysdig like apps that can be asked to monitor failed processes and log them into /var/log - atm I’m not sure what logs in the container are useful and where to find them.

I did manage to run strace in the end but the other issues remain. I have paddle my way thru some of the issues, so I can now have green builds but would like to know why certain things are happening so I can avoid future debugging efforts.