Neo4j suddenly disappears in the middle of the test suite running

I have a rails app and am using CircleCI for my test suite. I have started to implement neo4j in the app, but have been running into a problem getting it to work in the CI environment. After some amount of time, my neo4j container will just… go away. I have been running my test suite and then re-running it with SSH enabled, and while ssh-ed into the container and running a basic curl command to see if it’s up, after some amount of time, it just stops responding.

I’m using the bitnami neo4j docker image, and provide my docker key/setting below for the circle config. I haven’t been able to correlate the server disappearance to any set amount of time or any specific test or action of the test suite. I also can’t get access to the neo4j container to do any sort of hardcore debugging/logging. If I could get some sort of verbose logging maybe that could help.

I’d really, really appreciate any ideas or suggestions on how to fix or troubleshoot this. If I could somehow get verbose logging or an ssh into the neo4j container that would help me out a lot.

It does seem to be somehow time related, as in, if I run only the neo4j test alone and first it always runs and passes. But, I’ve also seen the neo4j container go away after anywhere between 5 and 12 minutes. It seems to be all over the place.

(the below output has been altered a bit)

circleci@b6db4173f47b:~$ curl localhost:7575
{
   "data" : "http:// localhost:7575/db/data/",
   "management" : "http:// localhost:7575/db/manage/",
   "bolt" : "bolt:// localhost:7687"
}
# Sometime Later....
circleci@b6db4173f47b:~$ curl localhost:7575
curl: (7) Failed to connect to localhost port 7575: Connection refused

Here’s the snippet of my circle/config.yml that I’m using to get neo4j running.

docker:
 - image: circleci/ruby:2.4.5-node-browsers
 - image: circleci/mariadb:10.2.15
 - image: docker.elastic.co/elasticsearch/elasticsearch:6.6.2
 - image: circleci/redis
 - image: bitnami/neo4j:3.5.7
   environment:
      NEO4J_PASSWORD: somepassword
      NEO4J_HTTP_PORT_NUMBER: 7575
      NEO4J_HOST: localhost

Here’s the example failure.

Neo4jNodes::Plan Connections connects and doesn't blow up!
  Failure/Error: n_plan = Neo4jNodes::Plan.create(title: "Plan")
  
  Neo4j::Core::CypherSession::ConnectionFailedError:
    Faraday::ConnectionFailed: Couldn't connect to server
  # ./spec/neo4j_nodes/plan_spec.rb:6:in `block (3 levels) in <top (required)>'
  # ------------------
  # --- Caused by: ---
  # Faraday::ConnectionFailed:
  #   Couldn't connect to server
  #   ./spec/neo4j_nodes/plan_spec.rb:6:in `block (3 levels) in <top (required)>'

I sometimes see “Job was canceled” in the neo4j output when I go back and look at it, but that’s the only output I get when it goes and fails.

1 Like

Try getting a post-failure SSH session, and see if Neo4j is alive at that point. If it is, I wonder if you need a dockerize or waitforit to wait for the Neo4j listener to start up.

I think I’ve been doing that, but there may be a better way I’m unaware of. Currently I ssh in and see that it’s up for a while and then down, which I verify by curling to the port and getting a response for a while, and then getting a “failed to connect”. The console output I get is in my opening post. Is there a better way that I should be doing this?

If you can get it to reliably fail in SSH, then check the Neo4j logs. If there is not much detail in them, try turning up the log level, assuming this system allows you to do that.

My money would be that you’re running out of memory, and the kernel is kicking the Neo4j process out. There are some settings to tweak memory allocation, try adjusting those.

You could also ask for a RAM upgrade from Circle (at a cost) or try moving to the Machine executor (has 8G of RAM instead of the default 4G on the Docker executor, currently free).

1 Like

Interesting, the kernel killing the neo4j process makes a lot of sense. It’s got that tell tale sign of just disappearing which normally happens when a process is killed.

I’m not sure how to see the neo4j logs after I ssh into the circle ci container. https://docs.bitnami.com/aws/infrastructure/neo4j/get-started/understand-default-config/ This says that the logs are contained at /opt/bitnami/neo4j/logs/debug.log but I don’t really know how to access them once I am SSH’d into the CircleCI container. Do I have to switch my circle.yml file to use machines instead of jobs?

It’s worth checking /var/log/kern.log - if it contains cheerful messages about “child sacrificed” (really) then you know that’s the issue.

Since you are using a secondary Docker executor, I don’t think you can at present. However you could remove that and install it in the build host instead, where you have full access to logs.

You still run jobs - you run Machine instead of Docker. It’s often a pretty simple change - see the docs on the YAML syntax. But I would recommend tweaking the Neo4j memory settings if you can - Docker is generally more flexible than Machine, and throwing memory at the problem is dodging rather than fixing.

I don’t see that file when I ssh into my CI container. A google search indicates that it may be an enterprise feature?

Ah, it could be some other file - maybe syslog?. This is ordinary Linux logs, it’d hardly be worth making it into an enterprise feature.

Unfortunately I’m not seeing a lot of exciting stuff in my /var/log folder. I Looked at both dpkg.log and alternatives.log (the logs with the most recent timestamps) and didn’t see anything super relevant (no appearance of neo4j in them)

circleci@71acf28b487c:/var/log$ ls -lh
total 1.5M
-rw-r--r-- 1 root        root  35K Jul 24 12:36 alternatives.log
drwxr-xr-x 2 root        root 4.0K Jul 24 12:36 apt
-rw-rw---- 1 root        utmp    0 Mar 26 12:00 btmp
-rw-r--r-- 1 root        root 338K Jul 24 12:36 dpkg.log
drwxr-s--- 2 Debian-exim adm  4.0K Apr  3 00:10 exim4
-rw-r--r-- 1 root        root 108K Apr  3 00:11 faillog
-rw-r--r-- 1 root        root 1.3K Apr  3 00:16 fontconfig.log
-rw-rw-r-- 1 root        utmp 980K Apr  3 00:11 lastlog
drwxr-xr-x 2 root        root 4.0K May 25  2017 sysstat
-rw-rw-r-- 1 root        utmp    0 Mar 26 12:00 wtmp
circleci@71acf28b487c:/var/log$

I also tried to use journalctl and it’s not installed so no help there.

circleci@71acf28b487c:~$ sudo journalctl -fu docker.service
sudo: journalctl: command not found
circleci@71acf28b487c:~$ journalctl -fu docker.service
bash: journalctl: command not found

I’ve enabled the “- setup_remote_docker” step so that I can do docker commands from the container, but docker ps, docker ps -a and container container ls -la show no existing running containers, so it looks like its segmented a bit and I can’t peek upwards too much into the host machine?

I also tried dmesg, but it doesn’t look like I’m allowed?

dmesg: read kernel buffer failed: Operation not permitted
circleci@71acf28b487c:/var/log$ sudo dmesg
dmesg: read kernel buffer failed: Operation not permitted
circleci@71acf28b487c:/var/log$ sudo sysctl kernel.dmesg_restrict=0
sysctl: setting key "kernel.dmesg_restrict": Read-only file system

Does dmesg work without sudo?

If you get no joy there, try the Neo4j logs. Of course, at this stage the suggestion that you have an OOM is just a guess, and it needs to be confirmed first.

Nah it does not.

Unfortunately I still haven’t figured out how to get to those logs since they are seem buried somewhere else in a different container that I can’t get to.

I’ve decided to switch my strategy a bit, and wanted to run the neo4j container myself in a run command, run tests on that, then shut it down when its done, and it works for the most part - except that I can’t connect to the port that the docker run command specifies. I run docker containers inside the main executor and get to their tcp ports?

Here’s the run command, setups a neo4j at 7575

jobs:
  experiment: 
    steps:
      - run: 
          name: Try running neo4j in our own container
          command: |
          
            echo "Startup neo4j docker container"
            neo4j_container_id=$(docker run -d \
                                  -e NEO4J_PASSWORD=somepassword -e NEO4J_HTTP_PORT_NUMBER=7575 -e NEO4J_HOST=localhost \
                                  -p 7575:7575 \
                                  bitnami/neo4j:3.5.7)   

I then curl to that port, 7575, and get a connection refused. Why is that? And can I get around it?

circleci@8f7fd6a17953:~$ curl localhost:7575
curl: (7) Failed to connect to localhost port 7575: Connection refused

Yep, try installing it in the build container directly, as I suggested before:

Regarding Docker:

The short answer is that you cannot publish ports when running Docker-in-Docker. If you were able to do that, by definition you would have control over the Docker networking stack that would affect other customers, which is a security risk. It is therefore disallowed.

You can do this by using Docker Compose, with Neo4j in one container and your client in another - this approach does not require port publishing, since you can use the virtual network instead. Or you can use Docker with custom virtual networks, but that is effectively what would get for free with Compose.

In my opinion, your shortest route to getting this working would be to forget about Docker for now and installing Neo4j directly. However, if you must use Docker or that specific image, then yes, it is possible.

Okay, I see now. I wasn’t sure which way to install it, so I thought booting up a docker container with the running image would be the shortest path. I’ll just apt install it next and see how far that gets me. Thanks!

I prefer that approach, but you need to transfer your client to another container, and you also need to be ready to deal with Docker-related issues (such as the port publishing gotcha). You can always give it a go if you are confident with Docker :slight_smile: