Kernel Update & Increase in Out-of-Memory Job Failures

Update 5 Jun 2020 19:00 UTC: We are rolling back our production fleet to an image based on the 4.15 kernel. The image based on the 4.15 kernel should roll out in the next few hours and should reduce the chance of OOM errors occurring. We apologize for the disruption the kernel version upgrade has caused for you and your teams.

On Sunday May 31st we performed a regular rollout of an updated base image for the shared hosts which run our docker executor. This is a process we do fairly often to keep the image and dependencies secure and up-to-date, usually without any impact.

In this case, one of the changes was an upgrade of the Linux Kernel from version 4.15 to version 5.3 in accordance with Ubuntu’s new AWS Rolling Kernel Policy. Among various other changes this included some changes to the behaviour of the OOMKiller - in particular a change to the heuristics around child processes.

After rolling out this release we’ve seen an increase in the number of jobs that are failing because a process was killed for using too much memory. This is predominantly affecting Android projects and NodeJS projects using webpack (both of which make use of child processes to fan out work).

In every case we’ve investigated, we have seen that the job was using more memory than had been allocated to it, including before the update - but the new OOMKiller behaviour is now selecting the parent process over the child process - leading to a failed job rather than a recovery.

We have published a support article outlining some ways of checking how much memory a job used, there are two main approaches:

Interrogating the cgroup to see the peak usage by adding the following at the end of a job

  - run:
      command: cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes
      when: always

Or seeing the usage per-process over time by adding the following to the start of a job - the RSS column will show the RAM usage in KB.

- run:
    command: |
      while true; do
      sleep 5
      ps auxwwf
      echo "======"
    background: true

To get your jobs passing again there are two options available:

  • Increase the resource class size to accommodate the additional usage
  • Make changes to the job configuration to make it use less memory

The output from ps above should help you establish which process or processes within the job are consuming the memory, Depending on the build tool there will likely be a way to apply either a memory limit or a limit on the number of parallel child processes to run at a time.

If you are using gradle, we have seen some customers have success with:

  • Disabling the gradle daemon using -Dorg.gradle.daemon=false in GRADLE_OPTS
  • Adjusting the -Xmx setting in org.gradle.jvmargs so that when combined with --max-workers the total usage will be lower than the job’s limit

There is more discussion about the behaviour of gradle with worker processes and memory usage in this forum post -

One of the reasons that tools use more memory than they have been allocated is that they interrogate the host machine to determine their limits, which will tell them they have 36 cores and 72GB of RAM available. To help avoid this, we have also begun rollout of a change to the Docker executor which will make it easier for tools to determine how much memory they have been allocated - This will allow tools which are aware of cgroups (like the JVM), to adjust their usage accordingly.