Skip to content

Slurm oom-kill due to memory is ignored. #5332

Open
@oliverdrechsel

Description

@oliverdrechsel

Bug report

Expected behavior and actual behavior

Slurm jobs that run out of memory get oom-killed. In nearly all cases this works. In an awk process i run there is excessive RAM usage that gets only logged in .command.log but is ignored in the nextflow process. This results in premature end of the awk processes leading to corrupted output.

Steps to reproduce the problem

The following code produces issues with fastq.gz files with 20 million reads or more.

process count_reads {

    label "count_reads"

    publishDir path: "${params.analysesdir}/${stage}/${sample}/", pattern: "*.csv", mode: "copy"

    // SLURM cluster options
    cpus 1
    memory "5 GB"
    time "1h"

    tag "readcount_${sample}"

    input:
        tuple val(sample), path(reads)
        val(stage)
        
    output:
        tuple val(sample), path("${sample}_read_count.csv"), emit: read_count
        
    script:
        """
            zless ${reads[0]} | awk 'END {printf "%s", "${sample},"; printf "%.0f", NR/4; print ""}' > ${sample}_read_count.csv

        """

    stub:
        """
            mkdir -p ${sample}
            touch ${sample}_read_count.csv
        """
}

Program output

In the nextflow.log the jobs look as if they are successful.

$ cat .command.log
slurmstepd-hpc-...: error: Detected 1 oom_kill event in StepId=71xxxxx.batch. Some of the step tasks have been OOM Killed.

Environment

  • Nextflow version: 24.04.2
  • Java version: openjdk 21-internal 2023-09-19
  • Operating system: Linux
  • Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Additional context

(Add any other context about the problem here)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions