Open
Description
Bug report
Expected behavior and actual behavior
Slurm jobs that run out of memory get oom-killed. In nearly all cases this works. In an awk process i run there is excessive RAM usage that gets only logged in .command.log but is ignored in the nextflow process. This results in premature end of the awk processes leading to corrupted output.
Steps to reproduce the problem
The following code produces issues with fastq.gz files with 20 million reads or more.
process count_reads {
label "count_reads"
publishDir path: "${params.analysesdir}/${stage}/${sample}/", pattern: "*.csv", mode: "copy"
// SLURM cluster options
cpus 1
memory "5 GB"
time "1h"
tag "readcount_${sample}"
input:
tuple val(sample), path(reads)
val(stage)
output:
tuple val(sample), path("${sample}_read_count.csv"), emit: read_count
script:
"""
zless ${reads[0]} | awk 'END {printf "%s", "${sample},"; printf "%.0f", NR/4; print ""}' > ${sample}_read_count.csv
"""
stub:
"""
mkdir -p ${sample}
touch ${sample}_read_count.csv
"""
}
Program output
In the nextflow.log the jobs look as if they are successful.
$ cat .command.log
slurmstepd-hpc-...: error: Detected 1 oom_kill event in StepId=71xxxxx.batch. Some of the step tasks have been OOM Killed.
Environment
- Nextflow version: 24.04.2
- Java version: openjdk 21-internal 2023-09-19
- Operating system: Linux
- Bash version: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Additional context
(Add any other context about the problem here)