Skip to content

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Open
@sereeena

Description

@sereeena

Bug report

Expected behavior and actual behavior

I use sbatch to launch a nextflow workflow which in turn launches other slurm jobs, and if there is a slurm node failure, nextflow does not detect that the task has completed and so the workflow hangs with the next process waiting forever.

executor >  slurm (46)
[34/c77efe] process > batch_pipeline:BATCHCLEAN (... [100%] 1 of 1 ✔
[07/78ef5d] process > batch_pipeline:PLOTPCA (all... [100%] 1 of 1 ✔
[89/b34062] process > batch_pipeline:IMPUTATION_P... [100%] 22 of 22 ✔
[5c/b1bb0d] process > batch_pipeline:PRS_PER_CHR ... [ 95%] 21 of 22
[-        ] process > batch_pipeline:PRS_SUMMARY     -

Apologies I have lost the .nextflow.log file, but it is repeatedly polling for job status and just shows the job as running and doesn't detect it as having exited with error, despite the job having failed on slurm

JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3826         nf-batch_+    compute                     1  NODE_FAIL      1:0 
3826.batch        batch                                1  CANCELLED     0:15

In my nextflow.config I have errorStrategy = 'retry', which usually works to resubmit the job when the process fails, but it doesn't detect this process as having failed even though slurm shows a non zero exit code?

I saw a similar issue https://github.com/nextflow-io/nextflow/issues/3422#issuecomment-1323855649 suggesting it might be due to nextflow launching jobs with --no-requeue slurm option. If it has something to do with this, couldn't there be a way to pass to nextflow whether to use --no-requeue in the executor config section? (It seems very much more useful to default to requeuing jobs due to node failures rather than disabling this functionality)

Steps to reproduce the problem

Difficult to reproduce as node failures are intermittent, I occasionally get these kinds of node failures in slurmctld.log

[2024-09-02T16:28:29.026] sched: Allocate JobId=3826 NodeList=seonixhpc-compute-ghpc-0 #CPUs=1 Partition=compute
[2024-09-02T16:29:28.127] Batch JobId=3826 missing from batch node seonixhpc-compute-ghpc-0 (not found BatchStartTime after startup)
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 WTERMSIG 126
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 cancelled by node failure
[2024-09-02T16:29:28.129] _job_complete: JobId=3826 done

Environment

  • Nextflow version: [23.10.1 build 5891]
  • Java version: [?]
  • Operating system: [Centos]
  • Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions