Description
Bug report
Expected behavior and actual behavior
I use sbatch to launch a nextflow workflow which in turn launches other slurm jobs, and if there is a slurm node failure, nextflow does not detect that the task has completed and so the workflow hangs with the next process waiting forever.
executor > slurm (46)
[34/c77efe] process > batch_pipeline:BATCHCLEAN (... [100%] 1 of 1 ✔
[07/78ef5d] process > batch_pipeline:PLOTPCA (all... [100%] 1 of 1 ✔
[89/b34062] process > batch_pipeline:IMPUTATION_P... [100%] 22 of 22 ✔
[5c/b1bb0d] process > batch_pipeline:PRS_PER_CHR ... [ 95%] 21 of 22
[- ] process > batch_pipeline:PRS_SUMMARY -
Apologies I have lost the .nextflow.log file, but it is repeatedly polling for job status and just shows the job as running and doesn't detect it as having exited with error, despite the job having failed on slurm
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3826 nf-batch_+ compute 1 NODE_FAIL 1:0
3826.batch batch 1 CANCELLED 0:15
In my nextflow.config I have errorStrategy = 'retry', which usually works to resubmit the job when the process fails, but it doesn't detect this process as having failed even though slurm shows a non zero exit code?
I saw a similar issue https://github.com/nextflow-io/nextflow/issues/3422#issuecomment-1323855649 suggesting it might be due to nextflow launching jobs with --no-requeue slurm option. If it has something to do with this, couldn't there be a way to pass to nextflow whether to use --no-requeue in the executor config section? (It seems very much more useful to default to requeuing jobs due to node failures rather than disabling this functionality)
Steps to reproduce the problem
Difficult to reproduce as node failures are intermittent, I occasionally get these kinds of node failures in slurmctld.log
[2024-09-02T16:28:29.026] sched: Allocate JobId=3826 NodeList=seonixhpc-compute-ghpc-0 #CPUs=1 Partition=compute
[2024-09-02T16:29:28.127] Batch JobId=3826 missing from batch node seonixhpc-compute-ghpc-0 (not found BatchStartTime after startup)
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 WTERMSIG 126
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 cancelled by node failure
[2024-09-02T16:29:28.129] _job_complete: JobId=3826 done
Environment
- Nextflow version: [23.10.1 build 5891]
- Java version: [?]
- Operating system: [Centos]
- Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)