Nextflow failing to detect task exit/failure due to slurm failed node

## Bug report 

### Expected behavior and actual behavior

I use sbatch to launch a nextflow workflow which in turn launches other slurm jobs, and if there is a slurm node failure, nextflow does not detect that the task has completed and so the workflow hangs with the next process waiting forever.
```
executor >  slurm (46)
[34/c77efe] process > batch_pipeline:BATCHCLEAN (... [100%] 1 of 1 ✔
[07/78ef5d] process > batch_pipeline:PLOTPCA (all... [100%] 1 of 1 ✔
[89/b34062] process > batch_pipeline:IMPUTATION_P... [100%] 22 of 22 ✔
[5c/b1bb0d] process > batch_pipeline:PRS_PER_CHR ... [ 95%] 21 of 22
[-        ] process > batch_pipeline:PRS_SUMMARY     -
```
Apologies I have lost the .nextflow.log file, but it is repeatedly polling for job status and just shows the job as running and doesn't detect it as having exited with error, despite the job having failed on slurm
```
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
3826         nf-batch_+    compute                     1  NODE_FAIL      1:0 
3826.batch        batch                                1  CANCELLED     0:15
```
In my nextflow.config I have errorStrategy = 'retry', which usually works to resubmit the job when the process fails, but it doesn't detect this process as having failed even though slurm shows a non zero exit code?    

I saw a similar issue [https://github.com/nextflow-io/nextflow/issues/3422#issuecomment-1323855649](here) suggesting it might be due to nextflow launching jobs with --no-requeue slurm option. If it has something to do with this, couldn't there be a way to pass to nextflow whether to use --no-requeue in the executor config section?  (It seems very much more useful to default to requeuing jobs due to node failures rather than disabling this functionality)

### Steps to reproduce the problem

Difficult to reproduce as node failures are intermittent, I occasionally get these kinds of node failures in slurmctld.log 
```
[2024-09-02T16:28:29.026] sched: Allocate JobId=3826 NodeList=seonixhpc-compute-ghpc-0 #CPUs=1 Partition=compute
[2024-09-02T16:29:28.127] Batch JobId=3826 missing from batch node seonixhpc-compute-ghpc-0 (not found BatchStartTime after startup)
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 WTERMSIG 126
[2024-09-02T16:29:28.128] _job_complete: JobId=3826 cancelled by node failure
[2024-09-02T16:29:28.129] _job_complete: JobId=3826 done
```

### Environment 

* Nextflow version: [23.10.1 build 5891] 
* Java version: [?]
* Operating system: [Centos]
* Bash version: GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nextflow failing to detect task exit/failure due to slurm failed node #5276

Description

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions