Description
Bug report
I am noticing a problem with the Nextflow-SLURM interface when processing large number of jobs. For example, I am running a workflow and have spawned ~10,000 tasks (50 samples * 150 intervals/sample) + downstream tasks) using the nf-core sarek workflow. However, I notice that my tasks are hanging, and as a result, nothing is being submitted to SLURM despite the jobs being completed.
Expected behavior and actual behavior
Nextflow generates a task. Nextflow submits the task to SLURM. SLURM runs the job. SLURM finishes the job (creating .exitcode
with value 0). Nextflow recognizes this and submits the next task.
This last step of recognizing a finished job seems to be failing.
Steps to reproduce the problem
More tangibly, I have used queueSize
to limit my workflow to 100 jobs at 1 time. Some of those jobs will finish (.exitcode
exists with the value 0 in the local work dir (work/76/69b9b4...
)) BUT these jobs are still considered RUNNING when I look at the .nextflow.log
file (see last photo)
Looking through the log file. I see that the SLURM job ID 636541
is submitted at 15:17 on March 2
It finishes and creates the .exitcode file (with a successful 0 exit) at 18:10. SLURM thinks the job is completed at this point.
but as of yesterday March 3 there is no completed status of this job, as a result SLURM won't start any of the remaining ~7000 jobs in the task scheduler)
Mar-03 05:26:48.603 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor slurm > tasks in the submission queue: 6714 -- tasks to be submitted are shown below
Any help would be appreciated, but is there a reason that the completed jobs aren't being recognized as such, or what I can do to force a re-recognition? Thanks in advance
Environment
nextflow info
Version: 21.10.6 build 5660
Created: 21-12-2021 16:55 UTC (17:55 CEST)
System: Linux 4.18.0-348.12.2.el8_5.x86_64
Runtime: Groovy 3.0.9 on OpenJDK 64-Bit Server VM 10.0.2+13
Encoding: UTF-8 (UTF-8)
shell: bash