Skip to content

Nextflow - SLURM job interface disconnect #2695

Open
@nickhsmith

Description

@nickhsmith

Bug report

I am noticing a problem with the Nextflow-SLURM interface when processing large number of jobs. For example, I am running a workflow and have spawned ~10,000 tasks (50 samples * 150 intervals/sample) + downstream tasks) using the nf-core sarek workflow. However, I notice that my tasks are hanging, and as a result, nothing is being submitted to SLURM despite the jobs being completed.

Expected behavior and actual behavior

Nextflow generates a task. Nextflow submits the task to SLURM. SLURM runs the job. SLURM finishes the job (creating .exitcode with value 0). Nextflow recognizes this and submits the next task.

This last step of recognizing a finished job seems to be failing.

Steps to reproduce the problem

More tangibly, I have used queueSize to limit my workflow to 100 jobs at 1 time. Some of those jobs will finish (.exitcode exists with the value 0 in the local work dir (work/76/69b9b4...)) BUT these jobs are still considered RUNNING when I look at the .nextflow.log file (see last photo)

Looking through the log file. I see that the SLURM job ID 636541 is submitted at 15:17 on March 2
It finishes and creates the .exitcode file (with a successful 0 exit) at 18:10. SLURM thinks the job is completed at this point.
slurm finished job

but as of yesterday March 3 there is no completed status of this job, as a result SLURM won't start any of the remaining ~7000 jobs in the task scheduler)

Mar-03 05:26:48.603 [Task submitter] DEBUG n.processor.TaskPollingMonitor - %% executor slurm > tasks in the submission queue: 6714 -- tasks to be submitted are shown below

working dir
nextflow log

Any help would be appreciated, but is there a reason that the completed jobs aren't being recognized as such, or what I can do to force a re-recognition? Thanks in advance

Environment

nextflow info
Version: 21.10.6 build 5660
Created: 21-12-2021 16:55 UTC (17:55 CEST)
System: Linux 4.18.0-348.12.2.el8_5.x86_64
Runtime: Groovy 3.0.9 on OpenJDK 64-Bit Server VM 10.0.2+13
Encoding: UTF-8 (UTF-8)

shell: bash

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions