Solve issue with losing track of SLURM jobs using cephfs #5976

riederd · 2025-04-15T06:06:13Z

The changes solve an issue with losing track of SLURM jobs on systems
using cephfs when the number of jobs is high.

Test case: running a pipeline that creates thousands of jobs on a
SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as
shared filesystem for the nextflow workDirs.

There are existing open issues and mentions of this problem on the
git issue tracker as well i.e. #2695, #5630, #5650

As @burcarjo already tried in #5650, adding ceph as shared filesystem
is not solving the issue, as, at least in our tests, it never reaches
the relevant code which should trigger a metadata refresh.

The proposed fix does a metadata refresh in multiple locations of the
code and in our tests we were able to run a pipeline with > 21k jobs without
any hang. Without the proposed changes there was no way to get this
pipeline to completion in a single run.

As the proposed changes refresh the metadata nearly once per job in
GridTaskHandler.groovy we also tried without refresh in this part
however it resulted in stalling again.

@burcarjo

The changes solve an issue with losing track of SLURM jobs on systems using cephfs when the number of jobs is high. Test case: running a pipeline that creates thousands of jobs on a SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as shared filesystem for the nextflow workDirs. There are existing open issues and mentions of this problem on the git issue tracker as well i.e. nextflow-io#2695, nextflow-io#5630, nextflow-io#5650 As @burcarjo already tried in nextflow-io#5650, adding `ceph` as shared filesystem is not solving the issue, as, at least in our tests, it never reaches the relevant code which should trigger a metadata refresh. The proposed fix does a metadata refresh in multiple locations of the code and in our tests we were able to run a pipeline with > 21k jobs without any hang. Without the proposed changes there was no way to get this pipeline to completion in a single run. As the proposed changes refresh the metadata nearly once per job in `GridTaskHandler.groovy` we also tried without refresh in this part however it resulted in stalling again.

netlify · 2025-04-15T06:07:21Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`8a8a3d3`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/67ffb0ad8590030008ddacb4

--------- Signed-off-by: jorgee <jorge.ejarque@seqera.io> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io> Co-authored-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

--------- Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

bentsherman · 2025-04-16T13:50:12Z

So you did an ablation analysis and found that all of these refreshes were required to prevent the jobs from hanging?

riederd · 2025-04-16T14:15:21Z

Yeah basically yes, I did run an ablation analysis and looked at the -trace output to identify possible places where we can try to refresh the metadata.

pditommaso · 2025-04-17T08:10:34Z

I'm vey hesitant regarding this PR, having a large number of jobs, syncing continuously the file system can lead to a network congestion.

riederd · 2025-04-17T09:43:32Z

I understand, in my tests the number of syncs/directory listings was roughly corresponding the number of submitted jobs (21k jobs ~ 20k refresh), we did not see any network issues.

I can try to omit the sync call though and only do directory listing.

jorgee and others added 6 commits April 16, 2025 15:28

Update shell block deprecation notice (nextflow-io#5972)

9035519

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Add fs command to CLI help (nextflow-io#5967)

4fe5d4f

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Document process/workflow recursion (nextflow-io#5974)

9797695

--------- Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Fix bugs in workflow outputs (nextflow-io#5978)

fc35e43

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

Fix recursion snippets in workflow docs (nextflow-io#5979)

815cdd0

Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>

riederd force-pushed the master branch from 11bc01a to 815cdd0 Compare April 16, 2025 13:28

riederd requested review from a team as code owners April 16, 2025 13:28

Merge branch 'master' into master

8a8a3d3

bentsherman removed the request for review from a team April 16, 2025 13:48

bentsherman added the executor/slurm label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve issue with losing track of SLURM jobs using cephfs #5976

Solve issue with losing track of SLURM jobs using cephfs #5976

riederd commented Apr 15, 2025 •

edited

Loading

netlify bot commented Apr 15, 2025 •

edited

Loading

bentsherman commented Apr 16, 2025 •

edited

Loading

riederd commented Apr 16, 2025

pditommaso commented Apr 17, 2025

riederd commented Apr 17, 2025

Solve issue with losing track of SLURM jobs using cephfs #5976

Are you sure you want to change the base?

Solve issue with losing track of SLURM jobs using cephfs #5976

Conversation

riederd commented Apr 15, 2025 • edited Loading

netlify bot commented Apr 15, 2025 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

bentsherman commented Apr 16, 2025 • edited Loading

riederd commented Apr 16, 2025

pditommaso commented Apr 17, 2025

riederd commented Apr 17, 2025

riederd commented Apr 15, 2025 •

edited

Loading

netlify bot commented Apr 15, 2025 •

edited

Loading

bentsherman commented Apr 16, 2025 •

edited

Loading