Skip to content

Solve issue with losing track of SLURM jobs using cephfs #5976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

riederd
Copy link

@riederd riederd commented Apr 15, 2025

The changes solve an issue with losing track of SLURM jobs on systems
using cephfs when the number of jobs is high.

Test case: running a pipeline that creates thousands of jobs on a
SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as
shared filesystem for the nextflow workDirs.

There are existing open issues and mentions of this problem on the
git issue tracker as well i.e. #2695, #5630, #5650

As @burcarjo already tried in #5650, adding ceph as shared filesystem
is not solving the issue, as, at least in our tests, it never reaches
the relevant code which should trigger a metadata refresh.

The proposed fix does a metadata refresh in multiple locations of the
code and in our tests we were able to run a pipeline with > 21k jobs without
any hang. Without the proposed changes there was no way to get this
pipeline to completion in a single run.

As the proposed changes refresh the metadata nearly once per job in
GridTaskHandler.groovy we also tried without refresh in this part
however it resulted in stalling again.

The changes solve an issue with losing track of SLURM jobs on systems
using cephfs when the number of jobs is high.

Test case: running a pipeline that creates thousands of jobs on a
SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as
shared filesystem for the nextflow workDirs.

There are existing open issues and mentions of this problem on the
git issue tracker as well i.e. nextflow-io#2695, nextflow-io#5630, nextflow-io#5650

As @burcarjo already tried in nextflow-io#5650, adding `ceph` as shared filesystem
is not solving the issue, as, at least in our tests, it never reaches
the relevant code which should trigger a metadata refresh.

The proposed fix does a metadata refresh in multiple locations of the
code and in our tests we were able to run a pipeline with > 21k jobs without
any hang. Without the proposed changes there was no way to get this
pipeline to completion in a single run.

As the proposed changes refresh the metadata nearly once per job in
`GridTaskHandler.groovy` we also tried without refresh in this part
however it resulted in stalling again.
Copy link

netlify bot commented Apr 15, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 8a8a3d3
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67ffb0ad8590030008ddacb4

jorgee and others added 6 commits April 16, 2025 15:28
---------

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io>
Co-authored-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
---------

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
@bentsherman bentsherman removed the request for review from a team April 16, 2025 13:48
@bentsherman
Copy link
Member

bentsherman commented Apr 16, 2025

So you did an ablation analysis and found that all of these refreshes were required to prevent the jobs from hanging?

@riederd
Copy link
Author

riederd commented Apr 16, 2025

Yeah basically yes, I did run an ablation analysis and looked at the -trace output to identify possible places where we can try to refresh the metadata.

@pditommaso
Copy link
Member

I'm vey hesitant regarding this PR, having a large number of jobs, syncing continuously the file system can lead to a network congestion.

@riederd
Copy link
Author

riederd commented Apr 17, 2025

I understand, in my tests the number of syncs/directory listings was roughly corresponding the number of submitted jobs (21k jobs ~ 20k refresh), we did not see any network issues.

I can try to omit the sync call though and only do directory listing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants