-
Notifications
You must be signed in to change notification settings - Fork 681
Solve issue with losing track of SLURM jobs using cephfs #5976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The changes solve an issue with losing track of SLURM jobs on systems using cephfs when the number of jobs is high. Test case: running a pipeline that creates thousands of jobs on a SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as shared filesystem for the nextflow workDirs. There are existing open issues and mentions of this problem on the git issue tracker as well i.e. nextflow-io#2695, nextflow-io#5630, nextflow-io#5650 As @burcarjo already tried in nextflow-io#5650, adding `ceph` as shared filesystem is not solving the issue, as, at least in our tests, it never reaches the relevant code which should trigger a metadata refresh. The proposed fix does a metadata refresh in multiple locations of the code and in our tests we were able to run a pipeline with > 21k jobs without any hang. Without the proposed changes there was no way to get this pipeline to completion in a single run. As the proposed changes refresh the metadata nearly once per job in `GridTaskHandler.groovy` we also tried without refresh in this part however it resulted in stalling again.
✅ Deploy Preview for nextflow-docs-staging canceled.
|
--------- Signed-off-by: jorgee <jorge.ejarque@seqera.io> Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com> Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io> Co-authored-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
--------- Signed-off-by: Ben Sherman <bentshermann@gmail.com> Co-authored-by: Chris Hakkaart <chris.hakkaart@seqera.io> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
Signed-off-by: Ben Sherman <bentshermann@gmail.com> Signed-off-by: Dietmar Rieder <dietmar.rieder@i-med.ac.at>
So you did an ablation analysis and found that all of these refreshes were required to prevent the jobs from hanging? |
Yeah basically yes, I did run an ablation analysis and looked at the |
I'm vey hesitant regarding this PR, having a large number of jobs, syncing continuously the file system can lead to a network congestion. |
I understand, in my tests the number of syncs/directory listings was roughly corresponding the number of submitted jobs (21k jobs ~ 20k refresh), we did not see any network issues. I can try to omit the |
The changes solve an issue with losing track of SLURM jobs on systems
using cephfs when the number of jobs is high.
Test case: running a pipeline that creates thousands of jobs on a
SLURM cluster with a queuesize of 600 and cephfs (reef 18.2.2) as
shared filesystem for the nextflow workDirs.
There are existing open issues and mentions of this problem on the
git issue tracker as well i.e. #2695, #5630, #5650
As @burcarjo already tried in #5650, adding
ceph
as shared filesystemis not solving the issue, as, at least in our tests, it never reaches
the relevant code which should trigger a metadata refresh.
The proposed fix does a metadata refresh in multiple locations of the
code and in our tests we were able to run a pipeline with > 21k jobs without
any hang. Without the proposed changes there was no way to get this
pipeline to completion in a single run.
As the proposed changes refresh the metadata nearly once per job in
GridTaskHandler.groovy
we also tried without refresh in this parthowever it resulted in stalling again.