Replies: 1 comment
-
Hi, the Slurm threw You can validate it by manually running the command, and I suggest contacting your HCP administrator since this error message does not seem to be expected. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When I execute the command "nohup dpgen init_bulk param.json machine.json > 111.out &", after a while, dpgen stops by itself and reports an error.
I find that the 111.out shows as blow:
INFO:dpgen:# working dir CuZr_FCC_32atom6.POSCAR.01x01x01
INFO:dpgen:Elements are Cu Zr
INFO:dpgen:Current stage is 1, relax
INFO:dpgen:[26, 6]
DeepModeling
Version: 0.13.0
Path: /home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpgen
Dependency
pymatgen unknown version or path
monty 2025.1.9 /home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/monty
ase 3.24.0 /home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/ase
paramiko 3.5.1 /home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/paramiko
custodian 2025.1.24 /home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/custodian
Reference
Please cite:
Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E,
DP-GEN: A concurrent learning platform for the generation of reliable deep learning
based potential energy models, Computer Physics Communications, 2020, 107206.
Description
2025-02-11 16:25:13,608 - INFO : info:check_all_finished: False
2025-02-11 16:25:13,637 - INFO : job: 399dc08c80293ea4336f90bce90c8164594479ea submit; job_id is 533023
2025-02-11 16:29:44,870 - INFO : job: 399dc08c80293ea4336f90bce90c8164594479ea 533023 finished
INFO:dpgen:Current stage is 2, perturb and scale
INFO:dpgen:Current stage is 3, run a short md
2025-02-11 16:30:02,625 - INFO : info:check_all_finished: False
2025-02-11 16:30:02,744 - INFO : job: e084075d4a1d20c27f0bd0671b052d38bceb17d6 submit; job_id is 533025
2025-02-11 16:30:02,775 - INFO : job: 31b081e3e2395fb0dbd3608b07b4d2771cce70c9 submit; job_id is 533026
2025-02-11 16:30:02,799 - INFO : job: 1bc8920fa799de78a5a9836bf4aeaeb4b98aff03 submit; job_id is 533027
2025-02-11 16:30:02,821 - INFO : job: fd0c8a64a8e4a86a3b81319b90b36a0fee956688 submit; job_id is 533028
2025-02-11 16:30:02,845 - INFO : job: ac9ea12e9e26bc398a48215edcbc207a71315294 submit; job_id is 533029
2025-02-11 16:30:02,868 - INFO : job: 885276d4a3494c67350e5d0bb5007448fe7e72c9 submit; job_id is 533030
Traceback (most recent call last):
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/bin/dpgen", line 8, in
sys.exit(main())
^^^^^^
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpgen/data/gen.py", line 1554, in gen_init_bulk
run_vasp_md(jdata, mdata)
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpgen/data/gen.py", line 1375, in run_vasp_md
submission.run_submission()
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpdispatcher/submission.py", line 259, in run_submission
self.update_submission_state()
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpdispatcher/submission.py", line 343, in update_submission_state
job.get_job_state()
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpdispatcher/submission.py", line 824, in get_job_state
job_state = self.machine.check_status(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpdispatcher/utils/utils.py", line 183, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/export/base/ycsc_hzw/hd100168/online1/anaconda3/lib/python3.12/site-packages/dpdispatcher/machines/slurm.py", line 148, in check_status
raise RuntimeError(
RuntimeError: status command squeue -o "%.18i %.2t" -j 533025 fails to execute.job_id:533025
error message:slurm_load_jobs error: Unexpected message received
return code 1
Currently, the SLURM_VERSION I am using is 23.11.8.
I ran it several times before without any issues, but after the HCP was shut down for maintenance recently, it has been reporting this error continuously.
Beta Was this translation helpful? Give feedback.
All reactions