Open
Description
Currently, the default way we build OpenMPI
, MPICH
etc. works well for InfiniBand
systems, but shows very poor performance on Omni-Path
(we saw between 2x and 10x worse bandwidth and latency in benchmarks on our system).
It would be good to figure out a way to improve Omni-Path
support in EasyBuild (perhaps through a configuration option?); at a minimum, we should improve documentation.
relevant PRs to date:
- [#20501]
PSM2
dependency added to recentlibfabric
easyconfigs - [#20585]
PSM2
dependency made conditional on havingx86_64
- [#20794] previous changes effectively undone, by commenting
PSM2
dependency back out due toCUDA
build dependency
further info/ideas:
Omni-Path
systems should use eitherPSM2
oropx
PSM2
can be either stand-alone, or vialibfabric
opx
is alibfabric
provider; drop-in replacement forPSM2
- Cornelis' plan is to move away from
PSM2
(the upcoming 400G adapters will only supportopx
) - no benefit (only additional overhead) from using
UCX
withOmni-Path
- Cornelis' documentation currently recommends using
PSM2
:-
For best performance, Cornelis recommends that you use the PSM2, the high performance
interface to the OPX Fabric. This is accomplished using the Open Fabrics Interface (OFI) MPI
fabric setting -genv I_MPI_FABRICS=ofi and ensure that FI_PROVIDER=psm2. - source: Cornelis_OPX_Performance_Tuning_UG_H93143_v25_0.pdf (March 2024)
-
- [#20794 comment] suggestion by @bartoldeman to patch
PSM2
in order to drop theCUDA
build dependency- matches current approach for
OpenMPI
, by including some minimal CUDA prototypes (sincePSM2
willdlopen('libcuda.so.1')
at runtime)
- matches current approach for
- [#20794 comment] suggestion by @boegel to perhaps implement "opt-in hooks that are part of EasyBuild framework, which can be enabled selectively?"