Description
Outline & Motivation
Hi all, firstly thanks for your work on the library, it's great!
Existing code
The _SubprocessScriptLauncher
results in a process hierarchy as follows, for N
GPUs:
LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1
where LocalRank=0
launches each of LocalRank=1
to LocakRank=N-1
via subprocess.Popen
.
Issues
This can make a few things challenging, namely:
- Defunct/zombie processes: if any of
LocalRank=1
toN-1
fail, then the process enters a zombie state, asLocalRank=0
(its parent), never reads the exit code. This means that a container running multi-GPU training will not exit on the failure of a non-zero process until DDP times out. For users where termination leads to freeing of expensive compute resources this is problematic. - Exit code checking: in Unix you can only read exit codes of a process from its parent. As a result, the user cannot read the exit code of
LocalRank=1..N-1
. - Local rank environment variable: a user can only identify
LocalRank=0
from the lack of an environment variable. This makes any code relying on this less robust, especially, if it also runs in CPU-only environments where the environment variable is correctly missing. - Process re-parenting: if
LocalRank=0
fails, thenLocalRank=1,..,N-1
are assigned to be children ofpid=1
. This makes cleaning up of processes a bit more challenging.
Proposal
There are two main approaches to rectify this:
- Add a thread in
LocalRank=0
that checks the status of the remaining processes and tidies up all the processes on a failure. Not sure I like this, however, since the thread isn't strictly speaking guaranteed to execute due to the GIL. - Refactor process structure to be something like
Watcher
├── LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1
where the role of the Watcher
process is to
a) launch the processes for each of the local ranks
b) monitor the status of the processes
c) terminate all the local rank processes, upon termination of a single one of them (and exit with the same exit code)
This would address all of the issues above and we could also explicitly set LOCAL_RANK=0
in the zeroth process.
Pitch
I feel that proposal 2. would add some robustness to multi-GPU training, especially in the case described above.
I would be happy to submit a PR that does this either in a new _Launcher
implementation or as a refactor of _SubprocessScriptLauncher
, if this sounds reasonable. Thanks!
Additional context
No response