Refactor `_SubprocessScriptLauncher` process launching strategy

### Outline & Motivation

Hi all, firstly thanks for your work on the library, it's great!

# Existing code
The `_SubprocessScriptLauncher` results in a process hierarchy as follows, for `N` GPUs: 
```
LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1
```
where `LocalRank=0` launches each of `LocalRank=1` to `LocakRank=N-1` via `subprocess.Popen`. 

# Issues
This can make a few things challenging, namely: 

- **Defunct/zombie processes**: if any of `LocalRank=1` to `N-1` fail, then the process enters a zombie state, as `LocalRank=0`(its parent), never reads the exit code. This means that a container running multi-GPU training will not exit on the failure of a non-zero process until DDP times out. For users where termination leads to freeing of expensive compute resources this is problematic.
- **Exit code checking**: in Unix you can only read exit codes of a process from its parent. As a result, the user cannot read the exit code of `LocalRank=1..N-1`.
- **Local rank** environment variable: a user can only identify `LocalRank=0` from the lack of an environment variable. This makes any code relying on this less robust, especially, if it also runs in CPU-only environments where the environment variable is correctly missing. 
- **Process re-parenting**: if `LocalRank=0` fails, then `LocalRank=1,..,N-1` are assigned to be children of `pid=1`. This makes cleaning up of processes a bit more challenging. 

# Proposal
There are two main approaches to rectify this: 
1. Add a thread in `LocalRank=0` that checks the status of the remaining processes and tidies up all the processes on a failure. Not sure I like this, however, since the thread isn't strictly speaking guaranteed to execute due to the GIL. 
2. Refactor process structure to be something like 
```
Watcher
├── LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1
```
where the role of the `Watcher` process is to 
a) launch the processes for each of the local ranks
b) monitor the status of the processes
c) terminate all the local rank processes, upon termination of a single one of them (and exit with the same exit code)

This would address all of the issues above and we could also explicitly set `LOCAL_RANK=0` in the zeroth process. 

### Pitch

I feel that proposal 2. would add some robustness to multi-GPU training, especially in the case described above. 

I would be happy to submit a PR that does this either in a new `_Launcher` implementation or as a refactor of `_SubprocessScriptLauncher`, if this sounds reasonable. Thanks!

### Additional context

_No response_

cc @awaelchli @justusschock @tchaton @borda @carmocca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `_SubprocessScriptLauncher` process launching strategy #17248

Outline & Motivation

Existing code

Issues

Proposal

Pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor _SubprocessScriptLauncher process launching strategy #17248

Description

Outline & Motivation

Existing code

Issues

Proposal

Pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Refactor `_SubprocessScriptLauncher` process launching strategy #17248