Skip to content

Refactor _SubprocessScriptLauncher process launching strategy #17248

Open
@ymohamedahmed

Description

@ymohamedahmed

Outline & Motivation

Hi all, firstly thanks for your work on the library, it's great!

Existing code

The _SubprocessScriptLauncher results in a process hierarchy as follows, for N GPUs:

LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1

where LocalRank=0 launches each of LocalRank=1 to LocakRank=N-1 via subprocess.Popen.

Issues

This can make a few things challenging, namely:

  • Defunct/zombie processes: if any of LocalRank=1 to N-1 fail, then the process enters a zombie state, as LocalRank=0(its parent), never reads the exit code. This means that a container running multi-GPU training will not exit on the failure of a non-zero process until DDP times out. For users where termination leads to freeing of expensive compute resources this is problematic.
  • Exit code checking: in Unix you can only read exit codes of a process from its parent. As a result, the user cannot read the exit code of LocalRank=1..N-1.
  • Local rank environment variable: a user can only identify LocalRank=0 from the lack of an environment variable. This makes any code relying on this less robust, especially, if it also runs in CPU-only environments where the environment variable is correctly missing.
  • Process re-parenting: if LocalRank=0 fails, then LocalRank=1,..,N-1 are assigned to be children of pid=1. This makes cleaning up of processes a bit more challenging.

Proposal

There are two main approaches to rectify this:

  1. Add a thread in LocalRank=0 that checks the status of the remaining processes and tidies up all the processes on a failure. Not sure I like this, however, since the thread isn't strictly speaking guaranteed to execute due to the GIL.
  2. Refactor process structure to be something like
Watcher
├── LocalRank=0
├── LocalRank=1
├── LocalRank=2
├── LocalRank=3
├── ...
└── LocalRank=N-1

where the role of the Watcher process is to
a) launch the processes for each of the local ranks
b) monitor the status of the processes
c) terminate all the local rank processes, upon termination of a single one of them (and exit with the same exit code)

This would address all of the issues above and we could also explicitly set LOCAL_RANK=0 in the zeroth process.

Pitch

I feel that proposal 2. would add some robustness to multi-GPU training, especially in the case described above.

I would be happy to submit a PR that does this either in a new _Launcher implementation or as a refactor of _SubprocessScriptLauncher, if this sounds reasonable. Thanks!

Additional context

No response

cc @awaelchli @justusschock @tchaton @Borda @carmocca

Metadata

Metadata

Assignees

No one assigned

    Labels

    designIncludes a design discussiondistributedGeneric distributed-related topicfabriclightning.fabric.FabricplGeneric label for PyTorch Lightning packagerefactor

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions