Skip to content

self.all_gather does not work on on_train_epoch_end #20683

Open
@yoniaflalo

Description

@yoniaflalo

Bug description

I am trying to compute metrics that I have aggregated. There is an issue with torchmetrics that is already addressed here #18803 (and not solved yet) so I am trying to aggregate my metric alone.

At each step (training and validation step) I am printing my metric as follows:

self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, sync_dist=True)

I have a tensor that I want to sync between all gpus, and I am trying to use inside on_validation_epoch_end and on_train_epoch_end:

  1. Native pytorch distributed all_gather
  2. Lightning self.all_gather

Both do not work, and the process is stuck, like if all the gpu are waiting for rank 0 to answer.

EDIT:

I see that if I log using on_epoch=False, the problem does not occur

self.log("train_loss", loss, on_step=True, on_epoch=False, prog_bar=True, logger=True, sync_dist=True)

Thanks

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.xwon't fixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions