Open
Description
Bug description
I am trying to compute metrics that I have aggregated. There is an issue with torchmetrics that is already addressed here #18803 (and not solved yet) so I am trying to aggregate my metric alone.
At each step (training and validation step) I am printing my metric as follows:
self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True, sync_dist=True)
I have a tensor that I want to sync between all gpus, and I am trying to use inside on_validation_epoch_end
and on_train_epoch_end
:
- Native pytorch distributed
all_gather
- Lightning
self.all_gather
Both do not work, and the process is stuck, like if all the gpu are waiting for rank 0 to answer.
EDIT:
I see that if I log using on_epoch=False
, the problem does not occur
self.log("train_loss", loss, on_step=True, on_epoch=False, prog_bar=True, logger=True, sync_dist=True)
Thanks
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response