Description
Bug description
This is either a bug or a misunderstanding of how logging and monitoring works if using the training step. But basically, I set the early stopping criteria (in this case on the train_loss
) which correctly stops. However the value that is logged ends up mismatching the value that it stopped at.
While it is hard to get a minimal example to reproduce, my training looks like
def training_step(self, batch, batch_idx):
x, y = batch
y = y.unsqueeze(1) # to enable broadcasting of self(x)
loss = F.mse_loss(self(x), y, reduction="mean")
self.log("train_loss", loss, prog_bar=True)
return loss
and the yaml config for the early stopping is
- class_path: pytorch_lightning.callbacks.EarlyStopping
init_args:
monitor: train_loss # val_loss or val_rel_error is normally a better choice, but looking at fitting/overfitting here.
min_delta: 0.0
patience: 1000000
mode: min
check_finite: true
divergence_threshold: 1e5
stopping_threshold: 1.0e-6
check_on_train_epoch_end: true # but it doesn't seem to do anything there.
verbose: true
and I am using the W&B logger.
With the verbose on the output correctly shows that this does early stopping and the last few lines of output are (with me adding a few newlines)
Epoch 241: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.33it/s, v_num=2bn1, train_loss=1e-6, val_loss=1.28e-6, val_rel_error=0.00061, val_abs_error=0.000951]
Metric train_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.000
Epoch 242: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.00it/s, v_num=2bn1, train_loss=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945]
Stopping threshold reached: train_loss = 9.949283139576437e-07 < 1e-06. Signaling Trainer to stop.
Epoch 242: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.00it/s, v_num=2bn1, train_loss=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945]
Testing DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 149.33it/s]
However... when I check the logged value (in W&B) after this executes:
cli.trainer.fit(cli.model)
cli.trainer.logger.experiment.summary["train_loss"]
It returns 2.0675681753346e-06
. Am I misunderstanding something about logging here?
I tried it with tensorboard logger as well. Harder to do programmatically but the value 2.0675681753346e-06
is the same.
One thing of note is that this is running with the full batch, so the distinction between train_loss for a step vs. the epoch shouldn't be as big of a difference.
If I change my training step to be:
def training_step(self, batch, batch_idx):
x, y = batch
y = y.unsqueeze(1) # to enable broadcasting of self(x)
loss = F.mse_loss(self(x), y, reduction="mean")
self.log("train_loss", loss, prog_bar=True, on_epoch=True)
return loss
Then the last step before early stopping occurs says
Epoch 242: 100%|██████████████████████████████████████| 1/1 [00:00<00:00, 16.00it/s, v_num=rm4z, train_loss_step=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945, train_loss_epoch=9.95e-7]
Testing DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 100.73it/s]
Note that that that train_loss_step and train_loss_epoch are identical.
However, if I check the logger it says
cli.trainer.fit(cli.model)
cli.trainer.logger.experiment.summary["train_loss_step"]
2.0675681753346e-06
cli.trainer.logger.experiment.summary["train_loss_epoch"]
9.949283139576437e-07
What version are you seeing the problem on?
v2.0
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response