Skip to content

Logged monitor value for train_loss after early stopping does not match the stopping criteria #17644

Open
@jlperla

Description

@jlperla

Bug description

This is either a bug or a misunderstanding of how logging and monitoring works if using the training step. But basically, I set the early stopping criteria (in this case on the train_loss) which correctly stops. However the value that is logged ends up mismatching the value that it stopped at.

While it is hard to get a minimal example to reproduce, my training looks like

    def training_step(self, batch, batch_idx):
        x, y = batch
        y = y.unsqueeze(1)  # to enable broadcasting of self(x)
        loss = F.mse_loss(self(x), y, reduction="mean")
        self.log("train_loss", loss, prog_bar=True)
        return loss

and the yaml config for the early stopping is

    - class_path: pytorch_lightning.callbacks.EarlyStopping
      init_args:
        monitor: train_loss # val_loss or val_rel_error is normally a better choice, but looking at fitting/overfitting here.
        min_delta: 0.0
        patience: 1000000
        mode: min
        check_finite: true
        divergence_threshold: 1e5 
        stopping_threshold: 1.0e-6  
        check_on_train_epoch_end: true  # but it doesn't seem to do anything there.
        verbose: true

and I am using the W&B logger.

With the verbose on the output correctly shows that this does early stopping and the last few lines of output are (with me adding a few newlines)

Epoch 241: 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.33it/s, v_num=2bn1, train_loss=1e-6, val_loss=1.28e-6, val_rel_error=0.00061, val_abs_error=0.000951]
Metric train_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.000                                                                                                                                          
Epoch 242: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.00it/s, v_num=2bn1, train_loss=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945]
Stopping threshold reached: train_loss = 9.949283139576437e-07 < 1e-06. Signaling Trainer to stop.
Epoch 242: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 32.00it/s, v_num=2bn1, train_loss=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945] 
Testing DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 149.33it/s]

However... when I check the logged value (in W&B) after this executes:

cli.trainer.fit(cli.model)
cli.trainer.logger.experiment.summary["train_loss"]

It returns 2.0675681753346e-06. Am I misunderstanding something about logging here?

I tried it with tensorboard logger as well. Harder to do programmatically but the value 2.0675681753346e-06 is the same.

One thing of note is that this is running with the full batch, so the distinction between train_loss for a step vs. the epoch shouldn't be as big of a difference.

If I change my training step to be:

    def training_step(self, batch, batch_idx):
        x, y = batch
        y = y.unsqueeze(1)  # to enable broadcasting of self(x)
        loss = F.mse_loss(self(x), y, reduction="mean")
        self.log("train_loss", loss, prog_bar=True, on_epoch=True)
        return loss

Then the last step before early stopping occurs says

Epoch 242: 100%|██████████████████████████████████████| 1/1 [00:00<00:00, 16.00it/s, v_num=rm4z, train_loss_step=9.95e-7, val_loss=1.26e-6, val_rel_error=0.000607, val_abs_error=0.000945, train_loss_epoch=9.95e-7] 
Testing DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 100.73it/s] 

Note that that that train_loss_step and train_loss_epoch are identical.

However, if I check the logger it says

cli.trainer.fit(cli.model)
cli.trainer.logger.experiment.summary["train_loss_step"]
2.0675681753346e-06
cli.trainer.logger.experiment.summary["train_loss_epoch"]
9.949283139576437e-07

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.0.xwon't fixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions