Description
It is a common sense that, during evaluation, the model is not trained by the dev dataset.
However, I noticed a strange different behavior between the two results:
(1) train 10 epoch, having final evaluate on test data
(2) train 10 epoch, having an evaluation after each training epoch on test data
Prior knowledge:
Even though you set seed for everything
# set seed
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed_all(args.seed) # if got GPU also set this seed
When you run examples/mnist/main.py
, it still give different result on GPU.
run 1
-------------
Test set: Average loss: 0.1018, Accuracy: 9660/10000 (97%)
Test set: Average loss: 0.0611, Accuracy: 9825/10000 (98%)
Test set: Average loss: 0.0555, Accuracy: 9813/10000 (98%)
Test set: Average loss: 0.0409, Accuracy: 9862/10000 (99%)
Test set: Average loss: 0.0381, Accuracy: 9870/10000 (99%)
Test set: Average loss: 0.0339, Accuracy: 9891/10000 (99%)
Test set: Average loss: 0.0340, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0399, Accuracy: 9872/10000 (99%)
Test set: Average loss: 0.0291, Accuracy: 9908/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9896/10000 (99%)
run 2
--------------
Test set: Average loss: 0.1016, Accuracy: 9666/10000 (97%)
Test set: Average loss: 0.0608, Accuracy: 9828/10000 (98%)
Test set: Average loss: 0.0567, Accuracy: 9810/10000 (98%)
Test set: Average loss: 0.0408, Accuracy: 9864/10000 (99%)
Test set: Average loss: 0.0382, Accuracy: 9868/10000 (99%)
Test set: Average loss: 0.0339, Accuracy: 9894/10000 (99%)
Test set: Average loss: 0.0349, Accuracy: 9871/10000 (99%)
Test set: Average loss: 0.0396, Accuracy: 9876/10000 (99%)
Test set: Average loss: 0.0294, Accuracy: 9911/10000 (99%)
Test set: Average loss: 0.0304, Accuracy: 9895/10000 (99%)
As long as you set torch.backends.cudnn.deterministic = True
You could get consistent results:
====== parameters ========
batch_size: 64
do_eval: True
do_eval_each_epoch: True
epochs: 10
log_interval: 10
lr: 0.01
momentum: 0.5
no_cuda: False
save_model: False
seed: 42
test_batch_size: 1000
==========================
Test set: Average loss: 0.1034, Accuracy: 9679/10000 (97%)
Test set: Average loss: 0.0615, Accuracy: 9804/10000 (98%)
Test set: Average loss: 0.0484, Accuracy: 9847/10000 (98%)
Test set: Average loss: 0.0361, Accuracy: 9888/10000 (99%)
Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)
Test set: Average loss: 0.0380, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9884/10000 (99%)
Test set: Average loss: 0.0283, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0266, Accuracy: 9907/10000 (99%) -> epoch 10
====== parameters ========
batch_size: 64
do_eval: True
do_eval_each_epoch: True
epochs: 20
log_interval: 10
lr: 0.01
momentum: 0.5
no_cuda: False
save_model: False
seed: 42
test_batch_size: 1000
==========================
Test set: Average loss: 0.1034, Accuracy: 9679/10000 (97%)
Test set: Average loss: 0.0615, Accuracy: 9804/10000 (98%)
Test set: Average loss: 0.0484, Accuracy: 9847/10000 (98%)
Test set: Average loss: 0.0361, Accuracy: 9888/10000 (99%)
Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)
Test set: Average loss: 0.0380, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9884/10000 (99%)
Test set: Average loss: 0.0283, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0266, Accuracy: 9907/10000 (99%) -> epoch 10
Test set: Average loss: 0.0373, Accuracy: 9870/10000 (99%)
Test set: Average loss: 0.0286, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0309, Accuracy: 9908/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0261, Accuracy: 9907/10000 (99%)
Test set: Average loss: 0.0258, Accuracy: 9913/10000 (99%)
Test set: Average loss: 0.0288, Accuracy: 9917/10000 (99%)
Test set: Average loss: 0.0280, Accuracy: 9904/10000 (99%)
Test set: Average loss: 0.0294, Accuracy: 9902/10000 (99%)
Test set: Average loss: 0.0257, Accuracy: 9914/10000 (99%) -> epoch 20
However, when you change the model to have final evaluation
after epoch 10, the result becomes:
====== parameters ========
batch_size: 64
do_eval: True
do_eval_each_epoch: False
epochs: 10
log_interval: 10
lr: 0.01
momentum: 0.5
no_cuda: False
save_model: False
seed: 42
test_batch_size: 1000
==========================
Test set: Average loss: 0.0361, Accuracy: 9885/10000 (99%) -> epoch 10
I also tried to add torch.backends.cudnn.benchmark = False
, it gives the same result.
Repeatability and consistent result is crucial in machine learning, do you guys know what is the reason for this strange behavior?
Attached code for your convenience:
PyTorch_mnist_example.zip