Description
I have been attempting to run the MNIST example on a Mac Mini with the M1 chip, and I run into an issue where the model seems to train correctly, but during test time, returns a test accuracy that is chance (~10% accuracy). After some debugging, I found that line 58 in the file is what causes this issue. Upon commenting out this line, the model train and evals correctly, returning ~97% accuracy after one epoch. I have attached my machine specifications below -
Device Information:
# M1 Mac Specifications
Model Name: Mac mini
Model Identifier: Macmini9,1
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
System Firmware Version: 6723.140.2
OS Loader Version: 6723.140.2
# Intel x86 MacBook Pro
Model Name: MacBook Pro
Model Identifier: MacBookPro14,1
Processor Name: Dual-Core Intel Core i5
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
System Firmware Version: 429.140.8.0.0
I have also independently run this file on an older x86 architecture Macbook Pro, where this issue does not exist, and the same line does not need to be commented out. On both architectures, I setup Python, Conda, and Pytorch from scratch following this link. In a nutshell, these were the commands on a fresh install (running Python 3.9.6)-
brew install miniforge
conda init zsh
conda create --name pytorch_env
conda activate pytorch_env
conda install -c pytorch pytorch torchvision
python main.py
These are the train/test loss and accuracy values after 1 epoch -
# With line 58 commented on M1
Train Epoch: 1 [0/60000 (0%)] Loss: 2.329474
Train Epoch: 1 [640/60000 (1%)] Loss: 1.419983
...
Train Epoch: 1 [58880/60000 (98%)] Loss: 0.044800
Train Epoch: 1 [59520/60000 (99%)] Loss: 0.018873
Test set: Average loss: 0.0456, Accuracy: 9838/10000 (98%)
# With line 58 uncommented on M1
Train Epoch: 1 [0/60000 (0%)] Loss: 2.329474
Train Epoch: 1 [640/60000 (1%)] Loss: 1.419983
...
Train Epoch: 1 [58880/60000 (98%)] Loss: 0.044800
Train Epoch: 1 [59520/60000 (99%)] Loss: 0.018873
Test set: Average loss: 2.3097, Accuracy: 974/10000 (10%)
This issue was also reported by a colleague of mine @shreyaspadhy in a separate thread, please see here.