Open
Description
sometimes, the training process will simply get stuck at testing.
Epoch: [0][5000/5005] Time 0.100 (0.335) Data 0.000 (0.244) Loss 5.9800 (6.5614) Prec@1 1.953 (0.735) Prec@5 7.812 (2.896)
Test: [0/196] Time 7.905 (7.905) Loss 4.1344 (4.1344) Prec@1 16.016 (16.016) Prec@5 51.562 (51.562)
Or, more frequently, the line Test: [0/196]
won't appear and the whole process gets stuck at line Epoch: [0][5000/5005]
it has been like so for several hours, and by looking at top
, no processes are using CPU.
I called CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 20 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC 2>&1 | tee alexnet_train.log
to train the network.
This appears both on a CentOS 6 machine as well as a Ubuntu 14.04 machine.