training will randomly freeze for training AlexNet from scratch.

sometimes, the training process will simply get stuck at testing.

~~~
Epoch: [0][5000/5005]   Time 0.100 (0.335)      Data 0.000 (0.244)      Loss 5.9800 (6.5614)    Prec@1 1.953 (0.735)    Prec@5 7.812 (2.896)
Test: [0/196]   Time 7.905 (7.905)      Loss 4.1344 (4.1344)    Prec@1 16.016 (16.016)  Prec@5 51.562 (51.562)
~~~

Or, more frequently, the line `Test: [0/196]` won't appear and the whole process gets stuck at line `Epoch: [0][5000/5005] `

it has been like so for several hours, and by looking at `top`, no processes are using CPU.

I called `CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 20 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC 2>&1 | tee alexnet_train.log` to train the network.

This appears both on a CentOS 6 machine as well as a Ubuntu 14.04 machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training will randomly freeze for training AlexNet from scratch. #148

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

training will randomly freeze for training AlexNet from scratch. #148

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions