Skip to content

training will randomly freeze for training AlexNet from scratch. #148

Open
@zym1010

Description

@zym1010

sometimes, the training process will simply get stuck at testing.

Epoch: [0][5000/5005]   Time 0.100 (0.335)      Data 0.000 (0.244)      Loss 5.9800 (6.5614)    Prec@1 1.953 (0.735)    Prec@5 7.812 (2.896)
Test: [0/196]   Time 7.905 (7.905)      Loss 4.1344 (4.1344)    Prec@1 16.016 (16.016)  Prec@5 51.562 (51.562)

Or, more frequently, the line Test: [0/196] won't appear and the whole process gets stuck at line Epoch: [0][5000/5005]

it has been like so for several hours, and by looking at top, no processes are using CPU.

I called CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python main.py -a alexnet --print-freq 20 --lr 0.01 --workers 20 --batch-size 256 /ssd/cv_datasets/ILSVRC2015/Data/CLS-LOC 2>&1 | tee alexnet_train.log to train the network.

This appears both on a CentOS 6 machine as well as a Ubuntu 14.04 machine.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions