Training in SSD-Implementierung in Keras hält nach ein paar Iterationen ohne Ausgabe oder Fehler

Nach einigen Iterationen der ersten Epoche stoppt der Trainingsprozess ohne Ausgabe oder Fehlermeldung. SSD Implementierung in Keras von https://github.com/rykov8/ssd_keras verwendet wurdeTraining in SSD-Implementierung in Keras hält nach ein paar Iterationen ohne Ausgabe oder Fehler

base_lr = 3e-4 
#optim = keras.optimizers.Adam(lr=base_lr) 
optim = keras.optimizers.RMSprop(lr=base_lr) 
#optim = keras.optimizers.SGD(lr=base_lr, momentum=0.9, decay=decay, nesterov=True) 
model.compile(optimizer=optim, 
       loss=MultiboxLoss(NUM_CLASSES+1, neg_pos_ratio=2.0).compute_loss) 



nb_epoch = 10 
history = model.fit_generator(gen.generate(True), gen.train_batches, 
           nb_epoch, verbose=1, 
           callbacks=None, 
           validation_data=gen.generate(False), 
           nb_val_samples=gen.val_batches, 
           nb_worker=1 
           )

Die Ausgabe des Programms ist wie folgt:

Epoch 1/10 
/home/deepesh/Documents/ssd_traffic/ssd_utils.py:119: RuntimeWarning: divide by zero encountered in log 
    assigned_priors_wh) 
2017-10-15 18:00:53.763886: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:02.602807: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:03.831092: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:03.831138: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:04.774444: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:05.897872: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:05.897923: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:09.133494: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:09.133541: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
2017-10-15 18:01:11.266114: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 
13/14 [==========================>...] - ETA: 9s - loss: 2.9617

Es erfolgt keine Ausgabe bzw. Fehlermeldung danach.

Quelle

2017-10-15 Deepesh Lekhak

Sie haben nicht genug Speicher haben, was Sie, das Problem lösen können:

die Chargengröße reduzieren
die Größe der Zugdaten
Ihre Modelle in Wolken trainieren (AMS, Google Cloud und etc)
mit mehr Speicher eine andere GPU-Karte verwenden
oder versuchen CPU

Quelle

2017-10-15 14:35:22 Paddy

Ich habe das Modell auf AMS g2.8xlarge Instanz trainieren, aber das Problem ist nicht gelöst. Wenn ich die Stapelgröße auf nur 2 reduziere, ist das Problem gelöst. –

gut zu hören :) – Paddy

Training in SSD-Implementierung in Keras hält nach ein paar Iterationen ohne Ausgabe oder Fehler

Antwort

Verwandte Themen