Ich versuche herauszufinden, warum der Code unten während Iteration durch Iteration ein großes Speicherleck auftritt. Hier ist der ganze Code.Tensorflow-Speicherverlust in jeder Iteration

def train_network(file_folder, file_list, hm_epochs, batch_size):  
    num_files = len(file_list) 

    with g.as_default(): 

     input_image = tf.placeholder(tf.float32, shape=[1, 40, 200, 300, 3]) 
     y1 = tf.placeholder(tf.int32) 
     y2 = tf.placeholder(tf.float32) 

     class_logit, highlight_logit = convolutional_neural_network(input_image) 

     class_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=class_logit, labels=y1)) 
     highlight_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=highlight_logit, labels=y2)) 

     optimizer1 = tf.train.RMSPropOptimizer(learning_rate=1e-6).minimize(class_loss, centered=True) 
     optimizer2 = tf.train.RMSPropOptimizer(learning_rate=1e-7).minimize(highlight_loss, centered=True) 

     #### Saving Network #### 
     with tf.Session(graph=g) as sess: 
      saver = tf.train.Saver(max_to_keep = 3) 
      sess.run(tf.global_variables_initializer()) 
      for epoch in xrange(hm_epochs): 
       epoch_loss = 0 

       for idx in xrange(num_files): 
        _file = file_folder + '/' + file_list[idx] 
        X_total, Y1_class, Y2_score = read_as_batch(_file) 
        n_batch = int(X_total.shape[0]/batch_size) 
        for i in xrange(n_batch): 

         batch_X = get_batch_piece(X_total, batch_size, i) 
         batch_Y1 = get_batch_piece(Y1_class, batch_size, i) 
         batch_Y2 = get_batch_piece(Y2_score, batch_size, i) 

         _, _, a, b, c, d = sess.run([optimizer1, optimizer2, class_loss, highlight_loss, tf.gather(class_logit, 0), tf.gather(highlight_logit, 0)], feed_dict={input_image: batch_X, y1: batch_Y1, y2: batch_Y2}) 
         result = float(a) + float(b) 
         del a, b, batch_X, batch_Y1, batch_Y2 

         epoch_loss += result 

         del c, d 
         gc.collect() 
       ckpt_path = saver.save(sess, "saved/train", epoch)

Und das folgende ist Speicher Profiler Ergebnis. Ich habe herausgefunden, die Funktionen read_as_batch und get_batch_piece sind nicht der Grund für den Speicherverlust durch mehrere Experimente.

Linie # Mem Nutzung Inkrementleitung Inhalt

35 215.758 MiB 0.000 MiB @profile 
36        def train_network(file_folder, file_list, hm_epochs, batch_size): 
37         
38 215.758 MiB 0.000 MiB  num_files = len(file_list)        
44 215.758 MiB 0.000 MiB  with g.as_default(): 
45        
46 216.477 MiB 0.719 MiB   input_image = tf.placeholder(tf.float32, shape=[1, 40, 200, 300, 3]) 
47 216.477 MiB 0.000 MiB   y1 = tf.placeholder(tf.int32) 
48 216.477 MiB 0.000 MiB   y2 = tf.placeholder(tf.float32) 
49        
50 220.199 MiB 3.723 MiB   class_logit, highlight_logit = convolutional_neural_network(input_image) 
51        
52 220.711 MiB 0.512 MiB   class_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=class_logit, labels=y1))       
54 220.953 MiB 0.242 MiB   highlight_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=highlight_logit, labels=y2)) 
55        
56 227.562 MiB 6.609 MiB   optimizer1 = tf.train.RMSPropOptimizer(learning_rate=1e-6).minimize(class_loss) 
57 234.062 MiB 6.500 MiB   optimizer2 = tf.train.RMSPropOptimizer(learning_rate=1e-7).minimize(highlight_loss) 
58        
59          #### Saving Network #### 
60 660.691 MiB 426.629 MiB   with tf.Session(graph=g) as sess: 
62 666.848 MiB 6.156 MiB    saver = tf.train.Saver(max_to_keep = 3) 
63 1183.676 MiB 516.828 MiB    sess.run(tf.global_variables_initializer()) 
67 1642.145 MiB 458.469 MiB    for epoch in xrange(hm_epochs): 
68 1642.145 MiB 0.000 MiB     epoch_loss = 0 
69 1642.145 MiB 0.000 MiB     file_list_ = iter(file_list) 
71            #for idx in xrange(num_files): 
74 1642.145 MiB 0.000 MiB     _file = file_folder + '/' + file_list_.next() 
77 1779.477 MiB 137.332 MiB     data = np.load(_file) 
78            # Batch Data Generation 
79 1916.629 MiB 137.152 MiB     X_total = np.array([data[0][0][0], data[0][0][1], ...]) 
81            # Class, Score Data Fetching 
82 1916.629 MiB 0.000 MiB     Y1_class = data[0][1][0] 
83 1916.629 MiB 0.000 MiB     Y2_score = data[0][2][0] 
85 1916.629 MiB 0.000 MiB     batch_X = get_batch_piece(X_total, 1, 1) 
86 1916.629 MiB 0.000 MiB     batch_Y1 = get_batch_piece(Y1_class, 1, 1) 
87 1916.629 MiB 0.000 MiB     batch_Y2 = get_batch_piece(Y2_score, 1, 1) 
88 1916.805 MiB 0.176 MiB     _ = sess.run([optimizer1], feed_dict={input_image: batch_X, y1: batch_Y1, y2: batch_Y2}) 
89        
90 1642.145 MiB -274.660 MiB     del data, X_total, Y1_class, Y2_score, batch_X, batch_Y1, batch_Y2, optimizer1

Lesbarkeit zu verbessern, verkürzen ich den Code. Selbst das Ergebnis der Speicherprofilerstellung unterscheidet sich geringfügig vom ursprünglichen Code, es ist dasselbe und es tritt das gleiche Problem auf (Speicherverlust). Die Tatsache ist, wenn ich die sess.run entfernen ([optimizer1], ...), leckt der Code nicht den Speicher selbst die Epoche ist über 100. Allerdings, den Fall, dass ich die Sitzung ausführen, wird die Speichernutzung immer größer, so dass ich auch mit der Epoche 5 nicht mehr trainieren kann.

Ich brauche deine Hilfe. Vielen Dank.

Quelle

2017-10-03 조수호

Der Grund ist, dass Sie bei jedem Sitzungsaufruf neue Tensorflow-Operationen erstellen.

Verschieben Sie diese zwei aus for Schleife tf.gather(class_logit, 0), tf.gather(highlight_logit, 0), und das Problem sollte weg.

Quelle

2017-10-03 09:38:20

Es funktioniert !! Vielen vielen Dank. Ich habe noch eine Frage. Gibt es eine andere Möglichkeit, die Logit-Änderungen während des Trainings zu überprüfen? –

Ja, Sie könnten entweder den Wert drucken oder Tensorboard verwenden. –

Yeeeeah. Ich werde es versuchen. Danke nochmal :) –

Tensorflow-Speicherverlust in jeder Iteration

Linie # Mem Nutzung Inkrementleitung Inhalt

Antwort

Verwandte Themen