我一直在尝试训练使用 Keras 的 Tensorflow 实现编写的 CNN。看起来训练在到达第一个 epoch 时就陷入了困境——尽管根据 nvidia-smi 的说法,我的 GPU 似乎仍在使用内存。也没有错误消息或回溯打印到终端,这使得调试对我来说有点棘手。我还使用 TF 估计器和数据集编写了此代码,当我将其放置过夜时,网络并未进行训练。因此,我不认为这只是让代码运行更长时间的情况 - 这可能是我所做的事情,但也可能是由于(据称已修复)错误(根据下面的第二个链接)。
目前,我还尝试使用 model.fit() 中的“verbose”参数来跟踪训练过程,以查看是否发生了任何情况。但我没有看到终端中出现任何内容。其他遇到此问题的人似乎仍然会出现进度条。
我还尝试使用 TensorBoard 进行日志记录并保存模型检查点。没有保存检查点,并且关于 Tensorboard,看起来也没有保存图表。
关于可能导致这种情况的原因有什么想法吗?
无法通过第一个纪元——只是挂起 [Keras 迁移学习初始阶段] https://stackoverflow.com/questions/47382952/cant-get-past-first-epoch-just-hangs-keras-transfer-learning-inception
Keras 拟合在第一个 epoch 结束时冻结 https://stackoverflow.com/questions/48748413/keras-fit-freezes-at-the-end-of-the-first-epoch
import os
import tensorflow as tf
from tensorflow import keras
import cv2
import numpy as np
from tensorflow.python.framework.graph_util import convert_variables_to_constants
from tensorflow.python.keras import backend as K
cwd = os.getcwd()
log_dir = cwd + "/Keras_Model/"
callbacks = [keras.callbacks.ModelCheckpoint(filepath="./Checkpoints/weights.{epoch:02d}-{val_loss:.2f}.hdf5"),
keras.callbacks.TensorBoard(log_dir="./logs")]
def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
"""
TAKEN FROM HERE: https://stackoverflow.com/questions/45466020/how-to-export-keras-h5-to-tensorflow-pb
Freezes the state of a session into a pruned computation graph. Used later to save model as TF pb file.
Creates a new computation graph where variable nodes are replaced by
constants taking their current value in the session. The new graph will be
pruned so subgraphs that are not necessary to compute the requested
outputs are removed.
@param session The TensorFlow session to be frozen.
@param keep_var_names A list of variable names that should not be frozen,
or None to freeze all the variables in the graph.
@param output_names Names of the relevant graph outputs.
@param clear_devices Remove the device directives from the graph for better portability.
@return The frozen graph definition.
"""
graph = session.graph
with graph.as_default():
freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
output_names = output_names or []
output_names += [v.op.name for v in tf.global_variables()]
input_graph_def = graph.as_graph_def()
if clear_devices:
for node in input_graph_def.node:
node.device = ""
frozen_graph = convert_variables_to_constants(session, input_graph_def,
output_names, freeze_var_names)
return frozen_graph
### IMPORT TRAINING IMAGES AS NUMPY ARRAY ###
t_dir = cwd + "/data-1/training/"
e_dir = cwd + "/data-1/evaluation"
xtrain = []
ytrain = []
print(" - Collating training data and labels... - ")
for subdir, dirs, files in os.walk(t_dir):
for f in files:
img = os.path.join(subdir, f)
x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
xtrain.append(x)
ytrain.append(y)
data = np.asarray(xtrain)
print(" - Training data collated. - ")
labels = np.asarray(ytrain)
print(" - Training labels collated. - ")
### IMPORT EVALUATION IMAGES AS TF ITERATOR ###
xeval = []
yeval = []
print(" - Collating validation data and labels... - ")
for subdir, dirs, files in os.walk(e_dir):
for f in files:
img = os.path.join(subdir, f)
x = cv2.imread(img) # --> Produces 8-bit tensor from image file.
y = int(img.split("/")[-2]) - 1 # --> Get label from file path.
xeval.append(x)
yeval.append(y)
val_data = np.asarray(xeval)
print(" - Validation data collated. - ")
val_labels = np.asarray(yeval)
print(" - Validation labels collated. - ")
### CREATE MODEL ###
model = keras.Sequential()
model.add(keras.layers.Conv2D(filters=32, kernel_size=5, strides=1, padding="same", data_format = "channels_last", activation="relu", input_shape= (480,640,3)))
model.add(keras.layers.GlobalMaxPool2D(data_format = "channels_last"))
model.add(keras.layers.Dense(64, activation="relu"))
model.add(keras.layers.Dropout(0.4)) # --> Change dropout rate here.
model.add(keras.layers.Dense(8, activation="softmax"))
model.compile(optimizer=tf.train.AdamOptimizer(0.001), # --> Choose learning rate here.
loss=keras.losses.sparse_categorical_crossentropy,
metrics=[keras.metrics.categorical_accuracy])
print(" - Model created... - ")
print(" - Model Summary - ")
model.summary() # --> Print model summary.
### TRAIN AND EVALUATE MODEL ###
print(" - Training model... - ")
model.fit(data, labels, epochs = 5, batch_size=32, callbacks=callbacks, validation_data=(val_data, val_labels), verbose = 2)
print(" - Model trained! - ")
### SAVE MODEL AS H5 AND PB FILES ###
model.save("./Keras_Model/model.h5", save_format="h5")
print(" - Saved model as h5. - ")
frozen_graph = freeze_session(K.get_session(), output_names=[out.op.name for out in model.outputs])
tf.train.write_graph(frozen_graph, "./Tensorflow_Model/", "model.pb", as_text=False)
print(" - Saved model as pb. - ")
print(" - Clearing session. - ")
keras.clear_session()
如果可以的话,我还可以提供使用 TF 数据集和评估器的版本,或者其他任何内容。如果我遗漏了任何明显的内容,我深表歉意,我刚刚开始使用 SO。
更新:我昨晚回家并在我的计算机上运行了这个脚本 - 它似乎工作得很清楚,这不是使用问题,但可能是 TF 本身的问题或它在我们服务器上的配置方式的问题。这有点奇怪,因为 TF 之前在某个时刻正在工作,但你能做什么呢?大家干杯。