在 Tensorflow 中训练简单模型 GPU 比 CPU 慢

2024-03-27

我在 Tensorflow 中设置了一个简单的线性回归问题，并在 1.13.1 中使用 Tensorflow CPU 和 GPU 创建了简单的 conda 环境（在 NVIDIA Quadro P600 的后端使用 CUDA 10.0）。

然而，GPU 环境似乎总是比 CPU 环境花费更长的时间。我正在运行的代码如下。

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd



# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


N = 35000
inttest = np.ones(N).reshape(N, 1)
stddev_raw = 0.09

true_int = 1.
true_b1 = 0.15
true_b2 = 0.7

np.random.seed(69)

X1 = (np.atleast_2d(np.linspace(
    0., 2., num=N)).T).astype(np.float64)
X2 = (np.atleast_2d(np.linspace(
    2., 1., num=N)).T).astype(np.float64)
Ytest = true_int + (true_b1*X1) + (true_b2*X2) + \
    np.random.normal(size=N, scale=stddev_raw).reshape(N, 1)

Ytest = Ytest.reshape(N, )
X1 = X1.reshape(N, )
X2 = X2.reshape(N, )

reset_g()

# Create data and param
model_X1 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_X2 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_Y = tf.placeholder(dtype=tf.float64, shape=[N, ])

alpha = tf.get_variable(shape=[1], name='alpha', dtype=tf.float64)
# these two params need shape of one if using trainable distro
beta1 = tf.get_variable(shape=[1], name='beta1', dtype=tf.float64)
beta2 = tf.get_variable(shape=[1], name='beta2', dtype=tf.float64)

# Yhat
tf_pred = (tf.multiply(model_X1, beta1) + tf.multiply(model_X2, beta2) + alpha)


# # Make difference of squares
# resid = tf.square(model_Y - tf_pred)
# loss = tf.reduce_sum(resid)

# # Make a Likelihood function based on simple stuff
stddev = tf.square(tf.get_variable(shape=[1],
                                    name='stddev', dtype=tf.float64))
covar = tfd.Normal(loc=model_Y, scale=stddev)
loss = -1.0*tf.reduce_sum(covar.log_prob(tf_pred))



# Trainer
lr=0.005
N_ITER = 20000

opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
train = opt.minimize(loss)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, out_b1, out_b2, out_a, laws = sess.run([train, beta1, beta2, alpha, loss],
                                                  feed_dict={model_X1: X1,
                                                             model_X2: X2,
                                                             model_Y: Ytest})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}, alpha = {a:.3f}, beta1 = {b1:.3f}, beta2 = {b2:.3f}'.format(
                s=step, l=laws, a=out_a[0], b1=out_b1[0], b2=out_b2[0]))
    print(f"True: alpha = {true_int}, beta1 = {true_b1}, beta2 = {true_b2}")
    end = time.time()
    print(end-start)

以下是一些打印的输出（如果它们可以指示正在发生的情况）：

对于CPU运行：

Colocations handled automatically by placer.
2019-04-18 09:00:56.329669: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:00:56.351151: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:00:56.351672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x558fefe604c0 executing computations on platform Host. Devices:
2019-04-18 09:00:56.351698: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>

对于 GPU 运行：

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0418 09:03:21.674947 139956864096064 deprecation.py:506] From /home/sadatnfs/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:187: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-04-18 09:03:21.712913: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:03:21.717598: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-04-18 09:03:21.951277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1009] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-18 09:03:21.952212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e583bc4480 executing computations on platform CUDA. Devices:
2019-04-18 09:03:21.952225: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Quadro P600, Compute Capability 6.1
2019-04-18 09:03:21.971218: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:03:21.971816: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e58577f290 executing computations on platform Host. Devices:
2019-04-18 09:03:21.971842: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-18 09:03:21.972102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: Quadro P600 major: 6 minor: 1 memoryClockRate(GHz): 1.5565
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.91GiB
2019-04-18 09:03:21.972147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-04-18 09:03:21.972248: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-04-18 09:03:21.973094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-18 09:03:21.973105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0
2019-04-18 09:03:21.973110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N
2019-04-18 09:03:21.973279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1735 MB memory) -> physical GPU (device: 0, name: Quadro P600, pci bus id: 0000:01:00.0, compute capability: 6.1)

我即将发布另一个关于在 R 中实现 CUBLAS 的问题，因为与 Intel MKL 相比，这给了我较慢的速度时间，但我希望也许有一个明确的原因，为什么即使是像 TF 一样构建良好的东西（与hacky R 和 CUBLAS 补丁）在 GPU 上运行缓慢。

EDIT:下列的弗拉德的建议 https://stackoverflow.com/questions/55749899/training-a-simple-model-in-tensorflow-gpu-slower-than-cpu#comment98177427_55749899，我编写了以下脚本来尝试抛出一些大型对象并对其进行训练，但我认为我可能没有正确设置它，因为在这种情况下，即使矩阵的大小在增加，CPU 也是如此。也许有什么建议吗？

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd

np.random.seed(69)

# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


# Loop over the different number of feature columns
for x_feat in [30, 50, 100, 1000, 10000]:

    y_feat=10;
    # Simulate data
    N = 5000
    inttest = np.ones(N).reshape(N, 1)
    stddev_raw = np.random.uniform(0.01, 0.25, size=y_feat)

    true_int = np.linspace(0.1 ,1., num=y_feat)
    xcols = x_feat
    true_bw = np.random.randn(xcols, y_feat)
    true_X = np.random.randn(N, xcols)
    true_errorcov = np.eye(y_feat)
    np.fill_diagonal(true_errorcov, stddev_raw)

    true_Y = true_int + np.matmul(true_X, true_bw) + \
        np.random.multivariate_normal(mean=np.array([0 for i in range(y_feat)]),
                                      cov=true_errorcov,
                                      size=N)


    ## Our model is:
    ## Y = a + b*X + error where, for N=5000 observations:
    ## Y : 10 outputs;
    ## X : 30,50,100,1000,10000 features
    ## a, b = bias and weights
    ## error: just... error

    # Number of iterations
    N_ITER = 1001

    # Training rate
    lr=0.005

    with tf.device('gpu'):

        # Create data and weights
        model_X = tf.placeholder(dtype=tf.float64, shape=[N, xcols])
        model_Y = tf.placeholder(dtype=tf.float64, shape=[N, y_feat])

        alpha = tf.get_variable(shape=[y_feat], name='alpha', dtype=tf.float64)
        # these two params need shape of one if using trainable distro
        betas = tf.get_variable(shape=[xcols, y_feat], name='beta1', dtype=tf.float64)


        # Yhat
        tf_pred = alpha + tf.matmul(model_X, betas)

        # Make difference of squares (loss fn) [CONVERGES TO TRUTH]
        resid = tf.square(model_Y - tf_pred)
        loss = tf.reduce_sum(resid)

        # Trainer
        opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
        train = opt.minimize(loss)


    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, laws = sess.run([train, loss], feed_dict={model_X: true_X, model_Y: true_Y})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}'.format(
                s=step, l=laws))
    end = time.time()
    print("y_feat: {n}, x_feat: {x2}, Time elapsed: {te}".format(n = y_feat, x2 = x_feat, te = end-start))

    reset_g()

正如我在评论中所说，调用 GPU 内核以及将数据复制到 GPU 或从 GPU 复制数据的开销非常高。对于参数很少的模型的操作，不值得使用 GPU，因为 CPU 核心的频率要高得多。如果比较矩阵乘法（这是深度学习主要做的事情），您会发现对于大型矩阵，GPU 的性能显着优于 CPU。

看看这个情节。 X 轴是两个方阵的大小，y 轴是在 GPU 和 CPU 上将这些矩阵相乘所需的时间。正如您在一开始所看到的，对于小矩阵，蓝线更高，这意味着 CPU 速度更快。但随着我们增加矩阵的大小，使用 GPU 的好处就会显着增加。

要重现的代码：

import tensorflow as tf
import time
cpu_times = []
sizes = [1, 10, 100, 500, 1000, 2000, 3000, 4000, 5000, 8000, 10000]
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('cpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    cpu_times.append(time.time() - start)
    print('cpu time took: {0:.4f}'.format(time.time() - start))

import tensorflow as tf
import time

gpu_times = []
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('gpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    gpu_times.append(time.time() - start)
    print('gpu time took: {0:.4f}'.format(time.time() - start))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(sizes, gpu_times, label='GPU')
ax.plot(sizes, cpu_times, label='CPU')
plt.xlabel('MATRIX SIZE')
plt.ylabel('TIME (sec)')
plt.legend()
plt.show()

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

在 Tensorflow 中训练简单模型 GPU 比 CPU 慢的相关文章

将 saxon 与 python 结合使用

我需要使用 python 处理 XSLT 目前我正在使用仅支持 XSLT 1 的 lxml 现在我需要处理 XSLT 2 有没有办法将 saxon XSLT 处理器与 python 一起使用有两种可能的方法设置一个 HTTP 服务接受
使 django 服务器可以在 LAN 中访问

我已经安装了Django服务器可以如下访问 http localhost 8000 get sms http 127 0 0 1 8000 get sms 假设我的IP是x x x x 当我这样做时从同一网络下的另一台电脑 my ip
从字符串中删除识别的日期

作为输入我有几个包含不同格式日期的字符串例如彼得在16 45 我的生日是1990年7月8日 On 7 月 11 日星期六我会回家 I use dateutil parser parse识别字符串中的日期在下一步中我想从字符串中删除
使用 on_bad_lines 将 pandas.read_csv 中的无效行写入文件

我有一个 CSV 文件我正在使用 Python 来解析该文件我发现文件中的某些行具有不同的列数 001 Snow Jon 19801201 002 Crom Jake 19920103 003 Wise Frank 19880303 l
是否可以忽略一行的pyright检查？

我需要忽略一行的pyright 检查有什么特别的评论吗 def create slog group SLogGroup data Optional dict None SLog insert one SLog group group da
测试 python Counter 是否包含在另一个 Counter 中

如何测试是否是pythonCounter https docs python org 2 library collections html collections Counter is 包含在另一个中使用以下定义柜台a包含在计数器中b当且
Spark KMeans 无法处理大数据吗？

KMeans 有几个参数training http spark apache org docs latest api python pyspark mllib html highlight kmeans pyspark mllib clus
如何加速Python中的N维区间树？

考虑以下问题给定一组n间隔和一组m浮点数对于每个浮点数确定包含该浮点数的区间子集这个问题已经通过构建一个解决区间树 https en wikipedia org wiki Interval tree 或称为范围树或线段树已经针对一
AWS EMR Spark Python 日志记录

我正在 AWS EMR 上运行一个非常简单的 Spark 作业但似乎无法从我的脚本中获取任何日志输出我尝试过打印到 stderr from pyspark import SparkContext import sys if name m
Pygame：有没有简单的方法可以找到按下的任何字母数字的字母/数字？

我目前正在开发的游戏需要让人们以自己的名义在高分板上计时我对如何处理按键有点熟悉但我只处理过寻找特定的按键有没有一种简单的方法可以按下任意键的字母而不必执行以下操作 for event in pygame event get if
IO 密集型任务中的 Python 多线程

建议仅在 IO 密集型任务中使用 Python 多线程因为 Python 有一个全局解释器锁 GIL 只允许一个线程持有 Python 解释器的控制权然而多线程对于 IO 密集型操作有意义吗 https stackoverflow c
无法在 Python 3 中导入 cProfile

我试图将 cProfile 模块导入 Python 3 3 0 但出现以下错误 Traceback most recent call last File
Jupyter Notebook 内核一直很忙

我已经安装了 anaconda 并且 python 在 Spyder IPython 等中工作正常但是我无法运行 python 笔记本内核被创建它也连接但它始终显示黑圈忙碌符号防火墙或防病毒软件没有问题我尝试过禁用两者我也无法
向 Altair 图表添加背景实心填充

I like Altair a lot for making graphs in Python As a tribute I wanted to regenerate the Economist graph s in Mistakes we
如何在seaborn displot中使用hist_kws

我想在同一图中用不同的颜色绘制直方图和 kde 线我想为直方图设置绿色为 kde 线设置蓝色我设法弄清楚使用 line kws 来更改 kde 线条颜色但 hist kws 不适用于显示我尝试过使用 histplot 但我无法为
使用 Python 绘制 2D 核密度估计

I would like to plot a 2D kernel density estimation I find the seaborn package very useful here However after searching
Python：如何将列表列表的元素转换为无向图？

我有一个程序可以检索 PubMed 出版物列表并希望构建一个共同作者图这意味着对于每篇文章我想将每个作者如果尚未存在添加为顶点并添加无向边或增加每个合著者之间的权重我设法编写了第一个程序该程序检索每个出版物的作者列表并
Scrapy：如何使用元在方法之间传递项目

我是 scrapy 和 python 的新手我试图将 parse quotes 中的项目 item author 传递给下一个解析方法 parse bio 我尝试了 request meta 和 response meta 方法如 sc
如何解释tf.map_fn的结果？

看代码 import tensorflow as tf import numpy as np elems tf ones 1 2 3 dtype tf int64 alternates tf map fn lambda x x x x el
如何将输入读取为数字？

这个问题的答案是社区努力 help privileges edit community wiki 编辑现有答案以改进这篇文章目前不接受新的答案或互动 Why are x and y下面的代码中使用字符串而不是整数注意在Python 2

随机推荐

Pygame 缩放精灵

如何将精灵的图像放大或缩小我可以更改矩形和所有内容但不能更改图像代码虽然我不确定为什么你需要它 class test pygame sprite Sprite def init self pygame sprite Sprite i
.net - C# 2.0 应用程序中的玻璃效果

如何在 net 2 0 中的 Windows 窗体应用程序上提供 Vista 或 Mac OS X 风格的玻璃效果这是通过使用 Vista DWM 桌面窗口管理器 API 的互操作来完成的例如导入这些函数 DllImport dwma
@BeanProperty 具有 PropertyChangeListener 支持吗？

BeanProperty生成简单的get set方法有没有办法自动生成此类方法并支持触发属性更改事件例如我想将其与 JFace 数据绑定一起使用我也有同样的问题并一直在密切关注可能的答案我想我刚刚偶然发现了一个尽管我还没有尝试
类路径中的 Flutter 运行时 JAR 文件应具有相同的版本

Building without sound null safety For more information see https dart dev null safety unsound null safety w Runtime JAR
小阴谋家 - 从哪里开始？

我刚刚打开小阴谋家我觉得我错过了一些东西第一个问题问这是一个原子吗但我没有看到原子是什么的任何定义我想我可以通过问题的答案推导出什么是原子但随后它继续问 l 的 car 是什么 l 的 cdr 是什么我不知道在问什么这本书
AngularJS：工厂 $http 服务

我试图理解 Angular 中工厂和服务的概念我在控制器下有以下代码 init function init http post services type getSource ID TP001 success function data
Java：具有重复键的 Json 可以使用 Jackson 进行映射

我有一个具有相同键但不同值的 json 文件如下所示 domains A name a type a1 B name r type g1 A name b type b1 这是来自外部系统如何转换json 到 java 映射对象并访问不
JQuery 如何 .find() 不区分大小写？
Fabric.loadSVGFromString 导致结果扭曲

我用 inkscape 编辑了 SVG
如何在xtable表格中放置颜色间距？

如何在xtable表格中放置颜色间距我使用以下说明生成表格 test table lt xtable summary test caption test floating FALSE align test table lt c l pri
DataGridView 中明显的内存泄漏

如何强制 DataGridView 释放其对绑定 DataSet 的引用我们有一个相当大的数据集显示在 DataGridView 中并注意到 DataGridView 关闭后资源没有被释放如果用户重复查看此报告他们最终会收到内存不足
我在 intellij 中的 jar 资源文件是只读的，我需要编辑它们

我已经尝试了几个小时来编辑我用作库的 jar 中的 java 文件但没有成功我已将资源标记为内容根和源根但我仍然无法编辑 jar 中的代码该项目编译并运行正确但我需要对资源文件进行调整但不能我尝试了所有我能想到的项目结构难道
kotlin如何通过delegate使用this来实例化viewmodel

我正在阅读 google android 架构示例并遇到了这个有人可以向我解释这个代表是如何工作的吗 private val viewModel by viewModels
如何在C++中“返回一个对象”？

我知道这个标题听起来很熟悉因为有很多类似的问题但我要求问题的不同方面我知道将东西放在堆栈上和将它们放在堆上之间的区别在Java中我总是可以返回对本地对象的引用 public Thing calculateThing Thing
Mono 可以在 rdlc 中创建/运行报告吗？

我从未使用过 mono 很好奇 mono 是否可以创建运行 rdlc 报告我正在寻找实现的是一个单声道 asp net mvc 应用程序用于使用 rdlc 创建报告并导出为 pdf 单声道可以吗有一些开源项目尝试在 NET 中实现
twitter 没有重定向到 android 应用程序中的回调 url

我的目标是允许使用 twitter4j 登录 Twitter 我用这个作为参考 https github com Sheikh Aman Android Samples blob master 1 20Sign inWithTwitterT
Jsoup：忽略 SSL 错误

我正在尝试下载https www deviantart com https www deviantart com使用 Jsoup v1 10 3 以及validateTLSCertificates false Java 8 已安装 Unli
使用 OpenTok 暂停视频通话

我一直在研究 webRTC 平台发现 OpenTok 似乎提供了最可定制的功能在深入研究之前我想确保它可以满足一项关键要求在两个用户 A 和 B 之间的 1 1 视频通话期间我希望其中一个用户让我们与用户 A 一起能够接收来自
Django - 无法获取 highchart 来显示数据

我尝试按照以下解决方案在 Highchart 的帮助下显示图表通过 JSON 将 Django 数据库查询集传递到 Highcharts https stackoverflow com questions 27810087 passing
在 Tensorflow 中训练简单模型 GPU 比 CPU 慢

我在 Tensorflow 中设置了一个简单的线性回归问题并在 1 13 1 中使用 Tensorflow CPU 和 GPU 创建了简单的 conda 环境在 NVIDIA Quadro P600 的后端使用 CUDA 10 0 然而

在 Tensorflow 中训练简单模型 GPU 比 CPU 慢

在 Tensorflow 中训练简单模型 GPU 比 CPU 慢 的相关文章

随机推荐

热门标签

在 Tensorflow 中训练简单模型 GPU 比 CPU 慢的相关文章