如何使用 numba 在 GPU 上推广快速矩阵乘法

2024-02-22

最近，我一直在尝试使用 Numba 库在 Python 中进行 GPU 编程。我一直在他们的网站上使用那里的教程阅读它，目前我陷入了他们的示例，可以在这里找到：https://numba.pydata.org/numba-doc/latest/cuda/examples.html https://numba.pydata.org/numba-doc/latest/cuda/examples.html。我试图稍微概括一下快速矩阵乘法的示例（其形式为 A*B=C）。在测试时，我注意到尺寸不能完全被每块线程数 (TPB) 整除的矩阵不会产生正确的答案。

我从示例中复制了下面的代码https://numba.pydata.org/numba-doc/latest/cuda/examples.html https://numba.pydata.org/numba-doc/latest/cuda/examples.html并创建了一个非常小的测试用例，包含 4 x 4 矩阵。如果我选择 TPB=2 一切都很好，但是当我设置 TPB=3 时就会出错。我知道代码超出了矩阵的范围，但我无法阻止这种情况的发生（我尝试了一些 if 语句ty + i * TPB and tx + i * TPB，但这些都不起作用。

from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)

我想编写一些不依赖于尺寸可被 TPB 整除的矩阵 A、B 和 C 的代码，因为这些有时超出了我的控制范围。我知道 GPU 仅在对非常大的矩阵进行矩阵乘法时速度更快，但我想使用小示例来检查答案是否正确，然后再将其应用于实际数据。

可以说至少有两个错误发布的代码 https://numba.pydata.org/numba-doc/latest/cuda/examples.html:

这不可能是正确的范围检查：
```
if x >= C.shape[0] and y >= C.shape[1]:
```
为了让我们决定网格中的特定线程不执行任何加载活动，我们需要either that x超出范围或y超出范围。这and应该是一个or.
It is illegal https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions to use cuda.syncthreads()在条件代码中，如果块中的所有线程都不能参与该语句。以前的return上述第 1 项中的陈述（即使从and to or）几乎保证了问题大小不能被线程块大小整除的这种非法行为。

因此，要解决这些问题，我们不能仅仅使用简单的方法return越界线程的语句。相反，在加载点，我们必须只允许线程从全局加载到共享内存，如果计算出的全局加载索引（例如A or B) 是界内的（根据定义，共享索引是界内的）。此外，在写入结果时，我们必须只写入在范围内的计算结果C.

以下代码修复了这些项目。对于您给定的测试用例，它似乎可以正常工作：

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = 0
        sB[tx, ty] = 0
        if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
          sA[tx, ty] = A[x, ty + i * TPB]
        if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
          sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if x < C.shape[0] and y < C.shape[1]:
        C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$

（注意使用and这里的边界测试是正确的。与测试一组索引是否出界相比，测试一组索引是否入界在布尔意义上是不同的。在界内测试中，我们要求两人都在界内。在越界测试中，任何一个索引越界都是取消资格）。

我并不是说上面的代码没有缺陷或适合任何特定目的。它旨在演示我发现的问题的可能修复方法。正如您所发现的，让共享内存平铺矩阵乘法在每种可以想象的配置中工作并非易事，并且除了此处显示的内容之外，我还没有对它进行测试。（例如，如果您决定使 TPB 大于 32，您会遇到其他问题。此外，原始发布的代码仅用于方阵乘法，这在一般非方阵情况下不起作用。）

如上所述，发布的代码和上面带有“修复”的代码将无法正确处理一般的非方形情况。我相信一些简单的修改将使我们能够处理非方形的情况。简而言之，我们必须将网格设置得足够大，以处理两个输入矩阵的维度，同时仍然只写入输出矩阵的界内值的结果。这是一个经过简单测试的示例：

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
          sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
          sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp



#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$

我也重新排序了感觉x and y（以及使用tx and ty) 修复上述代码中的性能问题。原始发布的文档代码中也存在相同的性能问题。

再次强调，不声称无缺陷。此外，我确信可以得出“更优化”的代码。然而，优化矩阵乘法是一项应该很快就会导致使用库实现的练习。使用cupy https://docs.cupy.dev/en/stable/reference/generated/cupy.matmul.html这里的 GPU 方法应该是一种相当简单的方法，可以在 GPU 上利用高质量的矩阵乘法例程。

编辑：正如所讨论的here https://stackoverflow.com/questions/68248548/understanding-shared-memory-use-for-improvement-in-numbaOP的代码（而且，看起来，文档示例 https://numba.pydata.org/numba-doc/latest/cuda/examples.html）在设置方面也存在性能问题tmp多变的。将其更改为适当的 32 位浮点变量会带来重要的性能差异。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

如何使用 numba 在 GPU 上推广快速矩阵乘法的相关文章

Django 代理模型的继承和多态性

我正在开发一个我没有启动的 Django 项目我面临着一个问题遗产我有一个大模型在示例中简化称为MyModel这应该代表不同种类的物品的所有实例对象MyModel应该具有相同的字段但方法的行为根据项目类型的不同而有很大差异到目
Python 的键盘中断不会中止 Rust 函数 (PyO3)

我有一个使用 PyO3 用 Rust 编写的 Python 库它涉及一些昂贵的计算单个函数调用最多需要 10 分钟从 Python 调用时如何中止执行 Ctrl C 好像只有执行结束后才会处理所以本质上没什么用最小可重现示例 Ca
使 django 服务器可以在 LAN 中访问

我已经安装了Django服务器可以如下访问 http localhost 8000 get sms http 127 0 0 1 8000 get sms 假设我的IP是x x x x 当我这样做时从同一网络下的另一台电脑 my ip
使用 matplotlib 绘制时间序列数据并仅在年初显示年份

rcParams date autoformatter month b n Y 我正在使用 matpltolib 来绘制时间序列如果我按上述方式设置 rcParams 则生成的图会在每个刻度处标记月份名称和年份我怎样才能将其设置为仅在每
python 相当于 R 中的 get() （= 使用字符串检索符号的值）

在 R 中 get s 函数检索名称存储在字符变量向量中的符号的值s e g X lt 10 r lt XVI s lt substr r 1 1 X get s 10 取罗马数字的第一个符号r并将其转换为其等效整数尽管花了一些时间翻
根据列值突出显示数据框中的行？

假设我有这样的数据框 col1 col2 col3 col4 0 A A 1 pass 2 1 A A 2 pass 4 2 A A 1 fail 4 3 A A 1 fail 5 4 A A 1 pass 3 5 A A 2 fail 2
如何使用 OpencV 从 Firebase 读取图像？

有没有使用 OpenCV 从 Firebase 读取图像的想法或者我必须先下载图片然后从本地文件夹执行 cv imread 功能有什么办法我可以使用cv imread link of picture from firebase 您可以
Flask如何获取请求的HTTP_ORIGIN

我想用我自己设置的 Access Control Allow Origin 标头做出响应而弄清楚请求中的 HTTP ORIGIN 参数在哪里似乎很混乱我在用着烧瓶 0 10 1 以及HTTP ORIGIN似乎是这个的特点之一object
GLKit的GLKMatrix“列专业”如何？

前提A 当谈论线性存储器中的列主矩阵时列被一个接一个地指定使得存储器中的前 4 个条目对应于矩阵中的第一列另一方面行主矩阵被理解为依次指定行以便内存中的前 4 个条目指定矩阵的第一行 A GLKMatrix4看起来像这样 u
python获取上传/下载速度

我想在我的计算机上监控上传和下载速度一个名为 conky 的程序已经在 conky conf 中执行了以下操作 Connection quality alignr wireless link qual perc wlan0 downspe
Jupyter Notebook 内核一直很忙

我已经安装了 anaconda 并且 python 在 Spyder IPython 等中工作正常但是我无法运行 python 笔记本内核被创建它也连接但它始终显示黑圈忙碌符号防火墙或防病毒软件没有问题我尝试过禁用两者我也无法
如何在Python中对类别进行加权随机抽样

给定一个元组列表其中每个元组都包含一个概率和一个项目我想根据其概率对项目进行采样例如给出列表 3 a 4 b 3 c 我想在 40 的时间内对 b 进行采样在 python 中执行此操作的规范方法是什么我查看了 random 模
Fabric env.roledefs 未按预期运行

On the 面料网站 http docs fabfile org en 1 10 usage execution html 给出这个例子 from fabric api import env env roledefs web hosts
向 Altair 图表添加背景实心填充

I like Altair a lot for making graphs in Python As a tribute I wanted to regenerate the Economist graph s in Mistakes we
如何在seaborn displot中使用hist_kws

我想在同一图中用不同的颜色绘制直方图和 kde 线我想为直方图设置绿色为 kde 线设置蓝色我设法弄清楚使用 line kws 来更改 kde 线条颜色但 hist kws 不适用于显示我尝试过使用 histplot 但我无法为
Python：如何将列表列表的元素转换为无向图？

我有一个程序可以检索 PubMed 出版物列表并希望构建一个共同作者图这意味着对于每篇文章我想将每个作者如果尚未存在添加为顶点并添加无向边或增加每个合著者之间的权重我设法编写了第一个程序该程序检索每个出版物的作者列表并
如何计算 pandas 数据帧上的连续有序值

我试图从给定的数据帧中获取连续 0 值的最大计数其中包含来自 pandas 数据帧的 id date value 列如下所示 id date value 354 2019 03 01 0 354 2019 03 02 0 354 201
使用 Python 的 matplotlib 选择在屏幕上显示哪些图形以及将哪些图形保存到文件中

我想用Python创建不同的图形matplotlib pyplot 然后我想将其中一些保存到文件中而另一些则应使用show 命令然而 show 显示all创建的数字我可以通过调用来避免这种情况close 创建我不想在屏幕上显示的绘图
从列表指向字典变量

假设你有一个清单 a 3 4 1 我想用这些信息来指向字典 b 3 4 1 现在我需要的是一个常规看到该值后在 b 的位置内读写一个值我不喜欢复制变量我想直接改变变量b的内容假设b是一个嵌套字典你可以这样做 reduce di
Statsmodels.formula.api OLS不显示截距的统计值

我正在运行以下源代码 import statsmodels formula api as sm Add one column of ones for the intercept term X np append arr np ones 50

随机推荐

USB 存储设备的无代码 kext

在一个项目中我最近尝试通过 osx 10 9 5 上的 libusb 访问 USB 存储设备但显然一些默认驱动程序声称该设备我无法访问它当我尝试通过 libusb 访问它时它返回我有权限不足当我在 Ubuntu 中执行 jav
如何解释这个C++类型？

今天我正在帮助解决以下错误的代码片段 func被宣布为int参数但是int 作为第二个参数传递给std thread构造函数 include
在单个语句中为多个变量分配相同的值

在 C 中是否有任何方法只是出于好奇因为我今天遇到了对多个变量的多个相同值赋值可以在单个语句中一次将一个值分配给多个变量沿着这些思路伪代码 int num1 1 int num2 1 num1 num2 5 可能不会但我认为值得
不支持的主要版本.次要版本 51.0 OpenShift

部署时我遇到了几个不支持的主要版本和次要版本的错误即 INFO Initializing Spring root WebApplicationContext 2014 08 02 05 46 09 115 localhost start
我在 solr 中创建了一个新的核心，一旦重新启动 Tomcat 服务器，该核心就会被删除。为什么 solr 中的核心会被删除？

我正在使用 apache solr 并且创建了另一个核心并且它工作正常但是一旦我关闭服务器并重新启动它新核心就会被删除但该文件夹似乎位于 solr 目录中任何人都可以告诉我为什么它会从我的 apache solr 中删除吗提前致谢
无法编译的源代码 - 错误的符号类型：java.io.File.getSampleV

import java io File import java io FileInputStream import javax swing JFrame public class filterpanitha public void grap
将 gem 解压/冻结到非 Rails ruby 应用程序中

我正在编写一个非 Rails ruby 应用程序哎呀并且希望能够在供应商子目录中包含该应用程序所需的所有 gem 依赖项这将类似于如何http gemsonrails rubyforge org http gemsonrails ru
通过管道将 emacs shell 输出传输到新缓冲区

例如我希望能够输入如下内容 git diff tempbuffer 并在新的未保存的缓冲区中打开差异如果你使用eshell您可以将输出重定向到缓冲区例如 print foo gt
加密和序列化 stl 字符串和其他容器

我有 stl 容器向量中的数据向量中的每个节点都是一个结构体其中也包含 stl 字符串 struct record string name string location int salary vector lt record gt
在发送restful POST/PUT API之前，如何使用rails3.2.x获取CSRF令牌，而不需要来自服务器端的任何表单/UI？

我知道rails默认从rails3开始提供CSRF保护但我的Web应用程序是单页应用程序所有通信都依赖于ajax 那么如何在每次 ajax 调用之前从服务器获取 CSRF 令牌或者我能做的就是取消 CSRF 保护对吗注意我们不
可以从 .editorconfig 中忽略/排除文件/文件夹吗？

是否可以从 editorconfig 中忽略排除文件文件夹原因我有一个 vendor包含第三方文件的文件夹我不希望该文件夹继承我的任何 editorconfig配置我找到了EditorConfig 属性 https github
C++ 模板：隐式转换，没有匹配的函数来调用 ctor

template
JSF 托管 bean 中列表的初始化

我有一个关于 POJO 中 List 初始化的问题因为它遵循以下代码 public class Person other fields private List
无法将“System.Data.DataRowView”类型的对象转换为“System.IConvertible”类型

我在 C 代码中使用 int i Convert ToInt32 ddlDivisionId SelectedValue 当时就出现了这种类型的错误 Unable to cast object of type System Data Dat
如何使用 Jquery 获取文件的真实路径[重复]

这个问题在这里已经有答案了如何使用 input type file 获取文件的真实路径即 HTML 代码
Python 套接字/端口转发

我用 Python 编写了服务器和客户端程序服务器 py import socket sock socket socket socket AF INET socket SOCK STREAM host socket gethostname
MySQL 限制变量

我的语法有错误 SET start 0 SELECT ROUND count item 2 FROM car INTO until SELECT FROM car LIMIT until OFFSET start 您的 SQL 语法有错误
为什么 Enum.Parse() 返回对象？

这里有很多关于将字符串转换为枚举值的问题一般来说答案看起来类似于这个问题 https stackoverflow com q 16100 298754 StatusEnum MyStatus StatusEnum Enum Parse
Xcode UI 测试手册快照

我想使用 Xcode 7 中引入的 UI 测试手动从我的应用程序中截取一些快照默认情况下 Xcode 会截取每个步骤的屏幕截图结果我得到了在视图控制器之间的转换中间截取的屏幕截图我想在特定时刻手动拍摄快照有什么想法如何做到这一点我
如何使用 numba 在 GPU 上推广快速矩阵乘法

最近我一直在尝试使用 Numba 库在 Python 中进行 GPU 编程我一直在他们的网站上使用那里的教程阅读它目前我陷入了他们的示例可以在这里找到 https numba pydata org numba doc latest

如何使用 numba 在 GPU 上推广快速矩阵乘法

如何使用 numba 在 GPU 上推广快速矩阵乘法 的相关文章

随机推荐

热门标签

如何使用 numba 在 GPU 上推广快速矩阵乘法的相关文章