yolov5的TensorRT部署--warpaffine_cuda核函数

2023-11-08

从0到1实现基于tensorrt的yolo部署教程 http://t.csdn.cn/HUn4T,请点击该链接，即可看到全文

本文对于上面的案例，将预处理使用cuda核函数进行加速

一、cuda核函数的基本概念

1.1 CUDA C基础

核函数是cuda编程的关键，实现程序利用显卡进行高效的并行计算。对于普通在CPU上运行的代码，用c文件或者cpp文件保存；而对于GPU上运行的代码，则用主要cu文件保存，该cu文件使用nvcc的编译器。相对于传统的c/c++，cu文件主要以下几个不同点：

函数类型限定符（如__global__：核函数，由host调用, device：设备函数，由device调用，host：host函数，由host调用）
执行配置运算符：function<<<gridDim,blockDim,sharememorysize,stream>>>(…);—gridDim:块个数，blockDim:一个块中的线程个数，sharememorysize：共享内存的大小，stream：流
只有__global__修饰的函数才可以用<<<>>>的方式调用
调用核函数是传值的，不能传2引用，可以传递类、结构体等
核函数的执行，是异步的，也就是立即返回，所以在使用核函数的时候，记得加入同步等待的代码
核函数内访问线程索引主要用到threadidx、blockidx、blockdim、griddim这些内置变量,其中，总线程数量为以下公式：

总线程数量=blockdim.x * blockdim.y * blockdim.z * griddim.x * griddim.y * griddim.z

在使用的时候，核函数会根据需要被分配给N个不同的线程（thread），并行地执行N次，以达到并行计算加速的作用。

1.2 如何确定运行参数和线程索引？

运行参数的确定
cuda核函数添加了<<<>>>尖括号配置信息，尖括号内的配置信息并不是传递给核函数的，而是传递给CUDA运行时系统，告诉运行时系统如何启动核函数。确定块个数和线程个数的一般步骤为：
1）先根据GPU设备的硬件资源确定一个块内的线程个数；
2）再根据数据大小和每个线程处理数据个数确定块个数。
参考代码如下：

//每个块内有256个线程
unsigned int threads = 256;
//每个线程处理4个数据，注意这4个数不是相邻的
unsigned int unroll = 4;
//根据数据量计算出块的个数
//为了保证线程数足够，在数据量的基础上加了threads-1，相当于向上取整
unsigned int blocks = (dataNum + threads -1)/threads/unroll;
cudaKernel<<<blocks, threads>>>(***);

注释：
1）blocks的最大为(21亿，65536，65536）
2）threads的最大为(1024,64,64),其中threads.x、threads.y和threads.z的乘积<=1024.

线程索引确定
首先从简单的一维结构来确认线程索引
在这里插入图片描述
grid在x,y,z方向上都有block, block在x,y,z三个方向都有thread。因此对于一维结构来说，我们可以通过blockIdx索引到线程块，通过threadIdx索引到某个块内的线程，x方向上thread索引为：

int idx = threadIdx.x + blockIdx.x * blockDim.x；

1.3 简单案例

将一个数组中的每个元素在GPU上进行并行求解其sigmoid的数值
创建一个cpp文件

#include <cuda_runtime.h>
#include <stdio.h>


void test_print(const float* pdata, int ndata);

int main(){
    float* parray_host = nullptr;
    float* parray_device = nullptr;
    int narray = 10;
    int array_bytes = sizeof(float) * narray;
    // pageable memory
    parray_host = new float[narray];
    // global memory
    cudaMalloc(&parray_device, array_bytes);

    for(int i = 0; i < narray; ++i)
        parray_host[i] = i;
    
    cudaMemcpy(parray_device, parray_host, array_bytes, cudaMemcpyHostToDevice);
    // 调用核函数
    test_print(parray_device, narray);
    // 核函数的调用都是异步执行的，需要加入cudaDeviceSynchronize()；
    cudaDeviceSynchronize();

    cudaFree(parray_device);
    delete[] parray_host;
    return 0;
}

再创建一个cu文件

#include <stdio.h>
#include <cuda_runtime.h>
#include <math.h>
__device__ float sigmoid(float x)
{
    return 1/(1+exp(-x));
}

__global__ void test_print_kernel(const float* pdata, int ndata){

    // idx用于分辨线程中的id号
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    /*    dims                 indexs
        gridDim.z            blockIdx.z
        gridDim.y            blockIdx.y
        gridDim.x            blockIdx.x
        blockDim.z           threadIdx.z
        blockDim.y           threadIdx.y
        blockDim.x           threadIdx.x

        Pseudo code:
        position = 0
        for i in 6:
            position *= dims[i]
            position += indexs[i]
    */
    float a = sigmoid(pdata[idx]);
    printf("sigmoid(%f) = %f\n", pdata[idx],a);
}

void test_print(const float* pdata, int ndata){
    // 调用核函数
    test_print_kernel<<<2, ndata/2, 0, nullptr>>>(pdata, ndata);
}

二、Warpaffine仿射变换

Warpaffine仿射变换主要为了解决图像的缩放和平移来处理目标检测中常见的预处理行为：
在这里插入图片描述
Warpaffine仿射变换的特点如下：

warpaffine是对图像做平移缩放旋转进行综合统一描述的方法（可通过一个矩阵变换来实现），同时也是一个很容易实现cuda并行加速的算法。
在深度学习领域通常需要做预处理，比如CopyMakeBorder，RGB->BGR，减去均值除以标准差，BGRBGRBGR -> BBBGGGRRR。
如果使用cuda进行并行加速实现，那么整个预处理都进行统一，并且性能也很好。
由于warpaffine是标准的矩阵映射坐标，并且可逆，所以逆变换就是其变换矩阵的逆矩阵。

对于如何获得缩放和平移的矩阵，可参考该链接。

双线性插值
由于一般而言，resize图中的某个像素点映射到原图上可能是非整数的像素点。假设我们要求这个紫色点的像素值：
在这里插入图片描述
要求这个紫色点的颜色我们就需要知道它周围四个点颜色的加权和，而每个点的权重，则是其对面矩形区域的面积，占总面积的比例。

其算法原理为，我们先得到我们周围的四个点，使用保留最近的最大整数值并且小于我们当前的坐标x，求得x1，x2则使用+1的形式进行求解，那么y1、y2也是同理。这样我们就可以得出这个周围框框的范围，之后我们再使用四个点的像素值，去乘以它们对角的面积，也就是上面那张图那样的求法，就可以得出相应的值。

三、基于 warpaffine_cuda核函数的预处理

在基于yolov5的TensorRT部署的main.cpp文件中，将部分代码按照以下的内容替换

    // 设置原图和resize图的指针和占用空间大小
    uint8_t* psrc_device = nullptr;
    float* pdst_device = nullptr;
    size_t src_size = image.cols * image.rows * 3;      // 行rows:Y (height) 列cols:X (width)
    size_t dst_size = output.cols * output.rows * 3*sizeof(float);
    // 开辟显存
    checkRuntime(cudaMalloc(&psrc_device, src_size)); // 在GPU上开辟两块空间   global memory
    checkRuntime(cudaMalloc(&pdst_device, dst_size));
    while(1)
    {
        clock_t startTime, endTime;
        startTime = clock();
        // 获取图像
        cap >> image;

        // image.data搬运数据到GPU上
        checkRuntime(cudaMemcpy(psrc_device, image.data, src_size, cudaMemcpyHostToDevice));
        // 在CPU上的仿射变换,除了变换图片的尺寸（通过仿射变换），还有减均值除方差、bgrbgrbgr->bbbgggrrr
        warp_affine_bilinear(
                    psrc_device, image.cols * 3, image.cols, image.rows,
                    pdst_device, output.rows * 3, output.rows, output.rows,
                    114
                    );
        // 由于cuda核函数的执行无论stream是否为nullptr，都将会是异步执行（即调用核函数，立马返回），这就需要加个同步等待
        cudaDeviceSynchronize();
        std::cout << "预处理时间: " << (double)(clock() - startTime) / CLOCKS_PER_SEC << "s" << std::endl;

        auto startTime1 = clock();
        // 设置模型推理的输入输出
        float *bindings[] = {pdst_device, output_data_device};
        .....
   }
    checkRuntime(cudaStreamDestroy(stream));
    checkRuntime(cudaFree(psrc_device));
    checkRuntime(cudaFree(pdst_device));
    checkRuntime(cudaFreeHost(output_data_host));
    checkRuntime(cudaFree(output_data_device));

再创建一个cu文件

#include <cuda_runtime.h>

#define min(a, b)  ((a) < (b) ? (a) : (b))
#define num_threads   512

typedef unsigned char uint8_t;

struct Size{
    int width = 0, height = 0;

    Size() = default;
    Size(int w, int h)
    :width(w), height(h){}
};

// 计算仿射变换矩阵
// 计算的矩阵是居中缩放
struct AffineMatrix{
    float i2d[6];       // image to dst(network), 2x3 matrix
    float d2i[6];       // dst to image, 2x3 matrix

    // 这里其实是求解imat的逆矩阵，由于这个3x3矩阵的第三行是确定的0, 0, 1，因此可以简写如下
    void invertAffineTransform(float imat[6], float omat[6]){
        float i00 = imat[0];  float i01 = imat[1];  float i02 = imat[2];
        float i10 = imat[3];  float i11 = imat[4];  float i12 = imat[5];

        // 计算行列式
        float D = i00 * i11 - i01 * i10;
        D = D != 0 ? 1.0 / D : 0;

        // 计算剩余的伴随矩阵除以行列式
        float A11 = i11 * D;
        float A22 = i00 * D;
        float A12 = -i01 * D;
        float A21 = -i10 * D;
        float b1 = -A11 * i02 - A12 * i12;
        float b2 = -A21 * i02 - A22 * i12;
        omat[0] = A11;  omat[1] = A12;  omat[2] = b1;
        omat[3] = A21;  omat[4] = A22;  omat[5] = b2;
    }
    void compute(const Size& from, const Size& to){
        float scale_x = to.width / (float)from.width;
        float scale_y = to.height / (float)from.height;
        float scale = min(scale_x, scale_y);
        i2d[0] = scale;  i2d[1] = 0;  i2d[2] =
            -scale * from.width  * 0.5  + to.width * 0.5 + scale * 0.5 - 0.5;

        i2d[3] = 0;  i2d[4] = scale;  i2d[5] =
            -scale * from.height * 0.5 + to.height * 0.5 + scale * 0.5 - 0.5;
        invertAffineTransform(i2d, d2i);
    }
};
__device__ void affine_project(float* matrix, int x, int y, float* proj_x, float* proj_y){

    // matrix
    // m0, m1, m2
    // m3, m4, m5
    *proj_x = matrix[0] * x + matrix[1] * y + matrix[2];
    *proj_y = matrix[3] * x + matrix[4] * y + matrix[5];
}
__global__ void warp_affine_bilinear_kernel(
    uint8_t* src, int src_line_size, int src_width, int src_height,
    float* dst, int dst_line_size, int dst_width, int dst_height,
    uint8_t fill_value, AffineMatrix matrix
){

    // 一个thread负责一个像素（3个通道）
    // cuda核函数是并行运行的，计算idx是为了指定某个线程，让某个线程执行以下代码
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // 通过线程的idx来判断出执行的线程应该执行哪个图像中的像素点
    const int dx = idx % dst_width;
    const int dy = idx / dst_width;
    // 只有dx和dx在dst_width和dst_height的范围内，才需要往下执行，否者直接return
    if (dx >= dst_width || dy >= dst_height)
        return;
    // 将像素点的数值默认设置都为fill_value
    float c0 = fill_value, c1 = fill_value, c2 = fill_value;
    float src_x = 0; float src_y = 0;

    // 通过仿射变换矩阵的逆变换，可以知道dx和dy在原图中哪里取值
    affine_project(matrix.d2i, dx, dy, &src_x, &src_y);
    // 已知src_x和src_y,怎么考虑变换后的像素值----双线性差值
    // 仿射变换的逆变换的src_x和src_y超过范围了
    if(src_x < -1 || src_x >= src_width || src_y < -1 || src_y >= src_height){
        // out of range
        // 超出范围,像素点的数值设置都为fill_value
        c0 = fill_value;
        c1 = fill_value;
        c2 = fill_value;
    }else{
        // 由于resize图中的像素点映射到原图上的，由于映射到原图，得到的像素点可能为非整数，如果求解这个在原图非整数对应的resize图上像素点的数值呢？通过原图非整数像素值周围的四个像素点来确定
        // 因此需要定义y_low、x_low、y_high、x_high
        int y_low = floorf(src_y);   //  floorf：求最大的整数，但是不大于原数值
        int x_low = floorf(src_x);
        int y_high = y_low + 1;
        int x_high = x_low + 1;

        // const_values[]常量数值，为啥是3个呢？因为一个像素点为3通道
        uint8_t const_values[] = {fill_value, fill_value, fill_value};
        float ly    = src_y - y_low;
        float lx    = src_x - x_low;
        float hy    = 1 - ly;
        float hx    = 1 - lx;
        // 对于原图上的四个点，如何计算中间非整数的点的像素值呢？---通过双线性插值
        float w1    = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;  // 仿射变换的双线性插值的权重求解
        uint8_t* v1 = const_values;
        uint8_t* v2 = const_values;
        uint8_t* v3 = const_values;
        uint8_t* v4 = const_values;
        if(y_low >= 0){
            if (x_low >= 0)
                v1 = src + y_low * src_line_size + x_low * 3;

            if (x_high < src_width)
                v2 = src + y_low * src_line_size + x_high * 3;
        }

        if(y_high < src_height){
            if (x_low >= 0)
                v3 = src + y_high * src_line_size + x_low * 3;

            if (x_high < src_width)
                v4 = src + y_high * src_line_size + x_high * 3;
        }
        // 为啥要加0.5f，为了四舍五入
        c0 = floorf(w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0] + 0.5f);
        c1 = floorf(w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1] + 0.5f);
        c2 = floorf(w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2] + 0.5f);
    }
    // mean={0，0，0}，std={255.0，255.0，255.0}
    c0 = (c0-0)/255.0;
    c1 = (c1-0)/255.0;
    c2 = (c2-0)/255.0;
    // bgrbgrbgr->bbbgggrrr
    int stride = dst_width*dst_height;
    dst[dy*dst_width + dx] = c0;
    dst[stride + dy*dst_width + dx] = c1;
    dst[stride*2 + dy*dst_width + dx] = c2;

}

void warp_affine_bilinear(
    uint8_t* src, int src_line_size, int src_width, int src_height,
    float* dst, int dst_line_size, int dst_width, int dst_height,
    uint8_t fill_value
){
    // 需要多少threads，启动dst_width*dst_height个线程，是为了让一个线程处理一个像素点
    const int n = dst_width*dst_height;
    // 设置一个块启动的线程个数
    int block_size = 1024;
    // 设置块的个数
    // 为啥要加上block_size-1，这是因为n/block_size有出现有小数的情况，为了向上取整，所以加上了block_size-1
    const int grid_size = (n + block_size - 1) / block_size;

    AffineMatrix affine;
    // 求解仿射变换矩阵---是为了得到原图和resize图的转换矩阵，通过该矩阵可以很方便根据原图来求出reize图中像素点的数值
    affine.compute(Size(src_width, src_height), Size(dst_width, dst_height));
    // 下面的函数就是核函数,核函数的格式必须包含<<<...>>>
    // 在<<<...>>>中，第一个参数是指定块个数，第二个参数指定一个块中的线程个数，第三个参数是共享内存，第四个参数是stream
    warp_affine_bilinear_kernel<<<grid_size, block_size, 0, nullptr>>>(
        src, src_line_size, src_width, src_height,
        dst, dst_line_size, dst_width, dst_height,
        fill_value, affine
    );
}

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)