我正在处理维度图像2208x1242来自 while 循环中的视频,使用 C++ 和 OpenCV。
为了加快速度,我想在 Nvidia Jetson Nano 的 GPU 上执行操作。
对于从 BGR 到 HSV 的颜色转换,使用cv::cuda::cvtColor
代替cv::cvtColor
我的速度提高了 5 倍。
不幸的是,形态学运算在 GPU 上要慢得多:
int num_frame = 10;
int frame = 0;
cv::Mat img;
cv::cuda::GpuMat img_gpu;
cv::Mat open_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(11, 11));
cv::Mat close_kernel = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(21, 21));
while (frame < num_frame){
// load image to img
// ...
img_gpu.upload(img);
cv::Ptr<cv::cuda::Filter> morph_filter_open = cv::cuda::createMorphologyFilter(cv::MORPH_OPEN, img_gpu.type(), open_kernel);
cv::Ptr<cv::cuda::Filter> morph_filter_close = cv::cuda::createMorphologyFilter(cv::MORPH_CLOSE, img_gpu.type(), close_kernel);
morph_filter_open->apply(img_gpu, img_gpu);
morph_filter_close->apply(img_gpu, img_gpu);
frame++;
}
仅测量apply()
-调用,GPU 版本大约慢 20 倍cv::morphologyEx
在 Jetson Nano 的 CPU 上(0.07s vs. 1.5s对于单帧)。
nvprof
表明,大部分时间都花在做cudaDeviceSynchronize
(这是为了整个程序比上面的代码示例做更多的事情,但长时间运行的操作可能与形态有关):
API calls: 71.05% 17.2756s 665 25.978ms 25.730us 1.44814s cudaDeviceSynchronize
8.36% 2.03194s 1826 1.1128ms 34.844us 847.66ms cudaLaunchKernel
5.16% 1.25490s 1 1.25490s 1.25490s 1.25490s cuCtxDestroy
4.80% 1.16684s 544 2.1449ms 17.865us 10.378ms cudaMallocPitch
1.89% 460.14ms 616 746.98us 20.469us 346.82ms cudaFree
1.65% 401.38ms 76 5.2813ms 44.533us 19.211ms cudaMemcpy2D
1.45% 352.97ms 51 6.9209ms 18.803us 242.14ms cudaMalloc
1.42% 345.25ms 1 345.25ms 345.25ms 345.25ms cudaFuncGetAttributes
1.23% 299.95ms 1 299.95ms 299.95ms 299.95ms cuCtxCreate
1.03% 251.43ms 20 12.572ms 162.61us 103.74ms cudaMallocManaged
0.92% 224.67ms 13 17.283ms 32.553us 65.173ms cudaMemcpy
0.56% 135.48ms 1 135.48ms 135.48ms 135.48ms cudaDeviceReset
...
我希望有人能帮我找出问题所在!