用于矩阵向量乘积的 Rcpp Parallel 或 openmp

2024-01-03

我正在尝试对共轭梯度的朴素并行版本进行编程，所以我从简单的维基百科算法开始，我想改变dot-products and MatrixVector产品通过其适当的并行版本，Rcppparallel 文档具有以下代码dot-product使用并行化简；我想我会在我的代码中使用该版本，但我正在尝试制作MatrixVector乘法，但与 R 基础相比，我还没有取得好的结果（没有并行）

并行矩阵乘法的一些版本：使用 OpenMP、Rcppparallel、串行版本、使用 Armadillo 的串行版本以及基准测试

// [[Rcpp::depends(RcppParallel)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <numeric>
// #include <cstddef>
// #include <cstdio>
#include <iostream>
using namespace RcppParallel;
using namespace Rcpp;

struct InnerProduct : public Worker
{   
   // source vectors
   const RVector<double> x;
   const RVector<double> y;

   // product that I have accumulated
   double product;

   // constructors
   InnerProduct(const NumericVector x, const NumericVector y) 
      : x(x), y(y), product(0) {}
   InnerProduct(const InnerProduct& innerProduct, Split) 
      : x(innerProduct.x), y(innerProduct.y), product(0) {}

   // process just the elements of the range I've been asked to
   void operator()(std::size_t begin, std::size_t end) {
      product += std::inner_product(x.begin() + begin, 
                                    x.begin() + end, 
                                    y.begin() + begin, 
                                    0.0);
   }

   // join my value with that of another InnerProduct
   void join(const InnerProduct& rhs) { 
     product += rhs.product; 
   }
};

struct MatrixMultiplication : public Worker
{
   // source matrix
   const RMatrix<double> A;

    //source vector
   const RVector<double> x;

   // destination matrix
   RMatrix<double> out;

   // initialize with source and destination
   MatrixMultiplication(const NumericMatrix A, const NumericVector x, NumericMatrix out) 
     : A(A), x(x), out(out) {}

   // take the square root of the range of elements requested
   void operator()(std::size_t begin, std::size_t end) {
      for (std::size_t i = begin; i < end; i++) {
            // rows we will operate on
            //RMatrix<double>::Row rowi = A.row(i);
            RMatrix<double>::Row rowi = A.row(i);

            //double res = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
            //Rcout << "res" << res << std::endl;
            out(i,1) = std::inner_product(rowi.begin(), rowi.end(), x.begin(), 0.0);
            //Rcout << "res" << out(i,1) << std::endl;
      }
    }  
};

// [[Rcpp::export]]
double parallelInnerProduct(NumericVector x, NumericVector y) {

   // declare the InnerProduct instance that takes a pointer to the vector data
   InnerProduct innerProduct(x, y);

   // call paralleReduce to start the work
   parallelReduce(0, x.length(), innerProduct);

   // return the computed product
   return innerProduct.product;
}
//librar(Rbenchmark)

// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel(NumericMatrix A, NumericVector x) {

   // // declare the InnerProduct instance that takes a pointer to the vector data
   // InnerProduct innerProduct(x, y);
   int nrows = A.nrow();
   NumericVector out(nrows);
   for(int i = 0; i< nrows;i++ )
   {
      out(i) = parallelInnerProduct(A(i,_),x);
   }
   // return the computed product
   return out;
}

// [[Rcpp::export]]
arma::rowvec matrixXvectorParallel(arma::mat A, arma::colvec x){
    arma::rowvec y = A.row(0)*0;
    int filas = A.n_rows;
    int columnas = A.n_cols;
    #pragma omp parallel for
    for(int j=0;j<columnas;j++)
    {
        //y(j) = A.row(j)*x(j))
        y(j) = dotproduct(A.row(j),x);
    }
    return y;
} 

arma::mat matrixXvector2(arma::mat A, arma::mat x){
  //arma::rowvec y = A.row(0)*0;
  //y=A*x;
  return A*x;
}

arma::rowvec matrixXvectorParallel2(arma::mat A, arma::colvec x){
    arma::rowvec y = A.row(0)*0;
    int filas = A.n_rows;
    int columnas = A.n_cols;

 #pragma omp parallel for
    for(int j = 0; j < columnas ; j++){
        double result = 0;
        for(int i = 0; i < filas; i++){
                result += x(i)*A(j,i);   
        }
        y(j) = result;
    }
    return y;
}

基准

                             test replications elapsed relative user.self sys.self user.child sys.child
1                         M %*% a           20   0.026    1.000     0.140    0.060          0         0
2 matrixXvector2(M, as.matrix(a))           20   0.040    1.538     0.101    0.217          0         0
4    matrixXvectorParallel2(M, a)           20   0.063    2.423     0.481    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.146    5.615     0.745    0.398          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.335   12.885     2.305    0.079          0         0

我目前的最后一次尝试是将 parallefor 与 Rcppparallel 一起使用，但我遇到了内存错误，而且我不知道问题出在哪里

// [[Rcpp::export]]
NumericVector matrixXvectorRcppParallel2(NumericMatrix A, NumericVector x) {

   // // declare the InnerProduct instance that takes a pointer to the vector data
   int nrows = A.nrow();
   NumericMatrix out(nrows,1); //allocar mempria de vector de salida
   //crear worker
   MatrixMultiplication matrixMultiplication(A, x, out);


   parallelFor(0,A.nrow(),matrixMultiplication);

   // return the computed product
   return out;
}

我注意到，当我使用 htop 检查终端时处理器的工作方式时，我在 htop 中看到当我使用 R-base 应用传统的矩阵向量乘法时，即使用所有处理器，所以矩阵乘法是否并行执行默认情况下？因为从理论上讲，如果是串行版本，则只有一个处理器应该工作。

如果有人知道 OpenMP 或 Rcppparallel 或其他方式哪个是更好的路径，那么这会给我带来比明显的 R-base 串行版本更好的性能。

目前共轭梯度系列代码

// [[Rcpp::export]]
arma::colvec ConjugateGradient(arma::mat A, arma::colvec xini, arma::colvec b, int num_iteraciones){
    //arma::colvec xnew = xini*0 //inicializar en 0's
    arma::colvec x= xini; //inicializar en 0's
    arma::colvec rkold = b - A*xini;
    arma::colvec rknew = b*0;
    arma::colvec pk = rkold;
    int k=0;
    double alpha_k=0;
    double betak=0;
    double normak = 0.0;

    for(k=0; k<num_iteraciones;k++){
         Rcout << "iteracion numero " << k << std::endl;
        alpha_k =  sum(rkold.t() * rkold) / sum(pk.t()*A*pk); //sum de un elemento para realizar casting
        (pk.t()*A*pk);
        x = x+ alpha_k * pk;
        rknew = rkold - alpha_k*A*pk;
        normak =  sum(rknew.t()*rknew);
        if( normak < 0.000001){
            break;
        }
        betak = sum(rknew.t()*rknew) / sum( rkold.t() * rkold );

        //actualizar valores para siguiente iteracion
        pk = rknew + betak*pk;
        rkold = rknew;

    }

    return x;

}

我不知道 R 中 BLAS 的使用，感谢 Hong Ooi 和 tim18，所以新的基准测试使用 option(matprod="internal") 和 option(matprod="blas")

options(matprod = "internal")
res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res

                             test replications elapsed relative user.self sys.self user.child sys.child
2 matrixXvector2(M, as.matrix(a))           20   0.043    1.000     0.107    0.228          0         0
4    matrixXvectorParallel2(M, a)           20   0.069    1.605     0.530    0.000          0         0
1                         M %*% a           20   0.072    1.674     0.071    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.140    3.256     0.746    0.346          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.343    7.977     2.272    0.175          0         0

选项（matprod =“blas”）

options(matprod = "blas")

res<-benchmark(M%*%a,matrixXvector2(M,as.matrix(a)),matrixXvectorParallel(M,a),matrixXvectorParallel2(M,a),matrixXvectorRcppParallel(M,a),order="relative",replications = 20)
res
                             test replications elapsed relative user.self sys.self user.child sys.child
1                         M %*% a           20   0.021    1.000     0.093    0.054          0         0
2 matrixXvector2(M, as.matrix(a))           20   0.092    4.381     0.177    0.464          0         0
5 matrixXvectorRcppParallel(M, a)           20   0.328   15.619     2.143    0.109          0         0
4    matrixXvectorParallel2(M, a)           20   0.438   20.857     3.036    0.000          0         0
3     matrixXvectorParallel(M, a)           20   0.546   26.000     3.667    0.127          0         0

正如您已经发现的，如果使用多线程 BLAS 实现，则基本 R 矩阵乘法可以是多线程的。情况就是这样rocker/*docker 镜像，通常使用 OpenBLAS。

此外，(Rcpp)Armadillo 已经使用了 R 使用的 BLAS 库（在本例中为多线程 OpenBLAS）以及 OpenMP。所以你的“串行”版本实际上是多线程的。您可以在以下位置验证这一点：htop以足够大的矩阵作为输入。

顺便说一句，你想做的事情看起来像过早优化 http://wiki.c2.com/?PrematureOptimization to me.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

用于矩阵向量乘积的 Rcpp Parallel 或 openmp 的相关文章

在 openMP C++ 中并行化许多嵌套 for 循环

你好我是 C 新手我编写了一个可以运行的代码但是由于许多嵌套的 for 循环它很慢我想通过 openmp 来加速它任何可以指导我的人我尝试使用 pragma omp 并行前ip循环并在这个循环中我使用了 pragma omp
有没有好的 x86 双精度小矩阵 SIMD 库？

我正在寻找一个专注于图形小型 4x4 矩阵运算的 SIMD 库那里有很多单精度但我需要支持单精度和双精度我看过 Intel 的 IPP MX 库但我更喜欢带有源代码的库我对这些特定操作的 SSE3 实现非常感兴趣垫4 垫4 Ma
在 Mac OS 上使用 OpenMP 和 C++11

我正在尝试在我的 C 11 代码中使用一些 OpenMP 多线程功能例如 pragma omp parallel for 当我尝试使用以下命令进行编译时 clang std c 11 stdlib libc fopenmp main cp
OpenMP 因大型数组而崩溃

我正在使用 Fortran 和 OpenMP 但当我尝试在存在大型数组时使用 OpenMP 并行化循环时我不断遇到问题例如以下代码 PROGRAM main IMPLICIT NONE INTEGER PARAMETER NUMLOO
在不平衡树上拆分 OpenMP 线程

我正在尝试使用 OpenMP 并行进行树操作例如对树中所有叶子中的数字进行求和我遇到的问题是我工作的树不平衡子节点的数量不同分支的大小也不同我目前在这些树上使用递归函数我想要实现的是 1 在第一个可能的机会时分割线程假设它是一
使用 OpenMP 时无用的 printf 没有加速

我刚刚编写了第一个 OpenMP 程序它并行化了一个简单的 for 循环我在双核机器上运行代码发现从 1 个线程变为 2 个线程时速度有所提高然而我在学校 Linux 服务器上运行相同的代码并没有看到加速在尝试了不同的事情之后
在Python中计算矩阵乘以其转置（AA^T）的最快方法

在Python中将矩阵与其转置 AA T 相乘的最快方法是什么我认为 NumPy SciPy 没有考虑使用例如时涉及的对称性 np dot or np matmul 得到的矩阵总是对称的所以我可以想象有一个更快的解决方案 None
OpenMP 线程映射到物理内核

于是我在网上查了一段时间没有结果我是 OpenMP 的新手所以不确定这里的术语但是有没有办法从 OMPThread 由 omp get thread num 给出和线程将运行的物理核心找出特定机器的映射我还对 OMP 分配线程的精
使用 Rcpp 得出斐波那契数列的意外结果

我刚刚开始使用Rcpp很抱歉如果我错过了一个简单的步骤或类似的东西我已经尝试过这个 sourceCpp library Rcpp sourceCpp code include
R、Rcpp 与 Armadillo 中矩阵 rowSums() 与 colSums() 的效率

背景来自 R 编程我正在扩展到 C C 形式的编译代码Rcpp 作为循环交换以及一般的 C C 效果的实践练习我实现了 R 的等效项rowSums and colSums 矩阵的函数Rcpp 我知道它们以 Rcpp 糖的形式存在并
C++ 是否可以在 MacOS 上与 OpenMP 和 boost 兼容？

我现在已经尝试了很多事情并得出了一些结论也许我监督了一些事情但似乎我无法完成我想要的事情问题是是否有可能使用 OpenMP 和 boost 在 MacOS High Sierra 上编译 C 一些发现如果我错了请纠正我 Open
如何使用 Rcpp 将 C 结构从 C 库公开到 R

我正在尝试将 C 结构从 C 库公开到 R 中例如 struct A int flag 库提供 API 来构造和销毁是很常见的A A initA void freeA A a 感谢RCPP MODULE 很容易暴露它而不考虑析构函数 in
OpenMP 共享与第一私有性能比较

我有一个 pragma omp parallel for在类方法内循环每个线程只读访问很少的方法局部变量很少调用私有数据和方法的参数所有这些都在一个声明中声明shared条款我的问题性能方面不应该有任何区别声明这些变量share
当我使用并行代码时，为什么我的计算机没有显示加速？

所以我意识到这个问题听起来很愚蠢是的我使用的是双核但我尝试了两个不同的库 Grand Central Dispatch 和 OpenMP 并且当使用 clock 来对带有和不带有使平行的话速度是一样的根据记录他们都使用自己的并行
更快地评估从右到左的矩阵乘法

我注意到以二次形式评估矩阵运算右到左明显快于左到右在 R 中取决于括号的放置方式显然它们都执行相同的计算量我想知道为什么会这样这与内存分配有什么关系吗 A 5000 5000 B 5000 2 A matrix runif 5000
Mac OS Big Sur R 编译错误：ld：找不到 CoreFoundation 框架

在我的 Xcode 自动更新到 12 4 后我的 Rstudio 包编译中断并抛出以下错误 ld framework not found CoreFoundation collect2 error ld returned 1 exit s
如何处理 OpenMP 中的数据争用？

我正在尝试使用 OpenMP 将数字添加到数组中以下是我的代码 int input int malloc sizeof int snum int sum 0 int i for i 0 i
帮助解决 openmp 编译问题

我试图在我的 C 代码中使用 omp 并遇到问题在代码中我有 include 但是当我尝试编译时 g fopenmp g c 并行 c 我收到 cc1plus error unrecognized command line option
如何使用 OpenMP 并行化数组移位？

如何使用 OpenMP 并行化数组移位我已经尝试了一些方法但没有得到以下示例的任何准确结果该示例旋转 Carteira 对象数组的元素用于排列算法 void rotaciona int i Carteira aux this gt
2 个数组/图像相乘的多线程性能 - 英特尔 IPP

我正在使用英特尔 IPP 来进行 2 个图像数组的乘法我使用的是 Intel Composer 2015 Update 6 附带的 Intel IPP 8 2 我创建了一个简单的函数来乘以太大的图像整个项目已附后见下文我想看看使

随机推荐

在react项目中使用JS和TS

我按照此处的说明创建了一个新的反应项目https learn microsoft com en us office dev add ins quickstarts excel quickstart react https learn mic
使用逗号作为小数分隔符解析双精度的最佳方法？

因为comma https en wikipedia org wiki Comma用作小数点分隔符 https en wikipedia org wiki Decimal separator 这段代码抛出一个NumberFormatExce
Java 相当于 Python 列表

在 Python 中有一种称为列表的数据结构通过使用 Python 中的列表数据结构我们可以追加扩展插入删除弹出索引计数排序反转 Java 中有没有类似的数据结构我们可以像 Python List 一样获得所
sql查询添加列值

我想添加表的两列值并按降序排序例如 int id int test one int test 2 1 25 13 2 12 45 3 25 15 考虑到上表我想要一个 SQL 查询它给出如下结果 int id sum int test
我尝试从 laravel 7 升级到 laravel 9 但出现此错误：[重复]

这个问题在这里已经有答案了 macsidigital laravel api client 3 3 0 3 3 4 require macsidigital laravel oauth2 client 1 2 gt satisfiable
提醒表单中未保存的更改

我想在主文件中编写 Jquery 代码这样如果用户更改页面并且有任何未保存的更改用户应该收到警报我从中得到了一个答案 link https stackoverflow com questions 155739 detecting un
获取运行时缺少的依赖项的名称 - 找不到指定的模块

以下代码是我正在开发的插件系统的一部分基本上它会加载一个 DLL 如果失败则会显示一条错误消息 HMODULE loadPlugin LPTSTR path const auto module LoadLibraryEx path NU
Android 打开带有 ACTION_GET_CONTENT 结果的文件到不同的 Uri 中

I am trying to open files by using Intent ACTION GET CONTENT 根据 Android 版本设备品牌文件浏览器打开我得到以下结果从以下位置选择一个文件Downloads con
执行 bash 脚本时如何进入 Python virtualenv？

如果定义在 bash 脚本中使用哪个版本的 python 那么它将是 export PYTHON path python python 3 5 1 bin python 但对于 Python virtualenv 来说可以在命令行中执行这
我怎样才能捕捉到“Unicode非字符”警告？

我怎样才能捕捉到 Unicode非字符0xffff对于交换是非法的警告 usr bin env perl use warnings use 5 012 use Try Tiny use warnings FATAL gt qw all m
在 netcdf 文件中将时间轴单位从“年以来”更改为“天之后”

我有一个有人传给我的 netcdf 文件它使用自 DATE 以来的年数作为时间单位 double time time time standard name time time long name time time calendar
CMake如何检测更改的文件

我有一个 C C CMake 项目运行良好但是我有时会在时间略有不同的远程集群上重新构建这台机器运行 Linux 我正在使用make 我想知道是否有一些 make CMake 方法可以更改检测文件更改的方式例如MD5 或 di
为什么我收到“网站或应用程序上的数据泄露暴露了您的密码。Chrome 建议立即更改“SITENAME”上的密码。”

我创建了一个应用程序用 bcrypt 存储您的密码表单的输入类型是密码我不明白为什么我会收到此警报为什么我收到网站或应用程序上的数据泄露暴露了您的密码 Chrome 建议立即更改 SITENAME 上的密码 axios post
XML序列化结构

很抱歉无法更具体地表达标题但我只能通过举例来解释我正在尝试构建一个序列化为以下 XML 的类
使用 Azure Data Lake Analytics 与传统 ETL 方法的原因

我正在考虑使用我最近几周一直在研究的数据湖技术与我多年来一直使用的传统 ETL SSIS 场景进行比较我认为数据湖与大数据密切相关但是使用数据湖技术与 SSIS 之间的界限在哪里使用 Data Lake 技术处理 25MB 100M
全屏 Direct3D 应用程序中的 Web 浏览器

我需要在全屏 Direct3D 应用程序中拥有一个可以正常工作的 Web 浏览器例如当您加入服务器时 Valve 的基于源的游戏某种程度上会在 MotD 窗口中执行此操作关于去哪里看有什么建议吗第二人生使用 ubrowser
Flying Saucer 的 .NET 版本或端口

有没有类似的开源 NET项目或端口飞碟项目 http code google com p flying saucer which 使用 iText 将 HTML 渲染为 PDF http today java net pub a tod
sidekiq_mailer 尝试发送电子邮件时出错

我将 gem 添加到我的应用程序中sidekiq 2 15 0 和sidekiq mailer 0 0 6 当我尝试发送电子邮件时我收到 NoMethodError undefined method key for
Spring Boot使用多个数据源时如何设置多个连接池？

我有一个连接到两个独立数据库的 Spring Boot 应用程序一切正常我按照the docs https docs spring io spring boot docs current reference html howto dat
用于矩阵向量乘积的 Rcpp Parallel 或 openmp

我正在尝试对共轭梯度的朴素并行版本进行编程所以我从简单的维基百科算法开始我想改变dot products and MatrixVector产品通过其适当的并行版本 Rcppparallel 文档具有以下代码dot product使用并行

用于矩阵向量乘积的 Rcpp Parallel 或 openmp

用于矩阵向量乘积的 Rcpp Parallel 或 openmp 的相关文章

随机推荐

热门标签