mpi4py 运行过程中出现Read -1, expected xxx, errno = 1 解决方案

2023-05-16

目录

问题描述

代码1(串行)

代码2(并行)

代码2执行时所用指令

错误信息

解决方案 

解决方案1 

解决方案2


问题描述

今天正在学习使用mpi4py,在对比运行以下2个代码时疯狂报错:

代码1(串行)

import numpy as np
import time

np.random.seed(2)
size = 1000000

x1 = np.random.random(size)
x2 = np.random.random(size)
result = np.zeros(size, dtype=float)

since = time.time()
for i in range(size):
    result[i] = x1[i] + x2[i]
end = time.time()

print(end - since)

代码2(并行)

from mpi4py import MPI
import numpy as np
import time

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()

size = 1000000
x1 = np.random.random(size)
x2 = np.random.random(size)

if rank == 0:

    ave, res = divmod(size, nprocs)

    count = [ave + 1 if p < res else ave for p in range(nprocs)]
    count = np.array(count)

    displ = [sum(count[:p]) for p in range(nprocs)]
    displ = np.array(displ)
else:
    sendbuf = None
    count = np.zeros(nprocs, dtype=np.int)
    displ = None

t0 = time.time()
comm.Bcast(count, root=0)

recvbuf1 = np.zeros(count[rank])
recvbuf2 = np.zeros(count[rank])

comm.Scatterv([x1, count, displ, MPI.DOUBLE], recvbuf1, root=0)
comm.Scatterv([x2, count, displ, MPI.DOUBLE], recvbuf2, root=0)

print('After Scatterv, process {} has data:'.format(rank), recvbuf1)
print('After Scatterv, process {} has data:'.format(rank), recvbuf2)

for i in range(recvbuf1.shape[0]):
    recvbuf1[i] += recvbuf2[i]

sendbuf2 = recvbuf1
recvbuf2 = np.zeros(sum(count))
comm.Gatherv(sendbuf2, [recvbuf2, count, displ, MPI.DOUBLE], root=0)

if comm.Get_rank() == 0:
    print('pi computed in {:.3f} sec'.format(time.time() - t0))
    print('After Gatherv, process 0 has data:', recvbuf2)

代码2执行时所用指令

# mpi_test.py是该代码存放的代码文件,代码是以root的权限执行的
mpirun -np 4 --allow-run-as-root python mpi_test.py

错误信息

这个错误是我第三次尝试解决,这次终于找到了解决方案,太不容易了,QAQ

解决方案 

参考链接:

python - Possible buffer size limit in mpi4py Reduce() - Stack Overflow

链接中指出,出现这个错误的主要原因是由于

The issue comes from the Cross-Memory Attach (CMA) system calls process_vm_readv() and process_vm_writev() that the shared-memory BTLs (Byte Transfer Layers, a.k.a. the things that move bytes between ranks) of Open MPI use to accelerate shared-memory communication between ranks that run on the same node by avoiding copying the data twice to and from a shared-memory buffer. This mechanism involves some setup overhead and is therefore only used for larger messages, which is why the problem only starts occurring after the messages size crosses the eager threshold.

有以下两个解决方案:

解决方案1 

在执行docker run时,带上参数

--cap-add=SYS_PTRACE

但是由于我拿到的是分好的docker,并不具备执行docker run指令的权限,所以只能选择解决方案2中的解决方法。 

解决方案2

 禁用CMA。

如果是Open MPI 1.8之前的版本,在执行mpirun时带上参数:

mpirun --mca btl_sm_use_cma 0 ...

如果是Open MPI 1.8之后的版本,执行mpirun时带上参数:

mpirun --mca btl_vader_single_copy_mechanism none

附上一个原网站的回答截图以备后续查阅: 

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

mpi4py 运行过程中出现Read -1, expected xxx, errno = 1 解决方案 的相关文章

随机推荐