使用 Open MPI 运行并行程序时出现分段错误

2024-02-14

在我之前的文章中,我需要在 10 台计算机之间分发 pgm 文件的数据。在 Jonathan Dursi 和 Shawn Chin 的帮助下,我集成了代码。 我可以编译我的程序,但出现分段错误。我跑了,但什么也没发生

mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

结果是

[ubuntu:04803] *** Process received signal ***
[ubuntu:04803] Signal: Segmentation fault (11)
[ubuntu:04803] Signal code: Address not mapped (1)
[ubuntu:04803] Failing at address: 0x7548d0c
[ubuntu:04803] [ 0] [0x86b410]
[ubuntu:04803] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x186b00]
[ubuntu:04803] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04803] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x141bd6]
[ubuntu:04803] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04803] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 4803 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

然后我尝试使用 valgrind 运行来调试程序并生成 output.pgm

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

结果是

==4632== Memcheck, a memory error detector
==4632== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
==4632== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==4632== Command: mpirun -np 10 ./exmpi_2 2.pgm 10.pgm
==4632==
==4632== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==4632==    at 0x4215D37: syscall (syscall.S:31)
==4632==    by 0x402B335: opal_paffinity_linux_plpa_api_probe_init (plpa_api_probe.c:56)
==4632==    by 0x402B7CC: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
==4632==    by 0x402B93C: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:494)
==4632==    by 0x402B180: linux_module_init (paffinity_linux_module.c:119)
==4632==    by 0x40BE2C3: opal_paffinity_base_select (paffinity_base_select.c:64)
==4632==    by 0x40927AC: opal_init (opal_init.c:295)
==4632==    by 0x4046767: orte_init (orte_init.c:76)
==4632==    by 0x804A82E: orterun (orterun.c:540)
==4632==    by 0x804A3EE: main (main.c:13)
==4632==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==4632==
[ubuntu:04638] *** Process received signal ***
[ubuntu:04639] *** Process received signal ***
[ubuntu:04639] Signal: Segmentation fault (11)
[ubuntu:04639] Signal code: Address not mapped (1)
[ubuntu:04639] Failing at address: 0x7548d0c
[ubuntu:04639] [ 0] [0xc50410]  
[ubuntu:04639] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xde4b00]
[ubuntu:04639] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04639] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xd9fbd6]
[ubuntu:04639] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04639] *** End of error message ***
[ubuntu:04640] *** Process received signal ***
[ubuntu:04640] Signal: Segmentation fault (11)
[ubuntu:04640] Signal code: Address not mapped (1)
[ubuntu:04640] Failing at address: 0x7548d0c
[ubuntu:04640] [ 0] [0xdad410]
[ubuntu:04640] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0xe76b00]
[ubuntu:04640] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04640] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0xe31bd6]
[ubuntu:04640] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04640] *** End of error message ***
[ubuntu:04641] *** Process received signal ***
[ubuntu:04641] Signal: Segmentation fault (11)
[ubuntu:04641] Signal code: Address not mapped (1)
[ubuntu:04641] Failing at address: 0x7548d0c
[ubuntu:04641] [ 0] [0xe97410]
[ubuntu:04641] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1e8b00]
[ubuntu:04641] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04641] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1a3bd6]
[ubuntu:04641] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04641] *** End of error message ***
[ubuntu:04642] *** Process received signal ***
[ubuntu:04642] Signal: Segmentation fault (11)
[ubuntu:04642] Signal code: Address not mapped (1)
[ubuntu:04642] Failing at address: 0x7548d0c
[ubuntu:04642] [ 0] [0x92d410]
[ubuntu:04642] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x216b00]
[ubuntu:04642] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04642] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1d1bd6]
[ubuntu:04642] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04642] *** End of error message ***
[ubuntu:04643] *** Process received signal ***
[ubuntu:04643] Signal: Segmentation fault (11)
[ubuntu:04643] Signal code: Address not mapped (1)
[ubuntu:04643] Failing at address: 0x7548d0c
[ubuntu:04643] [ 0] [0x8f4410]
[ubuntu:04643] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x16bb00]
[ubuntu:04643] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04643] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x126bd6]
[ubuntu:04643] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04643] *** End of error message ***
[ubuntu:04638] Signal: Segmentation fault (11)
[ubuntu:04638] Signal code: Address not mapped (1)
[ubuntu:04638] Failing at address: 0x7548d0c
[ubuntu:04638] [ 0] [0x4f6410]
[ubuntu:04638] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x222b00]
[ubuntu:04638] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04638] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x1ddbd6]
[ubuntu:04638] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04638] *** End of error message ***
[ubuntu:04644] *** Process received signal ***
[ubuntu:04644] Signal: Segmentation fault (11)
[ubuntu:04644] Signal code: Address not mapped (1)
[ubuntu:04644] Failing at address: 0x7548d0c
[ubuntu:04644] [ 0] [0x61f410]
[ubuntu:04644] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1a3b00]
[ubuntu:04644] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04644] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x15ebd6]
[ubuntu:04644] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04644] *** End of error message ***
[ubuntu:04645] *** Process received signal ***
[ubuntu:04645] Signal: Segmentation fault (11)
[ubuntu:04645] Signal code: Address not mapped (1)
[ubuntu:04645] Failing at address: 0x7548d0c
[ubuntu:04645] [ 0] [0x7a3410]
[ubuntu:04645] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x1d5b00]
[ubuntu:04645] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04645] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x190bd6]
[ubuntu:04645] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04645] *** End of error message ***
[ubuntu:04647] *** Process received signal ***
[ubuntu:04647] Signal: Segmentation fault (11)
[ubuntu:04647] Signal code: Address not mapped (1)
[ubuntu:04647] Failing at address: 0x7548d0c
[ubuntu:04647] [ 0] [0xf54410]
[ubuntu:04647] [ 1] /lib/tls/i686/cmov/libc.so.6(fclose+0x1a0) [0x2bab00]
[ubuntu:04647] [ 2] ./exmpi_2(main+0x78e) [0x80492c2]
[ubuntu:04647] [ 3] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6) [0x275bd6]
[ubuntu:04647] [ 4] ./exmpi_2() [0x8048aa1]
[ubuntu:04647] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 4639 on node ubuntu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
6 total processes killed (some possibly by mpirun during cleanup)
==4632==
==4632== HEAP SUMMARY:
==4632==     in use at exit: 158,751 bytes in 1,635 blocks
==4632==   total heap usage: 10,443 allocs, 8,808 frees, 15,854,537 bytes allocated
==4632==
==4632== LEAK SUMMARY:
==4632==    definitely lost: 81,655 bytes in 112 blocks
==4632==    indirectly lost: 5,108 bytes in 91 blocks
==4632==      possibly lost: 1,043 bytes in 17 blocks
==4632==    still reachable: 70,945 bytes in 1,415 blocks 
==4632==         suppressed: 0 bytes in 0 blocks
==4632== Rerun with --leak-check=full to see details of leaked memory
==4632==
==4632== For counts of detected and suppressed errors, rerun with: -v
==4632== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 96 from 9)

有人可以帮我解决这个问题吗?这是我的源代码

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mpi.h"
#include <syscall.h>

#define SIZE_X 640
#define SIZE_Y 480




 int main(int argc, char **argv)
{
FILE *FR,*FW;
int ierr;
int rank, size;
int ncells;
int greys[SIZE_X][SIZE_Y];
int rows,cols, maxval;

int mystart, myend, myncells;
const int IONODE=0;
int *disps, *counts, *mydata;
int *data;
int i,j,temp1;
char dummy[50]="";





ierr = MPI_Init(&argc, &argv);
if (argc != 3) {
    fprintf(stderr,"Usage: %s infile outfile\n",argv[0]);
    fprintf(stderr,"outputs the negative of the input file.\n");
    return -1;
}            

ierr  = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size);
if (ierr) {
    fprintf(stderr,"Catastrophic MPI problem; exiting\n");
    MPI_Abort(MPI_COMM_WORLD,1);
}

    if (rank == IONODE) {
            //if (read_pgm(argv[1], &greys, &rows, &cols, &maxval)) {
            //   fprintf(stderr,"Could not read file; exiting\n");
              //   MPI_Abort(MPI_COMM_WORLD,2);

         rows=SIZE_X;
         cols=SIZE_Y;
         maxval=255;
         FR=fopen(argv[1], "r+");

         fgets(dummy,50,FR);
         do{  fgets(dummy,50,FR); } while(dummy[0]=='#');
         fgets(dummy,50,FR);

     for (j = 0; j <cols; j++)
     {
       for (i = 0; i <rows; i++)
       {
           fscanf(FR,"%d",&temp1);
         greys[i][j] = temp1;
       }
     }
}

    ncells = rows*cols;
    disps = (int *)malloc(size * sizeof(int));
    counts= (int *)malloc(size * sizeof(int));
    data = &(greys[0][0]); /* we know all the data is contiguous */

/* everyone calculate their number of cells */
ierr = MPI_Bcast(&ncells, 1, MPI_INT, IONODE, MPI_COMM_WORLD);
myncells = ncells/size;
mystart = rank*myncells;
myend   = mystart + myncells - 1;
if (rank == size-1) myend = ncells-1;
myncells = (myend-mystart)+1;
mydata = (int *)malloc(myncells * sizeof(int));

/* assemble the list of counts.  Might not be equal if don't divide evenly. */
ierr = MPI_Gather(&myncells, 1, MPI_INT, counts, 1, MPI_INT, IONODE, MPI_COMM_WORLD);
if (rank == IONODE) {
    disps[0] = 0;
    for (i=1; i<size; i++) {
        disps[i] = disps[i-1] + counts[i-1];
    }
}

/* scatter the data */
ierr = MPI_Scatterv(data, counts, disps, MPI_INT, mydata, myncells, MPI_INT, IONODE, MPI_COMM_WORLD);

/* everyone has to know maxval */
ierr = MPI_Bcast(&maxval, 1, MPI_INT, IONODE, MPI_COMM_WORLD);

for (i=0; i<myncells; i++)
    mydata[i] = maxval-mydata[i];

/* Gather the data */
ierr = MPI_Gatherv(mydata, myncells, MPI_INT, data, counts, disps, MPI_INT, IONODE, MPI_COMM_WORLD);

if (rank == IONODE)
{
//      write_pgm(argv[2], greys, rows, cols, maxval);
  FW=fopen(argv[2], "w");
  fprintf(FW,"P2\n%d %d\n255\n",rows,cols);    
  for(j=0;j<cols;j++)
    for(i=0;i<rows;i++)
   fprintf(FW,"%d ", greys[i][j]);
}

free(mydata);
if (rank == IONODE) {
    free(counts);
    free(disps);
    //free(&(greys[0][0]));
    //free(greys);

}
fclose(FR);
fclose(FW);
MPI_Finalize();
return 0;
}

这是输入图像http://orion.math.iastate.edu/burkardt/data/pgm/balloons.pgm http://orion.math.iastate.edu/burkardt/data/pgm/balloons.pgm


恭喜;代码almost运行完全完美,几乎在最后几行代码处就死掉了。

使用 valgrind 会更清楚这个问题,但是使用 MPI 或任何涉及程序启动器的东西运行 valgrind 时必须更加棘手。代替:

valgrind mpirun -np 10 ./exmpi_2 balloons.pgm output.pgm

它做了 mpirun 的 valgrind,你并不真正关心,你想做

mpirun -np 10 valgrind ./exmpi_2 balloons.pgm output.pgm

-- 也就是说,您想要启动 10 个 valgrind,每个运行一个进程的 exmpi_2。如果你这样做(并且你已经使用 -g 进行了编译),你会在最后发现 valgrind 输出如下:

==6303==  Access not within mapped region at address 0x1
==6303==    at 0x387FA60C17: fclose@@GLIBC_2.2.5 (in /lib64/libc-2.5.so)
==6303==    by 0x401222: main (pgm.c:124)

.. 这就是全部;你有all进程执行fclose()s,当只有一个进程有一个句柄时fopen()首先是 ed 文件。只需更换

fclose(FR);
fclose(FW);

with

if (rank == IONODE) {
    fclose(FR);
    fclose(FW);
}

似乎对我有用。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

使用 Open MPI 运行并行程序时出现分段错误 的相关文章

  • 在 boost::spirit 语法中翻转规则内的子规则顺序会导致段错误

    警告 虽然我试图将代码缩短到最少 我仍然需要包含相当多的内容 以确保提供所需的信息 该代码编译文件并运行 导致语法错误 name simple name qi val qi 1 qualified name qi val qi 1 虽然这
  • 在 std::list 上使用擦除时的 C++ 分段

    我正在尝试使用以下命令从 C 链接列表中删除项目erase和一个列表迭代器 include
  • glBufferData() 的分段错误

    我不明白为什么这段代码会出现段错误 AxesMarker AxesMarker float size size size vbo vertices 0 vbo elements 0 Vertex vertices 6 Vertex Colo
  • pthread_cond_broadcast 被 dlsym 破坏了?

    我正在尝试使用 LD PRELOAD 机制插入对 pthread cond broadcast 的调用 我插入的 pthread cond broadcast 函数只是调用原始的 pthread cond broadcast 然而 对于一个
  • bash 是否存在内存泄漏?

    我在跑bashv4 4 19 1 在 Ubuntu 18 10 上发布 如果我跑valgrind在一个简单的脚本上 或者甚至bash version 我发现我确实丢失了 12 个字节的内存 但仍然可以访问大约 46kB 的内存 仍然可访问的
  • 为 Android 构建 Valgrind

    使用 ndk r6 或 ndk r8d 在 ubuntu 12 04 上构建 valgrind 3 8 1 失败 并出现以下错误 cc1 Error not rekognized option marm priv main globals
  • 调用 sdp_record_register() 时出现分段错误

    我正在尝试使用 BlueZ 在 SDP 中注册我的蓝牙服务 我跟随this http people csail mit edu albert bluez intro x604 html教程 代码编译成功 但当我运行它时 出现分段错误 即使是
  • mpi.h:使用未定义的类型?

    我正在尝试将 OpenMPI 的 mpi h 的重要部分翻译为 D 编程语言 以便我可以从 D 调用它 HTOD 根本不起作用 我无法理解以下代码段 typedef struct ompi communicator t MPI Comm O
  • Valgrind 自动测试——它们在什么地方使用过吗?

    您认为基于 valgrind 工具套件运行一组自动测试有意义吗 您听说过或看到过这样的设置吗 这样的设置可以执行哪些自动 不受人类直觉影响 操作 如果您在单元测试或最终构建测试中检查内存问题 错误代码 那么这是有意义的 可能有两种方法 编写
  • MPI+CUDA 与纯 MPI 相比有何优势?

    加速应用程序的常用方法是使用 MPI 或更高级别的库 例如在幕后使用 MPI 的 PETSc 并行化应用程序 然而 现在每个人似乎都对使用 CUDA 来并行化他们的应用程序或使用 MPI 和 CUDA 的混合来解决更雄心勃勃 更大的问题感兴
  • std::map::insert(...) 中的分段错误

    我使用过搜索 但没有找到令我满意的答案 所以 这是代码块 VoteContainer h typedef uint32 t order id t typedef int driver id t class Vote public enum
  • RSpec 抛出分段错误

    有时我的测试套件会无缘无故地抛出分段错误 这是输出 Users Test rvm gems ruby 1 9 3 p392 gems activerecord 3 2 9 lib active record relation query m
  • Boost MPI 在监听列表时不会释放资源?

    这是一个后续问题如何释放 boost mpi request https stackoverflow com questions 44078901 how do i free a boostmpirequest 我在监听列表而不是单个项目时
  • GProf 输出中缺少函数

    我正在尝试分析一些 C 代码 但最直观地成本最高的函数之一并未出现在 GProf 输出中 int main initialise haloSwap for functions propagate functions void propaga
  • 使用valgrind进行GDB远程调试

    如果我使用远程调试gdb我连接到gdbserver using target remote host 2345 如果我使用 valgrind 和 gdb 调试内存错误 以中断无效内存访问 我会使用 target remote vgdb 启动
  • WebCore::UserGestureIndicator::processingUserGesture 中的 EXC_BAD_ACCESS (SIGSEGV)

    我有一个使用 UIWebView 和 HTML5 websockets 构建的 iOS 应用程序 该应用程序经历了看似随机的崩溃 它发生在用户与其交互时以及在用户和应用程序之间没有发生交互的寿命测试期间 崩溃日志都有以下内容 Excepti
  • FreeBSD 上 valgrind 的限制

    我一直在尝试使用 valgrind 查找一些可疑的内存错误 在被分析的程序甚至到达我想要分析的点之前 它会因为对 mmap 的调用开始失败而退出 当不在 valgrind 下时 这些调用会成功 valgrind 下可能的文件映射 映射内存数
  • 直接泄漏和间接泄漏有什么区别?

    我从 LeakSanitizer 工具获得以下输出 正如该工具所理解的那样 直接泄漏和间接泄漏之间有什么区别 13 29107 ERROR LeakSanitizer detected memory leaks 13 13 Direct l
  • C 中的分段错误

    我需要用 0 填充二维数组 但编译后的程序会出现此错误 怎么了 int main int vert 1001 1001 int hor 1001 1001 int dudiag 1416 1416 int uddiag 1416 1416
  • mpi4py:关闭 MPI Spawn?

    我有一些 python 代码 我经常在其中生成多个进程 我收到错误 ORTE ERROR LOG The system limit on number of pipes a process can open was reached in f

随机推荐