I hope I'm not too late in my reply, and I hope I'm not repeating the same solution others have given you.
I had a similar error in a code a few months ago. The error was this: I think I was doing an MPI_Pack/Unpack to send data between nodes. The problem was that I was allocating space for a buffer using the wrong variable, so there was a buffer size mismatch between the sending and receiving nodes. When running problem as a single instance, these buffers weren't really being used, so the problem never presented itself. It trickier, the problem only occurred when the payload exceeded a certain size (number of elements in array, or data in packed buffer) when run in parallel. I used valgrind, which didn't shed much light on the problem. I finally found my error when I tracking down the data size dependency. I hope that helps. Prentice Jeff Squyres wrote: > Ouch. These are the worst kinds of bugs to find. :-( > > If you attach a debugger to these processes and step through the final death > throes of the process, does it provide any additional insight? I have not > infrequently done stuff like this: > > { > int i = 0; > printf("Process %d ready to attach\n", getpid()); > while (i == 0) sleep(5); > } > > Then you get a message indicating which pid to attach to. When you attach, > set the variable i to nonzero and you can continue stepping through the > process. > > > > On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote: > >> Apologies for the vague details of the problem I'm about to describe, >> but then I only understand it vaguely. Any pointers about the best >> directions for further investigation would be appreciated. Lengthy >> details follow: >> >> So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run >> into some weird behaviour. When run under mpiexec, a segmentation >> fault is thrown: >> >> % mpiexec -n 2 ./omegamip >> [...] >> main.cpp:52: Finished. >> Completed 20 of 20 in 0.0695 minutes >> [queen:23560] *** Process received signal *** >> [queen:23560] Signal: Segmentation fault (11) >> [queen:23560] Signal code: (128) >> [queen:23560] Failing at address: (nil) >> [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80] >> [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40) >> [0x2afb1fa43460] >> [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) >> [0x2afb1fa439ad] >> [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b] >> [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc] >> [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4] >> [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59] >> [queen:23560] *** End of error message *** >> mpiexec noticed that job rank 1 with PID 23560 on node >> queen.bioinformatics exited on signal 11 (Segmentation fault). >> >> Right, so I've got a memory overrun or something. Except that when the >> program is run in standalone mode, it works fine: >> >> % ./omegamip >> [...] >> main.cpp:52: Finished. >> Completed 20 of 20 in 0.05970 minutes >> >> Right, so there's a difference between my standalone and MPI modes. >> Except the the difference between my standalone and MPI versions is >> currently nothing but the calls to MPI_Init, MPI_Finalize and some >> exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't >> gotten as far as coding the problem division.) Also, calling mpiexec >> with 1 process always works: >> >> % mpiexec -n 1 ./omegamip >> [...] >> main.cpp:52: Finished. >> Completed 20 of 20 in 0.05801 minutes >> >> So there's still this segmentation fault. Running valgrind across the >> program doesn't show any obvious problems: there was some quirky >> pointer arithmetic and some huge blocks of dangling memory, but these >> were only leaked at the end of the program (i.e. the original >> programmer didn't bother cleaning up at program termination). I've >> caught most of those. But the segmentation fault still occurs only >> when run under mpiexec with 2 or more processes. And by use of >> diagnostic printfs and logging, I can see that it only occurs at the >> very end of the program, the very end of main, possibly when >> destructors are being automatically called. But again this cleanup >> doesn't cause any problems with the standalone or 1 process modes. >> >> So, any ideas for where to start looking? >> >> technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64, >> Red Hat 4.1.2-42 >> >> ---- >> Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk) >> Bioinformatics, Centre for Infections, Health Protection Agency >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ