I hope I'm not too late in my reply, and I hope I'm not repeating the
same solution others have given you.

I had a similar error in a code a few months ago. The error was this: I
think I was doing an MPI_Pack/Unpack to send data between nodes. The
problem was that I was allocating space for a buffer using the wrong
variable, so there was a buffer size mismatch between the sending and
receiving nodes.

When running problem as a single instance, these buffers weren't really
being used, so the problem never presented itself. It trickier, the
problem only occurred when the payload exceeded a certain size (number
of elements in array, or data in packed buffer) when run in parallel.

I used valgrind, which didn't shed much light on the problem. I finally
found my error when I tracking down the data size dependency.

I hope that helps.

Prentice


Jeff Squyres wrote:
> Ouch.  These are the worst kinds of bugs to find.  :-(
> 
> If you attach a debugger to these processes and step through the final death 
> throes of the process, does it provide any additional insight?  I have not 
> infrequently done stuff like this:
> 
>   {
>      int i = 0;
>      printf("Process %d ready to attach\n", getpid());
>      while (i == 0) sleep(5);
>   }
> 
> Then you get a message indicating which pid to attach to.  When you attach, 
> set the variable i to nonzero and you can continue stepping through the 
> process.
> 
> 
> 
> On May 14, 2010, at 10:44 AM, Paul-Michael Agapow wrote:
> 
>> Apologies for the vague details of the problem I'm about to describe,
>> but then I only understand it vaguely. Any pointers about the best
>> directions for further investigation would be appreciated. Lengthy
>> details follow:
>>
>> So I'm "MPI-izing" a pre-existing C++ program (not mine) and have run
>> into some weird behaviour. When run under mpiexec, a segmentation
>> fault is thrown:
>>
>> % mpiexec -n 2 ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.0695 minutes
>> [queen:23560] *** Process received signal ***
>> [queen:23560] Signal: Segmentation fault (11)
>> [queen:23560] Signal code:  (128)
>> [queen:23560] Failing at address: (nil)
>> [queen:23560] [ 0] /lib64/libpthread.so.0 [0x3d6a00de80]
>> [queen:23560] [ 1] /opt/openmpi/lib/libopen-pal.so.0(_int_free+0x40)
>> [0x2afb1fa43460]
>> [queen:23560] [ 2] /opt/openmpi/lib/libopen-pal.so.0(free+0xbd) 
>> [0x2afb1fa439ad]
>> [queen:23560] [ 3] ./omegamip(_ZN12omegaMapBaseD2Ev+0x5b) [0x433c2b]
>> [queen:23560] [ 4] ./omegamip(main+0x18c) [0x415ccc]
>> [queen:23560] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d6941d8b4]
>> [queen:23560] [ 6] ./omegamip(__gxx_personality_v0+0x1e9) [0x40ee59]
>> [queen:23560] *** End of error message ***
>> mpiexec noticed that job rank 1 with PID 23560 on node
>> queen.bioinformatics exited on signal 11 (Segmentation fault).
>>
>> Right, so I've got a memory overrun or something. Except that when the
>> program is run in standalone mode, it works fine:
>>
>> % ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.05970 minutes
>>
>> Right, so there's a difference between my standalone and MPI modes.
>> Except the the difference between my standalone and MPI versions is
>> currently nothing but the calls to MPI_Init, MPI_Finalize and some
>> exploratory calls to MPI_Comm_size and MPI_Comm_rank. (I haven't
>> gotten as far as coding the problem division.) Also, calling mpiexec
>> with 1 process always works:
>>
>> % mpiexec -n 1 ./omegamip
>> [...]
>> main.cpp:52: Finished.
>> Completed 20 of 20 in 0.05801 minutes
>>
>> So there's still this segmentation fault. Running valgrind across the
>> program doesn't show any obvious problems: there was some quirky
>> pointer arithmetic and some huge blocks of dangling memory, but these
>> were only leaked at the end of the program (i.e. the original
>> programmer didn't bother cleaning up at program termination). I've
>> caught most of those. But the segmentation fault still occurs only
>> when run under mpiexec with 2 or more processes. And by use of
>> diagnostic printfs and logging, I can see that it only occurs at the
>> very end of the program, the very end of main, possibly when
>> destructors are being automatically called. But again this cleanup
>> doesn't cause any problems with the standalone or 1 process modes.
>>
>> So, any ideas for where to start looking?
>>
>> technical details: gcc v4.1.2, C++, mpiexec (OpenRTE) 1.2.7, x86_64,
>> Red Hat 4.1.2-42
>>
>> ----
>> Paul-Michael Agapow (paul-michael.agapow (at) hpa.org.uk)
>> Bioinformatics, Centre for Infections, Health Protection Agency
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ

Reply via email to