On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote:

> On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernando...@gmail.com> 
> wrote:
>> Hello,
>> 
>> 
>> I'm trying to come up with a fault tolerant OpenMPI setup for research
>> purposes. I'm doing some tests now, but I'm stuck with a segfault when
>> I try to restart my test program from a checkpoint.
>> 
>> My test program is the "ring" program, where messages are sent to the
>> next node in the ring N times. It's pretty simple, I can supply the
>> source code if needed. I'm running it like this:
>> 
>> # mpirun -np 4 -am ft-enable-cr ring
>> ...
>>>>> Process 1 sending 703 to 2
>>>>> Process 3 received 704
>>>>> Process 3 sending 704 to 0
>>>>> Process 3 received 703
>>>>> Process 3 sending 703 to 0
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 18358 on node debian1
>> exited on signal 0 (Unknown signal 0).
>> --------------------------------------------------------------------------
>> 4 total processes killed (some possibly by mpirun during cleanup)
>> 
>> That's the output when I ompi-checkpoint the mpirun PID from another 
>> terminal.
>> 
>> The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
>> checkpoint directory has been created in $HOME.
>> 
>> This is what I get when I try to run ompi-restart
>> 
>> ps axroot@debian1:~# ps ax | grep mpirun
>> 18357 pts/0    R+     0:01 mpirun -np 4 -am ft-enable-cr ring
>> 18378 pts/5    S+     0:00 grep mpirun
>> root@debian1:~# ompi-checkpoint 18357
>> Snapshot Ref.:   0 ompi_global_snapshot_18357.ckpt
>> root@debian1:~# ompi-checkpoint --term 18357
>> Snapshot Ref.:   1 ompi_global_snapshot_18357.ckpt
>> root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
>> --------------------------------------------------------------------------
>> Error: Unable to obtain the proper restart command to restart from the
>>       checkpoint file (opal_snapshot_2.ckpt). Returned -1.
>> 
>> --------------------------------------------------------------------------
>> [debian1:18384] *** Process received signal ***
>> [debian1:18384] Signal: Segmentation fault (11)
>> [debian1:18384] Signal code: Address not mapped (1)
>> [debian1:18384] Failing at address: 0x725f725f
>> [debian1:18384] [ 0] [0xb775f40c]
>> [debian1:18384] [ 1]
>> /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
>> [debian1:18384] [ 2]
>> /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
>> [debian1:18384] [ 3]
>> /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
>> [debian1:18384] [ 4] opal-restart [0x804908e]
>> [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
>> [0xb7568b55]
>> [debian1:18384] [ 6] opal-restart [0x8048fc1]
>> [debian1:18384] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 18384 on node debian1
>> exited on signal 11 (Segmentat
>> --------------------------------------------------------------------------
>> 
>> I used a clean install of Debian Squeeze (testing) to make sure my
>> environment was ok. Those are the steps I took:
>> 
>> - Installed Debian Squeeze, only base packages
>> - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
>> tools, BLCR dev and run-time environment)
>> - Compiled openmpi-1.4.1
>> 
>> Note that I did compile openmpi-1.4.1 because the Debian package
>> (openmpi-checkpoint) doesn't seem to be usable at the moment. There
>> are no leftovers from any previous install of Debian packages
>> supplying OpenMPI because this is a fresh install, no openmpi package
>> had been installed before.
>> 
>> I used the following configure options:
>> 
>> # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
>> 
>> I also tried to add the option --with-memory-manager=none because I
>> saw an e-mail on the mailing list that described this as a possible
>> solution to an (apparently) not related problem, but the problem
>> remains the same.
>> 
>> I don't have config.log (I rm'ed the build dir), but if you think it's
>> necessary I can recompile OpenMPI and provide it.
>> 
>> Some information about the system (VirtualBox virtual machine, single
>> processor, btw):
>> 
>> Kernel version 2.6.32-trunk-686
>> 
>> root@debian1:~# lsmod | grep blcr
>> blcr                   79084  0
>> blcr_imports            2077  1 blcr
>> 
>> libcr (BLCR) is version 0.8.2-9.
>> 
>> gcc is version 4.4.3.
>> 
>> 
>> Please let me know of any other information you might need.
>> 
>> 
>> Thanks in advance,
>> 
> 
> Hello,
> 
> I figured it out. The problem is that the Debian package brcl-utils,
> which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.)
> wasn't installed. I believe OpenMPI could perhaps show a more
> descriptive message instead of segfaulting, though? Also, you might
> want to add that information to the FAQ.
> 
> Anyways, I'm filing another Debian bug report.
> 
> For the sake of completeness, here's, some more information:
> 
> - I forgot to mention that since I've installed OpenMPI to /usr/local.
> So I'm setting LD_LIBRARY_PATH to /usr/lib:/usr/local/lib in .bashrc,
> and thus I can run any OpenMPI command without problems.
> 
> - I tested BLCR with cr_checkpoint and cr_restart with a simple app,
> and it worked great too.
> 
> - I've purged /usr/local and rebuilt OpenMPI with the mentioned flags
> to obtain the attached config.log (gzipped).
> 
> - With brcl-utils installed, I can ompi-restart just fine. Without it
> installed, I get the segfault mentioned in my previous message.

Yes, ompi-restart should be printing a helpful message and exiting normally. 
Thanks for the bug report. I believe that I have seen and fixed this on a 
development branch making its way to the trunk. I'll make sure to move the fix 
to the 1.4 series once it has been applied to the trunk.

I filed a ticket on this if you wanted to track the issue.
  https://svn.open-mpi.org/trac/ompi/ticket/2329

Thanks again,
Josh

> 
> 
> 
> Best regards,
> <config.log.gz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to