On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote: > On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernando...@gmail.com> > wrote: >> Hello, >> >> >> I'm trying to come up with a fault tolerant OpenMPI setup for research >> purposes. I'm doing some tests now, but I'm stuck with a segfault when >> I try to restart my test program from a checkpoint. >> >> My test program is the "ring" program, where messages are sent to the >> next node in the ring N times. It's pretty simple, I can supply the >> source code if needed. I'm running it like this: >> >> # mpirun -np 4 -am ft-enable-cr ring >> ... >>>>> Process 1 sending 703 to 2 >>>>> Process 3 received 704 >>>>> Process 3 sending 704 to 0 >>>>> Process 3 received 703 >>>>> Process 3 sending 703 to 0 >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 18358 on node debian1 >> exited on signal 0 (Unknown signal 0). >> -------------------------------------------------------------------------- >> 4 total processes killed (some possibly by mpirun during cleanup) >> >> That's the output when I ompi-checkpoint the mpirun PID from another >> terminal. >> >> The checkpoint is taken just fine in maybe 1.5 seconds. I can see the >> checkpoint directory has been created in $HOME. >> >> This is what I get when I try to run ompi-restart >> >> ps axroot@debian1:~# ps ax | grep mpirun >> 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring >> 18378 pts/5 S+ 0:00 grep mpirun >> root@debian1:~# ompi-checkpoint 18357 >> Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt >> root@debian1:~# ompi-checkpoint --term 18357 >> Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt >> root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt >> -------------------------------------------------------------------------- >> Error: Unable to obtain the proper restart command to restart from the >> checkpoint file (opal_snapshot_2.ckpt). Returned -1. >> >> -------------------------------------------------------------------------- >> [debian1:18384] *** Process received signal *** >> [debian1:18384] Signal: Segmentation fault (11) >> [debian1:18384] Signal code: Address not mapped (1) >> [debian1:18384] Failing at address: 0x725f725f >> [debian1:18384] [ 0] [0xb775f40c] >> [debian1:18384] [ 1] >> /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] >> [debian1:18384] [ 2] >> /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] >> [debian1:18384] [ 3] >> /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] >> [debian1:18384] [ 4] opal-restart [0x804908e] >> [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) >> [0xb7568b55] >> [debian1:18384] [ 6] opal-restart [0x8048fc1] >> [debian1:18384] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 2 with PID 18384 on node debian1 >> exited on signal 11 (Segmentat >> -------------------------------------------------------------------------- >> >> I used a clean install of Debian Squeeze (testing) to make sure my >> environment was ok. Those are the steps I took: >> >> - Installed Debian Squeeze, only base packages >> - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build >> tools, BLCR dev and run-time environment) >> - Compiled openmpi-1.4.1 >> >> Note that I did compile openmpi-1.4.1 because the Debian package >> (openmpi-checkpoint) doesn't seem to be usable at the moment. There >> are no leftovers from any previous install of Debian packages >> supplying OpenMPI because this is a fresh install, no openmpi package >> had been installed before. >> >> I used the following configure options: >> >> # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads >> >> I also tried to add the option --with-memory-manager=none because I >> saw an e-mail on the mailing list that described this as a possible >> solution to an (apparently) not related problem, but the problem >> remains the same. >> >> I don't have config.log (I rm'ed the build dir), but if you think it's >> necessary I can recompile OpenMPI and provide it. >> >> Some information about the system (VirtualBox virtual machine, single >> processor, btw): >> >> Kernel version 2.6.32-trunk-686 >> >> root@debian1:~# lsmod | grep blcr >> blcr 79084 0 >> blcr_imports 2077 1 blcr >> >> libcr (BLCR) is version 0.8.2-9. >> >> gcc is version 4.4.3. >> >> >> Please let me know of any other information you might need. >> >> >> Thanks in advance, >> > > Hello, > > I figured it out. The problem is that the Debian package brcl-utils, > which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.) > wasn't installed. I believe OpenMPI could perhaps show a more > descriptive message instead of segfaulting, though? Also, you might > want to add that information to the FAQ. > > Anyways, I'm filing another Debian bug report. > > For the sake of completeness, here's, some more information: > > - I forgot to mention that since I've installed OpenMPI to /usr/local. > So I'm setting LD_LIBRARY_PATH to /usr/lib:/usr/local/lib in .bashrc, > and thus I can run any OpenMPI command without problems. > > - I tested BLCR with cr_checkpoint and cr_restart with a simple app, > and it worked great too. > > - I've purged /usr/local and rebuilt OpenMPI with the mentioned flags > to obtain the attached config.log (gzipped). > > - With brcl-utils installed, I can ompi-restart just fine. Without it > installed, I get the segfault mentioned in my previous message.
Yes, ompi-restart should be printing a helpful message and exiting normally. Thanks for the bug report. I believe that I have seen and fixed this on a development branch making its way to the trunk. I'll make sure to move the fix to the 1.4 series once it has been applied to the trunk. I filed a ticket on this if you wanted to track the issue. https://svn.open-mpi.org/trac/ompi/ticket/2329 Thanks again, Josh > > > > Best regards, > <config.log.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users