Re: [OMPI users] Question about checkpoint/restart protocol

Mohamed Adel Thu, 5 Nov 2009 06:47:20 -0500

Dear Sergio,

Thank you for your reply. I've inserted the modules into the kernel and it all 
worked fine. But there is still a weired issue. I use the command "mpirun -n 2 
-am ft-enable-cr -H comp001 checkpoint-restart-test" to start the an mpi job. I 
then use "ompi-checkpoint PID" to checkpoint a job, but the ompi-checkpoint 
didn't respond and the mpirun produces the following.


--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          comp001.local (PID 23514)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[login01.local:21425] 1 more process has sent help message help-mpi-runtime.txt 
/ mpi_init:warn-fork
[login01.local:21425] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages

Notice: if the -n option has a value more than 1, then this error occurs, but 
if the -n option has the value 1 then the ompi-checkpoint succeeds, mpirun 
produces the same message and ompi-restart fails with the message 
[login01:21417] *** Process received signal ***
[login01:21417] Signal: Segmentation fault (11)
[login01:21417] Signal code: Address not mapped (1)
[login01:21417] Failing at address: (nil)
[login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
[login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so 
[0x2b093509dfee]
[login01:21417] [ 2] 
/home/mab/openmpi-1.3.3/lib/openmpi/mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) 
[0x2b093509d251]
[login01:21417] [ 3] opal-restart [0x401c3e]
[login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) [0x32dea1d8b4]
[login01:21417] [ 5] opal-restart [0x401399]
[login01:21417] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21417 on node login01.local exited 
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any help with that will be appreciated?

Thanks in advance,
Mohamed Adel

________________________________________
From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Sergio Díaz [sd...@cesga.es]
Sent: Thursday, November 05, 2009 11:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Question about checkpoint/restart protocol

Hi,

Did you load the BLCR modules before compiling OpenMPI?

Regards,
Sergio

Mohamed Adel escribió:
> Dear OMPI users,
>
> I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those options 
> "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --enable-ft-thread 
> --with-ft=cr --enable-mpi-threads --enable-static --disable-shared 
> --with-blcr=/home/mab/blcr-0.8.2/" then compiled and installed it 
> successfully.
> Now I'm trying to use the checkpoint/restart protocol. I run a program with 
> the options "mpirun -n 2 -am ft-enable-cr -H localhost 
> prime/checkpoint-restart-test" but I receive the following error:
>
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [madel:28896] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   opal_cr_init() failed failed
>   --> Returned value -1 instead of OPAL_SUCCESS
> --------------------------------------------------------------------------
> [madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_mpi_init: orte_init failed
>   --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
>
> I can't find the files mentioned in this post 
> "http://www.open-mpi.org/community/lists/users/2009/09/10641.php"; 
> (mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with that error?
>
> Thanks in advance
> Mohamed Adel
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/

------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Question about checkpoint/restart protocol

Reply via email to