On Nov 5, 2009, at 4:46 AM, Mohamed Adel wrote:
Dear Sergio,
Thank you for your reply. I've inserted the modules into the kernel
and it all worked fine. But there is still a weired issue. I use the
command "mpirun -n 2 -am ft-enable-cr -H comp001 checkpoint-restart-
test" to start the an mpi job. I then use "ompi-checkpoint PID" to
checkpoint a job, but the ompi-checkpoint didn't respond and the
mpirun produces the following.
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: comp001.local (PID 23514)
MPI_COMM_WORLD rank: 0
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[login01.local:21425] 1 more process has sent help message help-mpi-
runtime.txt / mpi_init:warn-fork
[login01.local:21425] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages
Notice: if the -n option has a value more than 1, then this error
occurs, but if the -n option has the value 1 then the ompi-
checkpoint succeeds, mpirun produces the same message and ompi-
restart fails with the message
[login01:21417] *** Process received signal ***
[login01:21417] Signal: Segmentation fault (11)
[login01:21417] Signal code: Address not mapped (1)
[login01:21417] Failing at address: (nil)
[login01:21417] [ 0] /lib64/libpthread.so.0 [0x32df20de70]
[login01:21417] [ 1] /home/mab/openmpi-1.3.3/lib/openmpi/
mca_crs_blcr.so [0x2b093509dfee]
[login01:21417] [ 2] /home/mab/openmpi-1.3.3/lib/openmpi/
mca_crs_blcr.so(opal_crs_blcr_restart+0xd9) [0x2b093509d251]
[login01:21417] [ 3] opal-restart [0x401c3e]
[login01:21417] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x32dea1d8b4]
[login01:21417] [ 5] opal-restart [0x401399]
[login01:21417] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21417 on node
login01.local exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Any help with that will be appreciated?
I have not seen this behavior before. The first error is Open MPI
warning you that one of your MPI processes is trying to use fork(), so
you may want to make sure that your application is not using any system
() or fork() function calls. Open MPI internally should not be using
any of these functions from within the MPI library linked to the
application.
When you reloaded the BLCR module, did you rebuild Open MPI and
install it in a clean directory (not over the top of the old directory)?
Have you tried to checkpoint/restart an non-MPI process with BLCR on
your system? This will help to rule out installation problems with BLCR.
I suspect that Open MPI is not building correctly, or something in
your build environment is confusing/corrupting the build. Can you send
me your config.log, it may help me pinpoint the problem if it is build
related.
-- Josh
Thanks in advance,
Mohamed Adel
________________________________________
From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On
Behalf Of Sergio Díaz [sd...@cesga.es]
Sent: Thursday, November 05, 2009 11:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] Question about checkpoint/restart protocol
Hi,
Did you load the BLCR modules before compiling OpenMPI?
Regards,
Sergio
Mohamed Adel escribió:
Dear OMPI users,
I'm a new OpenMPI user. I've configured openmpi-1.3.3 with those
options "./configure --prefix=/home/mab/openmpi-1.3.3 --with-sge --
enable-ft-thread --with-ft=cr --enable-mpi-threads --enable-static
--disable-shared --with-blcr=/home/mab/blcr-0.8.2/" then compiled
and installed it successfully.
Now I'm trying to use the checkpoint/restart protocol. I run a
program with the options "mpirun -n 2 -am ft-enable-cr -H localhost
prime/checkpoint-restart-test" but I receive the following error:
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[madel:28896] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal
failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_cr_init() failed failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
[madel:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: orte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
I can't find the files mentioned in this post "http://www.open-mpi.org/community/lists/users/2009/09/10641.php
" (mca_crs_blcr.so, mca_crs_blcr.la). Could you please help me with
that error?
Thanks in advance
Mohamed Adel
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Sergio Díaz Montes
Centro de Supercomputacion de Galicia
Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
email: sd...@cesga.es ; http://www.cesga.es/
------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users