On Sep 25, 2009, at 7:10 AM, Mallikarjuna Shastry wrote:
dear sir
i am sending the details as follows
1. i am using openmpi-1.3.3 and blcr 0.8.2
2. i have installed blcr 0.8.2 first under /root/MS
3. then i installed openmpi 1.3.3 under /root/MS
4 i have configured and installed open mpi as follows
#./configure --with-ft=cr --enable-mpi-threads
--with-blcr=/usr/local/bin
--with-blcr-libdir=/usr/local/lib
If you want to enable the C/R thread then you need to specify it. Try
adding '--enable-ft-thread' to you Open MPI configure in addition to
'--enable-mpi-threads'. The C/R thread should help your problem below.
Also it looks like you are specifying the wrong BLCR path. Above you
said that it was installed in '/root/MS' but you are passing '/usr/
local/lib'.
Have you confirmed that you can successfully checkpoint/restart a non-
MPI program on this system with BLCR?
# make
# make install
then i added the following to the .bash_profile under home
directory( i went to home directory by doing cd ~)
/sbin/insmod
/usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko
/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko
Instead of putting this in your .bash_profile, the /sbin/insmod's
should probably be setup to automatically load a boot time. BLCR's
Admin Guide discusses how you can set this up (See section 2.5):
https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Admin_Guide.html
PATH=$PATH:/usr/local/bin
MANPATH=$MANPATH:/usr/local/man
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Again if you installed Open MPI and BLCR in /root/MS, then you need to
add that installation path to your environment (e.g., PATH,
LD_LIBRARY_PATH, MANPATH).
then i compiled and run the file arr_add.c as follows
[root@localhost examples]# mpicc -o res arr_add.c
[root@localhost examples]# mpirun -np 2 -am ft-enable-cr
./res
You really should not ever be running Open MPI as root. Neither Open
MPI nor BLCR require that you be root to use them.
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
--------------------------------------------------------------------------
Error: The process with PID 5790 is not checkpointable.
This could be due to one of
the following:
- An application with this PID
doesn't currently exist
- The application with this PID
isn't checkpointable
- The application with this PID
isn't an OPAL application.
We were looking for the
named files:
/tmp/opal_cr_prog_write.5790
/tmp/opal_cr_prog_read.5790
--------------------------------------------------------------------------
[localhost.localdomain:05788] local) Error: Unable to
initiate the handshake with peer [[7788,1],1]. -1
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
Error in file snapc_full_global.c at line 567
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
Error in file snapc_full_global.c at line 1054
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
I suspect that this is related to your application. Have you tried to
checkpoint/restart a simple example program, something that has a core
loop like (Note the MPI_Barrier is necessary if you are not using the
C/R thread since we need to call into the Open MPI library to check
for a checkpoint):
---------
for(i = 0; i < 100; i++) {
MPI_Barrier(MPI_COMM_WORLD);
printf("Counting %d\n", i);
sleep(1);
}
----------
Per my other message to you on the list:
http://www.open-mpi.org/community/lists/users/2009/09/10741.php
--------------------
Is your application using SIGUSR1?
This error message indicates that Open MPI's daemons could not
communicate with the application processes. The daemons send SIGUSR1
to the process to initiate the handshake (you can change this signal
with -mca opal_cr_signal). If your application does not respond to the
daemon within a time bound (default 20 sec, though you can change it
with -mca snapc_full_max_wait_time) then this error is printed, and
the checkpoint is aborted.--------------------
-- Josh
NOTE: the PID of mpirun is 5788
i geve the following command for taking the checkpoint
[root@localhost examples]#ompi-checkpoint -s 5788
i got the following output , but it was hanging like this
[localhost.localdomain:05796]
Requested - Global
Snapshot Reference: (null)
[localhost.localdomain:05796]
Pending -
Global Snapshot Reference: (null)
[localhost.localdomain:05796]
Running -
Global Snapshot Reference: (null)
kindly rectify it.
with regards
mallikarjuna shastry
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users