On Sep 25, 2009, at 7:10 AM, Mallikarjuna Shastry wrote:

dear sir

i am sending the details as follows


1. i am using openmpi-1.3.3 and blcr 0.8.2
2. i have installed blcr 0.8.2 first under /root/MS
3. then i installed openmpi 1.3.3 under /root/MS
4 i have configured and installed open mpi as follows

#./configure --with-ft=cr --enable-mpi-threads
--with-blcr=/usr/local/bin
--with-blcr-libdir=/usr/local/lib

If you want to enable the C/R thread then you need to specify it. Try adding '--enable-ft-thread' to you Open MPI configure in addition to '--enable-mpi-threads'. The C/R thread should help your problem below.

Also it looks like you are specifying the wrong BLCR path. Above you said that it was installed in '/root/MS' but you are passing '/usr/ local/lib'.

Have you confirmed that you can successfully checkpoint/restart a non- MPI program on this system with BLCR?


# make
# make install

then i added the following to the .bash_profile under home
directory( i went to home directory by doing cd ~)

 /sbin/insmod
/usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko
 /sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko

Instead of putting this in your .bash_profile, the /sbin/insmod's should probably be setup to automatically load a boot time. BLCR's Admin Guide discusses how you can set this up (See section 2.5):
  https://upc-bugs.lbl.gov//blcr/doc/html/BLCR_Admin_Guide.html

 PATH=$PATH:/usr/local/bin
 MANPATH=$MANPATH:/usr/local/man
 LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Again if you installed Open MPI and BLCR in /root/MS, then you need to add that installation path to your environment (e.g., PATH, LD_LIBRARY_PATH, MANPATH).


then i compiled and run the file arr_add.c as follows

[root@localhost examples]# mpicc -o res arr_add.c
[root@localhost examples]# mpirun -np 2 -am ft-enable-cr
./res

You really should not ever be running Open MPI as root. Neither Open MPI nor BLCR require that you be root to use them.


2  2  2  2 2 2 2 2 2 2
2  2  2  2 2 2 2 2 2 2
2  2  2  2 2 2 2 2 2 2
--------------------------------------------------------------------------
Error: The process with PID 5790 is not checkpointable.
       This could be due to one of
the following:
       - An application with this PID
doesn't currently exist
       - The application with this PID
isn't checkpointable
       - The application with this PID
isn't an OPAL application.
      We were looking for the
named files:

  /tmp/opal_cr_prog_write.5790

  /tmp/opal_cr_prog_read.5790
--------------------------------------------------------------------------
[localhost.localdomain:05788] local) Error: Unable to
initiate the handshake with peer [[7788,1],1]. -1
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
Error in file snapc_full_global.c at line 567
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG:
Error in file snapc_full_global.c at line 1054


2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2


I suspect that this is related to your application. Have you tried to checkpoint/restart a simple example program, something that has a core loop like (Note the MPI_Barrier is necessary if you are not using the C/R thread since we need to call into the Open MPI library to check for a checkpoint):
---------
for(i = 0; i < 100; i++) {
  MPI_Barrier(MPI_COMM_WORLD);
  printf("Counting %d\n", i);
  sleep(1);
}
----------

Per my other message to you on the list:
  http://www.open-mpi.org/community/lists/users/2009/09/10741.php

--------------------
Is your application using SIGUSR1?

This error message indicates that Open MPI's daemons could not communicate with the application processes. The daemons send SIGUSR1 to the process to initiate the handshake (you can change this signal with -mca opal_cr_signal). If your application does not respond to the daemon within a time bound (default 20 sec, though you can change it with -mca snapc_full_max_wait_time) then this error is printed, and the checkpoint is aborted.--------------------

-- Josh




NOTE: the PID of mpirun is 5788

i geve the following command for taking the checkpoint

[root@localhost examples]#ompi-checkpoint -s 5788

i got the following output , but it was hanging like this

[localhost.localdomain:05796]
         Requested - Global
Snapshot Reference: (null)
[localhost.localdomain:05796]
           Pending -
Global Snapshot Reference: (null)
[localhost.localdomain:05796]
           Running -
Global Snapshot Reference: (null)



kindly rectify it.

with regards

mallikarjuna shastry







_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to