Hi ,Thank you For your reply.
  
I have some problems:
(1)
Now ,In the my platform , all nodes have the same pathand LD_LIBRARY_PATH.
 I set in .bashrc  
/--------------------------------------------------------------------------------/
#BLCR
export PATH=$PATH:/usr/local/BLCR/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
#openMPI
export PATH=$PATH:/root/kidd_openMPI/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib

/-------------------------------------------------------------------------------------------/
but ,when I  running  mpirun  , I have to add  " -x  LD_LIBRARY_PATH" ,or  it 
can't  run
 example:  mpirun -hostfile hosts  -np  2  ./TEST .
 Error Message==> 
./TEST: error while loading shared libraries: libcr.so.0: cannot open shared 
object file: No such file or directory
 (2)  BLCR need to unify linux-kernel  of all the Node ?
       Now ,I reset all  Node.(using Ubuntu 10.04)

 (3) 
      Now , My porgram using  DLL . I implements some DLL  ,MPI-Program calls 
DLLs .  
      Ompi can check/Restart  Program contains  DLL ? 
________________________________



________________________________
 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/23 (週一) 10:51 PM
主旨: Re: [OMPI users] Ompi-restart failed and process migration
 
I wonder if the LD_LIBRARY_PATH is not being set properly upon
restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
ompi-restart will not pass that variable along for you, so if you are
using that to set the BLCR path this might be your problem.

A couple solutions:
- have the PATH and LD_LIBRARY_PATH set the same on all nodes
- have ompi-restart pass the -x parameter to the underlying mpirun by
using the -mpirun_opts command line switch:
   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...

Yes. ompi-restart will let you checkpoint a process on one node and
restart it on another. You will have to restart the whole application
since the ompi-migration operation is not available in the 1.5 series.

-- Josh

On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi all,
> I have Some problems,I wana check/Restart Multiple process on 2 node.
>
>  My environment:
>  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
> I have 2 Node :
>  N05(Master ,it have NFS shared file system),N07(slave
>  ,mount Master-Node).
>
>  My configure format=./configure --prefix=/root/kidd_openMPI
>  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
>  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared --enable-opal-multi-threads;
>
>   I had also set  ~/.openmpi/mca-params.conf->
>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>
> the dir->kidd_openMPI is my nfs shared dir.
>
>  My Command :
>   1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>
>   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
>      -np 2 ./TEST .
>
>   I can restart process-0 on Master,but process-1 on N07 was failed.
>
>   I checked my Node,it does not install the prelink,
>   so the error(restart-failed) is caused by other reasons.
>
>   Error Message-->
>  --------------------------------------------------------------------------
>   root@cuda05:~/kidd_openMPI/checkpoints#
>   ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
>  --------------------------------------------------------------------------
>     Error: BLCR was not able to restart the process because exec failed.
>      Check the installation of BLCR on all of the machines in your
>      system. The following information may be of help:
>   Return Code : -1
>   BLCR Restart Command : cr_restart
>   Restart Command Line : cr_restart
>  /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
>  opal_snapshot_1.ckpt/ompi_blcr_context.2704
>  --------------------------------------------------------------------------
>  --------------------------------------------------------------------------
>  Error: Unable to obtain the proper restart command to restart from the
>         checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>         Check the installation of the blcr checkpoint/restart service
>         on all of the machines in your system.
>  ###########################################################################
>  problem 2: I wana let MPI-process can migration to another Node.
>          if Ompi-Restart  Multiple-Node can be successful.
>          Can restart in another new node, rather than the original node?
>                        example:
>          checkpoint (node1,node2,node3),then restart(node1,node3,node4).
>          or just restart(node1,node3(2-process) ).
>
>    Please help me , thanks .
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to