Hi ,Thank you For your reply. I have some problems: (1) Now ,In the my platform , all nodes have the same pathand LD_LIBRARY_PATH. I set in .bashrc /--------------------------------------------------------------------------------/ #BLCR export PATH=$PATH:/usr/local/BLCR/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib #openMPI export PATH=$PATH:/root/kidd_openMPI/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
/-------------------------------------------------------------------------------------------/ but ,when I running mpirun , I have to add " -x LD_LIBRARY_PATH" ,or it can't run example: mpirun -hostfile hosts -np 2 ./TEST . Error Message==> ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory (2) BLCR need to unify linux-kernel of all the Node ? Now ,I reset all Node.(using Ubuntu 10.04) (3) Now , My porgram using DLL . I implements some DLL ,MPI-Program calls DLLs . Ompi can check/Restart Program contains DLL ? ________________________________ ________________________________ 寄件者: Josh Hursey <jjhur...@open-mpi.org> 收件者: Open MPI Users <us...@open-mpi.org> 寄件日期: 2012/4/23 (週一) 10:51 PM 主旨: Re: [OMPI users] Ompi-restart failed and process migration I wonder if the LD_LIBRARY_PATH is not being set properly upon restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. ompi-restart will not pass that variable along for you, so if you are using that to set the BLCR path this might be your problem. A couple solutions: - have the PATH and LD_LIBRARY_PATH set the same on all nodes - have ompi-restart pass the -x parameter to the underlying mpirun by using the -mpirun_opts command line switch: ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ... Yes. ompi-restart will let you checkpoint a process on one node and restart it on another. You will have to restart the whole application since the ompi-migration operation is not available in the 1.5 series. -- Josh On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote: > Hi all, > I have Some problems,I wana check/Restart Multiple process on 2 node. > > My environment: > BLCR= 0.8.4 , openMPI= 1.5.5 , OS = ubuntu 11.04 > I have 2 Node : > N05(Master ,it have NFS shared file system),N07(slave > ,mount Master-Node). > > My configure format=./configure --prefix=/root/kidd_openMPI > --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR > --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default > --enable-static --enable-shared --enable-opal-multi-threads; > > I had also set ~/.openmpi/mca-params.conf-> > crs_base_snapshot_dir=/root/kidd_openMPI/Tmp > snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. > > the dir->kidd_openMPI is my nfs shared dir. > > My Command : > 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c > > 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH > -np 2 ./TEST . > > I can restart process-0 on Master,but process-1 on N07 was failed. > > I checked my Node,it does not install the prelink, > so the error(restart-failed) is caused by other reasons. > > Error Message--> > -------------------------------------------------------------------------- > root@cuda05:~/kidd_openMPI/checkpoints# > ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ > -------------------------------------------------------------------------- > Error: BLCR was not able to restart the process because exec failed. > Check the installation of BLCR on all of the machines in your > system. The following information may be of help: > Return Code : -1 > BLCR Restart Command : cr_restart > Restart Command Line : cr_restart > /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ > opal_snapshot_1.ckpt/ompi_blcr_context.2704 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_1.ckpt). Returned -1. > Check the installation of the blcr checkpoint/restart service > on all of the machines in your system. > ########################################################################### > problem 2: I wana let MPI-process can migration to another Node. > if Ompi-Restart Multiple-Node can be successful. > Can restart in another new node, rather than the original node? > example: > checkpoint (node1,node2,node3),then restart(node1,node3,node4). > or just restart(node1,node3(2-process) ). > > Please help me , thanks . > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users