Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread Josh Hursey
On Mar 29, 2010, at 11:53 AM, fengguang tian wrote: hi i have used the --term option,but the mpirun is still hanging,it is the same whether I include the ' / ' or not.I am installing the v1.4 to see whether the problems are still there. I tried, but some problems are still there. What

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread fengguang tian
hi i have used the --term option,but the mpirun is still hanging,it is the same whether I include the ' / ' or not.I am installing the v1.4 to see whether the problems are still there. I tried, but some problems are still there. BTW, my MPI program will have some input file, and will generate som

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread Josh Hursey
On Mar 23, 2010, at 1:00 PM, Fernando Lemos wrote: On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: > > I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir > --hostfile .mpihostfile > to store the global checkpoint snapshot into the shared > directory:/mirror,but the problems are still there, > when ompi-checkpoin

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there, when ompi-checkpoint, the mpirun is still not killed,it is hanging there.when doing ompi

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
OK,thank you. I will try to move the checkpoint file into the shared directory Regards fengguang On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote: > On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian > wrote: > > I have created the shared file system. but I created a /mirror at root > > dir

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian wrote: > I have created the shared file system. but I created a /mirror at root > directory,not at the $HOME directory,is that the > problem? thank you Others might be able to give you more a accurate explanation. The way I understood it, in OpenMP

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
I have created the shared file system. but I created a /mirror at root directory,not at the $HOME directory,is that the problem? thank you cheers fengguang On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos wrote: > On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian > wrote: > > I set up a cluster o

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian wrote: > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI > program runs well on the clusters, > but how to checkpoint the MPI program on this clusters? > for example: > here is what I do for a test: > mpiu@nimbus: /mirror$

[OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-22 Thread fengguang tian
I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI program runs well on the clusters, but how to checkpoint the MPI program on this clusters? for example: here is what I do for a test: mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr hellompi th