Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-29 Thread fengguang tian
MPI from all > machines on your cluster, before installing the new version? Sometimes > problems like this come up because of mismatches in Open MPI versions on a > machine. > > -- Josh > > > On Mar 23, 2010, at 5:42 PM, fengguang tian wrote: > > I met the same probl

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-29 Thread fengguang tian
ue, Mar 23, 2010 at 12:55 PM, fengguang tian >> wrote: >> >>> >>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir >>> --hostfile .mpihostfile >>> to store the global checkpoint snapshot into the shared >>> di

[OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-23 Thread fengguang tian
I met the same problem with this link: http://www.open-mpi.org/community/lists/users/2009/12/11374.php in the link, they give a solution that use v1.4 open mpi instead of v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem. here is what I have done: my cluster composed o

Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread fengguang tian
.ckpt/0/opal_snapshot_4.ckpt), mkdir failed [1] [nimbus1:12630] Error: No metadata filename specified! why is that? cheers fengguang On Tue, Mar 23, 2010 at 10:37 AM, Fernando Lemos wrote: > On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian > wrote: > > Hi > > > > I am usin

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
invalid filename. Please see --help for usage. -- cheers fengguang On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote: > On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian > wrote: > > I have created the

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
OK,thank you. I will try to move the checkpoint file into the shared directory Regards fengguang On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote: > On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian > wrote: > > I have created the shared file system. but I created a /mi

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
I have created the shared file system. but I created a /mirror at root directory,not at the $HOME directory,is that the problem? thank you cheers fengguang On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos wrote: > On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian > wrote: > > I set

[OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread fengguang tian
Hi I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint and restart work fine in single machine,but when doing checkpoint in clusters environment, the ompi-checkpoint hangs for example my clusters composed of 3 machines, and using NFS, has a shared directory. in master node

[OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-22 Thread fengguang tian
I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI program runs well on the clusters, but how to checkpoint the MPI program on this clusters? for example: here is what I do for a test: mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr hellompi th