Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 1:25 PM, fengguang tian wrote: > now, I set $HOME as shared directory, but when doing ompi-checkpoint, it > shows:(nimbus1 is the remote machine in > my cluster) > > [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the > sub-directory (/home/mpiu/ompi_global_

[OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-23 Thread fengguang tian
I met the same problem with this link: http://www.open-mpi.org/community/lists/users/2009/12/11374.php in the link, they give a solution that use v1.4 open mpi instead of v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem. here is what I have done: my cluster composed o

Re: [OMPI users] problem with opal_net_private_ipv4

2010-03-23 Thread Rolf vandeVaarrt
Nicolas Niclausse wrote: Fernando Lemos ecrivait le 23/03/2010 16:28: I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several interfaces are private; on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible from cluster2. chicon-3 eth0 inet addr:192

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:55 PM, fengguang tian wrote: > > I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir > --hostfile .mpihostfile > to store the global checkpoint snapshot into the shared > directory:/mirror,but the problems are still there, > when ompi-checkpoin

Re: [OMPI users] problem with opal_net_private_ipv4

2010-03-23 Thread Nicolas Niclausse
Fernando Lemos ecrivait le 23/03/2010 16:28: >> I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several >> interfaces are private; >> >> on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible >> from cluster2. >> >> chicon-3 >> eth0 inet addr:192.168.160.

Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread fengguang tian
now, I set $HOME as shared directory, but when doing ompi-checkpoint, it shows:(nimbus1 is the remote machine in my cluster) [nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of (/home/mpiu/ompi_global_snapshot_1662.ckpt

[OMPI users] error depends on the number of processors

2010-03-23 Thread Junwei Huang
Hello, I am still using LAM/MPI on an old cluster and wonder if I can get some help from this mail list. Here is the problem. I am using a 18 node cluster, each node has 2 CPU and each CPU supports up to 2 threads. So I assume I can use 18*4 number of processors. As running the following code, an e

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir --hostfile .mpihostfile to store the global checkpoint snapshot into the shared directory:/mirror,but the problems are still there, when ompi-checkpoint, the mpirun is still not killed,it is hanging there.when doing ompi

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
OK,thank you. I will try to move the checkpoint file into the shared directory Regards fengguang On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote: > On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian > wrote: > > I have created the shared file system. but I created a /mirror at root > > dir

Re: [OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian wrote: > Hi > > I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint > and restart work fine in single machine,but when doing checkpoint in > clusters environment, the ompi-checkpoint hangs Besdies what has been said in anoth

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian wrote: > I have created the shared file system. but I created a /mirror at root > directory,not at the $HOME directory,is that the > problem? thank you Others might be able to give you more a accurate explanation. The way I understood it, in OpenMP

Re: [OMPI users] problem with opal_net_private_ipv4

2010-03-23 Thread Fernando Lemos
On Tue, Mar 23, 2010 at 10:25 AM, Nicolas Niclausse wrote: > Hello, > > > I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several > interfaces are private; > > on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible > from cluster2. > > chicon-3 > eth0     in

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread fengguang tian
I have created the shared file system. but I created a /mirror at root directory,not at the $HOME directory,is that the problem? thank you cheers fengguang On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos wrote: > On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian > wrote: > > I set up a cluster o

[OMPI users] ompi-checkpoint hangs when using in multiple clusters

2010-03-23 Thread fengguang tian
Hi I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint and restart work fine in single machine,but when doing checkpoint in clusters environment, the ompi-checkpoint hangs for example my clusters composed of 3 machines, and using NFS, has a shared directory. in master node

Re: [OMPI users] questions about checkpoint/restart on multiple clusters of MPI

2010-03-23 Thread Fernando Lemos
On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian wrote: > I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI > program runs well on the clusters, > but how to checkpoint the MPI program on this clusters? > for example: > here is what I do for a test: > mpiu@nimbus: /mirror$

[OMPI users] problem with opal_net_private_ipv4

2010-03-23 Thread Nicolas Niclausse
Hello, I'm trying to run openmpi (1.4.1) on two clusters; on each cluster, several interfaces are private; on cluster1, nodes have 3 interfaces, and only 192.168.159.0/24 is visible from cluster2. chicon-3 eth0 inet addr:192.168.160.76 Bcast:192.168.160.255 Mask:255.255.255.0 eth1 ine

Re: [OMPI users] error when using mpiexec to launch

2010-03-23 Thread Shiqing Fan
Hi Gilles, It has been fixed, could you update your source and do a clean build? Actually, mpiexec is the same as mpirun, and mpic++ is the wrapper for C++ applications. You could find more details in the OMPI documentations here: http://www.open-mpi.org/doc/v1.4/ Regards, Shiqing Bloom

[OMPI users] Author Open MPI books-Packt Publishing.

2010-03-23 Thread Kshipra Singh
Hi All, I am writing to you for Packt Publishing, the publishers of computer related books. We are planning to extend our catalogue of books based on Scientific Computing Tools and are currently inviting authors interested in writing for Packt. This doesn't need any previous writing experien