Re: [OMPI users] tcsh: orted: Not Found

2006-03-02 Thread Brian Barrett

On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote:

I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and  
the other
with 4 cpus.). I set up ssh on both machines according to the FAQ.  
My mpi
jobs work fine if I run the jobs on only one computer. But when I  
ran a job

across the two Macs from the first Mac mac1, I got:

mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world
tcsh: orted: Command not found.
[mac1:01019] ERROR: A daemon on node mac2 failed to start as expected.
[mac1:01019] ERROR: There may be more information available from
[mac1:01019] ERROR: the remote shell (see above).
[mac1:01019] ERROR: The daemon exited unexpectedly with status 1.
2 processes killed (possibly by Open MPI)

File my_hosts looks like

mac1 slots=2
mac2 slots=4

The orted is definitely on my path on both machines. Any idea? Help is
greatly appreciated!


I'm guessing that the issue is with your shell configuration.  mpirun  
starts the orted on the remote node through rsh/ssh, which will start  
a non-login shell on the remote node.  Unfortunately, the set of  
dotfiles evaluated when a non-login shell is different than when  
starting a login shell.  The easiest way to tell if this is the issue  
is to check whether orted is in your path when started in a non-login  
shell with a command like:


  ssh remote_host which orted

More information on how to configure your particular shell for use  
with Open MPI can be found in our FAQ at:


  http://www.open-mpi.org/faq/?category=running


Hope this helps,

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90

2006-03-02 Thread Jeff Squyres

On Mar 1, 2006, at 1:55 PM, Bjoern Nachtwey wrote:

I tried to compile OpenMPI using the PortzlandGroup compiler Suite,  
but the  configure-script tells me, my fortran compiler cannot  
compile .f or .f90 files. I'm sure it can ;-)

[snipped]

PS: Full Script and Logfiles can be found at
http://www-public.tu-bs.de:8080/~nachtwey/OpenMPI/


Can you also put the file config.log out there?  That's the one that  
will have the details about what went wrong.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90

2006-03-02 Thread Jeff Squyres

On Mar 1, 2006, at 5:14 PM, Troy Telford wrote:

That being said, I have been unable to get OpenMPI to compile with  
PGI 6.1

(but it does finish ./configure; it breaks during 'make').


Troy --

Can you provide some details on what is going wrong?  We currently  
only have PGI 5.2 and 6.0 to test with.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




[OMPI users] Building OpenMPI with Lahey Fortran 95

2006-03-02 Thread Adams Samuel D Contr AFRL/HEDR
I am trying to build OpenMPI using Lahey Fortran 95 6.2 on a Fedora Core 3
box.  I run the configure script ok, but the problem occurs when run make.
It appears that it is bombing out when it is building the Fortran libraries.
It seems like to me that OpenMPI is naming its modules with .ompi_mod
instead of .mod which my compiler expects.  Included below is the output
from what I was doing with building the code.  Do you know how to tell the
configure script to only make .mod modules, or is there something else that
I need to do?

Output:
I think this is the relevant part---
creating libmpi_f77.la
(cd .libs && rm -f libmpi_f77.la && ln -s ../libmpi_f77.la libmpi_f77.la)
make[4]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f77'
make[3]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f77'
Making all in f90
make[3]: Entering directory `/root/openmpi-1.0.1/ompi/mpi/f90'
lf95 -I../../../include -I../../../include -I.  -c -o mpi_kinds.ompi_module
mpi_kinds.f90
f95: fatal: "mpi_kinds.ompi_module": Invalid file suffix.
make[3]: *** [mpi_kinds.ompi_module] Error 1
make[3]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f90'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/openmpi-1.0.1/ompi'
make: *** [all-recursive] Error 1


---attached is the rest of the output

Sam Adams
General Dynamics - Network Systems

Script started on Thu 02 Mar 2006 09:37:24 AM CST
]0;sam@devmn:~/openmpi-1.0.1[root@devmn openmpi-1.0.1]# ulimit -s unlimited
]0;sam@devmn:~/openmpi-1.0.1[root@devmn openmpi-1.0.1]# FC=lf95 F77=lf95 
./configure --with-rsh=ssh && make clean && make || exit
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for gawk... gawk
checking whether make sets $(MAKE)... yes


== Configuring Open MPI


*** Checking versions
checking Open MPI version... 1.0.1
checking Open MPI Subversion repository version... r8453
checking Open Run-Time Environment (ORTE) version... 1.0.1
checking ORTE Subversion repository version... r8453
checking Open Portable Access Layer (OPAL) version... 1.0.1
checking OPAL Subversion repository version... r8453

*** Initialization, setup
configure: builddir: /root/openmpi-1.0.1
configure: srcdir: /root/openmpi-1.0.1
checking build system type... i686-pc-linux-gnu
checking host system type... i686-pc-linux-gnu
checking for prefix by checking for ompi_clean... no
installing to directory "/usr/local"

*** Configuration options
checking Whether to run code coverage... no
checking whether to debug memory usage... no
checking whether to profile memory usage... no
checking if want developer-level compiler pickyness... no
checking if want developer-level debugging code... no
checking if want Fortran 77 bindings... yes
checking if want Fortran 90 bindings... yes
checking whether to enable PMPI... yes
checking if want C++ bindings... yes
checking if want to enable weak symbol support... yes
checking if want run-time MPI parameter checking... runtime
checking if want to install OMPI header files... no
checking if want pretty-print stacktrace... yes
checking if want deprecated executable names... no
checking if want MPI-2 one-sided empty shell functions... no
checking max supported array dimension in F90 MPI bindings... 4
checking if pty support should be enabled... yes
checking if user wants dlopen support... yes
checking if heterogeneous support should be enabled... yes
checking if want trace file debugging... no


== Compiler and preprocessor tests


*** C compiler and preprocessor
checking for style of include used by make... GNU
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking dependency style of gcc... gcc3
checking whether gcc and cc understand -c and -o together... yes
checking if compiler impersonates gcc... no
checking if gcc supports -finline-functions... yes
checking if gcc supports -fno-strict-aliasing... yes
configure: WARNING:  -fno-strict-aliasing has been added to CFLAGS
checking for C optimization flags... -O3 -DNDEBUG -fno-strict-aliasing
checking how to run the C preprocessor... gcc -E
checking for egrep... grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for 

[OMPI users] Spawn and distribution of slaves

2006-03-02 Thread Jean Latour

Hello,

Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an 
example that works OK,
except that it shows that the spawned processes do not follow the 
"machinefile" setting of processors.
In this example a master process spawns first 2 processes, then 
disconnects from them and spawn 2 more
processes. Running on a Quad Opteron node, all processes are running on 
the same node, although the

machinefile specifies that the slaves should run on different nodes.

With the actual version of OpenMPI is it possible to direct the spawned 
processes on
a specific node ? (the node distribution could be given in the 
"machinefile" file, as with LAM MPI)


The code (Fortran 90) of this example and makefile is attached as a tar 
file.


Thank you very much

Jean Latour




spawn+connect.tar.gz
Description: Binary data
<>

Re: [OMPI users] tcsh: orted: Not Found

2006-03-02 Thread Xiaoning (David) Yang
Brian,

Thank you for the help. I did include path to orted in my .tcshrc file on
mac2, but I put the path at the end of the file. It is interesting that when
I logged into mac with ssh, the path was included and orted was in my path.
But when I ran "ssh mac2 which orted", orted was not found. It finds orted
only after I move the path from the end of .tcshrc to the beginning of the
file. Strange. Again, thanks and at least I may make MPI work.

David

* Correspondence *



> From: Brian Barrett 
> Reply-To: Open MPI Users 
> Date: Thu, 2 Mar 2006 00:24:27 -0500
> To: Open MPI Users 
> Subject: Re: [OMPI users] tcsh: orted: Not Found
> 
> On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote:
> 
>> I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and
>> the other
>> with 4 cpus.). I set up ssh on both machines according to the FAQ.
>> My mpi
>> jobs work fine if I run the jobs on only one computer. But when I
>> ran a job
>> across the two Macs from the first Mac mac1, I got:
>> 
>> mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world
>> tcsh: orted: Command not found.
>> [mac1:01019] ERROR: A daemon on node mac2 failed to start as expected.
>> [mac1:01019] ERROR: There may be more information available from
>> [mac1:01019] ERROR: the remote shell (see above).
>> [mac1:01019] ERROR: The daemon exited unexpectedly with status 1.
>> 2 processes killed (possibly by Open MPI)
>> 
>> File my_hosts looks like
>> 
>> mac1 slots=2
>> mac2 slots=4
>> 
>> The orted is definitely on my path on both machines. Any idea? Help is
>> greatly appreciated!
> 
> I'm guessing that the issue is with your shell configuration.  mpirun
> starts the orted on the remote node through rsh/ssh, which will start
> a non-login shell on the remote node.  Unfortunately, the set of
> dotfiles evaluated when a non-login shell is different than when
> starting a login shell.  The easiest way to tell if this is the issue
> is to check whether orted is in your path when started in a non-login
> shell with a command like:
> 
>ssh remote_host which orted
> 
> More information on how to configure your particular shell for use
> with Open MPI can be found in our FAQ at:
> 
>http://www.open-mpi.org/faq/?category=running
> 
> 
> Hope this helps,
> 
> Brian
> 
> -- 
>Brian Barrett
>Open MPI developer
>http://www.open-mpi.org/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Spawn and distribution of slaves

2006-03-02 Thread Edgar Gabriel
as far as I know, Open MPI should follow the machinefile for spawn 
operations, starting however for every spawn at the beginning of the 
machinefile again. An info object such as 'lam_sched_round_robin' is 
currently not available/implemented. Let me look into this...


Jean Latour wrote:


Hello,

Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an 
example that works OK,
except that it shows that the spawned processes do not follow the 
"machinefile" setting of processors.
In this example a master process spawns first 2 processes, then 
disconnects from them and spawn 2 more
processes. Running on a Quad Opteron node, all processes are running on 
the same node, although the

machinefile specifies that the slaves should run on different nodes.

With the actual version of OpenMPI is it possible to direct the spawned 
processes on
a specific node ? (the node distribution could be given in the 
"machinefile" file, as with LAM MPI)


The code (Fortran 90) of this example and makefile is attached as a tar 
file.


Thank you very much

Jean Latour


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] tcsh: orted: Not Found

2006-03-02 Thread Brian Barrett

On Mar 2, 2006, at 11:34 AM, Xiaoning (David) Yang wrote:

Thank you for the help. I did include path to orted in my .tcshrc  
file on
mac2, but I put the path at the end of the file. It is interesting  
that when
I logged into mac with ssh, the path was included and orted was in  
my path.
But when I ran "ssh mac2 which orted", orted was not found. It  
finds orted
only after I move the path from the end of .tcshrc to the beginning  
of the

file. Strange. Again, thanks and at least I may make MPI work.


Do you have a test like if ( $?prompt ) exit towards the end of  
your .tcshrc?  Most .tcshrc files do, and the end is only evaluated  
for interactive shells (which the one to start the orted is not).   
This is probably why moving it to the top helped.


Anyway, glad to hear things are working for you.

Brian




From: Brian Barrett 
Reply-To: Open MPI Users 
Date: Thu, 2 Mar 2006 00:24:27 -0500
To: Open MPI Users 
Subject: Re: [OMPI users] tcsh: orted: Not Found

On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote:


I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and
the other
with 4 cpus.). I set up ssh on both machines according to the FAQ.
My mpi
jobs work fine if I run the jobs on only one computer. But when I
ran a job
across the two Macs from the first Mac mac1, I got:

mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world
tcsh: orted: Command not found.
[mac1:01019] ERROR: A daemon on node mac2 failed to start as  
expected.

[mac1:01019] ERROR: There may be more information available from
[mac1:01019] ERROR: the remote shell (see above).
[mac1:01019] ERROR: The daemon exited unexpectedly with status 1.
2 processes killed (possibly by Open MPI)

File my_hosts looks like

mac1 slots=2
mac2 slots=4

The orted is definitely on my path on both machines. Any idea?  
Help is

greatly appreciated!


I'm guessing that the issue is with your shell configuration.  mpirun
starts the orted on the remote node through rsh/ssh, which will start
a non-login shell on the remote node.  Unfortunately, the set of
dotfiles evaluated when a non-login shell is different than when
starting a login shell.  The easiest way to tell if this is the issue
is to check whether orted is in your path when started in a non-login
shell with a command like:

   ssh remote_host which orted

More information on how to configure your particular shell for use
with Open MPI can be found in our FAQ at:

   http://www.open-mpi.org/faq/?category=running


Hope this helps,

Brian

--
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Spawn and Disconnect

2006-03-02 Thread Edgar Gabriel
Open MPI currently does not fully support a proper disconnection of 
parent and child processes. Thus, if a child dies/aborts, the parents 
will abort as well, despite of calling MPI_Comm_disconnect. (The new RTE 
will have better support for these operations, Ralph/Jeff can probably 
give a better estimate when this will be available.)


However, what should not happen is, that if the child calls MPI_Finalize 
(so not a violent death but a proper shutdown), the parent goes down at 
the same time. Let me check that as well...


Brignone, Sergio wrote:


Hi everybody,

 


I am trying to run a master/slave set.

Because of the nature of the problem I need to start and stop (kill) 
some slaves.


The problem is that as soon as one of the slave dies, the master dies also.

 


This is what I am doing:

 


MASTER:

 


MPI_Init(...)

 


MPI_Comm_spawn(slave1,...,nslave1,...,intercomm1);

 


MPI_Barrier(intercomm1);

 


MPI_Comm_disconnect(&intercomm1);

 


MPI_Comm_spawn(slave2,...,nslave2,...,intercomm2);

 


MPI_Barrier(intercomm2);

 


MPI_Comm_disconnect(&intercomm2);

 


MPI_Finalize();

 

 

 

 

 


SLAVE:

 


MPI_Init(...)

 


MPI_Comm_get_parent(&intercomm);

 


(does something)

 


MPI_Barrier(intercomm);

 


MPI_Comm_disconnect(&intercomm);

 


 MPI_Finalize();

 

 

 

The issue is that as soon as the first set of slaves calls MPI_Finalize, 
the master dies also (it dies right after MPI_Comm_disconnect(&intercomm1) )


 

 


What am I doing wrong?

 


Thanks

 


Sergio

 

 





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] cannot mak a simple ping-pong

2006-03-02 Thread Jose Pedro Garcia Mahedero
Finally it was a network problem. I had to disable  one network interface in
the master node of the cluster by setting
btl_tcp_if_include = eth1 on file /usr/local/etc/openmpi-mca-params.conf

thank you all for your help.

Jose Pedro
On 3/1/06, Jose Pedro Garcia Mahedero  wrote:
>
> OK, it ALMOST works!!
>
> Now I've install MPI on a non clustered machine and it works, but
> surprisingly, it works fine from machine OUT1 as master to machine CLUSTER1
> as slave, but (here was my surprise) it doesn't work on the other sense! If
> I run the same program with CLUSTER1 as master it only sends one message
> from master to slave and blocks while sending the second message. Maybe it
> is a firewall/iptable  problem.
>
> Does anybody know which ports does MPI use to send requests/responses ot
> how to trace it? What I really don't understand is why it happens at the
> second message and not the first one :-( I know my slave never finishes, but
> It is not intended to right now, it will in a next version, but I think it
> is not the main problem :-S
>
> I send an attachemtn with the (so simple) code and a tarball with my
> config.log
>
> thaks
>
>
> On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote:
> >
> > You're right, I'll try to use netpipes first and then the application.
> > If it doesn't workt I'll send configs and more detailed informations
> >
> > Thank you!
> >
> > On 3/1/06, Brian Barrett  wrote:
> > >
> > > Jose -
> > >
> > > I noticed that your output doesn't appear to match what the source
> > > code is capable of generating.  It's possible that you're running
> > > into problems with the code that we can't see because you didn't send
> > > a complete version of the source code.
> > >
> > > You might want to start by running some 3rd party codes that are
> > > known to be good, just to make sure that your MPI installation checks
> > > out.  A good start is NetPIPE, which runs between two peers and gives
> > > latency / bandwidth information.  If that runs, then it's time to
> > > look at your application.  If that doesn't run, then it's time to
> > > look at the MPI installation in more detail.  In this case, it would
> > > be useful to see all of the information requested here:
> > >
> > >http://www.open-mpi.org/community/help/
> > >
> > > as well as from running the mpirun command used to start NetPIPE with
> > > the -d option, so something like:
> > >
> > >mpirun -np 2 -hostfile foo -d ./NPMpi
> > >
> > > Brian
> > >
> > > On Feb 28, 2006, at 9:29 AM, Jose Pedro Garcia Mahedero wrote:
> > >
> > > > Hello everybody.
> > > >
> > > > I'm new to MPI and I'm having some problems while runnig a simple
> > > > pingpong program in more than one node.
> > > >
> > > > 1.- I followed all the instructions and installed open MPI without
> > > > problems in  a Beowulf cluster.
> > > > 2.-  Ths cluster is working OK and ssh keys are set for not
> > > > password prompting
> > > > 3.- miexec seems to run OK.
> > > > 4.- Now I'm using just 2 nodes: I've tried a simple ping-pong
> > > > application but my master only sends one request!!
> > > > 5.- I reduced the problem by trying to send just two mesages to the
> > > > same node:
> > > >
> > > > int main(int argc, char **argv){
> > > >   int myrank;
> > > >
> > > >   /* Initialize MPI */
> > > >
> > > >   MPI_Init(&argc, &argv);
> > > >
> > > >   /* Find out my identity in the default communicator */
> > > >
> > > >   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
> > > >   if (myrank == 0) {
> > > > int work = 100;
> > > > int count=0;
> > > > for (int i =0; i < 10; i++){
> > > >   cout << "MASTER IS SLEEPING..." << endl;
> > > >   sleep(3);
> > > >   cout << "MASTER AWAKE WILL SEND["<< count++ << "]:" << work
> > > > << endl;
> > > >MPI_Send(&work, 1, MPI_INT, 1, WORKTAG,   MPI_COMM_WORLD);
> > > > }
> > > >   } else {
> > > >   int count =0;
> > > >   int work;
> > > >   MPI_Status status;
> > > >   while (true){
> > > >   MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,
> > > > MPI_COMM_WORLD, &status);
> > > >  cout << "SLAVE[" << myrank << "] RECEIVED[" << count++ <<
> > > > "]:" << work < > > > if (status.MPI_TAG == DIETAG) {
> > > >   break;
> > > > }
> > > > }// while
> > > >   }
> > > >   MPI_Finalize();
> > > >
> > > >
> > > >
> > > > 6a.- RESULTS  (if I put more than one machine in my mpihostsfile),
> > > > my master sends the first message  and my slave receives it
> > > > perfectly. But my master doesnt send its second .
> > > > message:
> > > >
> > > >
> > > >
> > > > Here's my output
> > > >
> > > > MASTER IS SLEEPING...
> > > > MASTER AWAKE WILL SEND[0]:100
> > > > MASTER IS SLEEPING...
> > > > SLAVE[1] RECEIVED[0]:100MPI_STATUS.MPI_ERROR:0
> > > > MASTER AWAKE WILL SEND[1]:100
> > > >
> > > > 6b.- RESULTS (if I put ONLY  1 machine in my mpihostsfile),
> > > > everything is OK until iteration 9!!!
> > > > MASTER IS SLEEPING...
> > > > MASTER 

Re: [OMPI users] Spawn and Disconnect

2006-03-02 Thread Ralph Castain




We expect to have much better support for the entire comm_spawn process
in the next incarnation of the RTE. I don't expect that to be included
in a release, however, until 1.1 (Jeff may be able to give you an
estimate for when that will happen).

Jeff et al may be able to give you access to an early non-release
version sooner, if better comm_spawn support is a critical issue and
you don't mind being patient with the inevitable bugs in such versions.

Ralph


Edgar Gabriel wrote:

  Open MPI currently does not fully support a proper disconnection of 
parent and child processes. Thus, if a child dies/aborts, the parents 
will abort as well, despite of calling MPI_Comm_disconnect. (The new RTE 
will have better support for these operations, Ralph/Jeff can probably 
give a better estimate when this will be available.)

However, what should not happen is, that if the child calls MPI_Finalize 
(so not a violent death but a proper shutdown), the parent goes down at 
the same time. Let me check that as well...

Brignone, Sergio wrote:

  
  
Hi everybody,



I am trying to run a master/slave set.

Because of the nature of the problem I need to start and stop (kill) 
some slaves.

The problem is that as soon as one of the slave dies, the master dies also.



This is what I am doing:



MASTER:



MPI_Init(...)



MPI_Comm_spawn(slave1,...,nslave1,...,intercomm1);



MPI_Barrier(intercomm1);



MPI_Comm_disconnect(&intercomm1);



MPI_Comm_spawn(slave2,...,nslave2,...,intercomm2);



MPI_Barrier(intercomm2);



MPI_Comm_disconnect(&intercomm2);



MPI_Finalize();











SLAVE:



MPI_Init(...)



MPI_Comm_get_parent(&intercomm);



(does something)



MPI_Barrier(intercomm);



MPI_Comm_disconnect(&intercomm);



 MPI_Finalize();







The issue is that as soon as the first set of slaves calls MPI_Finalize, 
the master dies also (it dies right after MPI_Comm_disconnect(&intercomm1) )





What am I doing wrong?



Thanks



Sergio








___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  





Re: [OMPI users] tcsh: orted: Not Found

2006-03-02 Thread Xiaoning (David) Yang
Yes, that's it! I do have an if statement for interactive shells. Now I
know. Thanks.

David

* Correspondence *



> From: Brian Barrett 
> Reply-To: Open MPI Users 
> Date: Thu, 2 Mar 2006 12:09:18 -0500
> To: Open MPI Users 
> Subject: Re: [OMPI users] tcsh: orted: Not Found
> 
> On Mar 2, 2006, at 11:34 AM, Xiaoning (David) Yang wrote:
> 
>> Thank you for the help. I did include path to orted in my .tcshrc
>> file on
>> mac2, but I put the path at the end of the file. It is interesting
>> that when
>> I logged into mac with ssh, the path was included and orted was in
>> my path.
>> But when I ran "ssh mac2 which orted", orted was not found. It
>> finds orted
>> only after I move the path from the end of .tcshrc to the beginning
>> of the
>> file. Strange. Again, thanks and at least I may make MPI work.
> 
> Do you have a test like if ( $?prompt ) exit towards the end of
> your .tcshrc?  Most .tcshrc files do, and the end is only evaluated
> for interactive shells (which the one to start the orted is not).
> This is probably why moving it to the top helped.
> 
> Anyway, glad to hear things are working for you.
> 
> Brian
> 
> 
> 
>>> From: Brian Barrett 
>>> Reply-To: Open MPI Users 
>>> Date: Thu, 2 Mar 2006 00:24:27 -0500
>>> To: Open MPI Users 
>>> Subject: Re: [OMPI users] tcsh: orted: Not Found
>>> 
>>> On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote:
>>> 
 I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and
 the other
 with 4 cpus.). I set up ssh on both machines according to the FAQ.
 My mpi
 jobs work fine if I run the jobs on only one computer. But when I
 ran a job
 across the two Macs from the first Mac mac1, I got:
 
 mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world
 tcsh: orted: Command not found.
 [mac1:01019] ERROR: A daemon on node mac2 failed to start as
 expected.
 [mac1:01019] ERROR: There may be more information available from
 [mac1:01019] ERROR: the remote shell (see above).
 [mac1:01019] ERROR: The daemon exited unexpectedly with status 1.
 2 processes killed (possibly by Open MPI)
 
 File my_hosts looks like
 
 mac1 slots=2
 mac2 slots=4
 
 The orted is definitely on my path on both machines. Any idea?
 Help is
 greatly appreciated!
>>> 
>>> I'm guessing that the issue is with your shell configuration.  mpirun
>>> starts the orted on the remote node through rsh/ssh, which will start
>>> a non-login shell on the remote node.  Unfortunately, the set of
>>> dotfiles evaluated when a non-login shell is different than when
>>> starting a login shell.  The easiest way to tell if this is the issue
>>> is to check whether orted is in your path when started in a non-login
>>> shell with a command like:
>>> 
>>>ssh remote_host which orted
>>> 
>>> More information on how to configure your particular shell for use
>>> with Open MPI can be found in our FAQ at:
>>> 
>>>http://www.open-mpi.org/faq/?category=running
>>> 
>>> 
>>> Hope this helps,
>>> 
>>> Brian
>>> 
>>> -- 
>>>Brian Barrett
>>>Open MPI developer
>>>http://www.open-mpi.org/
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Spawn and distribution of slaves

2006-03-02 Thread Edgar Gabriel

so for my tests, Open MPI did follow the machinefile (see output)
further below, however, for each spawn operation it starts from the very
beginning of the machinefile...

The following example spawns 5 child processes (with a single
MPI_Comm_spawn), and each child prints its rank and the hostname.

gabriel@linux12 ~/dyncomm $ mpirun -hostfile machinefile  -np 3
./dyncomm_spawn_father
 Checking for MPI_Comm_spawn.working
Hello world from child 0 on host linux12
Hello world from child 1 on host linux13
Hello world from child 3 on host linux15
Hello world from child 4 on host linux16
 Testing Send/Recv on the intercomm..working
Hello world from child 2 on host linux14


with the machinefile being:
gabriel@linux12 ~/dyncomm $ cat machinefile
linux12
linux13
linux14
linux15
linux16

In your code, you always spawn 1 process at the time, and that's why 
they are all located on the same node.


Hope this helps...
Edgar


Edgar Gabriel wrote:

as far as I know, Open MPI should follow the machinefile for spawn 
operations, starting however for every spawn at the beginning of the 
machinefile again. An info object such as 'lam_sched_round_robin' is 
currently not available/implemented. Let me look into this...


Jean Latour wrote:



Hello,

Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an 
example that works OK,
except that it shows that the spawned processes do not follow the 
"machinefile" setting of processors.
In this example a master process spawns first 2 processes, then 
disconnects from them and spawn 2 more
processes. Running on a Quad Opteron node, all processes are running on 
the same node, although the

machinefile specifies that the slaves should run on different nodes.

With the actual version of OpenMPI is it possible to direct the spawned 
processes on
a specific node ? (the node distribution could be given in the 
"machinefile" file, as with LAM MPI)


The code (Fortran 90) of this example and makefile is attached as a tar 
file.


Thank you very much

Jean Latour


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Edgar Gabriel
Assistant Professor
Department of Computer Science  email:gabr...@cs.uh.edu
University of Houston   http://www.cs.uh.edu/~gabriel
Philip G. Hoffman Hall, Room 524Tel: +1 (713) 743-3857
Houston, TX-77204, USA  Fax: +1 (713) 743-3335




Re: [OMPI users] cannot mak a simple ping-pong

2006-03-02 Thread Jeff Squyres

Jose --

This sounds like a problem that we just recently fixed in the 1.0.x  
branch -- there were some situations where the "wrong" ethernet  
device could have been picked by Open MPI (e.g., if you have a  
cluster with all private IP addresses, and you run an MPI job that  
spans the head node and the compute nodes, but the head node has  
multiple IP addresses).  Can you try the latest 1.0.2 release  
candidate tarball and let us know if this fixes the problem?


http://www.open-mpi.org/software/ompi/v1.0/

Specifically, you should no longer need to specify that  
btl_tcp_if_include parameter -- Open MPI should be able to "figure it  
all out" for you.


Let us know if this works for you.



On Mar 2, 2006, at 1:28 PM, Jose Pedro Garcia Mahedero wrote:

Finally it was a network problem. I had to disable  one network  
interface in the master node of the cluster by setting
btl_tcp_if_include = eth1 on file /usr/local/etc/openmpi-mca- 
params.conf


thank you all for your help.

Jose Pedro
On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote:
OK, it ALMOST works!!

Now I've install MPI on a non clustered machine and it works, but  
surprisingly, it works fine from machine OUT1 as master to machine  
CLUSTER1 as slave, but (here was my surprise) it doesn't work on  
the other sense! If I run the same program with CLUSTER1 as master  
it only sends one message from master to slave and blocks while  
sending the second message. Maybe it is a firewall/iptable  problem.


Does anybody know which ports does MPI use to send requests/ 
responses ot how to trace it? What I really don't understand is why  
it happens at the second message and not the first one :-( I know  
my slave never finishes, but It is not intended to right now, it  
will in a next version, but I think it is not the main problem :-S


I send an attachemtn with the (so simple) code and a tarball with  
my config.log


thaks


On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com>  
wrote:You're right, I'll try to use netpipes first and then the  
application.  If it doesn't workt I'll send configs and more  
detailed informations


Thank you!


On 3/1/06, Brian Barrett  wrote: Jose -

I noticed that your output doesn't appear to match what the source
code is capable of generating.  It's possible that you're running
into problems with the code that we can't see because you didn't send
a complete version of the source code.

You might want to start by running some 3rd party codes that are
known to be good, just to make sure that your MPI installation checks
out.  A good start is NetPIPE, which runs between two peers and gives
latency / bandwidth information.  If that runs, then it's time to
look at your application.  If that doesn't run, then it's time to
look at the MPI installation in more detail.  In this case, it would
be useful to see all of the information requested here:

   http://www.open-mpi.org/community/help/

as well as from running the mpirun command used to start NetPIPE with
the -d option, so something like:

   mpirun -np 2 -hostfile foo -d ./NPMpi

Brian

On Feb 28, 2006, at 9:29 AM, Jose Pedro Garcia Mahedero wrote:

> Hello everybody.
>
> I'm new to MPI and I'm having some problems while runnig a simple
> pingpong program in more than one node.
>
> 1.- I followed all the instructions and installed open MPI without
> problems in  a Beowulf cluster.
> 2.-  Ths cluster is working OK and ssh keys are set for not
> password prompting
> 3.- miexec seems to run OK.
> 4.- Now I'm using just 2 nodes: I've tried a simple ping-pong
> application but my master only sends one request!!
> 5.- I reduced the problem by trying to send just two mesages to the
> same node:
>
> int main(int argc, char **argv){
>   int myrank;
>
>   /* Initialize MPI */
>
>   MPI_Init(&argc, &argv);
>
>   /* Find out my identity in the default communicator */
>
>   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
>   if (myrank == 0) {
> int work = 100;
> int count=0;
> for (int i =0; i < 10; i++){
>   cout << "MASTER IS SLEEPING..." << endl;
>   sleep(3);
>   cout << "MASTER AWAKE WILL SEND["<< count++ << "]:" << work
> << endl;
>MPI_Send(&work, 1, MPI_INT, 1, WORKTAG,   MPI_COMM_WORLD);
> }
>   } else {
>   int count =0;
>   int work;
>   MPI_Status status;
>   while (true){
>   MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG,
> MPI_COMM_WORLD, &status);
>  cout << "SLAVE[" << myrank << "] RECEIVED[" << count++ <<
> "]:" << work < if (status.MPI_TAG == DIETAG) {
>   break;
> }
> }// while
>   }
>   MPI_Finalize();
>
>
>
> 6a.- RESULTS  (if I put more than one machine in my mpihostsfile),
> my master sends the first message  and my slave receives it
> perfectly. But my master doesnt send its second .
> message:
>
>
>
> Here's my output
>
> MASTER IS SLEEPING...
> MASTER AWAKE WILL SEND[0]:100
> MASTER IS SLEEPING...
> SLAVE[1] RECEIVED[0]:100MPI_ST

Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90 ==> Problem solved

2006-03-02 Thread Bjoern Nachtwey
Dear Folks,

I had to add the "--with-gnu-ld" flag  and call my variables F77 and FC (not 
FC and F90).

now it works :-)

Thanks!

Bjørn

you wrote:
> I've used
>
> ./configure --with-gnu-ld F77=pgf77 FFLAGS=-fastsse FC=pgf90
> FCFLAGS=-fastsse
>
> and that worked for me.  Email direct if you have problems.
>
> - Brent
>



[OMPI users] Problem running open mpi across nodes.

2006-03-02 Thread Xiaoning (David) Yang
I installed Open MPI on two Mac G5s, one with 2 cpus and the other with 4
cpus. I can run jobs on either of the machines fine. But when I ran a job on
machine one across the two nodes, the all processes I requested would start,
but they then seemed to hang and I got the error message:

[0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect]
connect() failed with
errno=60[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect
] connect() failed with errno=60

When I ran the job on machine two across the nodes, only processes on this
machine would start and then hung. No processes would start on machine one
and I didn't get any messages. In both cases, I have to Ctrl+C to kill the
jobs. Any idea what was wrong? Thanks a lot!

David

* Correspondence *





Re: [OMPI users] Problem running open mpi across nodes.

2006-03-02 Thread Brian Barrett

On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:

I installed Open MPI on two Mac G5s, one with 2 cpus and the other  
with 4
cpus. I can run jobs on either of the machines fine. But when I ran  
a job on
machine one across the two nodes, the all processes I requested  
would start,

but they then seemed to hang and I got the error message:

[0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect]
connect() failed with
errno=60[0,1,0][btl_tcp_endpoint.c: 
559:mca_btl_tcp_endpoint_complete_connect

] connect() failed with errno=60

When I ran the job on machine two across the nodes, only processes  
on this
machine would start and then hung. No processes would start on  
machine one
and I didn't get any messages. In both cases, I have to Ctrl+C to  
kill the

jobs. Any idea what was wrong? Thanks a lot!


errno 60 is ETIMEDOUT, which means that the connect() timed out  
before the remote side answered.  The other way was probably a  
similar problem - there's something strange going on with the routing  
on the two nodes that's causing OMPI to get confused.  Do your G5  
machines have ethernet adapters other than the primary GigE cards  
(wireless, a second GigE card, a Firewire TCP stack) by any chance?   
There's an issue with situations where there are multiple ethernet  
cards that causes the TCP btl to behave badly like this.  We think we  
have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so  
it might help to upgrade to that version:


  http://www.open-mpi.org/software/ompi/v1.0/

Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




[OMPI users] C++ bool type reduction failing

2006-03-02 Thread Andy Selle
I am trying to do a reduction using a bool type using the C++ bindings.  I am
using this sample program to test:

-
#include 
#include 

int main(int argc,char *argv[])
{
MPI::Init();
int rank=MPI::COMM_WORLD.Get_rank();

{bool test=true;
bool result;
MPI::COMM_WORLD.Allreduce(&test,&result,1,MPI::BOOL,MPI::LOR);
std::cout<<"rank "<

Re: [OMPI users] Problem running open mpi across nodes.

2006-03-02 Thread Xiaoning (David) Yang
Brian,

My G5s only have one ethernet card each and are connected to the network
through those cards. I upgraded to Open MPI 1.0.2. The problem remains the
same.

A somewhat detailed description of the problem is like this. When I run jobs
from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
processes will start on this machine. orted and 2 processes will start on
the 2-cpu machine. The processes hang for a while and then I get the error
message . After that, the processes still hang. If I Ctrl+c, all processes
on both machines including both orteds and the orterun will quit. If I run
jobs from the 2-cpu machin, specfying 6 processes, orted, orterun and 2
processes will start on this machine. Only orted will start on the 4-cpu
machine and no processes will start. The job then hang and I don't get any
response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
machine will quit. But orted on the 4-cpu machine will not quit.

Does this have anything to do with the IP addresses? The IP address
xxx.xxx.aaa.bbb for one machine is different from the IP address
xxx.xxx.cc.dd for the other machine in that not only bbb is not dd, but also
aaa is not cc.

David

* Correspondence *



> From: Brian Barrett 
> Reply-To: Open MPI Users 
> Date: Thu, 2 Mar 2006 18:50:35 -0500
> To: Open MPI Users 
> Subject: Re: [OMPI users] Problem running open mpi across nodes.
> 
> On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:
> 
>> I installed Open MPI on two Mac G5s, one with 2 cpus and the other
>> with 4
>> cpus. I can run jobs on either of the machines fine. But when I ran
>> a job on
>> machine one across the two nodes, the all processes I requested
>> would start,
>> but they then seemed to hang and I got the error message:
>> 
>> [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with
>> errno=60[0,1,0][btl_tcp_endpoint.c:
>> 559:mca_btl_tcp_endpoint_complete_connect
>> ] connect() failed with errno=60
>> 
>> When I ran the job on machine two across the nodes, only processes
>> on this
>> machine would start and then hung. No processes would start on
>> machine one
>> and I didn't get any messages. In both cases, I have to Ctrl+C to
>> kill the
>> jobs. Any idea what was wrong? Thanks a lot!
> 
> errno 60 is ETIMEDOUT, which means that the connect() timed out
> before the remote side answered.  The other way was probably a
> similar problem - there's something strange going on with the routing
> on the two nodes that's causing OMPI to get confused.  Do your G5
> machines have ethernet adapters other than the primary GigE cards
> (wireless, a second GigE card, a Firewire TCP stack) by any chance?
> There's an issue with situations where there are multiple ethernet
> cards that causes the TCP btl to behave badly like this.  We think we
> have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
> it might help to upgrade to that version:
> 
>http://www.open-mpi.org/software/ompi/v1.0/
> 
> Brian
> 
> -- 
>Brian Barrett
>Open MPI developer
>http://www.open-mpi.org/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem running open mpi across nodes.

2006-03-02 Thread Brian Barrett

On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote:

My G5s only have one ethernet card each and are connected to the  
network
through those cards. I upgraded to Open MPI 1.0.2. The problem  
remains the

same.

A somewhat detailed description of the problem is like this. When I  
run jobs

from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
processes will start on this machine. orted and 2 processes will  
start on
the 2-cpu machine. The processes hang for a while and then I get  
the error
message . After that, the processes still hang. If I Ctrl+c, all  
processes
on both machines including both orteds and the orterun will quit.  
If I run
jobs from the 2-cpu machin, specfying 6 processes, orted, orterun  
and 2
processes will start on this machine. Only orted will start on the  
4-cpu
machine and no processes will start. The job then hang and I don't  
get any

response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
machine will quit. But orted on the 4-cpu machine will not quit.

Does this have anything to do with the IP addresses? The IP address
xxx.xxx.aaa.bbb for one machine is different from the IP address
xxx.xxx.cc.dd for the other machine in that not only bbb is not dd,  
but also

aaa is not cc.


Well, you can't guess right all the time :).  But I think you gave  
enough information for the next thing to try.  It sounds like there  
might be a firewall running on the 2 process machine.  When you  
orterun on the 4 cpu machine, the remote orted can clearly connect  
back to orterun because it is getting the process startup and  
shutdown messages.  Things only fail when the MPI process on the 4  
cpu machine try to connect to the other processes.  On the other  
hand, when you start on the 2 cpu machine, the orted on the 4 cpu  
machine starts but can't even connect back to orterun to find out  
what processes to start, nor can it get the shutdown request.  So you  
get a hang.


If you go into System Preferences -> Sharing, make sure that the  
firewall is turned off  in the "firewall" tab.  Hopefully, that will  
do the trick.


Brian




From: Brian Barrett 
Reply-To: Open MPI Users 
Date: Thu, 2 Mar 2006 18:50:35 -0500
To: Open MPI Users 
Subject: Re: [OMPI users] Problem running open mpi across nodes.

On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:


I installed Open MPI on two Mac G5s, one with 2 cpus and the other
with 4
cpus. I can run jobs on either of the machines fine. But when I ran
a job on
machine one across the two nodes, the all processes I requested
would start,
but they then seemed to hang and I got the error message:

[0,1,1][btl_tcp_endpoint.c: 
559:mca_btl_tcp_endpoint_complete_connect]

connect() failed with
errno=60[0,1,0][btl_tcp_endpoint.c:
559:mca_btl_tcp_endpoint_complete_connect
] connect() failed with errno=60

When I ran the job on machine two across the nodes, only processes
on this
machine would start and then hung. No processes would start on
machine one
and I didn't get any messages. In both cases, I have to Ctrl+C to
kill the
jobs. Any idea what was wrong? Thanks a lot!


errno 60 is ETIMEDOUT, which means that the connect() timed out
before the remote side answered.  The other way was probably a
similar problem - there's something strange going on with the routing
on the two nodes that's causing OMPI to get confused.  Do your G5
machines have ethernet adapters other than the primary GigE cards
(wireless, a second GigE card, a Firewire TCP stack) by any chance?
There's an issue with situations where there are multiple ethernet
cards that causes the TCP btl to behave badly like this.  We think we
have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
it might help to upgrade to that version:

   http://www.open-mpi.org/software/ompi/v1.0/

Brian

--
   Brian Barrett
   Open MPI developer
   http://www.open-mpi.org/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users