Re: [OMPI users] tcsh: orted: Not Found
On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote: I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and the other with 4 cpus.). I set up ssh on both machines according to the FAQ. My mpi jobs work fine if I run the jobs on only one computer. But when I ran a job across the two Macs from the first Mac mac1, I got: mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world tcsh: orted: Command not found. [mac1:01019] ERROR: A daemon on node mac2 failed to start as expected. [mac1:01019] ERROR: There may be more information available from [mac1:01019] ERROR: the remote shell (see above). [mac1:01019] ERROR: The daemon exited unexpectedly with status 1. 2 processes killed (possibly by Open MPI) File my_hosts looks like mac1 slots=2 mac2 slots=4 The orted is definitely on my path on both machines. Any idea? Help is greatly appreciated! I'm guessing that the issue is with your shell configuration. mpirun starts the orted on the remote node through rsh/ssh, which will start a non-login shell on the remote node. Unfortunately, the set of dotfiles evaluated when a non-login shell is different than when starting a login shell. The easiest way to tell if this is the issue is to check whether orted is in your path when started in a non-login shell with a command like: ssh remote_host which orted More information on how to configure your particular shell for use with Open MPI can be found in our FAQ at: http://www.open-mpi.org/faq/?category=running Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90
On Mar 1, 2006, at 1:55 PM, Bjoern Nachtwey wrote: I tried to compile OpenMPI using the PortzlandGroup compiler Suite, but the configure-script tells me, my fortran compiler cannot compile .f or .f90 files. I'm sure it can ;-) [snipped] PS: Full Script and Logfiles can be found at http://www-public.tu-bs.de:8080/~nachtwey/OpenMPI/ Can you also put the file config.log out there? That's the one that will have the details about what went wrong. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90
On Mar 1, 2006, at 5:14 PM, Troy Telford wrote: That being said, I have been unable to get OpenMPI to compile with PGI 6.1 (but it does finish ./configure; it breaks during 'make'). Troy -- Can you provide some details on what is going wrong? We currently only have PGI 5.2 and 6.0 to test with. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
[OMPI users] Building OpenMPI with Lahey Fortran 95
I am trying to build OpenMPI using Lahey Fortran 95 6.2 on a Fedora Core 3 box. I run the configure script ok, but the problem occurs when run make. It appears that it is bombing out when it is building the Fortran libraries. It seems like to me that OpenMPI is naming its modules with .ompi_mod instead of .mod which my compiler expects. Included below is the output from what I was doing with building the code. Do you know how to tell the configure script to only make .mod modules, or is there something else that I need to do? Output: I think this is the relevant part--- creating libmpi_f77.la (cd .libs && rm -f libmpi_f77.la && ln -s ../libmpi_f77.la libmpi_f77.la) make[4]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f77' make[3]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f77' Making all in f90 make[3]: Entering directory `/root/openmpi-1.0.1/ompi/mpi/f90' lf95 -I../../../include -I../../../include -I. -c -o mpi_kinds.ompi_module mpi_kinds.f90 f95: fatal: "mpi_kinds.ompi_module": Invalid file suffix. make[3]: *** [mpi_kinds.ompi_module] Error 1 make[3]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi/f90' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/openmpi-1.0.1/ompi/mpi' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/openmpi-1.0.1/ompi' make: *** [all-recursive] Error 1 ---attached is the rest of the output Sam Adams General Dynamics - Network Systems Script started on Thu 02 Mar 2006 09:37:24 AM CST ]0;sam@devmn:~/openmpi-1.0.1[root@devmn openmpi-1.0.1]# ulimit -s unlimited ]0;sam@devmn:~/openmpi-1.0.1[root@devmn openmpi-1.0.1]# FC=lf95 F77=lf95 ./configure --with-rsh=ssh && make clean && make || exit checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes == Configuring Open MPI *** Checking versions checking Open MPI version... 1.0.1 checking Open MPI Subversion repository version... r8453 checking Open Run-Time Environment (ORTE) version... 1.0.1 checking ORTE Subversion repository version... r8453 checking Open Portable Access Layer (OPAL) version... 1.0.1 checking OPAL Subversion repository version... r8453 *** Initialization, setup configure: builddir: /root/openmpi-1.0.1 configure: srcdir: /root/openmpi-1.0.1 checking build system type... i686-pc-linux-gnu checking host system type... i686-pc-linux-gnu checking for prefix by checking for ompi_clean... no installing to directory "/usr/local" *** Configuration options checking Whether to run code coverage... no checking whether to debug memory usage... no checking whether to profile memory usage... no checking if want developer-level compiler pickyness... no checking if want developer-level debugging code... no checking if want Fortran 77 bindings... yes checking if want Fortran 90 bindings... yes checking whether to enable PMPI... yes checking if want C++ bindings... yes checking if want to enable weak symbol support... yes checking if want run-time MPI parameter checking... runtime checking if want to install OMPI header files... no checking if want pretty-print stacktrace... yes checking if want deprecated executable names... no checking if want MPI-2 one-sided empty shell functions... no checking max supported array dimension in F90 MPI bindings... 4 checking if pty support should be enabled... yes checking if user wants dlopen support... yes checking if heterogeneous support should be enabled... yes checking if want trace file debugging... no == Compiler and preprocessor tests *** C compiler and preprocessor checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking dependency style of gcc... gcc3 checking whether gcc and cc understand -c and -o together... yes checking if compiler impersonates gcc... no checking if gcc supports -finline-functions... yes checking if gcc supports -fno-strict-aliasing... yes configure: WARNING: -fno-strict-aliasing has been added to CFLAGS checking for C optimization flags... -O3 -DNDEBUG -fno-strict-aliasing checking how to run the C preprocessor... gcc -E checking for egrep... grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for
[OMPI users] Spawn and distribution of slaves
Hello, Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the machinefile specifies that the slaves should run on different nodes. With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI) The code (Fortran 90) of this example and makefile is attached as a tar file. Thank you very much Jean Latour spawn+connect.tar.gz Description: Binary data <>
Re: [OMPI users] tcsh: orted: Not Found
Brian, Thank you for the help. I did include path to orted in my .tcshrc file on mac2, but I put the path at the end of the file. It is interesting that when I logged into mac with ssh, the path was included and orted was in my path. But when I ran "ssh mac2 which orted", orted was not found. It finds orted only after I move the path from the end of .tcshrc to the beginning of the file. Strange. Again, thanks and at least I may make MPI work. David * Correspondence * > From: Brian Barrett > Reply-To: Open MPI Users > Date: Thu, 2 Mar 2006 00:24:27 -0500 > To: Open MPI Users > Subject: Re: [OMPI users] tcsh: orted: Not Found > > On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote: > >> I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and >> the other >> with 4 cpus.). I set up ssh on both machines according to the FAQ. >> My mpi >> jobs work fine if I run the jobs on only one computer. But when I >> ran a job >> across the two Macs from the first Mac mac1, I got: >> >> mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world >> tcsh: orted: Command not found. >> [mac1:01019] ERROR: A daemon on node mac2 failed to start as expected. >> [mac1:01019] ERROR: There may be more information available from >> [mac1:01019] ERROR: the remote shell (see above). >> [mac1:01019] ERROR: The daemon exited unexpectedly with status 1. >> 2 processes killed (possibly by Open MPI) >> >> File my_hosts looks like >> >> mac1 slots=2 >> mac2 slots=4 >> >> The orted is definitely on my path on both machines. Any idea? Help is >> greatly appreciated! > > I'm guessing that the issue is with your shell configuration. mpirun > starts the orted on the remote node through rsh/ssh, which will start > a non-login shell on the remote node. Unfortunately, the set of > dotfiles evaluated when a non-login shell is different than when > starting a login shell. The easiest way to tell if this is the issue > is to check whether orted is in your path when started in a non-login > shell with a command like: > >ssh remote_host which orted > > More information on how to configure your particular shell for use > with Open MPI can be found in our FAQ at: > >http://www.open-mpi.org/faq/?category=running > > > Hope this helps, > > Brian > > -- >Brian Barrett >Open MPI developer >http://www.open-mpi.org/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Spawn and distribution of slaves
as far as I know, Open MPI should follow the machinefile for spawn operations, starting however for every spawn at the beginning of the machinefile again. An info object such as 'lam_sched_round_robin' is currently not available/implemented. Let me look into this... Jean Latour wrote: Hello, Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the machinefile specifies that the slaves should run on different nodes. With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI) The code (Fortran 90) of this example and makefile is attached as a tar file. Thank you very much Jean Latour ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] tcsh: orted: Not Found
On Mar 2, 2006, at 11:34 AM, Xiaoning (David) Yang wrote: Thank you for the help. I did include path to orted in my .tcshrc file on mac2, but I put the path at the end of the file. It is interesting that when I logged into mac with ssh, the path was included and orted was in my path. But when I ran "ssh mac2 which orted", orted was not found. It finds orted only after I move the path from the end of .tcshrc to the beginning of the file. Strange. Again, thanks and at least I may make MPI work. Do you have a test like if ( $?prompt ) exit towards the end of your .tcshrc? Most .tcshrc files do, and the end is only evaluated for interactive shells (which the one to start the orted is not). This is probably why moving it to the top helped. Anyway, glad to hear things are working for you. Brian From: Brian Barrett Reply-To: Open MPI Users Date: Thu, 2 Mar 2006 00:24:27 -0500 To: Open MPI Users Subject: Re: [OMPI users] tcsh: orted: Not Found On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote: I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and the other with 4 cpus.). I set up ssh on both machines according to the FAQ. My mpi jobs work fine if I run the jobs on only one computer. But when I ran a job across the two Macs from the first Mac mac1, I got: mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world tcsh: orted: Command not found. [mac1:01019] ERROR: A daemon on node mac2 failed to start as expected. [mac1:01019] ERROR: There may be more information available from [mac1:01019] ERROR: the remote shell (see above). [mac1:01019] ERROR: The daemon exited unexpectedly with status 1. 2 processes killed (possibly by Open MPI) File my_hosts looks like mac1 slots=2 mac2 slots=4 The orted is definitely on my path on both machines. Any idea? Help is greatly appreciated! I'm guessing that the issue is with your shell configuration. mpirun starts the orted on the remote node through rsh/ssh, which will start a non-login shell on the remote node. Unfortunately, the set of dotfiles evaluated when a non-login shell is different than when starting a login shell. The easiest way to tell if this is the issue is to check whether orted is in your path when started in a non-login shell with a command like: ssh remote_host which orted More information on how to configure your particular shell for use with Open MPI can be found in our FAQ at: http://www.open-mpi.org/faq/?category=running Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Spawn and Disconnect
Open MPI currently does not fully support a proper disconnection of parent and child processes. Thus, if a child dies/aborts, the parents will abort as well, despite of calling MPI_Comm_disconnect. (The new RTE will have better support for these operations, Ralph/Jeff can probably give a better estimate when this will be available.) However, what should not happen is, that if the child calls MPI_Finalize (so not a violent death but a proper shutdown), the parent goes down at the same time. Let me check that as well... Brignone, Sergio wrote: Hi everybody, I am trying to run a master/slave set. Because of the nature of the problem I need to start and stop (kill) some slaves. The problem is that as soon as one of the slave dies, the master dies also. This is what I am doing: MASTER: MPI_Init(...) MPI_Comm_spawn(slave1,...,nslave1,...,intercomm1); MPI_Barrier(intercomm1); MPI_Comm_disconnect(&intercomm1); MPI_Comm_spawn(slave2,...,nslave2,...,intercomm2); MPI_Barrier(intercomm2); MPI_Comm_disconnect(&intercomm2); MPI_Finalize(); SLAVE: MPI_Init(...) MPI_Comm_get_parent(&intercomm); (does something) MPI_Barrier(intercomm); MPI_Comm_disconnect(&intercomm); MPI_Finalize(); The issue is that as soon as the first set of slaves calls MPI_Finalize, the master dies also (it dies right after MPI_Comm_disconnect(&intercomm1) ) What am I doing wrong? Thanks Sergio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] cannot mak a simple ping-pong
Finally it was a network problem. I had to disable one network interface in the master node of the cluster by setting btl_tcp_if_include = eth1 on file /usr/local/etc/openmpi-mca-params.conf thank you all for your help. Jose Pedro On 3/1/06, Jose Pedro Garcia Mahedero wrote: > > OK, it ALMOST works!! > > Now I've install MPI on a non clustered machine and it works, but > surprisingly, it works fine from machine OUT1 as master to machine CLUSTER1 > as slave, but (here was my surprise) it doesn't work on the other sense! If > I run the same program with CLUSTER1 as master it only sends one message > from master to slave and blocks while sending the second message. Maybe it > is a firewall/iptable problem. > > Does anybody know which ports does MPI use to send requests/responses ot > how to trace it? What I really don't understand is why it happens at the > second message and not the first one :-( I know my slave never finishes, but > It is not intended to right now, it will in a next version, but I think it > is not the main problem :-S > > I send an attachemtn with the (so simple) code and a tarball with my > config.log > > thaks > > > On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote: > > > > You're right, I'll try to use netpipes first and then the application. > > If it doesn't workt I'll send configs and more detailed informations > > > > Thank you! > > > > On 3/1/06, Brian Barrett wrote: > > > > > > Jose - > > > > > > I noticed that your output doesn't appear to match what the source > > > code is capable of generating. It's possible that you're running > > > into problems with the code that we can't see because you didn't send > > > a complete version of the source code. > > > > > > You might want to start by running some 3rd party codes that are > > > known to be good, just to make sure that your MPI installation checks > > > out. A good start is NetPIPE, which runs between two peers and gives > > > latency / bandwidth information. If that runs, then it's time to > > > look at your application. If that doesn't run, then it's time to > > > look at the MPI installation in more detail. In this case, it would > > > be useful to see all of the information requested here: > > > > > >http://www.open-mpi.org/community/help/ > > > > > > as well as from running the mpirun command used to start NetPIPE with > > > the -d option, so something like: > > > > > >mpirun -np 2 -hostfile foo -d ./NPMpi > > > > > > Brian > > > > > > On Feb 28, 2006, at 9:29 AM, Jose Pedro Garcia Mahedero wrote: > > > > > > > Hello everybody. > > > > > > > > I'm new to MPI and I'm having some problems while runnig a simple > > > > pingpong program in more than one node. > > > > > > > > 1.- I followed all the instructions and installed open MPI without > > > > problems in a Beowulf cluster. > > > > 2.- Ths cluster is working OK and ssh keys are set for not > > > > password prompting > > > > 3.- miexec seems to run OK. > > > > 4.- Now I'm using just 2 nodes: I've tried a simple ping-pong > > > > application but my master only sends one request!! > > > > 5.- I reduced the problem by trying to send just two mesages to the > > > > same node: > > > > > > > > int main(int argc, char **argv){ > > > > int myrank; > > > > > > > > /* Initialize MPI */ > > > > > > > > MPI_Init(&argc, &argv); > > > > > > > > /* Find out my identity in the default communicator */ > > > > > > > > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > > > > if (myrank == 0) { > > > > int work = 100; > > > > int count=0; > > > > for (int i =0; i < 10; i++){ > > > > cout << "MASTER IS SLEEPING..." << endl; > > > > sleep(3); > > > > cout << "MASTER AWAKE WILL SEND["<< count++ << "]:" << work > > > > << endl; > > > >MPI_Send(&work, 1, MPI_INT, 1, WORKTAG, MPI_COMM_WORLD); > > > > } > > > > } else { > > > > int count =0; > > > > int work; > > > > MPI_Status status; > > > > while (true){ > > > > MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG, > > > > MPI_COMM_WORLD, &status); > > > > cout << "SLAVE[" << myrank << "] RECEIVED[" << count++ << > > > > "]:" << work < > > > if (status.MPI_TAG == DIETAG) { > > > > break; > > > > } > > > > }// while > > > > } > > > > MPI_Finalize(); > > > > > > > > > > > > > > > > 6a.- RESULTS (if I put more than one machine in my mpihostsfile), > > > > my master sends the first message and my slave receives it > > > > perfectly. But my master doesnt send its second . > > > > message: > > > > > > > > > > > > > > > > Here's my output > > > > > > > > MASTER IS SLEEPING... > > > > MASTER AWAKE WILL SEND[0]:100 > > > > MASTER IS SLEEPING... > > > > SLAVE[1] RECEIVED[0]:100MPI_STATUS.MPI_ERROR:0 > > > > MASTER AWAKE WILL SEND[1]:100 > > > > > > > > 6b.- RESULTS (if I put ONLY 1 machine in my mpihostsfile), > > > > everything is OK until iteration 9!!! > > > > MASTER IS SLEEPING... > > > > MASTER
Re: [OMPI users] Spawn and Disconnect
We expect to have much better support for the entire comm_spawn process in the next incarnation of the RTE. I don't expect that to be included in a release, however, until 1.1 (Jeff may be able to give you an estimate for when that will happen). Jeff et al may be able to give you access to an early non-release version sooner, if better comm_spawn support is a critical issue and you don't mind being patient with the inevitable bugs in such versions. Ralph Edgar Gabriel wrote: Open MPI currently does not fully support a proper disconnection of parent and child processes. Thus, if a child dies/aborts, the parents will abort as well, despite of calling MPI_Comm_disconnect. (The new RTE will have better support for these operations, Ralph/Jeff can probably give a better estimate when this will be available.) However, what should not happen is, that if the child calls MPI_Finalize (so not a violent death but a proper shutdown), the parent goes down at the same time. Let me check that as well... Brignone, Sergio wrote: Hi everybody, I am trying to run a master/slave set. Because of the nature of the problem I need to start and stop (kill) some slaves. The problem is that as soon as one of the slave dies, the master dies also. This is what I am doing: MASTER: MPI_Init(...) MPI_Comm_spawn(slave1,...,nslave1,...,intercomm1); MPI_Barrier(intercomm1); MPI_Comm_disconnect(&intercomm1); MPI_Comm_spawn(slave2,...,nslave2,...,intercomm2); MPI_Barrier(intercomm2); MPI_Comm_disconnect(&intercomm2); MPI_Finalize(); SLAVE: MPI_Init(...) MPI_Comm_get_parent(&intercomm); (does something) MPI_Barrier(intercomm); MPI_Comm_disconnect(&intercomm); MPI_Finalize(); The issue is that as soon as the first set of slaves calls MPI_Finalize, the master dies also (it dies right after MPI_Comm_disconnect(&intercomm1) ) What am I doing wrong? Thanks Sergio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] tcsh: orted: Not Found
Yes, that's it! I do have an if statement for interactive shells. Now I know. Thanks. David * Correspondence * > From: Brian Barrett > Reply-To: Open MPI Users > Date: Thu, 2 Mar 2006 12:09:18 -0500 > To: Open MPI Users > Subject: Re: [OMPI users] tcsh: orted: Not Found > > On Mar 2, 2006, at 11:34 AM, Xiaoning (David) Yang wrote: > >> Thank you for the help. I did include path to orted in my .tcshrc >> file on >> mac2, but I put the path at the end of the file. It is interesting >> that when >> I logged into mac with ssh, the path was included and orted was in >> my path. >> But when I ran "ssh mac2 which orted", orted was not found. It >> finds orted >> only after I move the path from the end of .tcshrc to the beginning >> of the >> file. Strange. Again, thanks and at least I may make MPI work. > > Do you have a test like if ( $?prompt ) exit towards the end of > your .tcshrc? Most .tcshrc files do, and the end is only evaluated > for interactive shells (which the one to start the orted is not). > This is probably why moving it to the top helped. > > Anyway, glad to hear things are working for you. > > Brian > > > >>> From: Brian Barrett >>> Reply-To: Open MPI Users >>> Date: Thu, 2 Mar 2006 00:24:27 -0500 >>> To: Open MPI Users >>> Subject: Re: [OMPI users] tcsh: orted: Not Found >>> >>> On Mar 1, 2006, at 5:26 PM, Xiaoning (David) Yang wrote: >>> I installed Open MPI 1.0.1 on two Mac G5s (one with two cpus and the other with 4 cpus.). I set up ssh on both machines according to the FAQ. My mpi jobs work fine if I run the jobs on only one computer. But when I ran a job across the two Macs from the first Mac mac1, I got: mac1: mpirun -np 6 --hostfiles /Users/me/my_hosts hello_world tcsh: orted: Command not found. [mac1:01019] ERROR: A daemon on node mac2 failed to start as expected. [mac1:01019] ERROR: There may be more information available from [mac1:01019] ERROR: the remote shell (see above). [mac1:01019] ERROR: The daemon exited unexpectedly with status 1. 2 processes killed (possibly by Open MPI) File my_hosts looks like mac1 slots=2 mac2 slots=4 The orted is definitely on my path on both machines. Any idea? Help is greatly appreciated! >>> >>> I'm guessing that the issue is with your shell configuration. mpirun >>> starts the orted on the remote node through rsh/ssh, which will start >>> a non-login shell on the remote node. Unfortunately, the set of >>> dotfiles evaluated when a non-login shell is different than when >>> starting a login shell. The easiest way to tell if this is the issue >>> is to check whether orted is in your path when started in a non-login >>> shell with a command like: >>> >>>ssh remote_host which orted >>> >>> More information on how to configure your particular shell for use >>> with Open MPI can be found in our FAQ at: >>> >>>http://www.open-mpi.org/faq/?category=running >>> >>> >>> Hope this helps, >>> >>> Brian >>> >>> -- >>>Brian Barrett >>>Open MPI developer >>>http://www.open-mpi.org/ >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Spawn and distribution of slaves
so for my tests, Open MPI did follow the machinefile (see output) further below, however, for each spawn operation it starts from the very beginning of the machinefile... The following example spawns 5 child processes (with a single MPI_Comm_spawn), and each child prints its rank and the hostname. gabriel@linux12 ~/dyncomm $ mpirun -hostfile machinefile -np 3 ./dyncomm_spawn_father Checking for MPI_Comm_spawn.working Hello world from child 0 on host linux12 Hello world from child 1 on host linux13 Hello world from child 3 on host linux15 Hello world from child 4 on host linux16 Testing Send/Recv on the intercomm..working Hello world from child 2 on host linux14 with the machinefile being: gabriel@linux12 ~/dyncomm $ cat machinefile linux12 linux13 linux14 linux15 linux16 In your code, you always spawn 1 process at the time, and that's why they are all located on the same node. Hope this helps... Edgar Edgar Gabriel wrote: as far as I know, Open MPI should follow the machinefile for spawn operations, starting however for every spawn at the beginning of the machinefile again. An info object such as 'lam_sched_round_robin' is currently not available/implemented. Let me look into this... Jean Latour wrote: Hello, Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the machinefile specifies that the slaves should run on different nodes. With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI) The code (Fortran 90) of this example and makefile is attached as a tar file. Thank you very much Jean Latour ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Assistant Professor Department of Computer Science email:gabr...@cs.uh.edu University of Houston http://www.cs.uh.edu/~gabriel Philip G. Hoffman Hall, Room 524Tel: +1 (713) 743-3857 Houston, TX-77204, USA Fax: +1 (713) 743-3335
Re: [OMPI users] cannot mak a simple ping-pong
Jose -- This sounds like a problem that we just recently fixed in the 1.0.x branch -- there were some situations where the "wrong" ethernet device could have been picked by Open MPI (e.g., if you have a cluster with all private IP addresses, and you run an MPI job that spans the head node and the compute nodes, but the head node has multiple IP addresses). Can you try the latest 1.0.2 release candidate tarball and let us know if this fixes the problem? http://www.open-mpi.org/software/ompi/v1.0/ Specifically, you should no longer need to specify that btl_tcp_if_include parameter -- Open MPI should be able to "figure it all out" for you. Let us know if this works for you. On Mar 2, 2006, at 1:28 PM, Jose Pedro Garcia Mahedero wrote: Finally it was a network problem. I had to disable one network interface in the master node of the cluster by setting btl_tcp_if_include = eth1 on file /usr/local/etc/openmpi-mca- params.conf thank you all for your help. Jose Pedro On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote: OK, it ALMOST works!! Now I've install MPI on a non clustered machine and it works, but surprisingly, it works fine from machine OUT1 as master to machine CLUSTER1 as slave, but (here was my surprise) it doesn't work on the other sense! If I run the same program with CLUSTER1 as master it only sends one message from master to slave and blocks while sending the second message. Maybe it is a firewall/iptable problem. Does anybody know which ports does MPI use to send requests/ responses ot how to trace it? What I really don't understand is why it happens at the second message and not the first one :-( I know my slave never finishes, but It is not intended to right now, it will in a next version, but I think it is not the main problem :-S I send an attachemtn with the (so simple) code and a tarball with my config.log thaks On 3/1/06, Jose Pedro Garcia Mahedero < jpgmahed...@gmail.com> wrote:You're right, I'll try to use netpipes first and then the application. If it doesn't workt I'll send configs and more detailed informations Thank you! On 3/1/06, Brian Barrett wrote: Jose - I noticed that your output doesn't appear to match what the source code is capable of generating. It's possible that you're running into problems with the code that we can't see because you didn't send a complete version of the source code. You might want to start by running some 3rd party codes that are known to be good, just to make sure that your MPI installation checks out. A good start is NetPIPE, which runs between two peers and gives latency / bandwidth information. If that runs, then it's time to look at your application. If that doesn't run, then it's time to look at the MPI installation in more detail. In this case, it would be useful to see all of the information requested here: http://www.open-mpi.org/community/help/ as well as from running the mpirun command used to start NetPIPE with the -d option, so something like: mpirun -np 2 -hostfile foo -d ./NPMpi Brian On Feb 28, 2006, at 9:29 AM, Jose Pedro Garcia Mahedero wrote: > Hello everybody. > > I'm new to MPI and I'm having some problems while runnig a simple > pingpong program in more than one node. > > 1.- I followed all the instructions and installed open MPI without > problems in a Beowulf cluster. > 2.- Ths cluster is working OK and ssh keys are set for not > password prompting > 3.- miexec seems to run OK. > 4.- Now I'm using just 2 nodes: I've tried a simple ping-pong > application but my master only sends one request!! > 5.- I reduced the problem by trying to send just two mesages to the > same node: > > int main(int argc, char **argv){ > int myrank; > > /* Initialize MPI */ > > MPI_Init(&argc, &argv); > > /* Find out my identity in the default communicator */ > > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > if (myrank == 0) { > int work = 100; > int count=0; > for (int i =0; i < 10; i++){ > cout << "MASTER IS SLEEPING..." << endl; > sleep(3); > cout << "MASTER AWAKE WILL SEND["<< count++ << "]:" << work > << endl; >MPI_Send(&work, 1, MPI_INT, 1, WORKTAG, MPI_COMM_WORLD); > } > } else { > int count =0; > int work; > MPI_Status status; > while (true){ > MPI_Recv(&work, 1, MPI_INT, 0, MPI_ANY_TAG, > MPI_COMM_WORLD, &status); > cout << "SLAVE[" << myrank << "] RECEIVED[" << count++ << > "]:" << work < if (status.MPI_TAG == DIETAG) { > break; > } > }// while > } > MPI_Finalize(); > > > > 6a.- RESULTS (if I put more than one machine in my mpihostsfile), > my master sends the first message and my slave receives it > perfectly. But my master doesnt send its second . > message: > > > > Here's my output > > MASTER IS SLEEPING... > MASTER AWAKE WILL SEND[0]:100 > MASTER IS SLEEPING... > SLAVE[1] RECEIVED[0]:100MPI_ST
Re: [OMPI users] OpenMPI 1.0.x and PGI pgf90 ==> Problem solved
Dear Folks, I had to add the "--with-gnu-ld" flag and call my variables F77 and FC (not FC and F90). now it works :-) Thanks! Bjørn you wrote: > I've used > > ./configure --with-gnu-ld F77=pgf77 FFLAGS=-fastsse FC=pgf90 > FCFLAGS=-fastsse > > and that worked for me. Email direct if you have problems. > > - Brent >
[OMPI users] Problem running open mpi across nodes.
I installed Open MPI on two Mac G5s, one with 2 cpus and the other with 4 cpus. I can run jobs on either of the machines fine. But when I ran a job on machine one across the two nodes, the all processes I requested would start, but they then seemed to hang and I got the error message: [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=60[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect ] connect() failed with errno=60 When I ran the job on machine two across the nodes, only processes on this machine would start and then hung. No processes would start on machine one and I didn't get any messages. In both cases, I have to Ctrl+C to kill the jobs. Any idea what was wrong? Thanks a lot! David * Correspondence *
Re: [OMPI users] Problem running open mpi across nodes.
On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote: I installed Open MPI on two Mac G5s, one with 2 cpus and the other with 4 cpus. I can run jobs on either of the machines fine. But when I ran a job on machine one across the two nodes, the all processes I requested would start, but they then seemed to hang and I got the error message: [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=60[0,1,0][btl_tcp_endpoint.c: 559:mca_btl_tcp_endpoint_complete_connect ] connect() failed with errno=60 When I ran the job on machine two across the nodes, only processes on this machine would start and then hung. No processes would start on machine one and I didn't get any messages. In both cases, I have to Ctrl+C to kill the jobs. Any idea what was wrong? Thanks a lot! errno 60 is ETIMEDOUT, which means that the connect() timed out before the remote side answered. The other way was probably a similar problem - there's something strange going on with the routing on the two nodes that's causing OMPI to get confused. Do your G5 machines have ethernet adapters other than the primary GigE cards (wireless, a second GigE card, a Firewire TCP stack) by any chance? There's an issue with situations where there are multiple ethernet cards that causes the TCP btl to behave badly like this. We think we have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so it might help to upgrade to that version: http://www.open-mpi.org/software/ompi/v1.0/ Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
[OMPI users] C++ bool type reduction failing
I am trying to do a reduction using a bool type using the C++ bindings. I am using this sample program to test: - #include #include int main(int argc,char *argv[]) { MPI::Init(); int rank=MPI::COMM_WORLD.Get_rank(); {bool test=true; bool result; MPI::COMM_WORLD.Allreduce(&test,&result,1,MPI::BOOL,MPI::LOR); std::cout<<"rank "<
Re: [OMPI users] Problem running open mpi across nodes.
Brian, My G5s only have one ethernet card each and are connected to the network through those cards. I upgraded to Open MPI 1.0.2. The problem remains the same. A somewhat detailed description of the problem is like this. When I run jobs from the 4-cpu machine, specifying 6 processes, orted, orterun and 4 processes will start on this machine. orted and 2 processes will start on the 2-cpu machine. The processes hang for a while and then I get the error message . After that, the processes still hang. If I Ctrl+c, all processes on both machines including both orteds and the orterun will quit. If I run jobs from the 2-cpu machin, specfying 6 processes, orted, orterun and 2 processes will start on this machine. Only orted will start on the 4-cpu machine and no processes will start. The job then hang and I don't get any response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu machine will quit. But orted on the 4-cpu machine will not quit. Does this have anything to do with the IP addresses? The IP address xxx.xxx.aaa.bbb for one machine is different from the IP address xxx.xxx.cc.dd for the other machine in that not only bbb is not dd, but also aaa is not cc. David * Correspondence * > From: Brian Barrett > Reply-To: Open MPI Users > Date: Thu, 2 Mar 2006 18:50:35 -0500 > To: Open MPI Users > Subject: Re: [OMPI users] Problem running open mpi across nodes. > > On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote: > >> I installed Open MPI on two Mac G5s, one with 2 cpus and the other >> with 4 >> cpus. I can run jobs on either of the machines fine. But when I ran >> a job on >> machine one across the two nodes, the all processes I requested >> would start, >> but they then seemed to hang and I got the error message: >> >> [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with >> errno=60[0,1,0][btl_tcp_endpoint.c: >> 559:mca_btl_tcp_endpoint_complete_connect >> ] connect() failed with errno=60 >> >> When I ran the job on machine two across the nodes, only processes >> on this >> machine would start and then hung. No processes would start on >> machine one >> and I didn't get any messages. In both cases, I have to Ctrl+C to >> kill the >> jobs. Any idea what was wrong? Thanks a lot! > > errno 60 is ETIMEDOUT, which means that the connect() timed out > before the remote side answered. The other way was probably a > similar problem - there's something strange going on with the routing > on the two nodes that's causing OMPI to get confused. Do your G5 > machines have ethernet adapters other than the primary GigE cards > (wireless, a second GigE card, a Firewire TCP stack) by any chance? > There's an issue with situations where there are multiple ethernet > cards that causes the TCP btl to behave badly like this. We think we > have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so > it might help to upgrade to that version: > >http://www.open-mpi.org/software/ompi/v1.0/ > > Brian > > -- >Brian Barrett >Open MPI developer >http://www.open-mpi.org/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem running open mpi across nodes.
On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote: My G5s only have one ethernet card each and are connected to the network through those cards. I upgraded to Open MPI 1.0.2. The problem remains the same. A somewhat detailed description of the problem is like this. When I run jobs from the 4-cpu machine, specifying 6 processes, orted, orterun and 4 processes will start on this machine. orted and 2 processes will start on the 2-cpu machine. The processes hang for a while and then I get the error message . After that, the processes still hang. If I Ctrl+c, all processes on both machines including both orteds and the orterun will quit. If I run jobs from the 2-cpu machin, specfying 6 processes, orted, orterun and 2 processes will start on this machine. Only orted will start on the 4-cpu machine and no processes will start. The job then hang and I don't get any response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu machine will quit. But orted on the 4-cpu machine will not quit. Does this have anything to do with the IP addresses? The IP address xxx.xxx.aaa.bbb for one machine is different from the IP address xxx.xxx.cc.dd for the other machine in that not only bbb is not dd, but also aaa is not cc. Well, you can't guess right all the time :). But I think you gave enough information for the next thing to try. It sounds like there might be a firewall running on the 2 process machine. When you orterun on the 4 cpu machine, the remote orted can clearly connect back to orterun because it is getting the process startup and shutdown messages. Things only fail when the MPI process on the 4 cpu machine try to connect to the other processes. On the other hand, when you start on the 2 cpu machine, the orted on the 4 cpu machine starts but can't even connect back to orterun to find out what processes to start, nor can it get the shutdown request. So you get a hang. If you go into System Preferences -> Sharing, make sure that the firewall is turned off in the "firewall" tab. Hopefully, that will do the trick. Brian From: Brian Barrett Reply-To: Open MPI Users Date: Thu, 2 Mar 2006 18:50:35 -0500 To: Open MPI Users Subject: Re: [OMPI users] Problem running open mpi across nodes. On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote: I installed Open MPI on two Mac G5s, one with 2 cpus and the other with 4 cpus. I can run jobs on either of the machines fine. But when I ran a job on machine one across the two nodes, the all processes I requested would start, but they then seemed to hang and I got the error message: [0,1,1][btl_tcp_endpoint.c: 559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=60[0,1,0][btl_tcp_endpoint.c: 559:mca_btl_tcp_endpoint_complete_connect ] connect() failed with errno=60 When I ran the job on machine two across the nodes, only processes on this machine would start and then hung. No processes would start on machine one and I didn't get any messages. In both cases, I have to Ctrl+C to kill the jobs. Any idea what was wrong? Thanks a lot! errno 60 is ETIMEDOUT, which means that the connect() timed out before the remote side answered. The other way was probably a similar problem - there's something strange going on with the routing on the two nodes that's causing OMPI to get confused. Do your G5 machines have ethernet adapters other than the primary GigE cards (wireless, a second GigE card, a Firewire TCP stack) by any chance? There's an issue with situations where there are multiple ethernet cards that causes the TCP btl to behave badly like this. We think we have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so it might help to upgrade to that version: http://www.open-mpi.org/software/ompi/v1.0/ Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users