[OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running
Has anyone in the list ever tested openMPI in infiniband network in which openSM is running with LASH routing algorithm enabled? I haven't tested the above case but i could foresee a problem because LASH routing algorithm in openSM uses virtual lanes (VL) which are directly mapped with service levels (SL). And LASH routing algorithm assigns different VLs (SLs) to different paths in the network. This SL <-> path association is available only through the subnet manager (openSM) during connection establishment. But AFAIK, openMPI don't use the services of subnet manager for connection establishment between nodes. So I want to know whether anyone thought about it and working on it or not. regards, Mahesh
[OMPI users] version 1.3
Hello Guys, When is the version 1.3 scheduled to be released? As it would contain checkpointing, library for non-blocking communication, ConnectX for QP's, it would be great to have it ASAP. Since i am evaluating MVAPICH against OpenMPI, i found that MVAPICH still has upper hand in terms of checkpointing. But i am pretty sure, once v1.3 will come, it will help a lot to HPC community. I can find the development trunk version, but i am more interested in production release version. -Neeraj
[OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)
Hi, Subject: "Need exact command line for ./configure {optionslist} " to build OPENMPI-1.2.4 on windows." while configuration script checking the FORTRAN77 compiler , iam getting following error,so openmpi- build is unsuccessful on windows(with configure script) checking for correct handling of FORTRAN logical arrays... no configure: error: Error determining if arrays of logical values work properly. i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on windows -2000 machine. can somebody give proper build command i can use to "build opennmpi on windows-2000" machine. i.e ./configure ...(options list) can some body pls tell "exact options to pass" in the option list. iam using cygwin to build openmpi on windows. PS: I am attaching the output files. config.log -> actual log file. config.out -> output of the ./configure file make.out -> fail because, configure build unsuccess on windows. make.install-> fail because, configure build unsuccess on windows PS: I am using all g77,g++,gcc from MINGW package. i have downloaded and added g95 also, but which does not solve my problem. Thanks, Geetha * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <> <> <> <>
Re: [OMPI users] SegFault with MPI_THREAD_MULTIPLE in 1.2.4
This is to be expected. OMPI's support for THREAD_MULTIPLE is incomplete and most likely doesn't work. On Nov 25, 2007, at 6:45 PM, Emilio J. Padron wrote: Hi, it's my fist message here so greetings to everyone (and sorry about my poor english) :-) I'm coding a parallel algorithm and I've decided to upgrade the openmpi version used in our cluster (1.2.3) this week. After that, problems arise :-/ There seems to be any problem with multithreding support in OpenMPI 1.2.4, at least in my installation. Problem appears when more than one process per node is spawned. A simple *hello world* program (with no snd/ rcvs) works ok in MPI_THREAD_SINGLE mode, but when I tried MPI_THREAD_MULTIPLE this error arises: /opt/openmpi/bin/mpirun -np 2 -machinefile /home/users/emilioj/ machinefileOpenMPI --debug-daemons justhi Daemon [0,0,1] checking in as pid 5446 on host c0-0 [pvfs2-compute-0-0.local:05446] [0,0,1] orted: received launch callback [pvfs2-compute-0-0:05447] *** Process received signal *** [pvfs2-compute-0-0:05447] Signal: Segmentation fault (11) [pvfs2-compute-0-0:05447] Signal code: Address not mapped (1) [pvfs2-compute-0-0:05447] Failing at address: (nil) [pvfs2-compute-0-0:05448] *** Process received signal *** [pvfs2-compute-0-0:05448] Signal: Segmentation fault (11) [pvfs2-compute-0-0:05448] Signal code: Address not mapped (1) [pvfs2-compute-0-0:05448] Failing at address: (nil) [pvfs2-compute-0-0:05448] [ 0] /lib/tls/libpthread.so.0 [0xbb2890] [pvfs2-compute-0-0:05448] [ 1] /opt/openmpi/lib/openmpi/ mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x4b1d99] [pvfs2-compute-0-0:05448] [ 2] /opt/openmpi/lib/libopen-pal.so. 0(opal_progress+0x65) [0x592265] [pvfs2-compute-0-0:05448] [ 3] /opt/openmpi/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x29) [0x20a731] [pvfs2-compute-0-0:05448] [ 4] /opt/openmpi/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_recv+0x365) [0x20f301] [pvfs2-compute-0-0:05448] [ 5] /opt/openmpi/lib/libopen-rte.so. 0(mca_oob_recv_packed+0x38) [0x13c6a0] [pvfs2-compute-0-0:05448] [ 6] /opt/openmpi/lib/libopen-rte.so. 0(mca_oob_xcast+0xa0e) [0x13d36a] [pvfs2-compute-0-0:05448] [ 7] /opt/openmpi/lib/libmpi.so. 0(ompi_mpi_init+0x566) [0xda9f22] [pvfs2-compute-0-0:05447] [ 0] /lib/tls/libpthread.so.0 [0xbb2890] [pvfs2-compute-0-0:05447] [ 1] /opt/openmpi/lib/openmpi/ mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x305d99] [pvfs2-compute-0-0:05447] [ 2] /opt/openmpi/lib/libopen-pal.so. 0(opal_progress+0x65) [0x9fb265] [pvfs2-compute-0-0:05447] [ 3] /opt/openmpi/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x29) [0x2ed731] [pvfs2-compute-0-0:05447] [ 4] /opt/openmpi/lib/openmpi/ mca_oob_tcp.so(mca_oob_tcp_recv+0x365) [0x2f2301] [pvfs2-compute-0-0:05447] [ 5] /opt/openmpi/lib/libopen-rte.so. 0(mca_oob_recv_packed+0x38) [0x53c6a0] [pvfs2-compute-0-0:05447] [ 6] /opt/openmpi/lib/openmpi/ mca_gpr_proxy.so(orte_gpr_proxy_put+0x1b0) [0x2c4fc8] [pvfs2-compute-0-0:05447] [ 7] /opt/openmpi/lib/libopen-rte.so. 0(orte_smr_base_set_proc_state+0x244) [0x551420] [pvfs2-compute-0-0:05447] [ 8] /opt/openmpi/lib/libmpi.so. 0(ompi_mpi_init+0x52e) [0x13ceea] [pvfs2-compute-0-0:05447] [ 9] /opt/openmpi/lib/libmpi.so. 0(PMPI_Init_thread+0x5c) [0x15e844] [pvfs2-compute-0-0:05447] [10] justhi(main+0x36) [0x8048782] [pvfs2-compute-0-0:05448] [ 8] /opt/openmpi/lib/libmpi.so. 0(PMPI_Init_thread+0x5c) [0xdcb844] [pvfs2-compute-0-0:05448] [ 9] justhi(main+0x36) [0x8048782] [pvfs2-compute-0-0:05448] [10] /lib/tls/libc.so.6(__libc_start_main +0xd3) [0x970de3] [pvfs2-compute-0-0:05448] [11] justhi [0x80486c5] [pvfs2-compute-0-0:05448] *** End of error message *** [pvfs2-compute-0-0:05447] [11] /lib/tls/libc.so.6(__libc_start_main +0xd3) [0x1a0de3] [pvfs2-compute-0-0:05447] [12] justhi [0x80486c5] [pvfs2-compute-0-0:05447] *** End of error message *** [pvfs2-compute-0-0.local:05446] [0,0,1] orted_recv_pls: received message from [0,0,0] [pvfs2-compute-0-0.local:05446] [0,0,1] orted_recv_pls: received kill_local_procs [Ctrl+Z and kill -9 is needed to finish the execution] The machinefile contains: c0-0 slots=4 c0-1 slots=4 c0-2 slots=4 c0-3 slots=4 ... If processes are forced to be spawned in different nodes (c0-0 slots=1, c0-1 slots=1, c0-2 slots=1, c0-3 slots=1...) then there is no error :-? With 1.2.3 version (same *configure* options) everything runs perfectly. The ompi_info for my openmpi 1.2.4 installation: Open MPI: 1.2.4 Open MPI SVN revision: r16187 Open RTE: 1.2.4 Open RTE SVN revision: r16187 OPAL: 1.2.4 OPAL SVN revision: r16187 Prefix: /opt/openmpi Configured architecture: i686-pc-linux-gnu Configured by: root Configured on: Sun Nov 25 20:13:42 CET 2007 Configure host: pvfs2-compute-0-0.local Built by: root Built on: Sun Nov 25 20:19:55 CET 2007 Built host: pvfs2-compute-0-0.local C bindings: yes
Re: [OMPI users] Newbie: Using hostfile
Well, that's odd. What happens if you try to mpirun "hostname" (i.e., a non-MPI application)? Does it run, or does it hang? On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote: I have been using using clusters for some tests. My localhost "lynx" and i have "puma" and "tiger" which make up the cluster. All have passwordless ssh enabled. Now if i have the following in my hostfile(perline in the same order) lynx puma tiger My tests(from lynx) run over the cluster without any issues. But if move/remove the lynx from there either (perline in the same order) puma lynx tiger or puma tiger My test(from lynx) just does not get any where. It just hangs. And does not proceed at all. Is this an issue with way my script handles the cluster node. Or is there an method for the hostfile. Thanks. -- Sam aka Vijju :)~ Linux: Open, True and Cool ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running
There is work starting literally right about now to allow Open MPI to use the RDMA CM and/or the IBCM for creating OpenFabrics connections (IB or iWARP). On Nov 28, 2007, at 4:37 AM, Keshetti Mahesh wrote: Has anyone in the list ever tested openMPI in infiniband network in which openSM is running with LASH routing algorithm enabled? I haven't tested the above case but i could foresee a problem because LASH routing algorithm in openSM uses virtual lanes (VL) which are directly mapped with service levels (SL). And LASH routing algorithm assigns different VLs (SLs) to different paths in the network. This SL <-> path association is available only through the subnet manager (openSM) during connection establishment. But AFAIK, openMPI don't use the services of subnet manager for connection establishment between nodes. So I want to know whether anyone thought about it and working on it or not. regards, Mahesh -- Jeff Squyres Cisco Systems
Re: [OMPI users] OpenIB problems
Roland thought that the default value of 10 might be a bit too low and that tuning it to be higher, particularly in apps that pound on a single port, would probably be acceptable. Tuning up to 20 is probably a bit overkill. On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote: BTW, Andrew is correct about the unit for btl_openib_ib_timeout and that the value is simply passed down to the verbs library when making an IB connection. Open MPI does nothing else with that value; it's an IBTA-defined value. The help message was wrong on the 1.2 branch for a while; I think it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I don't recall which version specifically). On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote: Brock Palen wrote: What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ? The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :). Not sure why a larger value isn't the default. No, its not seconds -- check the description of the MCA parameter: 4.096 microseconds * (2^btl_openib_ib_timeout) You sure? ompi_info --param btl openib MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, in seconds (must be >= 1) Yeah: MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout)(must be = 0 and <= 31) Reading earlier in the thread you said OMPI v1.2.0, I got this from a trunk checkout thats around 3 weeks old. A quick check shows this description was changed between 1.2.0 and 1.2.1. However the use of this parameter hasn't changed -- it's simply passed along to IB verbs when creating a queue pair (aka a connection). Andrew ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems
Re: [OMPI users] OpenIB problems
What value do you suggest then? I know I've seen the problem persist at values of 14 and 16, and would rather be certain that this isn't going to kill the job that just sat in the queue for a week. Andrew Jeff Squyres wrote: Roland thought that the default value of 10 might be a bit too low and that tuning it to be higher, particularly in apps that pound on a single port, would probably be acceptable. Tuning up to 20 is probably a bit overkill. On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote: BTW, Andrew is correct about the unit for btl_openib_ib_timeout and that the value is simply passed down to the verbs library when making an IB connection. Open MPI does nothing else with that value; it's an IBTA-defined value. The help message was wrong on the 1.2 branch for a while; I think it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I don't recall which version specifically). On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote: Brock Palen wrote: What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ? The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :). Not sure why a larger value isn't the default. No, its not seconds -- check the description of the MCA parameter: 4.096 microseconds * (2^btl_openib_ib_timeout) You sure? ompi_info --param btl openib MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, in seconds (must be >= 1) Yeah: MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout)(must be = 0 and <= 31) Reading earlier in the thread you said OMPI v1.2.0, I got this from a trunk checkout thats around 3 weeks old. A quick check shows this description was changed between 1.2.0 and 1.2.1. However the use of this parameter hasn't changed -- it's simply passed along to IB verbs when creating a queue pair (aka a connection). Andrew ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] SegFault with MPI_THREAD_MULTIPLE in 1.2.4
Hi Jeff, thank you for your answer... > > > > [... regarding SegFault with MPI_THREAD_MULTIPLE in OMPI 1.2.4 ...] > > On Wed, Nov 28, 2007 at 11:27:51AM -0500, Jeff Squyres wrote: > This is to be expected. OMPI's support for THREAD_MULTIPLE is > incomplete and most likely doesn't work. > Ok, I knew it was not *too* tested, but I was been using previous verions with *more or less* success, with a relatively small number of communications in a multithread (posix) environment. This last 1.2.4 version breaks even with no explicit comms in the program. It just seemed quite weird to me in a (supposedly) minor-changes revision :-? Anyway, thanks to the OMPI team for all the hard work. I hope complete thread-safety support is available in future revisions :-) Cheers, E.
Re: [OMPI users] version 1.3
v1.3's schedule is being developed right now -- it was a little hard to hear on the teleconference yesterday, but I think I heard Brad Benton from IBM (one of the two Release Managers for the v1.3 series) say that he'd have a plan for review by the group next week. So far, I've been [wildly] estimating 1HCY2008. I don't think it would be a good idea to try to be any more precise before we get the RM's input. :-) On Nov 28, 2007, at 5:39 AM, Neeraj Chourasia wrote: When is the version 1.3 scheduled to be released? As it would contain checkpointing, library for non-blocking communication, ConnectX for QP's, it would be great to have it ASAP. Since i am evaluating MVAPICH against OpenMPI, i found that MVAPICH still has upper hand in terms of checkpointing. But i am pretty sure, once v1.3 will come, it will help a lot to HPC community. I can find the development trunk version, but i am more interested in production release version. -- Jeff Squyres Cisco Systems
Re: [OMPI users] OpenIB problems
For what it's worth Andrew, the RETRY_EXCEEDED_ERRORS can be caused by flaky hardware as well. The timeout value is probably best tuned relative to the size of your IB fabric. But if reliability is the biggest criteria, crank up the timemout value to 20. That's the best you can do. If it continues to happen, it is more than likely you have a flaky HCA, IB link, switch side sw, or node. We actually have way too much IB hardware for any sane person and my experience is that the RETRY_EXCEEDED_ERRORS can sometimes be really tricky to track down. One of my favorites is the spontaneous rebooting node. We see nodes under heavy MPI application load sometimes randomly reboot. This causes the RETRY_EXCEEDED_ERROR as well. I would second the recommendation to watch the IB counters across the entire IB fabric from the subnet manager. Good luck! > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Andrew Friedley > Sent: Wednesday, November 28, 2007 9:36 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenIB problems > > What value do you suggest then? I know I've seen the problem > persist at > values of 14 and 16, and would rather be certain that this > isn't going > to kill the job that just sat in the queue for a week. > > Andrew > > Jeff Squyres wrote: > > Roland thought that the default value of 10 might be a bit > too low and > > that tuning it to be higher, particularly in apps that pound on a > > single port, would probably be acceptable. > > > > Tuning up to 20 is probably a bit overkill. > > > >
Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)
If your F77 compiler do not support array of LOGICAL variables (which seems to be the case if you look in the config.log file), then you're left with only one option. Remove the F77 support from the compilation. This means adding the --disable-mpi-f77 option to the ./ configure. Thanks, george. On Nov 28, 2007, at 9:24 AM, geetha r wrote: Hi, Subject: "Need exact command line for ./configure {optionslist} " to build OPENMPI-1.2.4 on windows." while configuration script checking the FORTRAN77 compiler , iam getting following error,so openmpi- build is unsuccessful on windows(with configure script) checking for correct handling of FORTRAN logical arrays... no configure: error: Error determining if arrays of logical values work properly. i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on windows -2000 machine. can somebody give proper build command i can use to "build opennmpi on windows-2000" machine. i.e ./configure ...(options list) can some body pls tell "exact options to pass" in the option list. iam using cygwin to build openmpi on windows. PS: I am attaching the output files. config.log -> actual log file. config.out -> output of the ./configure file make.out -> fail because, configure build unsuccess on windows. make.install-> fail because, configure build unsuccess on windows PS: I am using all g77,g++,gcc from MINGW package. i have downloaded and added g95 also, but which does not solve my problem. Thanks, Geetha * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * < make .install .zip > < make .out .zip > < config .out.zip>___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)
On Wed, 2007-11-28 at 13:20 -0500, George Bosilca wrote: > If your F77 compiler do not support array of LOGICAL variables (which > seems to be the case if you look in the config.log file), then you're > left with only one option. Remove the F77 support from the > compilation. This means adding the --disable-mpi-f77 option to the ./ > configure. It's a lot weirder than that. configure: WARNING: *** Fortran 77 REAL*8 does not have expected size! configure: WARNING: *** Expected 8, got 8 configure: WARNING: *** Disabling MPI support for Fortran 77 REAL*8 Somehow, 8/=8 :-\
[OMPI users] mca_oob_tcp_peer_try_connect problem
I am new to openmpi and have a problem that I cannot seem to solve. I am trying to run the hello_c example and I can't get it to work. I compiled openmpi with: ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 --with-openib The hostname file contains the local host and one other node. When I run it I get: [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 hello_c [max14:31465] [0,0,0] accepting connections via event library [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1] accepting connections via event library [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: sending ack, 0 [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed: Software caused connection abort (103) [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to 192.168.1.14:38852 failed, connecting over all interfaces failed! [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [max14:31465] ERROR: A daemon on node max15 failed to start as expected. [max14:31465] ERROR: There may be more information available from [max14:31465] ERROR: the remote shell (see above). [max14:31465] ERROR: The daemon exited unexpectedly with status 1. [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] orted_recv_pls: received exit [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed connection [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6 state 4 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- I can see that the orted deamon program is starting on both computers but it looks to me like they can't talk to each other. Here is the output from ifconfig on one of the nodes, the other node is similar. [root@max14 ~]# /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:17:31:9C:93:A1 inet addr:192.168.2.14 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1353 errors:0 dropped:0 overruns:0 frame:0 TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:188125 (183.7 KiB) TX bytes:1500567 (1.4 MiB) Interrupt:17 eth1 Link encap:Ethernet HWaddr 00:17:31:9C:93:A2 inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0 TX packets:49368158 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:21844618928 (20.3 GiB) TX bytes:16122676331 (15.0 GiB) Interrupt:19 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1
Re: [OMPI users] OpenIB problems
Jeff thanks for all the reply's, Hate to admit but at the moment we can't log onto the switch. But the ibcheckerrors command returns nothing out of bounds, and i think that command also checks the switch ports. Thanks, we will do some tests Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 27, 2007, at 4:50 PM, Jeff Squyres wrote: Sorry for jumping in late; the holiday and other travel prevented me from getting to all my mail recently... :-\ Have you checked the counters on the subnet manager to see if any other errors are occurring? It might be good to clear all the counters, run the job, and see if the counters are increasing faster than they should (i.e., any particular counter should advance very very slowly -- perhaps 1 per day or so). I'll ask around the kernel-level guys (i.e., Roland) to see what else could cause this kind of error. On Nov 27, 2007, at 3:35 PM, Brock Palen wrote: Ok i will open a case with cisco, Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote: Brock Palen wrote: What would be a place to look? Should this just be default then for OMPI? ompi_info shows the default as 10 seconds? Is that right 'seconds' ? The other IB guys can probably answer better than I can -- I'm not an expert in this part of IB (or really any part I guess :). Not sure why a larger value isn't the default. No, its not seconds -- check the description of the MCA parameter: 4.096 microseconds * (2^btl_openib_ib_timeout) You sure? ompi_info --param btl openib MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, in seconds (must be >= 1) Yeah: MCA btl: parameter "btl_openib_ib_timeout" (current value: "10") InfiniBand transmit timeout, plugged into formula: 4.096 microseconds * (2^btl_openib_ib_timeout)(must be = 0 and <= 31) Reading earlier in the thread you said OMPI v1.2.0, I got this from a trunk checkout thats around 3 weeks old. A quick check shows this description was changed between 1.2.0 and 1.2.1. However the use of this parameter hasn't changed -- it's simply passed along to IB verbs when creating a queue pair (aka a connection). Andrew ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Run a process double
Hi everybody out there. This is my first post to the mail list. I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE linux. In principal, the instalation was succefull, with ifort 10.X. But when i run any code ( mpirun -np 2 a.out), instead of share the calcules between the two processor, the system duplicate the executable and send one to each processor. i don´t know what the h$%& is going on.. regards.. Henry -- Henry Adolfo Lambis Miranda,Chem.Eng. Molecular Simulation Group I & II Rovira i Virgili University. http://www.etseq.urv.es/ms Av. Pa?sos Catalans, 26 C.P. 43007. Tarragona, Catalunya Espanya. "No podr?s quedarte en casa, hermano. No podr?s encender, apagar y olvidarte () Porque la revoluci?n no ser? televisada". Gil Scott-Heron (The Revolution Will Not Be Televised, 1974) Es una cosa bastante repugnante el exito. Su falsa semejanza con el merito enga?a a los hombres. -- Victor Hugo. (1802-1885) Novelista franc?s. El militar es una planta que hay que cuidar con esmero para que no de sus frutos. -- Jacques Tati. "La libertad viene en paquetes peque?os, usualmente TCP/IP" Colombian Reality bite: http://www.youtube.com/watch?v=jn3vM_5kIgM http://en.wikipedia.org/wiki/Cartagena,_Colombia http://www.youtube.com/watch?v=cvxMWSsrwg0 http://www.youtube.com/watch?v=eVmYf5U6x3k __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas
Re: [OMPI users] Run a process double
That's what's supposed to happen, it's how MPI works. Process 0 is the head or boss process, and the others are slaves, and execute partially different code even though they're in the same executable. MPI is multi-process, not multi-thread. Damien Henry Adolfo Lambis Miranda wrote: Hi everybody out there. This is my first post to the mail list. I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE linux. In principal, the instalation was succefull, with ifort 10.X. But when i run any code ( mpirun -np 2 a.out), instead of share the calcules between the two processor, the system duplicate the executable and send one to each processor. i don´t know what the h$%& is going on.. regards.. Henry
Re: [OMPI users] Run a process double
Henry, Apologies ahead of time for any unintended insults, but... Your "a.out" sounds like it is not truly a parallel code. If you submit a hello_world program using OpenMPI's mpirun, you will simply get two copies of "Hello World" printed to the screen. If you want the work shared, you must change your serial program such that it executes different code pieces or operates on different portions of your data, based on something like the "rank" of the process. (Rank is the numerical ID assigned by MPI to each process running from a single invocation of mpirun.) All MPI, or specifically OpenMPI, provides you is a vehicle to launch multiple copies of a program or programs and then to facilitate the communication of those separate processes with one another. Perhaps a primer on parallel processing would be in order. Or since you have started with Message Passing, perhaps the old standard "Using MPI Portable Parallel Programming with the Message-Passing Interface, MIT Press, by Gropp, Lusk, and Skjellum would give you the familiarization needed. Other books in that series by some of the same authors are also good starting points for MPI. I'm sure other readers can pipe in with a host of better references. Good luck. regards, Henry Adolfo Lambis Miranda wrote: Hi everybody out there. This is my first post to the mail list. I have installed openmp 1.2.4 over a x_64 AMD double processor with SuSE linux. In principal, the instalation was succefull, with ifort 10.X. But when i run any code ( mpirun -np 2 a.out), instead of share the calcules between the two processor, the system duplicate the executable and send one to each processor. i don´t know what the h$%& is going on.. regards.. Henry -- *** >> Mark J. Potts, PhD >> >> HPC Applications Inc. >> phone: 410-992-8360 Bus >>410-313-9318 Home >>443-418-4375 Cell >> email: po...@hpcapplications.com >>po...@excray.com ***