Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster
Hi Brian, I have installed OpenMPI-1.1a1r9260 on my SunOS machines. It has solved the problems. However there is one more issue that I found in my testing and that I failed to report. This concerns Linux machines too. My host file is hosts.txt - csultra06 csultra02 csultra05 csultra08 My app file is mpiinit_appfile --- -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit -np 1 /home/cs/manredd/OpenMPI/openmpi-1.1a1r9260/MPITESTS/mpiinit My application program is mpiinit.c - #include int main(int argc, char** argv) { int rc, me; char pname[MPI_MAX_PROCESSOR_NAME]; int plen; MPI_Init( &argc, &argv ); rc = MPI_Comm_rank( MPI_COMM_WORLD, &me ); if (rc != MPI_SUCCESS) { return rc; } MPI_Get_processor_name( pname, &plen ); printf("%s:Hello world from %d\n", pname, me); MPI_Finalize(); return 0; } Compilation is successful csultra06$ mpicc -o mpiinit mpiinit.c However mpirun prints just 6 statements instead of 8. csultra06$ mpirun --hostfile hosts.txt --app mpiinit_appfile csultra02:Hello world from 5 csultra06:Hello world from 0 csultra06:Hello world from 4 csultra02:Hello world from 1 csultra08:Hello world from 3 csultra05:Hello world from 2 The following two more statements are not printed. csultra05:Hello world from 6 csultra08:Hello world from 7 This behavior I observed on my Linux cluster too. I have attached the log for "-d" option for your debugging purposes. Regards, Ravi. - Original Message - From: Brian Barrett List-Post: users@lists.open-mpi.org Date: Monday, March 13, 2006 7:56 pm Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster To: Open MPI Users > Hi Ravi - > > With the help of another Open MPI user, I spent the weekend finding > a > couple of issues with Open MPI on Solaris. I believe you are > running > into the same problems. We're in the process of certifying the > changes for release as part of 1.0.2, but it's Monday morning and > the > release manager hasn't gotten them into the release branch just > yet. > Could you give the nightly tarball from our development trunk a try > > and let us know if it solves your problems on Solaris? You > probably > want last night's 1.1a1r9260 release. > > http://www.open-mpi.org/nightly/trunk/ > > Thanks, > > Brian > > > On Mar 12, 2006, at 11:23 PM, Ravi Manumachu wrote: > > > > > Hi Brian, > > > > Thank you for your help. I have attached all the files you have > asked> for in a tar file. > > > > Please find attached the 'config.log' and 'libmpi.la' for my > Solaris> installation. > > > > The output from 'mpicc -showme' is > > > > sunos$ mpicc -showme > > gcc -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/ > > include > > -I/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS- > > 5.9/include/openmpi/ompi-L/home/cs/manredd/OpenMPI/openmpi- > > 1.0.1/OpenMPI-SunOS-5.9/lib -lmpi > > -lorte -lopal -lnsl -lsocket -lthread -laio -lm -lnsl -lsocket - > > lthread -ldl > > > > There are serious issues when running on just solaris machines. > > > > I am using the host file and app file shown below. Both the > > machines are > > SunOS and are similar. > > > > hosts.txt > > - > > csultra01 slots=1 > > csultra02 slots=1 > > > > mpiinit_appfile > > --- > > -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos > > -np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos > > > > Running mpirun without -d option hangs. > > > > csultra01$ mpirun --hostfile hosts.txt --app mpiinit_appfile > > hangs > > > > Running mpirun with -d option dumps core with output in the file > > "mpirun_output_d_option.txt", which is attached. The core is also > > attached. > > Running just on only one host is also not working. The output from > > mpirun using "-d" option for this scenario is attached in file > > "mpirun_output_d_option_one_host.txt". > > > > I have also attached the list of packages installed on my solaris > > machine in "pkginfo.txt" > > > > I hope these will help you to resolve the issue. > > > > Regards, > > Ravi. > > > >> - Original Message - > >> From: Brian Barrett > >> Date: Friday, March 10, 2006 7:09 pm > >> Subject: Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; > >> problems on heterogeneous cluster > >> To: Open MPI Users > >> > >>> On Mar 10, 2006, at 12:09 AM, Ravi Manu
Re: [OMPI users] Memory allocation issue with OpenIB
Emanuel, Thanks for the tip on this issue, we will be adding it to the FAQ shortly. - Galen On Mar 15, 2006, at 4:29 PM, Emanuel Ziegler wrote: Hi Davide! You are using the -prefix option. I guess this is due to the fact that You cannot set the paths appropriately. Most likely You are using rsh for starting remote processes. This causes some trouble since the environment offered by rsh lacks many things that a usual login environment offers (e.g. the path is hardcoded and cannot be changed). Checking with mpirun -np 2 -prefix /usr/local /bin/bash -c ulimit -l may result in reporting plenty of memory (according to Your settings) but this is not reliable since the new bash instance sets the limits differently. Unfortunately mpirun -np 2 -prefix /usr/local ulimit -l does not work, since mpirun expects an executable. So the only way to check is to run rsh directly like rsh remotenode ulimit -l (where remotenode has to be replaced by the name of the remote host). This may give a different result (e.g. 32 which is way too small). In my case this problem was solved by adding session requiredpam_limits.so at the end of the file "/etc/pam.d/rsh". In case of ssh check the file "/etc/pam.d/ssh" for a line similar to the one above and add it if it does not yet exist. Hope that helps, Emanuel ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Performance of ping-pong using OpenMPI over Infiniband
Hello, Testing performance of OpenMPI over Infiniband I have the following result : 1) Hardware is : SilversStorm interface 2) Openmpi version is : (from ompi_info) Open MPI: 1.0.2a9r9159 Open MPI SVN revision: r9159 Open RTE: 1.0.2a9r9159 Open RTE SVN revision: r9159 OPAL: 1.0.2a9r9159 OPAL SVN revision: r9159 3) Cluster with Bi-processors Opteron 248 2.2 GHz Configure has been run with option --with-mvapi=path-to-mvapi 4) a C coded pinpong gives the following values : LOOPS: 1000 BYTES: 4096 SECONDS: 0.085557 MBytes/sec: 95.749051 LOOPS: 1000 BYTES: 8192 SECONDS: 0.050657 MBytes/sec: 323.429912 LOOPS: 1000 BYTES: 16384 SECONDS: 0.084038 MBytes/sec: 389.918757 LOOPS: 1000 BYTES: 32768 SECONDS: 0.163161 MBytes/sec: 401.665104 LOOPS: 1000 BYTES: 65536 SECONDS: 0.306694 MBytes/sec: 427.370561 LOOPS: 1000 BYTES: 131072 SECONDS: 0.529589 MBytes/sec: 494.995011 LOOPS: 1000 BYTES: 262144 SECONDS: 0.952616 MBytes/sec: 550.366583 LOOPS: 1000 BYTES: 524288 SECONDS: 1.927987 MBytes/sec: 543.870859 LOOPS: 1000 BYTES: 1048576 SECONDS: 3.673732 MBytes/sec: 570.850562 LOOPS: 1000 BYTES: 2097152 SECONDS: 9.993185 MBytes/sec: 419.716435 LOOPS: 1000 BYTES: 4194304 SECONDS: 18.211958 MBytes/sec: 460.609893 LOOPS: 1000 BYTES: 8388608 SECONDS: 35.421490 MBytes/sec: 473.645124 My questions are : a) Is OpenMPI doing in this case TCP/IP over IB ? (I guess so) b) Is it possible to improve significantly these values by changing the defaults ? I have used several mca btl parameters but without improving the maximum bandwith. For example : --mca btl mvapi --mca btl_mvapi_max_send_size 8388608 c) Is it possible that other IB hardware implementations have better performances with OpenMPI ? d) Is it possible to use specific IB drivers for optimal performance ? (should reach almost 800 MB/sec) Thank you very much for your help Best Regards, Jean Latour <>
Re: [OMPI users] Performance of ping-pong using OpenMPI over Infiniband
Hi Jean, Take a look here: http://www.open-mpi.org/faq/?category=infiniband#ib- leave-pinned This should improve performance for micro-benchmarks and some applications. Please let mw know if this doesn't solve the issue. Thanks, Galen On Mar 16, 2006, at 10:34 AM, Jean Latour wrote: Hello, Testing performance of OpenMPI over Infiniband I have the following result : 1) Hardware is : SilversStorm interface 2) Openmpi version is : (from ompi_info) Open MPI: 1.0.2a9r9159 Open MPI SVN revision: r9159 Open RTE: 1.0.2a9r9159 Open RTE SVN revision: r9159 OPAL: 1.0.2a9r9159 OPAL SVN revision: r9159 3) Cluster with Bi-processors Opteron 248 2.2 GHz Configure has been run with option --with-mvapi=path-to-mvapi 4) a C coded pinpong gives the following values : LOOPS: 1000 BYTES: 4096 SECONDS: 0.085557 MBytes/sec: 95.749051 LOOPS: 1000 BYTES: 8192 SECONDS: 0.050657 MBytes/sec: 323.429912 LOOPS: 1000 BYTES: 16384 SECONDS: 0.084038 MBytes/sec: 389.918757 LOOPS: 1000 BYTES: 32768 SECONDS: 0.163161 MBytes/sec: 401.665104 LOOPS: 1000 BYTES: 65536 SECONDS: 0.306694 MBytes/sec: 427.370561 LOOPS: 1000 BYTES: 131072 SECONDS: 0.529589 MBytes/sec: 494.995011 LOOPS: 1000 BYTES: 262144 SECONDS: 0.952616 MBytes/sec: 550.366583 LOOPS: 1000 BYTES: 524288 SECONDS: 1.927987 MBytes/sec: 543.870859 LOOPS: 1000 BYTES: 1048576 SECONDS: 3.673732 MBytes/sec: 570.850562 LOOPS: 1000 BYTES: 2097152 SECONDS: 9.993185 MBytes/sec: 419.716435 LOOPS: 1000 BYTES: 4194304 SECONDS: 18.211958 MBytes/sec: 460.609893 LOOPS: 1000 BYTES: 8388608 SECONDS: 35.421490 MBytes/sec: 473.645124 My questions are : a) Is OpenMPI doing in this case TCP/IP over IB ? (I guess so) b) Is it possible to improve significantly these values by changing the defaults ? I have used several mca btl parameters but without improving the maximum bandwith. For example : --mca btl mvapi --mca btl_mvapi_max_send_size 8388608 c) Is it possible that other IB hardware implementations have better performances with OpenMPI ? d) Is it possible to use specific IB drivers for optimal performance ? (should reach almost 800 MB/sec) Thank you very much for your help Best Regards, Jean Latour ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Performance of ping-pong using OpenMPI over Infiniband
On Thu, 16 Mar 2006, Jean Latour wrote: > My questions are : > a) Is OpenMPI doing in this case TCP/IP over IB ? (I guess so) If the path to the mvapi library is correct then Open MPI will use mvapi not TCP over IB. There is a simple way to check. "ompi_info --param btl mvapi" will print all the parameters attached to the mvapi driver. If there is no mvapi in the output, then mvapi was not correctly detected. But I don't think it's the case, because if I remember well we have a protection at configure time. If you specify one of the drivers and we're not able to correctly use the libraries, we will stop the configure. > b) Is it possible to improve significantly these values by changing the > defaults ? By default we are using a very conservative approach. We never leave the memory pinned down, and that decrease the performance for a ping-pong. There are pro and cons for that, too long to be explained here, but in general we're seeing better performance for real-life applications with our default approach, and that's our main goal. Now, if you want to get better performance for the ping-pong test please read the FAQ at http://www.open-mpi.org/faq/?category=infiniband. These are the 3 flags that affect the mvapi performance for the ping-pong case (add them in $(HOME)/.openmpi/mca-params.conf): btl_mvapi_flags=6 mpi_leave_pinned=1 pml_ob1_leave_pinned_pipeline=1 > > I have used several mca btl parameters but without improving the maximum > bandwith. > For example : --mca btl mvapi --mca btl_mvapi_max_send_size 8388608 It is difficult to improve the maximum bandwidth without the leave_pinned activated. But you can improve the bandwidth for medium size messages. Play with btl_mvapi_eager_limit to set the limit between short and rendez-vous protocol. "ompi_info --param btl mvapi" will give you a full list of parameters as well as their description. > > c) Is it possible that other IB hardware implementations have better > performances with OpenMPI ? The maximum bandwidth depend on several factors. One of the most importants is the maximum bandwidth on your node bus. To reach 800 and more MB/s you definitively need a PCI-X 16 ... > > d) Is it possible to use specific IB drivers for optimal performance ? > (should reach almost 800 MB/sec) Once the 3 options are set, you should see an improvement on the bandwidth. Let me know if it does not solve your problems. george. "We must accept finite disappointment, but we must never lose infinite hope." Martin Luther King
Re: [OMPI users] Using Multiple Gigabit Ethernet Interface
Thanks Brian, Thanks Michael I wanted to benchmark the communcation throughput and latency using multiple using gigabit eithernet controller. So here are the results which i want share with you all I used . OpenMPI version 1.0.2a10r9275 Hpcbench Two Dell Precision 650 workstation. The Dell Precision 650 workstation has three separate PCI bus segments. Segment 1 -> PCI Slot1,2 -> 32 bit, 33MHz, Shared with integrated 1394 Segment 2 -> PCI SLot3,4 -> 64 bit, 100MHz, Shared with the Gb Ethernet connection Segment 3 -> PCI Slot 5 -> Shared with Integrated Ultra 320 controller The workstation has Integrated PCI-X 64-bit Intel 10/100/1000 Gigabit Ethernet. I added three D-Link DGE-530T 1000 Mbps Ethernet Card in Slot2, Slot4 and Slot5 respectively. As i expected, the Card in slot5 performed better than the cards in other slots. Hereare the results. (Using Slot2)-# MPI communication latency (roundtrip time) test -- Wed Mar 15 09:19:10 2006# Hosts: DELL <> DELL2# Blocking Communication (MPI_Send/MPI_Recv)# Message size (Bytes) : 40960# Iteration: 7# Test time (Seconds): 0.20 # RTT-time # Microseconds1 25953.5652 25569.4393 22392.0004 20876.5785 21327.1216 19597.1567 21264.0088 24109.5689 23877.85910 24064.575 # MPI RTT min/avg/max = 19597.156/22903.187/25953.565 usec -- # MPI communication test -- Wed Mar 15 10:16:22 2006# Test mode: Fixed-size stream (unidirectional) test# Hosts: DELL <> DELL2# Blocking communication (MPI_Send/MPI_Recv)# Total data size of each test (Bytes): 524288000# Message size (Bytes): 104857600# Iteration : 5# Test time: 5.00# Test repetition: 10## Overall Master-node M-process M-process Slave-node S-process S-process# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode# Mbps Seconds Seconds Seconds Seconds Seconds Seconds1 521.9423 8.04 1.42 6.62 8.04 0.93 7.102 551.5377 7.60 1.20 6.41 7.60 0.77 6.873 552.5600 7.59 1.27 6.32 7.59 0.82 6.814 552.6328 7.59 1.28 6.31 7.59 0.80 6.835 552.6334 7.59 1.24 6.35 7.59 0.86 6.776 552.7048 7.59 1.26 6.33 7.59 0.77 6.867 563.6736 7.44 1.22 6.22 7.44 0.78 6.708 552.2710 7.59 1.22 6.37 7.59 0.83 6.809 520.9938 8.05 1.37 6.68 8.05 0.93 7.1610 535.0131 7.84 1.36 6.48 7.84 0.84 7.04 == (Using Slot3)-# MPI communication latency (roundtrip time) test -- Thu Mar 16 10:15:58 2006# Hosts: DELL <> DELL2# Blocking Communication (MPI_Send/MPI_Recv)# Message size (Bytes) : 40960# Iteration: 10# Test time (Seconds): 0.20 # RTT-time # Microseconds1 20094.2042 14773.5123 14846.0154 17756.8205 18419.2906 23394.7997 21840.5968 17727.4949 21822.09510 17659.688 # MPI RTT min/avg/max = 14773.512/18833.451/23394.799 usec -- # MPI communication test -- Wed Mar 15 09:17:54 2006# Test mode: Fixed-size stream (unidirectional) test# Hosts: DELL <> DELL2# Blocking communication (MPI_Send/MPI_Recv)# Total data size of each test (Bytes): 524288000# Message size (Bytes): 104857600# Iteration : 5# Test time: 5.00# Test repetition: 10## Overall Master-node M-process M-process Slave-node S-process S-process# Throughput Elapsed-time User-mode Sys-mode Elapsed-time User-mode Sys-mode# Mbps Seconds Seconds Seconds Seconds Seconds Seconds1 794.9650 5.28 1.04 4.24 5.28 0.47 4.812 838.1621 5.00 0.91 4.09 5.00 0.39 4.653 898.3811 4.67 0.84 3.82 4.67 0.34 4.374 798.9575 5.25 1.03 4.22 5.25 0.40 4.895 829.7181 5.06 0.94 4.11 5.05 0.40 4.696 881.5526 4.76 0.86 3.90 4.76 0.28 4.527 827.9215 5.07 0.96 4.11 5.07 0.41 4.708 845.6428 4.96 0.87 4.09 4.96 0.38 4.629 845.6903 4.96 0.90 4.06 4.96
[OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
Hello, I'm just compiled open-mpi and tried to run my code which just measures bandwidth from one node to another. (Code compile fine and runs under other mpi implementations) When I did I got this. uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 c317-6 c317-5 [c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=110) - retrying (pid=24979) [c317-5:24979] mca_oob_tcp_peer_timer_handler [c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=110) - retrying (pid=24997) [c317-5:24997] mca_oob_tcp_peer_timer_handler [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=110 I compiled open-mpi with Pbspro 5.4-4 and I'm guessing that has something to do with it. I've attached my config.log Any help with this would be appreciated. uahrcw@c275-6:~/mpi-benchmarks> ompi_info Open MPI: 1.0.1r8453 Open MPI SVN revision: r8453 Open RTE: 1.0.1r8453 Open RTE SVN revision: r8453 OPAL: 1.0.1r8453 OPAL SVN revision: r8453 Prefix: /opt/asn/apps/openmpi-1.0.1 Configured architecture: x86_64-unknown-linux-gnu Configured by: asnrcw Configured on: Fri Feb 24 15:19:37 CST 2006 Configure host: c275-6 Built by: asnrcw Built on: Fri Feb 24 15:40:09 CST 2006 Built host: c275-6 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: ifort Fortran90 compiler abs: /opt/asn/intel/fce/9.0/bin/ifort C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: 1 MCA memory: malloc_hooks (MCA v1.0, API v1.0, Component v1.0.1) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.0.1) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0.1) MCA timer: linux (MCA v1.0, API v1.0, Component v1.0.1) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.0.1) MCA coll: self (MCA v1.0, API v1.0, Component v1.0.1) MCA coll: sm (MCA v1.0, API v1.0, Component v1.0.1) MCA io: romio (MCA v1.0, API v1.0, Component v1.0.1) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0.1) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0.1) MCA pml: teg (MCA v1.0, API v1.0, Component v1.0.1) MCA ptl: self (MCA v1.0, API v1.0, Component v1.0.1) MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0.1) MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0.1) MCA btl: self (MCA v1.0, API v1.0, Component v1.0.1) MCA btl: sm (MCA v1.0, API v1.0, Component v1.0.1) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.0.1) MCA gpr: null (MCA v1.0, API v1.0, Component v1.0.1) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0.1) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0.1) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0.1) MCA iof: svc (MCA v1.0, API v1.0, Component v1.0.1) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0.1) MCA ns: replica (MCA v1.0, API v1.0, Component v1.0.1) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.0.1) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.1) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.0.1) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0.1) MCA ras: tm (MCA v1.0, API v1.0, Component v1.0.1) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.0.1) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.0.1) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.0.1) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0.1) MCA rmgr: urm (MCA v1.0, API v1.0, Comp
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
I see only 2 possibilities: 1. your trying to run Open MPI on nodes having multiple IP addresses. 2. your nodes are behind firewalls and Open MPI is unable to pass through. Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full answer to your question. Thanks, george. On Thu, 16 Mar 2006, Charles Wright wrote: > Hello, >I'm just compiled open-mpi and tried to run my code which just > measures bandwidth from one node to another. (Code compile fine and > runs under other mpi implementations) > > When I did I got this. > > uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 > c317-6 > c317-5 > [c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: > connection failed (errno=110) - retrying (pid=24979) > [c317-5:24979] mca_oob_tcp_peer_timer_handler > [c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: > connection failed (errno=110) - retrying (pid=24997) > [c317-5:24997] mca_oob_tcp_peer_timer_handler > > [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=110 > > > I compiled open-mpi with Pbspro 5.4-4 and I'm guessing that has > something to do with it. > > I've attached my config.log > > Any help with this would be appreciated. > > uahrcw@c275-6:~/mpi-benchmarks> ompi_info >Open MPI: 1.0.1r8453 > Open MPI SVN revision: r8453 >Open RTE: 1.0.1r8453 > Open RTE SVN revision: r8453 >OPAL: 1.0.1r8453 > OPAL SVN revision: r8453 > Prefix: /opt/asn/apps/openmpi-1.0.1 > Configured architecture: x86_64-unknown-linux-gnu > Configured by: asnrcw > Configured on: Fri Feb 24 15:19:37 CST 2006 > Configure host: c275-6 >Built by: asnrcw >Built on: Fri Feb 24 15:40:09 CST 2006 > Built host: c275-6 > C bindings: yes >C++ bindings: yes > Fortran77 bindings: yes (all) > Fortran90 bindings: no > C compiler: gcc > C compiler absolute: /usr/bin/gcc >C++ compiler: g++ > C++ compiler absolute: /usr/bin/g++ > Fortran77 compiler: g77 > Fortran77 compiler abs: /usr/bin/g77 > Fortran90 compiler: ifort > Fortran90 compiler abs: /opt/asn/intel/fce/9.0/bin/ifort > C profiling: yes > C++ profiling: yes > Fortran77 profiling: yes > Fortran90 profiling: no > C++ exceptions: no > Thread support: posix (mpi: no, progress: no) > Internal debug support: no > MPI parameter check: runtime > Memory profiling support: no > Memory debugging support: no > libltdl support: 1 > MCA memory: malloc_hooks (MCA v1.0, API v1.0, Component > v1.0.1) > MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0.1) > MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.0.1) > MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0.1) > MCA timer: linux (MCA v1.0, API v1.0, Component v1.0.1) > MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) > MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) >MCA coll: basic (MCA v1.0, API v1.0, Component v1.0.1) >MCA coll: self (MCA v1.0, API v1.0, Component v1.0.1) >MCA coll: sm (MCA v1.0, API v1.0, Component v1.0.1) > MCA io: romio (MCA v1.0, API v1.0, Component v1.0.1) > MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0.1) > MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0.1) > MCA pml: teg (MCA v1.0, API v1.0, Component v1.0.1) > MCA ptl: self (MCA v1.0, API v1.0, Component v1.0.1) > MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0.1) > MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0.1) > MCA btl: self (MCA v1.0, API v1.0, Component v1.0.1) > MCA btl: sm (MCA v1.0, API v1.0, Component v1.0.1) > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) >MCA topo: unity (MCA v1.0, API v1.0, Component v1.0.1) > MCA gpr: null (MCA v1.0, API v1.0, Component v1.0.1) > MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0.1) > MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0.1) > MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0.1) > MCA iof: svc (MCA v1.0, API v1.0, Component v1.0.1) > MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0.1) > MCA ns: replica (MCA v1.0, API v1.0, Component v1.0.1) > MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) > MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.0.1) > MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.1) > MCA ras: localhost (MCA v1.0, API v1.0, Component v1.0.1) > MCA ra
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
Thanks for the tip. I see that both number 1 and 2 are true. Openmpi is insisting on using my eth0 (I know this by watching the firewall log on the node it is trying to go to) This is despite the fact that I have the first dns entry go to eth1, normally that is all pbs would need to do the right thing and use the network I prefer. Ok so I see there are some options to in/exclude interfaces. however mpiexec is igorning my requests. I tried it two ways. Neither worked. Firewall rejects traffic coming into 1.0.x.x. network in both cases. /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1 -n 2 $XD1LAUNCHER ./mpimeasure /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0 -n 2 $XD1LAUNCHER ./mpimeasure (see dns works... not over eth0) uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:0E:AB:01:58:60 inet addr:1.0.21.134 Bcast:1.127.255.255 Mask:255.128.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0 TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:560395541 (534.4 Mb) TX bytes:34367848 (32.7 Mb) Interrupt:16 eth1 Link encap:Ethernet HWaddr 00:0E:AB:01:58:61 inet addr:1.128.21.134 Mask:255.128.0.0 UP RUNNING NOARP MTU:1500 Metric:1 RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0 TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6203028277 (5915.6 Mb) TX bytes:566471561 (540.2 Mb) Interrupt:25 eth2 Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:829064 errors:0 dropped:0 overruns:0 frame:0 TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:61216408 (58.3 Mb) TX bytes:19079579 (18.1 Mb) Base address:0x2000 Memory:fea8-feaa eth2:2Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 inet addr:129.66.9.146 Bcast:129.66.9.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Base address:0x2000 Memory:fea8-feaa loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:14259 errors:0 dropped:0 overruns:0 frame:0 TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:879631 (859.0 Kb) TX bytes:879631 (859.0 Kb) uahrcw@c344-6:~/mpi-benchmarks> ping c344-5 PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data. 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64 time=0.067 ms 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64 time=0.037 ms 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64 time=0.022 ms --- c344-5.x.asc.edu ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms George Bosilca wrote: >I see only 2 possibilities: >1. your trying to run Open MPI on nodes having multiple IP >addresses. >2. your nodes are behind firewalls and Open MPI is unable to pass through. > >Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full >answer to your question. > > Thanks, > george. > >On Thu, 16 Mar 2006, Charles Wright wrote: > > >>Hello, >> I'm just compiled open-mpi and tried to run my code which just >>measures bandwidth from one node to another. (Code compile fine and >>runs under other mpi implementations) >> >>When I did I got this. >> >>uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 >>c317-6 >>c317-5 >>[c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>connection failed (errno=110) - retrying (pid=24979) >>[c317-5:24979] mca_oob_tcp_peer_timer_handler >>[c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>connection failed (errno=110) - retrying (pid=24997) >>[c317-5:24997] mca_oob_tcp_peer_timer_handler >> >>[0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] >>connect() failed with errno=110 >> >> >>I compiled open-mpi with Pbspro 5.4-4 and I'm guessing that has >>something to do with it. >> >>I've attached my config.log >> >>Any help with this would be appreciated. >> >>uahrcw@c275-6:~/mpi-benchmarks> ompi_info >> Open MPI: 1.0.1r8453 >> Open MPI SVN revision: r8453 >> Open RTE: 1.0.1r8453 >> Open RTE SVN revision: r8453 >> OPAL: 1.0.1r8453 >> OPAL SVN revision: r8453 >> Prefix: /opt/asn/apps/openmpi-1.0.1 >>Configured architecture: x86_64-unknown-linux-gnu >> Configured by: asnrcw >> Configured on: Fri Feb 24 15:19:37 CST 20
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
Sorry I wasn't clear enough on my previous post. The error messages that you get are comming from the OOB which is the framework we're using to setup the MPI run. The options that you use (btl_tcp_if_include) are only used for MPI communications. Please add "--mca oob_tcp_include eth0" to force the OOB framework to use eth0. In order to don't have to type all these options all the time you can add them in the $(HOME).openmpi/mca-params.conf file. A file containing: oob_tcp_include=eth1 btl_tcp_if_include=eth1 should solve your problems, if the firewall is opened on eth1 between these nodes. Thanks, george. On Thu, 16 Mar 2006, Charles Wright wrote: > Thanks for the tip. > > I see that both number 1 and 2 are true. > Openmpi is insisting on using my eth0 (I know this by watching the > firewall log on the node it is trying to go to) > > This is despite the fact that I have the first dns entry go to eth1, > normally that is all pbs would need to do the right thing and use the > network I prefer. > > Ok so I see there are some options to in/exclude interfaces. > > however mpiexec is igorning my requests. > I tried it two ways. Neither worked. Firewall rejects traffic coming > into 1.0.x.x. network in both cases. > > /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1 > -n 2 $XD1LAUNCHER ./mpimeasure > /opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0 > -n 2 $XD1LAUNCHER ./mpimeasure > > (see dns works... not over eth0) > uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig > eth0 Link encap:Ethernet HWaddr 00:0E:AB:01:58:60 > inet addr:1.0.21.134 Bcast:1.127.255.255 Mask:255.128.0.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0 > TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:560395541 (534.4 Mb) TX bytes:34367848 (32.7 Mb) > Interrupt:16 > > eth1 Link encap:Ethernet HWaddr 00:0E:AB:01:58:61 > inet addr:1.128.21.134 Mask:255.128.0.0 > UP RUNNING NOARP MTU:1500 Metric:1 > RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0 > TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:6203028277 (5915.6 Mb) TX bytes:566471561 (540.2 Mb) > Interrupt:25 > > eth2 Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:829064 errors:0 dropped:0 overruns:0 frame:0 > TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:61216408 (58.3 Mb) TX bytes:19079579 (18.1 Mb) > Base address:0x2000 Memory:fea8-feaa > > eth2:2Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 > inet addr:129.66.9.146 Bcast:129.66.9.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > Base address:0x2000 Memory:fea8-feaa > > loLink encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:14259 errors:0 dropped:0 overruns:0 frame:0 > TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:879631 (859.0 Kb) TX bytes:879631 (859.0 Kb) > > uahrcw@c344-6:~/mpi-benchmarks> ping c344-5 > PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data. > 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64 > time=0.067 ms > 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64 > time=0.037 ms > 64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64 > time=0.022 ms > > --- c344-5.x.asc.edu ping statistics --- > 3 packets transmitted, 3 received, 0% packet loss, time 1999ms > rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms > > > > George Bosilca wrote: >> I see only 2 possibilities: >> 1. your trying to run Open MPI on nodes having multiple IP >> addresses. >> 2. your nodes are behind firewalls and Open MPI is unable to pass through. >> >> Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full >> answer to your question. >> >> Thanks, >> george. >> >> On Thu, 16 Mar 2006, Charles Wright wrote: >> >> >>> Hello, >>> I'm just compiled open-mpi and tried to run my code which just >>> measures bandwidth from one node to another. (Code compile fine and >>> runs under other mpi implementations) >>> >>> When I did I got this. >>> >>> uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 >>> c317-6 >>> c317-5 >>> [c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>> connection failed (errno=110) - retrying (pid=24979) >>> [c317-5:24979] mca_oob_tcp_peer_timer_handler >>> [c317-5:24997] [0,1,1]-[0,0,0] mca_oob_tcp_peer_complete_connect: >>> connectio
Re: [OMPI users] mca_oob_tcp_peer_complete_connect: connection failed
That works!! Thanks!! George Bosilca wrote: >Sorry I wasn't clear enough on my previous post. The error messages that >you get are comming from the OOB which is the framework we're using to >setup the MPI run. The options that you use (btl_tcp_if_include) are only >used for MPI communications. Please add "--mca oob_tcp_include eth0" to >force the OOB framework to use eth0. In order to don't have to type all >these options all the time you can add them in the >$(HOME).openmpi/mca-params.conf file. A file containing: > >oob_tcp_include=eth1 >btl_tcp_if_include=eth1 > >should solve your problems, if the firewall is opened on eth1 between >these nodes. > > Thanks, > george. > >On Thu, 16 Mar 2006, Charles Wright wrote: > > >>Thanks for the tip. >> >>I see that both number 1 and 2 are true. >>Openmpi is insisting on using my eth0 (I know this by watching the >>firewall log on the node it is trying to go to) >> >>This is despite the fact that I have the first dns entry go to eth1, >>normally that is all pbs would need to do the right thing and use the >>network I prefer. >> >>Ok so I see there are some options to in/exclude interfaces. >> >>however mpiexec is igorning my requests. >>I tried it two ways. Neither worked. Firewall rejects traffic coming >>into 1.0.x.x. network in both cases. >> >>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_include eth1 >>-n 2 $XD1LAUNCHER ./mpimeasure >>/opt/asn/apps/openmpi-1.0.1/bin/mpiexec --gmca btl_tcp_if_exclude eth0 >>-n 2 $XD1LAUNCHER ./mpimeasure >> >>(see dns works... not over eth0) >>uahrcw@c344-6:~/mpi-benchmarks> /sbin/ifconfig >>eth0 Link encap:Ethernet HWaddr 00:0E:AB:01:58:60 >> inet addr:1.0.21.134 Bcast:1.127.255.255 Mask:255.128.0.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:6596091 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:316165 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:560395541 (534.4 Mb) TX bytes:34367848 (32.7 Mb) >> Interrupt:16 >> >>eth1 Link encap:Ethernet HWaddr 00:0E:AB:01:58:61 >> inet addr:1.128.21.134 Mask:255.128.0.0 >> UP RUNNING NOARP MTU:1500 Metric:1 >> RX packets:5600487 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:4863441 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:6203028277 (5915.6 Mb) TX bytes:566471561 (540.2 Mb) >> Interrupt:25 >> >>eth2 Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:829064 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:181572 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:61216408 (58.3 Mb) TX bytes:19079579 (18.1 Mb) >> Base address:0x2000 Memory:fea8-feaa >> >>eth2:2Link encap:Ethernet HWaddr 00:0E:AB:01:58:62 >> inet addr:129.66.9.146 Bcast:129.66.9.255 Mask:255.255.255.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> Base address:0x2000 Memory:fea8-feaa >> >>loLink encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> RX packets:14259 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:14259 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:879631 (859.0 Kb) TX bytes:879631 (859.0 Kb) >> >>uahrcw@c344-6:~/mpi-benchmarks> ping c344-5 >>PING c344-5.x.asc.edu (1.128.21.133) 56(84) bytes of data. >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=1 ttl=64 >>time=0.067 ms >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=2 ttl=64 >>time=0.037 ms >>64 bytes from c344-5.x.asc.edu (1.128.21.133): icmp_seq=3 ttl=64 >>time=0.022 ms >> >>--- c344-5.x.asc.edu ping statistics --- >>3 packets transmitted, 3 received, 0% packet loss, time 1999ms >>rtt min/avg/max/mdev = 0.022/0.042/0.067/0.018 ms >> >> >> >>George Bosilca wrote: >> >>>I see only 2 possibilities: >>>1. your trying to run Open MPI on nodes having multiple IP >>>addresses. >>>2. your nodes are behind firewalls and Open MPI is unable to pass through. >>> >>>Please check the FAQ on http://www.open-mpi.org/faq/ to find out the full >>>answer to your question. >>> >>> Thanks, >>>george. >>> >>>On Thu, 16 Mar 2006, Charles Wright wrote: >>> >>> >>> Hello, I'm just compiled open-mpi and tried to run my code which just measures bandwidth from one node to another. (Code compile fine and runs under other mpi implementations) When I did I got this. uahrcw@c275-6:~/mpi-benchmarks> cat openmpitcp.o15380 c317-6 c317-5 [c317-5:24979] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect: connection failed (errno=110) - retrying (pid=24979) [c317-5:24979] mca_oob