Hello, Thanks i figured out what was the exact problem in my case. Now i am using the following execution line. it is directing the mpi comm port to start from 10000...
mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include br0 --mca btl_tcp_port_min_v4 10000 ./a.out and every thing works again. Thanks. Best regards. On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed <e.hamidsa...@gmail.com>wrote: > Hello, > I am not sure what approach does the MPI communication follow but when i > use > --mca btl_base_verbose 30 > > I observe the mentioned port. > > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on > port 4 > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > the information on the > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > is not enough could you kindly explain.. > > How can restrict MPI communication to use the ports starting from 1025. > or use the port some what like > 59822... > > Regards. > > > > On Tue, Mar 25, 2014 at 9:15 AM, Reuti <re...@staff.uni-marburg.de> wrote: > >> Hi, >> >> Am 25.03.2014 um 08:34 schrieb Hamid Saeed: >> >> > Is it possible to change the port number for the MPI communication? >> > >> > I can see that my program uses port 4 for the MPI communication. >> > >> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 >> on port 4 >> > >> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >> connect() to 134.106.3.252 failed: Connection refused (111) >> > >> > In my case the ports from 1 to 1024 are reserved. >> > MPI tries to use one of the reserve ports and prompts the connection >> refused error. >> > >> > I will be very glade for the kind suggestions. >> >> There are certain parameters to set the range of used ports, but using >> any up to 1024 should not be the default: >> >> http://www.open-mpi.org/community/lists/users/2011/11/17732.php >> >> Are any of these set by accident beforehand by your environment? >> >> -- Reuti >> >> >> > Regards. >> > >> > >> > >> > >> > >> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed <e.hamidsa...@gmail.com> >> wrote: >> > Hello Jeff, >> > >> > Thanks for your cooperation. >> > >> > --mca btl_tcp_if_include br0 >> > >> > worked out of the box. >> > >> > The problem was from the network administrator. The machines on the >> network side were halting the mpi... >> > >> > so cleaning and killing every thing worked. >> > >> > :) >> > >> > regards. >> > >> > >> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > There is no "self" IP interface in the Linux kernel. >> > >> > Try using btl_tcp_if_include and list just the interface(s) that you >> want to use. From your prior email, I'm *guessing* it's just br2 (i.e., >> the 10.x address inside your cluster). >> > >> > Also, it looks like you didn't setup your SSH keys properly for logging >> in to remote notes automatically. >> > >> > >> > >> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed <e.hamidsa...@gmail.com> >> wrote: >> > >> > > Hello, >> > > >> > > I added the "self" e.g >> > > >> > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib >> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth >> ./scatterv >> > > >> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': >> > > >> -------------------------------------------------------------------------- >> > > >> > > ERROR:: >> > > >> > > At least one pair of MPI processes are unable to reach each other for >> > > MPI communications. This means that no Open MPI device has indicated >> > > that it can be used to communicate between these processes. This is >> > > an error; Open MPI requires that all MPI processes be able to reach >> > > each other. This error can sometimes be the result of forgetting to >> > > specify the "self" BTL. >> > > >> > > Process 1 ([[15751,1],7]) is on host: wirth >> > > Process 2 ([[15751,1],0]) is on host: karp >> > > BTLs attempted: self sm >> > > >> > > Your MPI job is now going to abort; sorry. >> > > >> -------------------------------------------------------------------------- >> > > >> -------------------------------------------------------------------------- >> > > MPI_INIT has failed because at least one MPI process is unreachable >> > > from another. This *usually* means that an underlying communication >> > > plugin -- such as a BTL or an MTL -- has either not loaded or not >> > > allowed itself to be used. Your MPI job will now abort. >> > > >> > > You may wish to try to narrow down the problem; >> > > >> > > * Check the output of ompi_info to see which BTL/MTL plugins are >> > > available. >> > > * Run your application with MPI_THREAD_SINGLE. >> > > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, >> > > if using MTL-based communications) to see exactly which >> > > communication plugins were considered and/or discarded. >> > > >> -------------------------------------------------------------------------- >> > > [wirth:40329] *** An error occurred in MPI_Init >> > > [wirth:40329] *** on a NULL communicator >> > > [wirth:40329] *** Unknown error >> > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort >> > > >> -------------------------------------------------------------------------- >> > > An MPI process is aborting at a time when it cannot guarantee that all >> > > of its peer processes in the job will be killed properly. You should >> > > double check that everything has shut down cleanly. >> > > >> > > Reason: Before MPI_INIT completed >> > > Local host: wirth >> > > PID: 40329 >> > > >> -------------------------------------------------------------------------- >> > > >> -------------------------------------------------------------------------- >> > > mpirun has exited due to process rank 7 with PID 40329 on >> > > node wirth exiting improperly. There are two reasons this could occur: >> > > >> > > 1. this process did not call "init" before exiting, but others in >> > > the job did. This can cause a job to hang indefinitely while it waits >> > > for all processes to call "init". By rule, if one process calls >> "init", >> > > then ALL processes must call "init" prior to termination. >> > > >> > > 2. this process called "init", but exited without calling "finalize". >> > > By rule, all processes that call "init" MUST call "finalize" prior to >> > > exiting or it will be considered an "abnormal termination" >> > > >> > > This may have caused other processes in the application to be >> > > terminated by signals sent by mpirun (as reported here). >> > > >> -------------------------------------------------------------------------- >> > > [karp:29513] 1 more process has sent help message help-mca-bml-r2.txt >> / unreachable proc >> > > [karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> > > [karp:29513] 1 more process has sent help message help-mpi-runtime / >> mpi_init:startup:pml-add-procs-fail >> > > [karp:29513] 1 more process has sent help message help-mpi-errors.txt >> / mpi_errors_are_fatal unknown handle >> > > [karp:29513] 1 more process has sent help message >> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed >> > > >> > > I tried every combination for btl_tcp_if_include or exclude... >> > > >> > > I cant figure out what is wrong. >> > > I can easily talk with the remote pc using netcat. >> > > I am sure i am very near to the solution but.. >> > > >> > > regards. >> > > >> > > >> > > >> > > On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > > If you you use btl_tcp_if_exclude, you also need to exclude the >> loopback interface. Loopback is excluded by the default value of >> btl_tcp_if_exclude, but if you overwrite that value, then you need to >> *also* include the loopback interface in the new value. >> > > >> > > >> > > >> > > On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsa...@gmail.com> >> wrote: >> > > >> > > > Hello, >> > > > Still i am facing problems. >> > > > I checked there is no firewall which is acting as a barrier for the >> mpi communication. >> > > > >> > > > even i used the execution line like >> > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca >> btl_tcp_if_exclude br2 -host wirth,karp ./a.out >> > > > >> > > > Now the output hangup without displaying any error. >> > > > >> > > > Used "..exclude bt2" because the failed connection was from bt2 as >> you can see in the "ifconfig" output mentioned above. >> > > > >> > > > I know there is something wrong but i am almost unable to figure it >> out. >> > > > >> > > > I need some more kind suggestions. >> > > > >> > > > regards. >> > > > >> > > > >> > > > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > > > Do you have any firewalling enabled on these machines? If so, >> you'll want to either disable it, or allow random TCP connections between >> any of the cluster nodes. >> > > > >> > > > >> > > > On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsa...@gmail.com> >> wrote: >> > > > >> > > > > /sbin/ifconfig >> > > > > >> > > > > hsaeed@karp:~$ /sbin/ifconfig >> > > > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba >> > > > > inet addr:134.106.3.231 Bcast:134.106.3.255 >> Mask:255.255.255.0 >> > > > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link >> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:49080961 errors:0 dropped:50263 overruns:0 >> frame:0 >> > > > > TX packets:43279252 errors:0 dropped:0 overruns:0 >> carrier:0 >> > > > > collisions:0 txqueuelen:0 >> > > > > RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745 >> (74.9 GiB) >> > > > > >> > > > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb >> > > > > inet addr:134.106.53.231 Bcast:134.106.53.255 >> Mask:255.255.255.0 >> > > > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link >> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:41573060 errors:0 dropped:50261 overruns:0 >> frame:0 >> > > > > TX packets:1693509 errors:0 dropped:0 overruns:0 >> carrier:0 >> > > > > collisions:0 txqueuelen:0 >> > > > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 >> (219.9 MiB) >> > > > > >> > > > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 >> > > > > inet addr:10.231.2.231 Bcast:10.231.2.255 >> Mask:255.255.255.0 >> > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > > > > collisions:0 txqueuelen:0 >> > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > > > > >> > > > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba >> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:69108377 errors:0 dropped:0 overruns:0 >> frame:0 >> > > > > TX packets:86459066 errors:0 dropped:0 overruns:0 >> carrier:0 >> > > > > collisions:0 txqueuelen:1000 >> > > > > RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885 >> (77.6 GiB) >> > > > > Memory:dfe60000-dfe80000 >> > > > > >> > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb >> > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:43531546 errors:0 dropped:0 overruns:0 >> frame:0 >> > > > > TX packets:1716151 errors:0 dropped:0 overruns:0 >> carrier:0 >> > > > > collisions:0 txqueuelen:1000 >> > > > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 >> (221.2 MiB) >> > > > > Memory:dfee0000-dff00000 >> > > > > >> > > > > lo Link encap:Local Loopback >> > > > > inet addr:127.0.0.1 Mask:255.0.0.0 >> > > > > inet6 addr: ::1/128 Scope:Host >> > > > > UP LOOPBACK RUNNING MTU:16436 Metric:1 >> > > > > RX packets:10890707 errors:0 dropped:0 overruns:0 >> frame:0 >> > > > > TX packets:10890707 errors:0 dropped:0 overruns:0 >> carrier:0 >> > > > > collisions:0 txqueuelen:0 >> > > > > RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576 >> (33.7 GiB) >> > > > > >> > > > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 >> > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 >> > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >> > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >> > > > > collisions:0 txqueuelen:500 >> > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) >> > > > > >> > > > > When i execute the following line >> > > > > >> > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp >> ./a.out >> > > > > >> > > > > i receive Error >> > > > > >> > > > > >> [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] >> connect() to 10.231.2.231 failed: Connection refused (111) >> > > > > >> > > > > >> > > > > NOTE: Karp and wirth are two machines on ssh cluster. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > > > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed <e.hamidsa...@gmail.com> >> wrote: >> > > > > >> > > > > > > I think i have a tcp connection. As for as i know my cluster >> is not configured for Infiniband (IB). >> > > > > >> > > > > Ok. >> > > > > >> > > > > > > but even for tcp connections. >> > > > > > > >> > > > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self >> ./helloworldmpi >> > > > > > > mpirun -n 2 -host master,node001 ./helloworldmpi >> > > > > > > >> > > > > > > These line are not working they output >> > > > > > > Error like >> > > > > > > >> [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to >> xx.xxx.x.xxx failed: Connection refused (111) >> > > > > >> > > > > What are the IP addresses reported by connect()? (i.e., the >> address you X'ed out) >> > > > > >> > > > > Send the output from ifconfig on each of your servers. Note that >> some Linux distributions do not put ifconfig in the default PATH of normal >> users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig. >> > > > > >> > > > > -- >> > > > > Jeff Squyres >> > > > > jsquy...@cisco.com >> > > > > For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > > > >> > > > > _______________________________________________ >> > > > > users mailing list >> > > > > us...@open-mpi.org >> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > _______________________________________________ >> > > > > Hamid Saeed >> > > > > CoSynth GmbH & Co. KG >> > > > > Escherweg 2 - 26121 Oldenburg - Germany >> > > > > Tel +49 441 9722 738 | Fax -278 >> > > > > http://www.cosynth.com >> > > > > _______________________________________________ >> > > > > _______________________________________________ >> > > > > users mailing list >> > > > > us...@open-mpi.org >> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > >> > > > >> > > > -- >> > > > Jeff Squyres >> > > > jsquy...@cisco.com >> > > > For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > > >> > > > _______________________________________________ >> > > > users mailing list >> > > > us...@open-mpi.org >> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > >> > > > >> > > > >> > > > -- >> > > > _______________________________________________ >> > > > Hamid Saeed >> > > > CoSynth GmbH & Co. KG >> > > > Escherweg 2 - 26121 Oldenburg - Germany >> > > > Tel +49 441 9722 738 | Fax -278 >> > > > http://www.cosynth.com >> > > > _______________________________________________ >> > > > _______________________________________________ >> > > > users mailing list >> > > > us...@open-mpi.org >> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >> > > >> > > -- >> > > Jeff Squyres >> > > jsquy...@cisco.com >> > > For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > >> > > _______________________________________________ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >> > > >> > > >> > > -- >> > > _______________________________________________ >> > > Hamid Saeed >> > > _______________________________________________ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > -- >> > _______________________________________________ >> > Hamid Saeed >> > _______________________________________________ >> > >> > >> > >> > -- >> > _______________________________________________ >> > Hamid Saeed >> > ______________________________________________ >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > > _______________________________________________ > > Hamid Saeed > CoSynth GmbH & Co. KG > Escherweg 2 - 26121 Oldenburg - Germany > > Tel +49 441 9722 738 | Fax -278 > http://www.cosynth.com > > _______________________________________________ > -- _______________________________________________ Hamid Saeed _______________________________________________