Thanks for the reply. no. In my case the problem was with the misunderstanding of our network administrator. Our network system should have, up to 1023 ports locked but some one else put a ticket on 1024 too. for this purpose i wasn't able to communicate with other computers.
On Mon, Apr 7, 2014 at 9:52 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>wrote: > I was out on vacation / fully disconnected last week, and am just getting > to all the backlog now... > > Are you saying that port 1024 was locked as well -- i.e., that we should > set the minimum to 1025? > > > On Mar 31, 2014, at 4:32 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > > > Yes Jeff, > > You were right. The default value for btl_tcp_port_min_v4 is 1024. > > > > I was facing problem in running my Algorithm on multiple processors > (using ssh). > > > > Answer: > > The network administrator locked that port. > > :( > > > > i changed the communication port by forcing mpi to use another. > > > > mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca > btl_tcp_if_include br0 --mca btl_tcp_port_min_v4 10000 ./a.out > > > > Thanks again for the nice and effective suggestions. > > > > Regards. > > > > > > > > On Tue, Mar 25, 2014 at 1:27 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > This is very odd -- the default value for btl_tcp_port_min_v4 is 1024. > So unless you have overridden this value, you should not be getting a port > less than 1024. You can run this to see: > > > > ompi_info --level 9 --param btl tcp --parsable | grep port_min_v4 > > > > Mine says this in a default 1.7.5 installation: > > > > mca:btl:tcp:param:btl_tcp_port_min_v4:value:1024 > > mca:btl:tcp:param:btl_tcp_port_min_v4:source:default > > mca:btl:tcp:param:btl_tcp_port_min_v4:status:writeable > > mca:btl:tcp:param:btl_tcp_port_min_v4:level:2 > > mca:btl:tcp:param:btl_tcp_port_min_v4:help:The minimum port where the > TCP BTL will try to bind (default 1024) > > mca:btl:tcp:param:btl_tcp_port_min_v4:deprecated:no > > mca:btl:tcp:param:btl_tcp_port_min_v4:type:int > > mca:btl:tcp:param:btl_tcp_port_min_v4:disabled:false > > > > > > > > On Mar 25, 2014, at 5:36 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > > > > > Hello, > > > Thanks i figured out what was the exact problem in my case. > > > Now i am using the following execution line. > > > it is directing the mpi comm port to start from 10000... > > > > > > mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca > btl_tcp_if_include br0 --mca btl_tcp_port_min_v4 10000 ./a.out > > > > > > and every thing works again. > > > > > > Thanks. > > > > > > Best regards. > > > > > > > > > > > > > > > On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed <e.hamidsa...@gmail.com> > wrote: > > > Hello, > > > I am not sure what approach does the MPI communication follow but when > i > > > use > > > --mca btl_base_verbose 30 > > > > > > I observe the mentioned port. > > > > > > [karp:23756] btl: tcp: attempting to connect() to address > 134.106.3.252 on port 4 > > > > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > > > > > > > the information on the > > > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > > > is not enough could you kindly explain.. > > > > > > How can restrict MPI communication to use the ports starting from 1025. > > > or use the port some what like > > > 59822... > > > > > > Regards. > > > > > > > > > > > > On Tue, Mar 25, 2014 at 9:15 AM, Reuti <re...@staff.uni-marburg.de> > wrote: > > > Hi, > > > > > > Am 25.03.2014 um 08:34 schrieb Hamid Saeed: > > > > > > > Is it possible to change the port number for the MPI communication? > > > > > > > > I can see that my program uses port 4 for the MPI communication. > > > > > > > > [karp:23756] btl: tcp: attempting to connect() to address > 134.106.3.252 on port 4 > > > > > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 134.106.3.252 failed: Connection refused (111) > > > > > > > > In my case the ports from 1 to 1024 are reserved. > > > > MPI tries to use one of the reserve ports and prompts the connection > refused error. > > > > > > > > I will be very glade for the kind suggestions. > > > > > > There are certain parameters to set the range of used ports, but using > any up to 1024 should not be the default: > > > > > > http://www.open-mpi.org/community/lists/users/2011/11/17732.php > > > > > > Are any of these set by accident beforehand by your environment? > > > > > > -- Reuti > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed <e.hamidsa...@gmail.com> > wrote: > > > > Hello Jeff, > > > > > > > > Thanks for your cooperation. > > > > > > > > --mca btl_tcp_if_include br0 > > > > > > > > worked out of the box. > > > > > > > > The problem was from the network administrator. The machines on the > network side were halting the mpi... > > > > > > > > so cleaning and killing every thing worked. > > > > > > > > :) > > > > > > > > regards. > > > > > > > > > > > > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > > There is no "self" IP interface in the Linux kernel. > > > > > > > > Try using btl_tcp_if_include and list just the interface(s) that you > want to use. From your prior email, I'm *guessing* it's just br2 (i.e., > the 10.x address inside your cluster). > > > > > > > > Also, it looks like you didn't setup your SSH keys properly for > logging in to remote notes automatically. > > > > > > > > > > > > > > > > On Mar 24, 2014, at 10:56 AM, Hamid Saeed <e.hamidsa...@gmail.com> > wrote: > > > > > > > > > Hello, > > > > > > > > > > I added the "self" e.g > > > > > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib > --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth > ./scatterv > > > > > > > > > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': > > > > > > -------------------------------------------------------------------------- > > > > > > > > > > ERROR:: > > > > > > > > > > At least one pair of MPI processes are unable to reach each other > for > > > > > MPI communications. This means that no Open MPI device has > indicated > > > > > that it can be used to communicate between these processes. This > is > > > > > an error; Open MPI requires that all MPI processes be able to reach > > > > > each other. This error can sometimes be the result of forgetting > to > > > > > specify the "self" BTL. > > > > > > > > > > Process 1 ([[15751,1],7]) is on host: wirth > > > > > Process 2 ([[15751,1],0]) is on host: karp > > > > > BTLs attempted: self sm > > > > > > > > > > Your MPI job is now going to abort; sorry. > > > > > > -------------------------------------------------------------------------- > > > > > > -------------------------------------------------------------------------- > > > > > MPI_INIT has failed because at least one MPI process is unreachable > > > > > from another. This *usually* means that an underlying > communication > > > > > plugin -- such as a BTL or an MTL -- has either not loaded or not > > > > > allowed itself to be used. Your MPI job will now abort. > > > > > > > > > > You may wish to try to narrow down the problem; > > > > > > > > > > * Check the output of ompi_info to see which BTL/MTL plugins are > > > > > available. > > > > > * Run your application with MPI_THREAD_SINGLE. > > > > > * Set the MCA parameter btl_base_verbose to 100 (or > mtl_base_verbose, > > > > > if using MTL-based communications) to see exactly which > > > > > communication plugins were considered and/or discarded. > > > > > > -------------------------------------------------------------------------- > > > > > [wirth:40329] *** An error occurred in MPI_Init > > > > > [wirth:40329] *** on a NULL communicator > > > > > [wirth:40329] *** Unknown error > > > > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > > > > > > -------------------------------------------------------------------------- > > > > > An MPI process is aborting at a time when it cannot guarantee that > all > > > > > of its peer processes in the job will be killed properly. You > should > > > > > double check that everything has shut down cleanly. > > > > > > > > > > Reason: Before MPI_INIT completed > > > > > Local host: wirth > > > > > PID: 40329 > > > > > > -------------------------------------------------------------------------- > > > > > > -------------------------------------------------------------------------- > > > > > mpirun has exited due to process rank 7 with PID 40329 on > > > > > node wirth exiting improperly. There are two reasons this could > occur: > > > > > > > > > > 1. this process did not call "init" before exiting, but others in > > > > > the job did. This can cause a job to hang indefinitely while it > waits > > > > > for all processes to call "init". By rule, if one process calls > "init", > > > > > then ALL processes must call "init" prior to termination. > > > > > > > > > > 2. this process called "init", but exited without calling > "finalize". > > > > > By rule, all processes that call "init" MUST call "finalize" prior > to > > > > > exiting or it will be considered an "abnormal termination" > > > > > > > > > > This may have caused other processes in the application to be > > > > > terminated by signals sent by mpirun (as reported here). > > > > > > -------------------------------------------------------------------------- > > > > > [karp:29513] 1 more process has sent help message > help-mca-bml-r2.txt / unreachable proc > > > > > [karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to > see all help / error messages > > > > > [karp:29513] 1 more process has sent help message help-mpi-runtime > / mpi_init:startup:pml-add-procs-fail > > > > > [karp:29513] 1 more process has sent help message > help-mpi-errors.txt / mpi_errors_are_fatal unknown handle > > > > > [karp:29513] 1 more process has sent help message > help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed > > > > > > > > > > I tried every combination for btl_tcp_if_include or exclude... > > > > > > > > > > I cant figure out what is wrong. > > > > > I can easily talk with the remote pc using netcat. > > > > > I am sure i am very near to the solution but.. > > > > > > > > > > regards. > > > > > > > > > > > > > > > > > > > > On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > > > If you you use btl_tcp_if_exclude, you also need to exclude the > loopback interface. Loopback is excluded by the default value of > btl_tcp_if_exclude, but if you overwrite that value, then you need to > *also* include the loopback interface in the new value. > > > > > > > > > > > > > > > > > > > > On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsa...@gmail.com> > wrote: > > > > > > > > > > > Hello, > > > > > > Still i am facing problems. > > > > > > I checked there is no firewall which is acting as a barrier for > the mpi communication. > > > > > > > > > > > > even i used the execution line like > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca > btl_tcp_if_exclude br2 -host wirth,karp ./a.out > > > > > > > > > > > > Now the output hangup without displaying any error. > > > > > > > > > > > > Used "..exclude bt2" because the failed connection was from bt2 > as you can see in the "ifconfig" output mentioned above. > > > > > > > > > > > > I know there is something wrong but i am almost unable to figure > it out. > > > > > > > > > > > > I need some more kind suggestions. > > > > > > > > > > > > regards. > > > > > > > > > > > > > > > > > > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > > > > Do you have any firewalling enabled on these machines? If so, > you'll want to either disable it, or allow random TCP connections between > any of the cluster nodes. > > > > > > > > > > > > > > > > > > On Mar 21, 2014, at 10:24 AM, Hamid Saeed < > e.hamidsa...@gmail.com> wrote: > > > > > > > > > > > > > /sbin/ifconfig > > > > > > > > > > > > > > hsaeed@karp:~$ /sbin/ifconfig > > > > > > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba > > > > > > > inet addr:134.106.3.231 Bcast:134.106.3.255 > Mask:255.255.255.0 > > > > > > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link > > > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:49080961 errors:0 dropped:50263 > overruns:0 frame:0 > > > > > > > TX packets:43279252 errors:0 dropped:0 overruns:0 > carrier:0 > > > > > > > collisions:0 txqueuelen:0 > > > > > > > RX bytes:41348407558 (38.5 GiB) TX > bytes:80505842745 (74.9 GiB) > > > > > > > > > > > > > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb > > > > > > > inet addr:134.106.53.231 Bcast:134.106.53.255 > Mask:255.255.255.0 > > > > > > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link > > > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:41573060 errors:0 dropped:50261 > overruns:0 frame:0 > > > > > > > TX packets:1693509 errors:0 dropped:0 overruns:0 > carrier:0 > > > > > > > collisions:0 txqueuelen:0 > > > > > > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 > (219.9 MiB) > > > > > > > > > > > > > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 > > > > > > > inet addr:10.231.2.231 Bcast:10.231.2.255 > Mask:255.255.255.0 > > > > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > > > > > collisions:0 txqueuelen:0 > > > > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > > > > > > > > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba > > > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:69108377 errors:0 dropped:0 overruns:0 > frame:0 > > > > > > > TX packets:86459066 errors:0 dropped:0 overruns:0 > carrier:0 > > > > > > > collisions:0 txqueuelen:1000 > > > > > > > RX bytes:43533091399 (40.5 GiB) TX > bytes:83359370885 (77.6 GiB) > > > > > > > Memory:dfe60000-dfe80000 > > > > > > > > > > > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb > > > > > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:43531546 errors:0 dropped:0 overruns:0 > frame:0 > > > > > > > TX packets:1716151 errors:0 dropped:0 overruns:0 > carrier:0 > > > > > > > collisions:0 txqueuelen:1000 > > > > > > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 > (221.2 MiB) > > > > > > > Memory:dfee0000-dff00000 > > > > > > > > > > > > > > lo Link encap:Local Loopback > > > > > > > inet addr:127.0.0.1 Mask:255.0.0.0 > > > > > > > inet6 addr: ::1/128 Scope:Host > > > > > > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > > > > > > RX packets:10890707 errors:0 dropped:0 overruns:0 > frame:0 > > > > > > > TX packets:10890707 errors:0 dropped:0 overruns:0 > carrier:0 > > > > > > > collisions:0 txqueuelen:0 > > > > > > > RX bytes:36194379576 (33.7 GiB) TX > bytes:36194379576 (33.7 GiB) > > > > > > > > > > > > > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 > > > > > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 > > > > > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > > > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > > > > > collisions:0 txqueuelen:500 > > > > > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > > > > > > > > > When i execute the following line > > > > > > > > > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host > wirth,karp ./a.out > > > > > > > > > > > > > > i receive Error > > > > > > > > > > > > > > > [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > connect() to 10.231.2.231 failed: Connection refused (111) > > > > > > > > > > > > > > > > > > > > > NOTE: Karp and wirth are two machines on ssh cluster. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > > > > > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed < > e.hamidsa...@gmail.com> wrote: > > > > > > > > > > > > > > > > I think i have a tcp connection. As for as i know my > cluster is not configured for Infiniband (IB). > > > > > > > > > > > > > > Ok. > > > > > > > > > > > > > > > > but even for tcp connections. > > > > > > > > > > > > > > > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self > ./helloworldmpi > > > > > > > > > mpirun -n 2 -host master,node001 ./helloworldmpi > > > > > > > > > > > > > > > > > > These line are not working they output > > > > > > > > > Error like > > > > > > > > > > [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] connect() to > xx.xxx.x.xxx failed: Connection refused (111) > > > > > > > > > > > > > > What are the IP addresses reported by connect()? (i.e., the > address you X'ed out) > > > > > > > > > > > > > > Send the output from ifconfig on each of your servers. Note > that some Linux distributions do not put ifconfig in the default PATH of > normal users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig. > > > > > > > > > > > > > > -- > > > > > > > Jeff Squyres > > > > > > > jsquy...@cisco.com > > > > > > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > > > _______________________________________________ > > > > > > > users mailing list > > > > > > > us...@open-mpi.org > > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > _______________________________________________ > > > > > > > Hamid Saeed > > > > > > > CoSynth GmbH & Co. KG > > > > > > > Escherweg 2 - 26121 Oldenburg - Germany > > > > > > > Tel +49 441 9722 738 | Fax -278 > > > > > > > http://www.cosynth.com > > > > > > > _______________________________________________ > > > > > > > _______________________________________________ > > > > > > > users mailing list > > > > > > > us...@open-mpi.org > > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > -- > > > > > > Jeff Squyres > > > > > > jsquy...@cisco.com > > > > > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > > > _______________________________________________ > > > > > > users mailing list > > > > > > us...@open-mpi.org > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > _______________________________________________ > > > > > > Hamid Saeed > > > > > > CoSynth GmbH & Co. KG > > > > > > Escherweg 2 - 26121 Oldenburg - Germany > > > > > > Tel +49 441 9722 738 | Fax -278 > > > > > > http://www.cosynth.com > > > > > > _______________________________________________ > > > > > > _______________________________________________ > > > > > > users mailing list > > > > > > us...@open-mpi.org > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > -- > > > > > Jeff Squyres > > > > > jsquy...@cisco.com > > > > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > us...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > > > -- > > > > > _______________________________________________ > > > > > Hamid Saeed > > > > > _______________________________________________ > > > > > users mailing list > > > > > us...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > -- > > > > Jeff Squyres > > > > jsquy...@cisco.com > > > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > -- > > > > _______________________________________________ > > > > Hamid Saeed > > > > _______________________________________________ > > > > > > > > > > > > > > > > -- > > > > _______________________________________________ > > > > Hamid Saeed > > > > ______________________________________________ > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > -- > > > _______________________________________________ > > > Hamid Saeed > > > CoSynth GmbH & Co. KG > > > Escherweg 2 - 26121 Oldenburg - Germany > > > Tel +49 441 9722 738 | Fax -278 > > > http://www.cosynth.com > > > _______________________________________________ > > > > > > > > > > > > -- > > > _______________________________________________ > > > Hamid Saeed > > > _______________________________________________ > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > _______________________________________________ > > Hamid Saeed > > _______________________________________________ > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- _______________________________________________ Hamid Saeed CoSynth GmbH & Co. KG Escherweg 2 - 26121 Oldenburg - Germany Tel +49 441 9722 738 | Fax -278 http://www.cosynth.com _______________________________________________