There is no "self" IP interface in the Linux kernel. Try using btl_tcp_if_include and list just the interface(s) that you want to use. From your prior email, I'm *guessing* it's just br2 (i.e., the 10.x address inside your cluster).
Also, it looks like you didn't setup your SSH keys properly for logging in to remote notes automatically. On Mar 24, 2014, at 10:56 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > Hello, > > I added the "self" e.g > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa': > -------------------------------------------------------------------------- > > ERROR:: > > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[15751,1],7]) is on host: wirth > Process 2 ([[15751,1],0]) is on host: karp > BTLs attempted: self sm > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > MPI_INIT has failed because at least one MPI process is unreachable > from another. This *usually* means that an underlying communication > plugin -- such as a BTL or an MTL -- has either not loaded or not > allowed itself to be used. Your MPI job will now abort. > > You may wish to try to narrow down the problem; > > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > -------------------------------------------------------------------------- > [wirth:40329] *** An error occurred in MPI_Init > [wirth:40329] *** on a NULL communicator > [wirth:40329] *** Unknown error > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: wirth > PID: 40329 > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun has exited due to process rank 7 with PID 40329 on > node wirth exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [karp:29513] 1 more process has sent help message help-mca-bml-r2.txt / > unreachable proc > [karp:29513] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages > [karp:29513] 1 more process has sent help message help-mpi-runtime / > mpi_init:startup:pml-add-procs-fail > [karp:29513] 1 more process has sent help message help-mpi-errors.txt / > mpi_errors_are_fatal unknown handle > [karp:29513] 1 more process has sent help message help-mpi-runtime.txt / ompi > mpi abort:cannot guarantee all killed > > I tried every combination for btl_tcp_if_include or exclude... > > I cant figure out what is wrong. > I can easily talk with the remote pc using netcat. > I am sure i am very near to the solution but.. > > regards. > > > > On Mon, Mar 24, 2014 at 3:25 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > If you you use btl_tcp_if_exclude, you also need to exclude the loopback > interface. Loopback is excluded by the default value of btl_tcp_if_exclude, > but if you overwrite that value, then you need to *also* include the loopback > interface in the new value. > > > > On Mar 24, 2014, at 4:57 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > > > Hello, > > Still i am facing problems. > > I checked there is no firewall which is acting as a barrier for the mpi > > communication. > > > > even i used the execution line like > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 --mca btl_tcp_if_exclude br2 > > -host wirth,karp ./a.out > > > > Now the output hangup without displaying any error. > > > > Used "..exclude bt2" because the failed connection was from bt2 as you can > > see in the "ifconfig" output mentioned above. > > > > I know there is something wrong but i am almost unable to figure it out. > > > > I need some more kind suggestions. > > > > regards. > > > > > > On Fri, Mar 21, 2014 at 6:05 PM, Jeff Squyres (jsquyres) > > <jsquy...@cisco.com> wrote: > > Do you have any firewalling enabled on these machines? If so, you'll want > > to either disable it, or allow random TCP connections between any of the > > cluster nodes. > > > > > > On Mar 21, 2014, at 10:24 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > > > > > /sbin/ifconfig > > > > > > hsaeed@karp:~$ /sbin/ifconfig > > > br0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba > > > inet addr:134.106.3.231 Bcast:134.106.3.255 Mask:255.255.255.0 > > > inet6 addr: fe80::225:90ff:fe59:c9ba/64 Scope:Link > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:49080961 errors:0 dropped:50263 overruns:0 frame:0 > > > TX packets:43279252 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:41348407558 (38.5 GiB) TX bytes:80505842745 (74.9 GiB) > > > > > > br1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb > > > inet addr:134.106.53.231 Bcast:134.106.53.255 > > > Mask:255.255.255.0 > > > inet6 addr: fe80::225:90ff:fe59:c9bb/64 Scope:Link > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:41573060 errors:0 dropped:50261 overruns:0 frame:0 > > > TX packets:1693509 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:6177072160 (5.7 GiB) TX bytes:230617435 (219.9 MiB) > > > > > > br2 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 > > > inet addr:10.231.2.231 Bcast:10.231.2.255 Mask:255.255.255.0 > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > eth0 Link encap:Ethernet HWaddr 00:25:90:59:c9:ba > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:69108377 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:86459066 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:43533091399 (40.5 GiB) TX bytes:83359370885 (77.6 GiB) > > > Memory:dfe60000-dfe80000 > > > > > > eth1 Link encap:Ethernet HWaddr 00:25:90:59:c9:bb > > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > > RX packets:43531546 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:1716151 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:1000 > > > RX bytes:7201915977 (6.7 GiB) TX bytes:232026383 (221.2 MiB) > > > Memory:dfee0000-dff00000 > > > > > > lo Link encap:Local Loopback > > > inet addr:127.0.0.1 Mask:255.0.0.0 > > > inet6 addr: ::1/128 Scope:Host > > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > > RX packets:10890707 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:10890707 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:0 > > > RX bytes:36194379576 (33.7 GiB) TX bytes:36194379576 (33.7 GiB) > > > > > > tap0 Link encap:Ethernet HWaddr 00:c0:0a:ec:02:e7 > > > UP BROADCAST MULTICAST MTU:1500 Metric:1 > > > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > > collisions:0 txqueuelen:500 > > > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > > > > > > When i execute the following line > > > > > > hsaeed@karp:~/Task4_mpi/scatterv$ mpiexec -n 2 -host wirth,karp ./a.out > > > > > > i receive Error > > > > > > [wirth][[59430,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > > > connect() to 10.231.2.231 failed: Connection refused (111) > > > > > > > > > NOTE: Karp and wirth are two machines on ssh cluster. > > > > > > > > > > > > > > > On Fri, Mar 21, 2014 at 3:13 PM, Jeff Squyres (jsquyres) > > > <jsquy...@cisco.com> wrote: > > > On Mar 21, 2014, at 10:09 AM, Hamid Saeed <e.hamidsa...@gmail.com> wrote: > > > > > > > > I think i have a tcp connection. As for as i know my cluster is not > > > > > configured for Infiniband (IB). > > > > > > Ok. > > > > > > > > but even for tcp connections. > > > > > > > > > > mpirun -n 2 -host master,node001 --mca btl tcp,sm,self ./helloworldmpi > > > > > mpirun -n 2 -host master,node001 ./helloworldmpi > > > > > > > > > > These line are not working they output > > > > > Error like > > > > > [btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect] > > > > > connect() to xx.xxx.x.xxx failed: Connection refused (111) > > > > > > What are the IP addresses reported by connect()? (i.e., the address you > > > X'ed out) > > > > > > Send the output from ifconfig on each of your servers. Note that some > > > Linux distributions do not put ifconfig in the default PATH of normal > > > users; look for it in/sbin/ifconfig or /usr/sbin/ifconfig. > > > > > > -- > > > Jeff Squyres > > > jsquy...@cisco.com > > > For corporate legal information go to: > > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > -- > > > _______________________________________________ > > > Hamid Saeed > > > CoSynth GmbH & Co. KG > > > Escherweg 2 - 26121 Oldenburg - Germany > > > Tel +49 441 9722 738 | Fax -278 > > > http://www.cosynth.com > > > _______________________________________________ > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > -- > > _______________________________________________ > > Hamid Saeed > > CoSynth GmbH & Co. KG > > Escherweg 2 - 26121 Oldenburg - Germany > > Tel +49 441 9722 738 | Fax -278 > > http://www.cosynth.com > > _______________________________________________ > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > _______________________________________________ > Hamid Saeed > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/