Yes. I did. Because it was a same NIC with two ports each capable of delivering 5gb/s, I never thought that they should be in different subnet. But once I changed the subnet for one of the ports on both the nodes, it seemed to work..
Also, I am looking for a good way to start understanding the implementation level details for OpenMPI. Can you point me to some good source? (PS: To start with, I have already read the FAQ section) thanks a lot for the help -- Joba On Fri, Feb 17, 2012 at 8:30 AM, Richard Bardwell <rich...@sharc.co.uk>wrote: > Yes, they were on the same subnet. I guess that is the problem. > > ----- Original Message ----- From: "Jeff Squyres" <jsquy...@cisco.com> > To: "Open MPI Users" <us...@open-mpi.org> > Sent: Friday, February 17, 2012 4:20 PM > Subject: Re: [OMPI users] Problem running an mpi application on nodes > with more than one interface > > > > Did you have both of the ethernet ports on the same subnet, or were they >> on different subnets? >> >> >> On Feb 17, 2012, at 5:36 AM, Richard Bardwell wrote: >> >> I had exactly the same problem. >>> Trying to run mpi between 2 separate machines, with each machine having >>> 2 ethernet ports, causes really weird behaviour on the most basic code. >>> I had to disable one of the ethernet ports on each of the machines >>> and it worked just fine after that. No idea why though ! >>> >>> ----- Original Message ----- >>> From: Jingcha Joba >>> To: us...@open-mpi.org >>> Sent: Thursday, February 16, 2012 8:43 PM >>> Subject: [OMPI users] Problem running an mpi application on nodes with >>> more than one interface >>> >>> Hello Everyone, >>> This is my 1st post in open-mpi forum. >>> I am trying to run a simple program which does Sendrecv between two >>> nodes having 2 interface cards on each of two nodes. >>> Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon >>> processor. >>> What I noticed was that when using two or more interface on both the >>> nodes, the mpi "hangs" attempting to connect. >>> These details might help, >>> Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which >>> I use to ssh to that machine), and a double port "B" card (eth23 - 10.3.1.1 >>> & eth24 - 10.3.1.2). >>> Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx >>> - again uses for ssh) and a double port B card ( eth29 - 10.3.1.3 ð30 - >>> 10.3.1.4). >>> My /etc/host looks like >>> 25.192.xx.xx denver.xxx.com denver >>> 10.3.1.1 denver.xxx.com denver >>> 10.3.1.2 denver.xxx.com denver >>> 25.192.xx.xx chicago.xxx.com chicago >>> 10.3.1.3 chicago.xxx.com chicago >>> 10.3.1.4 chicago.xxx.com chicago >>> ... >>> ... >>> ... >>> This is how I run, >>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude >>> eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv >>> I get bunch of things from both chicago and denver, which says its has >>> found components like tcp, sm, self and stuffs, and then hangs at >>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address >>> 10.3.1.3 on port 4 >>> [denver.xxx.com:21682] btl: tcp: attempting to connect() to address >>> 10.3.1.4 on port 4 >>> However, if I run the same program by excluding eth29 or eth30, then it >>> works fine. Something like this: >>> mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude >>> eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv >>> My hostfile looks like this >>> [sshuser@denver Sendrecv]$ cat host1 >>> denver slots=2 >>> chicago slots=2 >>> I am not sure if I have to provide somethbing else. Please if I have to, >>> please feel to ask me.. >>> thanks, >>> -- >>> Joba >>> >>> >>> ______________________________**_________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> ______________________________**_________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/**about/doing_business/legal/**cri/<http://www.cisco.com/web/about/doing_business/legal/cri/> >> >> >> ______________________________**_________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users> >> > > > ______________________________**_________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users> >