I notice that in the worker, you have:

eth2      Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d4  
          inet addr:192.168.1.155  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1336628768 (1.3 GB)  TX bytes:552 (552.0 B)

eth3      Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d5  
          inet addr:192.168.1.156  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0
          TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:70061260271 (70.0 GB)  TX bytes:11844181778 (11.8 GB)

Two different NICs are on the same subnet -- that doesn't seem like a good 
idea...?  I think this topic has come up on the users list before, and, IIRC, 
the general consensus is "don't do that" because it's not clear as to which NIC 
Linux will actually send outgoing traffic across bound for the 192.168.1.x 
subnet.



On Aug 4, 2011, at 1:59 PM, Keith Manville wrote:

> I am having trouble running my MPI program on multiple nodes. I can
> run multiple processes on a single node, and I can spawn processes on
> on remote nodes, but when I call Send from a remote node, the node
> never returns, even though there is an appropriate Recv waiting. I'm
> pretty sure this is an issue with my configuration, not my code. I've
> tried some other sample programs I found and had the same problem of
> hanging on a send from one host to another.
> 
> Here's an in depth description:
> 
> I wrote a quick test program where each process with rank > 1 sends an
> int to the master (rank 0), and the master receives until it gets
> something from every other process.
> 
> My test program works fine when I run multiple processes on a single machine.
> 
> either the local node:
> 
> $ ./mpirun -n 4 ./mpi-test
> Hi I'm localhost:2
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:2 sending 12...
> localhost:2 sent 12
> localhost:1 sent 11
> Hi I'm localhost:0
> localhost:0 received 11 from 1
> localhost:0 received 12 from 2
> Hi I'm localhost:3
> localhost:3 sending 13...
> localhost:3 sent 13
> localhost:0 received 13 from 3
> all workers checked in!
> 
> or a remote one:
> 
> $ ./mpirun -np 2 -host remotehost ./mpi-test
> Hi I'm remotehost:0
> remotehost:0 received 11 from 1
> all workers checked in!
> Hi I'm remotehost:1
> remotehost:1 sending 11...
> remotehost:1 sent 11
> 
> But when I try to run the master locally and the worker(s) remotely
> (this is the way I am actually interested in running it), Send never
> returns and it hangs indefinitely.
> 
> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
> Hi I'm localhost:0
> Hi I'm remotehost:1
> remotehost:1 sending 11...
> 
> Just to see if it would work, I tried spawning the master on the
> remotehost and the worker on the localhost.
> 
> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
> Hi I'm localhost:1
> localhost:1 sending 11...
> localhost:1 sent 11
> Hi I'm remotehost:0
> remotehost:0 received 0 from 1
> all workers checked in!
> 
> It doesn't hang on Send, but the wrong value is received.
> 
> Any idea what's going on? I've attached my code, my config.log,
> ifconfig output, and ompi_info output.
> 
> Thanks,
> Keith
> <mpi.tgz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to