On Jan 5, 2009, at 5:19 PM, Jeff Squyres wrote:
On Jan 5, 2009, at 5:01 PM, Maciej Kazulak wrote:
Interesting though. I thought in such a simple scenario shared
memory would be used for IPC (or whatever's fastest) . But nope.
Even with one process still it wants to use TCP/IP to communicate
between mpirun and orted.
Correct -- we only have TCP enabled for MPI process <--> orted
communication. There are several reasons why; the simplest is that
this is our "out of band" channel and it is only used to setup and
tear down the job. As such, we don't care that it's a little slower
than other possible channels (such as sm). MPI traffic will use
shmem, OpenFabrics-based networks, Myrinet, ...etc. But not MPI
process <--> orted communication.
What's even more surprising to me it won't use loopback for that.
Hence my maybe a little bit over-restrictive iptables rules were
the problem. I allowed traffic from 127.0.0.1 to 127.0.0.1 on lo
but not from <eth0_addr> to <eth0_addr> on eth0 and both processes
ended up waiting for IO.
Can I somehow configure it to use something other than TCP/IP here?
Or at least switch it to loopback?
I don't remember how it works in the v1.2 series offhand; I think
it's different in the v1.3 series (where all MPI processes *only*
talk to the local orted, vs. MPI processes making direct TCP
connections back to mpirun and any other MPI process with which it
needs to bootstrap other communication channels). I'm *guessing*
that the MPI process <--> orted communication either uses a named
unix socket or TCP loopback. Ralph -- can you explain the details?
In the 1.2 series, mpirun spawns a local orted to handle all local
procs. The code that discovers local interfaces specifically ignores
any interfaces that are not up or are just local loopbacks. My guess
is that the person who wrote that code long, long ago was assuming
that the sole purpose was to talk to remote nodes, not to loop back
onto yourself.
I imagine it could be changed to include loopback, but I would first
need to work with other developers to ensure there are no unexpected
consequences in doing so. Since no TCP interface is found, mpirun fails.
In the 1.3 series, mpirun handles the local procs itself. Thus, this
issue does not appear and things run just fine.
Ralph
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users