On Mon, 2007-12-17 at 20:58 -0500, Brian Dobbins wrote:
> Hi Marco and Jeff,
> 
>   My own knowledge of OpenMPI's internals is limited, but I thought
> I'd add my less-than-two-cents...
> 
>         > I've found only a way in order to have tcp connections
>         binded only to
>         > the eth1 interface, using both the following MCA directives
>         in the
>         > command line:
>         >
>         > mpirun .... --mca oob_tcp_include eth1 --mca
>         oob_tcp_include 
>         > lo,eth0,ib0,ib1 .....
>         >
>         > This sounds me as bug.
>         
>         
>         Yes, it does.  Specifying the MCA same param twice on the
>         command line
>         results in undefined behavior -- it will only take one of
>         them, and I 
>         assume it'll take the first (but I'd have to check the code to
>         be sure).
> 
>   I think that Marco intended to write:
>   mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_exclude
> lo,eth0,ib0,ib1 ... 

no, I intended to write exactly what I wrote. The double statement is
reported by --mca mpi_show_mca_params exactly as I write one statement
only, as follows:

--mca oob_tcp_include eth1,lo,eth0,ib0,ib1

> 
>   Is this correct?  So you're not specifying include twice, you're
> specifying include and exclude, so each interface is explicitly stated
> in one list or the other.  I remember encountering this behaviour as
> well, in a slightly different format, but I can't seem to reproduce it
> now either. 

notice, the two lists are never intersecting.

>  That said, with these options, won't the MPI traffic (as opposed to
> the OOB traffic) still use the eth1,ib0 and ib1 interfaces?  You'd
> need to add '-mca btl_tcp_include eth1' in order to say it should only
> go over that NIC, I think. 

Yes I know, in fact -mca btl_tcp_[if]_exclude lo,eth0,ib0,ib1
works fine (seems). I'm using this MCA parameter since open-mpi 1.2.1
and the trouble with oob_tcp_[if]_[in|ex]clude sounded quite strange to
me, after all the code used for the parser should be more or less the
same ..... 

> 
>   As for the 'connection errors', two bizarre things to check are,
> first, that all of your nodes using eth1 actually have
> correct /etc/hosts mappings to the other nodes.  One system I ran on
> had this problem when some nodes had an IP address for node002 as one
> thing, and another node had node002's IP address as something
> different.   This should be easy enough by trying to run on one node
> first, then two nodes that you're sure have the correct addresses. 

Yes, I've already verified that. 

> 
>   .. The second situation is if you're launching an MPMD program.
> Here, you need to use '-gmca <whatever>' instead of '-mca <whatever>'.
> 

No, currently I'm using only SPMD ones, and I hope to use them for the
rest of the century :-)

>   Hope some of that is at least a tad useful.  :) 
> 

Thanks you very much Brian,

Marco 

>   Cheers,
>   - Brian
> 
-- 
-----------------------------------------------------------------
 Marco Sbrighi  m.sbri...@cineca.it

 HPC Group
 CINECA Interuniversity Computing Centre
 via Magnanelli, 6/3
 40033 Casalecchio di Reno (Bo) ITALY
 tel. 051 6171516

Reply via email to