On Mon, 2007-12-17 at 20:58 -0500, Brian Dobbins wrote: > Hi Marco and Jeff, > > My own knowledge of OpenMPI's internals is limited, but I thought > I'd add my less-than-two-cents... > > > I've found only a way in order to have tcp connections > binded only to > > the eth1 interface, using both the following MCA directives > in the > > command line: > > > > mpirun .... --mca oob_tcp_include eth1 --mca > oob_tcp_include > > lo,eth0,ib0,ib1 ..... > > > > This sounds me as bug. > > > Yes, it does. Specifying the MCA same param twice on the > command line > results in undefined behavior -- it will only take one of > them, and I > assume it'll take the first (but I'd have to check the code to > be sure). > > I think that Marco intended to write: > mpirun .... --mca oob_tcp_include eth1 --mca oob_tcp_exclude > lo,eth0,ib0,ib1 ...
no, I intended to write exactly what I wrote. The double statement is reported by --mca mpi_show_mca_params exactly as I write one statement only, as follows: --mca oob_tcp_include eth1,lo,eth0,ib0,ib1 > > Is this correct? So you're not specifying include twice, you're > specifying include and exclude, so each interface is explicitly stated > in one list or the other. I remember encountering this behaviour as > well, in a slightly different format, but I can't seem to reproduce it > now either. notice, the two lists are never intersecting. > That said, with these options, won't the MPI traffic (as opposed to > the OOB traffic) still use the eth1,ib0 and ib1 interfaces? You'd > need to add '-mca btl_tcp_include eth1' in order to say it should only > go over that NIC, I think. Yes I know, in fact -mca btl_tcp_[if]_exclude lo,eth0,ib0,ib1 works fine (seems). I'm using this MCA parameter since open-mpi 1.2.1 and the trouble with oob_tcp_[if]_[in|ex]clude sounded quite strange to me, after all the code used for the parser should be more or less the same ..... > > As for the 'connection errors', two bizarre things to check are, > first, that all of your nodes using eth1 actually have > correct /etc/hosts mappings to the other nodes. One system I ran on > had this problem when some nodes had an IP address for node002 as one > thing, and another node had node002's IP address as something > different. This should be easy enough by trying to run on one node > first, then two nodes that you're sure have the correct addresses. Yes, I've already verified that. > > .. The second situation is if you're launching an MPMD program. > Here, you need to use '-gmca <whatever>' instead of '-mca <whatever>'. > No, currently I'm using only SPMD ones, and I hope to use them for the rest of the century :-) > Hope some of that is at least a tad useful. :) > Thanks you very much Brian, Marco > Cheers, > - Brian > -- ----------------------------------------------------------------- Marco Sbrighi m.sbri...@cineca.it HPC Group CINECA Interuniversity Computing Centre via Magnanelli, 6/3 40033 Casalecchio di Reno (Bo) ITALY tel. 051 6171516