On Fri, Jul 08, 2011 at 07:03:13PM -0400, Jeff Squyres wrote: > Sorry -- I got distracted all afternoon...
No problem. We all have obligations that we prioritize. > In addition to what Ralph said (i.e., I'm not sure if the > CIDR notation stuff made it over to the v1.5 branch or not, > but it is available from the nightly SVN trunk tarballs: > http://www.open-mpi.org/nightly/trunk/), here's a few points > from other mails in this thread... I try this out sometime next week. > 1. Gus is correct that OMPI is complaining that bge1 doesn't > exist on all nodes. The MCA parameters that you pass on the > command line get shipped to *all* MPI processes, and therefore > generally need to work on all of them. If you have per-host > MCA parameter values, you can set them a few different ways: > > - have a per-host MCA param file, usually in > $prefix/etc/openmpi-mca-params.conf > - have your shell startup files intelligently determine which > host you're on and set the corresponding MCA environment variable > as appropriate (e.g., on the head node, set the env variable > OMPI_MCA_btl_tcp_if_include to bge1, and set it to bge0 on the others) > > Those are a little klunky, but having a heterogeneous setup like this > is not common, so we haven't really optimized the ability to set > different MCA params on different servers. There is no compelling reason for me to keep bge0 on the 10.208. subnet and bge1 on the 192.168 subnet on node10. If I switch the two, so all bge0 nics are on 192.168., then I suppose that --mca btl_tcp_if_include bge0 should work. I'll try this next weekr; if I can kick everyone off the cluster for a few minutes. > 2. I am curious to figure out why the automatic reachability > computations isn't working for you. Unfortunately, the code > to compute the reachability is pretty gnarly. :-\ The code > to find the IP interfaces on your machines is in opal/util/if.c. > That *should* be working -- there's *BSD-specific code in there > that has been verified by others in the past... but who knows? > Perhaps it has bit-rotted...? I'm running a Feb 2011 version of the bleeding edge FreeBSD, which will become FreeBSD 9.0 is a few months. Perhaps, something has changed in FreeBSD's networking code. I'll see if I can understand opal/util/if.c sufficiently to see what's happening. > The code to take these IP interfaces > and figure out if a given peer is reachable is in > ompi/mca/btl/tcp/btl_tcp_proc.c:mca_btl_tcp_proc_insert(). > This requires a little explanation... (snip to keep this short) > This was a long explanation -- I hope it helps... > Is there any chance you could dig into this to see what's going on? Thanks, I'll see what I can ferret out of the syste > We unfortunately don't have access to any BSD machines to test this > on, ourselves. It works on other OS's, so I'm curious as to why it > doesn't seem to work for you. :-( I can arrange access on the cluster in question. ;-) -- Steve