On Jan 27, 2009, at 10:19 AM, Peter Kjellstrom wrote:
It is worth clarifying a point in this discussion that I neglected to
mention in my initial post: although Open MPI may not work *by
default* with heterogeneous HCAs/RNICs, it is quite possible/likely
that if you manually configure Open MPI to use the same verbs/
hardware
settings across all your HCAs/RNICs (assuming that you use a set of
values that is compatible with all your hardware) that MPI jobs
spanning multiple different kinds of HCAs or RNICs will work fine.
See this post on the devel list for a few more details:
http://www.open-mpi.org/community/lists/devel/2009/01/5314.php
So is it correct that each rank will check its HCA-model and then
pick up
suitable settings for that HCA?
Correct. We have an INI-style file that is installed in $pkgdir/mca-
btl-openib-device-params.ini (typically expands to $prefix/share/
openmpi/mca-btl-openib-device-params.ini). This file contains a bunch
of device-specific parameters, but it also has a "general" section
that can be applied to any device if no specific match is found.
If so maybe OpenMPI could fall back to a very conservative settings
if more
than one HCA model was detected among the ranks. Or would this require
communication in a stage where that would be complicated and/or ugly?
Today we don't do this kind of check; we just assume that every other
MPI process is using the same hardware and/or the settings pulled from
the INI file will be compatible. AFAIK, most (all?) other MPI's do
the same thing.
We *could* do that kind of check:
a) there hasn't been enough customer demand for it / no one has
submitted a patch to do so
b) it might be a bit complicated because the startup sequence in the
openib BTL is a little complex
c) we are definitely moving to a scenario (at scale) where there is
little/no communication at startup about coordinating information from
all of the MPI peer processes; this strategy might be problematic in
those scenarios (i.e., the coordination / determination of
"conservative" settings would have to be done by a human and likely
pre-posted to a file on each node -- still hand-waving a bit because
that design isn't finalized/implemented yet)
d) programatically finding what "conservative" settings are workable
across a wide variety of devices may be problematic because individual
device capabilities can vary wildly (does it have SRQ? can it support
more than one BSRQ? what's a good MTU? ...?)
I think d) is a big sticking point; we *could* make extremely
conservative settings that should probably work everywhere. I can see
at least one potential problematic scenario:
- cluster has N nodes
- a year later, an HCA in 1 node dies
- get a new HCA, perhaps even from a different vendor
- capabilities of the new HCA and old HCAs are different
- so OMPI falls back to "extreme conservative" settings
- jobs that run on that one node suffer in performance
- jobs that do not run on that node see "normal" performance
- users are confused
I suppose that we could print a Big Hairy Warning(tm) if we fall back
to extreme conservative settings, but it still seems to create the
potential to violate the Law of Least Astonishment.
--
Jeff Squyres
Cisco Systems