Re: [OMPI users] Heterogeneous OpenFabrics hardware

Jeff Squyres Tue, 27 Jan 2009 11:01:20 -0500

On Jan 27, 2009, at 10:19 AM, Peter Kjellstrom wrote:

It is worth clarifying a point in this discussion that I neglected to
mention in my initial post: although Open MPI may not work *by
default* with heterogeneous HCAs/RNICs, it is quite possible/likely

that if you manually configure Open MPI to use the same verbs/hardware

settings across all your HCAs/RNICs (assuming that you use a set of
values that is compatible with all your hardware) that MPI jobs
spanning multiple different kinds of HCAs or RNICs will work fine.

See this post on the devel list for a few more details:

    http://www.open-mpi.org/community/lists/devel/2009/01/5314.php

So is it correct that each rank will check its HCA-model and thenpick up

suitable settings for that HCA?

Correct. We have an INI-style file that is installed in $pkgdir/mca-btl-openib-device-params.ini (typically expands to $prefix/share/openmpi/mca-btl-openib-device-params.ini). This file contains a bunchof device-specific parameters, but it also has a "general" sectionthat can be applied to any device if no specific match is found.

If so maybe OpenMPI could fall back to a very conservative settingsif more
than one HCA model was detected among the ranks. Or would this require
communication in a stage where that would be complicated and/or ugly?

Today we don't do this kind of check; we just assume that every otherMPI process is using the same hardware and/or the settings pulled fromthe INI file will be compatible. AFAIK, most (all?) other MPI's dothe same thing.


We *could* do that kind of check:

a) there hasn't been enough customer demand for it / no one hassubmitted a patch to do sob) it might be a bit complicated because the startup sequence in theopenib BTL is a little complexc) we are definitely moving to a scenario (at scale) where there islittle/no communication at startup about coordinating information fromall of the MPI peer processes; this strategy might be problematic inthose scenarios (i.e., the coordination / determination of"conservative" settings would have to be done by a human and likelypre-posted to a file on each node -- still hand-waving a bit becausethat design isn't finalized/implemented yet)d) programatically finding what "conservative" settings are workableacross a wide variety of devices may be problematic because individualdevice capabilities can vary wildly (does it have SRQ? can it supportmore than one BSRQ? what's a good MTU? ...?)

I think d) is a big sticking point; we *could* make extremelyconservative settings that should probably work everywhere. I can seeat least one potential problematic scenario:


- cluster has N nodes
- a year later, an HCA in 1 node dies
- get a new HCA, perhaps even from a different vendor
- capabilities of the new HCA and old HCAs are different
- so OMPI falls back to "extreme conservative" settings
- jobs that run on that one node suffer in performance
- jobs that do not run on that node see "normal" performance
- users are confused

I suppose that we could print a Big Hairy Warning(tm) if we fall backto extreme conservative settings, but it still seems to create thepotential to violate the Law of Least Astonishment.


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Heterogeneous OpenFabrics hardware

Reply via email to