Jeff Squyres wrote:
Greetings. I loosely watched the MPI ABI discussions on the Beowulf
list but refrained from commenting (I stopped checking -- is it still
going on?). Now that the discussion has come to my project's list, I
guess I should speak up. :)
Since I've been "saving up" for a while, this post is a bit lengthy. I
apologize.
Thanks for the effort that you've put into this reply
First, let me ask a question: what does an MPI ABI *really* get for you?
The obvious answer is that you don't have to recompile. Your app runs
anywhere with any MPI on any system. Well, that is, unless want to run
on a different architecture (32/64 bit, different CPU, different
platform, etc.). Or if you want to use a different compiler on the same
system (let's not forget C++ and F90 name mangling issues). Or if you
want to use different system or compiler flags (e.g., threading / no
threading, largefile support on Linux, optimization and debugging
support, etc.).
So -- hmm. You can run your MPI app on any MPI implementation that is
on exactly the same platform, architecture, uses the same compilers, and
uses the same system and compiler flags that you want. So an MPI ABI
does not enable the "compile once, run anywhere" scheme -- it really is
much narrower than the casual observer might expect.
Then how do you explain the effort that went into the C++ ABI ?
What about the ISV?
Again, on the surface, this looks great -- an ISV can ship *one*
executable and have it work "anywhere". Er, well, anywhere "similar"
(so let's not forget that the ISV will still end up shipping a lot of
executables -- they may be shipping *fewer* executables than before, but
there will still be [far?] more than one).
But does an ISV really want that? Suddenly their app can [potentially]
run in a lot of scenarios that they have not verified through their Q&A
process. How do you know that you'll get the right answers? How do you
know it won't crap out in the middle of the run because of a missing
symbol (not involving MPI)? The fact is that the app can now run in a
lot of unsupported places, whereas today, the possibilities of this
happening are *much* more limited. ISVs generally choose which MPI
implementations, but then their apps *only* run on those implementations
(there are exceptions to this rule, I know).
This all depends on how details one specifies a platform and very few
ISV's specify every little detail.
For instance we specify for the linux platform which glibc we support.
We support one specific version _and_ all higher version. We do this
because we rely on the backward compatibility of glibc. The drawback is
that if the backward comptability in glibc is broken it will probably
show up in our code and our clients will contact us about it.
Additionally we will have to spend time on finding the error to
eventually find out that it is a backward compatibility problem of
glibc. However we can not afford to tell our customers to use one
specific version of glibc or test a whole range of glibc versions so we
have to take our chances and rely on the software our software relies on.
Big companies however have to power to specify their platforms in every
little detail (this includes not only the os and the version of the os
but also the version of every module in the os) but 95% of the ISV's do
not have this power ;-(
This is quite an important point, and is something that several others
have brought up in other mails: all MPI implementations are not created
equal. Take any two production-quality MPI implementations and they'll
have their own quirks and differences. They'll behave and perform
differently. So even though your application is source code portable,
it may not be performance / behavioral portable. This has been a
well-known fact for years (as someone said -- it's an artifact of using
a standard with multiple implementations). This is why ISV's Q&A test
their applications with different MPI implementations, and only certify
specific ones. More specifically, if your application works on one MPI
implementation, you can't guarantee that it will work on another. It
*probably* will, but customers don't pay for "probably" (e.g., you can't
know if you're accidentally relying on a quirk of one [or more!]
implementation[s] without testing on exactly the ones that you plan to
support).
If our client has an cluster with infiniband and we do not we will try
to make an executable for him. However we can not test this executable
ourselves but because it runs on 5 different platforms we _suppose_ that
if the MPI implementation is correct, our app will run correctly on that
switch too. If the customer nevertheless has a problem, we log in
remotely or go on-site to evaluate the problem. However we can not
afford to say 'buy another switch' because this will mean that our
customer will go somewhere else and thus we loose the customer.
Again for comparison, we neither specify which BIOS-es we support or
which brands of ethernet cards because we suppose they all work as
expected. Without any such assumptions, you just have to ship hardware
together with your software.
toon