Greetings. I loosely watched the MPI ABI discussions on the Beowulf
list but refrained from commenting (I stopped checking -- is it still
going on?). Now that the discussion has come to my project's list, I
guess I should speak up. :)
Since I've been "saving up" for a while, this post is a bit lengthy. I
apologize.
-----
On the surface, an ABI looks great. Greg's slides show a bunch of
reasons why an ABI could be a Good Thing(tm) and how a bunch of groups
of people could "win" -- the scenarios outlined are certainly
attractive.
But I think that there are deeper problems. In short, I believe that
MPI is only one part of a complex web of issues. Even if we wave our
hands and assume that we have an MPI ABI today, there are many other
factors that prevent the scenarios that are described in Greg's slides.
I'll discuss these below.
Because e-mail is such an imprecise media, and because I don't
personally know many (most?) of you, I want to stress right up front
that I am *not* attempting to start a holy war, flame fest, or any
other such nonsense. This is all simply the opinions of an MPI
implementor. Although I've been around the block a few times, I'm
certainly not going to claim to be a definitive expert on all systems
everywhere. This mail simply summarizes my views.
-----
First, let me ask a question: what does an MPI ABI *really* get for you?
The obvious answer is that you don't have to recompile. Your app runs
anywhere with any MPI on any system. Well, that is, unless want to run
on a different architecture (32/64 bit, different CPU, different
platform, etc.). Or if you want to use a different compiler on the
same system (let's not forget C++ and F90 name mangling issues). Or if
you want to use different system or compiler flags (e.g., threading /
no threading, largefile support on Linux, optimization and debugging
support, etc.).
So -- hmm. You can run your MPI app on any MPI implementation that is
on exactly the same platform, architecture, uses the same compilers,
and uses the same system and compiler flags that you want. So an MPI
ABI does not enable the "compile once, run anywhere" scheme -- it
really is much narrower than the casual observer might expect.
But let's say that even that would be a big "win" for users -- you have
"similar" systems with "similar" MPI implementations, and you can run
your MPI app on any combination of them. So how does the end user
choose which MPI implementation to use?
The old way was to set your $PATH -- ensure that you have the "right"
mpicc, mpirun, etc. Sure, you can have 20+ MPI implementations loaded
on your cluster (and there are many that do), and they can all
peacefully co-exist (within reason). Switching between MPI
implementations is [usually] a matter of the user changing their $PATH
(sometimes this has to be in their shell startup files).
As an MPI implementor and support provider, let me tell you that
getting users to do this correctly is a nightmare. Users generally
understand PATH, but getting them to set it right [consistently] is
quite difficult. I'm not making any judgements on whether setting
$PATH to switch MPI implementations is a good system or not -- I'm just
saying that that's [usually] what it is. And it's difficult enough for
the user who doesn't care why/how MPI works.
The new way is based on MPI shared libraries -- users will change their
LD_LIBRARY_PATH instead of their PATH. Some will claim that this is
equivalent; if a user can change their PATH, they can certainly change
their LD_LIBRARY_PATH. I'm guessing, however, that it will be much
harder. Users generally understand $PATH; how many of them have ever
heard of LD_LIBRARY_PATH and/or will understand (or care) what it is
for? I have visions of "set ld library path = /opt/xyz-mpi".
But I digress -- whether the users will get it right or not is
speculative. My point here is that you have traded one "switch"
mechanism for another that, from a procedural standpoint, is equivalent
(setting LD_LIBRARY_PATH vs. setting PATH).
Ok, so let's wave our hands again and assume that this is all working
fine and good, and users can switch between MPI implementations on
their similar systems with ease. What have we solved? My cluster
still has 20+ MPI implementations on it, and I (the user) still have to
choose which to use. I don't have to recompile my app, but now I've
got a somewhat-intangible way to know which MPI I'm using (look at
$LD_LIBRARY_PATH). Users are now quite accustomed to "myapp-lam",
"myapp-ftmpi", "myapp-lampi", "myapp-mpich-gm", etc., where the
difference is quite obvious. Now it's much less obvious.
Is this a good or a bad thing? I don't know -- I just raise the point.
When you only have one binary, it becomes harder (or, better put,
takes more effort) to ensure that you're running with the MPI
implementation that you intend to. Mistakes (by end users and/or
pre-bundled MPI software) will become easier to make.
A final thought here: the -rpath (and equivalent) linker flags are
extremely convenient for users. You compile against shared libraries
and they are magically "found" at run time, regardless of your
LD_LIBRARY_PATH. This is particularly helpful for packages that are
installed in non-system-default locations (like the 20 MPI
implementations you have installed on your cluster). Having an MPI ABI
will pretty much stop this practice -- you don't want to link an MPI
application with -rpath because you don't want to (or can't) assume
which MPI a user will want to use. So they user *has* to set their
LD_LIBRARY_PATH -- you no longer have an MPI implementation that "just
works"; users must do one [more] thing before an MPI application will
run.
Summary:
- With an MPI ABI, you can only run on "similar" systems
- Users now set their LD_LIBRARY_PATH instead of PATH
- It's less obvious which MPI the user is actually using
- -rpath linker flag can/should not be used; users *have* to set
LD_LIBRARY_PATH
-----
What about the ISV?
Again, on the surface, this looks great -- an ISV can ship *one*
executable and have it work "anywhere". Er, well, anywhere "similar"
(so let's not forget that the ISV will still end up shipping a lot of
executables -- they may be shipping *fewer* executables than before,
but there will still be [far?] more than one).
But does an ISV really want that? Suddenly their app can [potentially]
run in a lot of scenarios that they have not verified through their Q&A
process. How do you know that you'll get the right answers? How do
you know it won't crap out in the middle of the run because of a
missing symbol (not involving MPI)? The fact is that the app can now
run in a lot of unsupported places, whereas today, the possibilities of
this happening are *much* more limited. ISVs generally choose which
MPI implementations, but then their apps *only* run on those
implementations (there are exceptions to this rule, I know).
This is quite an important point, and is something that several others
have brought up in other mails: all MPI implementations are not created
equal. Take any two production-quality MPI implementations and they'll
have their own quirks and differences. They'll behave and perform
differently. So even though your application is source code portable,
it may not be performance / behavioral portable. This has been a
well-known fact for years (as someone said -- it's an artifact of using
a standard with multiple implementations). This is why ISV's Q&A test
their applications with different MPI implementations, and only certify
specific ones. More specifically, if your application works on one MPI
implementation, you can't guarantee that it will work on another. It
*probably* will, but customers don't pay for "probably" (e.g., you
can't know if you're accidentally relying on a quirk of one [or more!]
implementation[s] without testing on exactly the ones that you plan to
support).
I'm not an ISV, so I won't pretend to speak for them, but several whom
I've had conversations actually *prefer* having tight control of where
their apps run (regardless of the mechanism) -- not just in terms of
Q&A, but also in terms of support. Granted, today's system of
enforcing that is rather klunky (you won't get any disagreement from me
there), but it gives ISVs what they want (at least, the ones that I
have talked to).
Let's again wave our hands and assume that we have an MPI ABI, and
imagine a support call for an ISV's MPI application:
Tech: "Hello, welcome to ABC support."
User: "I'm having a problem with your XYZ product."
Tech: "Ah yes, this product uses MPI. Which MPI are you using?"
User: "I'm using JKL MPI."
Tech: "I'm sorry, we don't support JKL MPI."
That's a bit fanciful and simplified, but my point here is that ISV's
are still going to choose which MPI's they want to / can support. If
you (the user) use something outside of that set, you're unsupported.
This may be confusing for users because the application *runs* (or
seems like it is *supposed* to run) -- they have an MPI, right? So why
doesn't the ISV support that MPI?
More specifically: it is better for an application to not run at all
rather than run poorly (or, even worse, silently/unknowingly generate
incorrect results). Having a clear-cut distinction here is a Good
Thing(tm).
Also, let's not forget that some ISV's have chosen to avoid today's
klunky mechanisms and simply statically compile a libmpi.a into their
application. They include a stripped down MPI implementation
(potentially not their own) inside their own app, and provide varying
degrees of hiding the MPI from the user. Hence, the ISV has delivered
a solution that will always work.
Granted, this isn't [yet] possible for all scenarios. But it works
quite well in a wide variety of environments (let's not discount the
number of clusters that are being bought outside of "traditional" HPC
environments -- bio, chemical, etc., where TCP-based networks are used
heavily).
While we're on the topic: as cited in Greg's slides,
non-traditional-HPC parallel applications (bio, chemical, etc.) are not
going to tolerate recompiling. They expect to get a binary that "just
works". This is certainly a valid point. However, these types of
users will also be buying a complete solution, from hardware all the
way to application (as much as possible). Specifically, these users
don't care (and sometimes don't even know) which MPI they are using.
They don't care about running with 20 different MPI implementations.
They'll use one -- whichever one their application is bundled with --
and will never use another (on that system, at least). So an MPI ABI
may not be very important to them.
Another solution that some ISV's use today is to have a thin message
passing abstraction layer. So their main code base consists of 98%
application-related stuff; 2% message passing stuff. Engineered
properly, the 98% makes calls into a separate library (i.e., the 2%)
that funnels all access to MPI. Hence, the ISV's really only need to
recompile the small library that interfaces to MPI -- not their entire
application -- to switch between MPI implementations.
Don't get me wrong -- I'm not saying that this is a perfect solution.
All I'm saying is that with proper planning and engineering, it's not a
*bad* one. Indeed, with slightly more effort, an ISV application could
have a dynamic module that opens different libraries to talk to
different MPI implementations (e.g., dlopen("mpi_interface_lam.so"), or
dlopen("mpi_interface_mpich.so"), etc.).
Hence, it is possible for ISVs to ship MPI-independent applications
*today*. More specifically, this would solve many of the same issues
that an MPI ABI would solve *without requiring anything additional from
MPI implementations* (and all the baggage that goes along with that) --
but you can still only run on "similar" systems.
Summary:
- ISVs are still only going to support some MPI implementations
- ISVs lost control over which MPI implementations their apps are used
with
- Potential user confusion because it's less obvious which MPI they're
using
- ISVs can ship static executables *today*
- ISVs can write binary MPI-independent applications *today*
-----
Much of what is being discussed first centers around standardizing
mpi.h. I think we all agree that it was not the Forum's goal to
standardize mpi.h -- they actually didn't standardize it on purpose (to
allow implementors to do whatever they want/need).
The main differences between mpi.h's can be summarized as:
1. values of constants
2. size of MPI_Status
3. size and types of MPI handles (crassly: pointer vs. integer)
#1 is probably fairly easy to solve, but it's dependent upon #3. #2
may present some arguments between implementors. #3 may introduce some
fist fights. ;-) Canonical example: MPICH* uses integers; Open MPI
uses pointers. I don't think that either side is willing to give them
up -- a simple reason (but definitely not the only reason -- this mail
is not intended to open that debate) is that the amount of code that
will change as a result of converting from int->pointer or pointer->int
is quite large. Admittedly, each change is fairly small, but it's
still a *lot* of small changes.
Let's also not forget MPI for smaller niche environments -- do we
really want to force an embedded MPI to use 32 or 64 bit handles?
That's a silly example, of course, but the point I'm trying to make is
that MPI spans a wide range of platforms -- what is suitable for one is
not necessarily suitable for another (even at the mpi.h level).
A user recently asked me, "So why should I suffer because of religious
differences between MPI implementors?"
My reply to that is "How exactly are you suffering?" Is recompiling
really that difficult? The fact remains that you're still going to
have many different MPI implementations out there -- an ABI will not
change this. Just because you slightly change the mechanism of how to
switch your application between them; does that really, fundamentally
make life better somehow? I'll return to this question later.
Every MPI implementation has different goals (research, production,
latency, bandwidth, portability / specivity, etc.). These goals
strongly influence the design of that implementation and have tangible
impacts on mpi.h.
Summary:
- Standardizing the size/type of MPI handles is problematic
- What is appropriate on one platform is not necessarily appropriate on
another
- Every MPI implementation has different goals, which even affects mpi.h
-----
As mentioned multiple times in the slides, having a common mpi.h is
only half the story. You'll still need a common mpirun to really make
things transparent to the user (you may even be able to hide some of
the LD_LIBRARY_PATH issues if you have a good uber-mpirun). The slides
argue that you can't support multiple batch queue systems in most
current MPI implementations.
I strongly disagree with this. LAM/MPI has been doing it for years.
LAM currently supports -- out of the box -- the run-time decision of
whether to use rsh/ssh, PBS, SLURM, BProc (both LANL and Scyld
variants), and limited scenarios for Globus. Open MPI will support
even more than this.
*** Sidenote: this same argument holds for support of different network
interconnects. LAM/MPI has been supporting the run-time decision of
which interconnect to use for years. Open MPI will continue
capability. But let's get back to the RTE discussion...
I firmly believe -- and the software to back up my belief is freely
available -- that this is purely a quality of implementation issue. If
MPI implementations want to support multiple back-end run-time
environments (RTEs), they can (this all goes back to the goals of an
MPI implementation). All batch systems have some kind of interface
(API or command line) to launch processes; although there is varying
support for monitoring and killing, it's simply a question of the MPI
implementation using the interface to launch its MPI processes. It's
not difficult; support for all the RTE systems listed above is
approximately 2% of LAM/MPI 7.1.2's code base (in terms of lines of
code).
I'm wary of standardizing on an uber-mpirun. There's more to MPI_INIT
than just discovering your peers and your identity (Greg mentioned a
few issues in his slides: IO forwarding, process monitoring, etc.). In
some cases, there is no out-of-band channel for newly-started MPI
processes to contact mpirun; MPI_INIT has to figure out its peers and
identity based on what the back-end RTE gave it (e.g., Quadrics,
Portals, etc.). Hence, you can't hide everything in an uber-mpirun --
the MPI sometimes *needs* knowledge of the back-end RTE.
You're also going to be standardizing many of the MPI-2 dynamic
functions, MPI_FINALIZE, and MPI_ABORT. That's a *lot* of ground to
cover (and to get implementors with different opinions and goals to
agree upon). Indeed, in MPI-2, the Forum went so far as to say
(paraphrasing, obviously) "We didn't specify the exact behavior of
MPI_FINALIZE on purpose."
The MPI's internal RTE is the soul of the machine; everyone has done
theirs entirely differently. Given that mpi.h is halfway specified by
the MPI standard and we *still* can't agree on the specifics, it is
difficult for me to imagine standardizing critical elements of the
back-end of all MPI implementations where there is currently no
uniformity at all. Consider: as I mentioned above, standardizing mpi.h
means touching potentially a lot of code in an MPI implementation.
Standardizing the internal run-time environment will touch a lot *more*
code in an MPI implementation. That's a hard sell.
Let's also not forget that some MPI implementations distinguish
themselves by their run-time environments. Some have really good RTEs.
Some don't. But consider: if performance is roughly equivalent among
multiple MPI implementations, users will choose by feature sets. I
speak from experience -- long before I became an MPI implementor, I
chose to use a specific MPI implementation because it had a fast mpirun
and when I hit ctrl-C, all my applications were guaranteed to be
killed. If you eliminate these differences, you're asking some MPI
implementations to standardize themselves out of existence. That, too,
is a pretty hard sell.
Finally, this uber-mpirun will have a consistent syntax across all
platforms and RTEs, but what about mpiexec? The MPI Forum explicitly
specified mpiexec to fulfill this requirement. Has it failed? Are all
the mpiexec implementations out there so radically different as to be
useless in terms of uniform syntax? (this is an honest question)
Summary:
- Run-time decision of back-end RTE launcher support is easy and
available today
- An uber-mpirun cannot hide all job control details (MPI_INIT must be
involved)
- An uber-mpirun would effectively standardize MPI_INIT, MPI_FINALIZE,
MPI_ABORT, and the MPI-2 dynamic functions
- Standardizing the internal RTE in MPI implementations is a *LOT* of
work
- What about mpiexec?
-----
I have a few random notes on Greg's slides:
- As I mentioned above, any MPI implementation can support multiple
batch-queue systems (or, more specifically, any back-end launching
system). It's purely a quality-of-implementation issue. An MPI ABI is
not required to make MPI implementations support multiple different
run-time environments.
- "Ever wonder why MPI applications don't come with a 'make check'
target?" This is an oversimplification -- you're implying that lack of
consistent mpirun syntax makes MPI applications non-portable, and
therefore impossible to have a consistent launching mechanism. This is
simply not true; it ignores at least two significant issues:
1. There are many other external factors required to run an MPI
application (e.g., SSH keys, a batch-queue system, permission and time
allocation on a cluster/parallel hardware, local setup decisions and
policies [pre-staging executables or using a global filesystem], etc.).
Indeed, the simple matter of choosing how many CPUs to use and which
ones to launch across is different in every run-time environment. This
is not the fault of MPI; this is the "fault" of heterogeneity of
run-time environments that exist today.
2. mpiexec seems to be able to handle at least some of these issues; it
already has a more-or-less standardized command line syntax. The
slides did not address mpiexec at all -- are there issues with mpiexec?
Regardless, doesn't "mpirun -np 4 my_app" pretty much work on
many/most implementations?
- On the "Recompilation considered harmful" slide: what about different
compilers? Even if we have an MPI ABI, compilers will be (or already
are?) the next battleground. Whatever happened to the C++ ABI effort?
Is there, or will there be an F90 ABI effort? Specifically: MPI is
only one piece of the puzzle. There are a lot of other factors that
determine whether recompilation is required or not. ABIs between
compilers (not libraries) would be a good first step.
- On the "Winners: End Users" slide: this is also an
oversimplification. "Any MPI app works on your system" / "Your app
works on your collaborator's system". As discussed above, this only
works for "similar" systems -- as long as your MPI app was compiled
with for the same OS, hardware, same system and compiler flags -- then
sure, your app will run in multiple places. Indeed, we have this today
-- if you compile any non-trivial app (MPI or not), you can [only] run
it on any similar system. But if it's not a "similar" system, you can
(and will) run into DLL Hell or downright incompatibility. Therefore,
this is not MPI's fault. This is the "fault" of the heterogeneity of
systems out there.
- On the "Winners: MPI implementation researchers" slide: although
there are some (a very small number), most implementation researchers
do not write their own MPI from scratch. Most take an existing open
source MPI and modify it. Having an MPI ABI gains nothing for MPI
implementation researchers except that they don't have to recompile
applications for their new implementation. This is exactly the same as
it is for everyone else (per restrictions discussed above); singling
out MPI implementation researchers is misleading.
- On the "Winners: Interconnect implementors" slide: Why will
interconnect implementors only reach systems that recompile? Quadrics
distributes binaries, for example. Are you saying that all
interconnects must write their own MPI implementations? I can assure
that most interconnect vendors do *not* want to do this.
- On the "Winners: Commercial software vendors" slide: I talked about
this above. An ABI does *not* make testing easier -- the ISV still
have to test with all the target MPIs that they are going to support.
Just because they don't have to recompile will not significantly reduce
the logistics of all ISV's. I don't see how automated testing becomes
easier with an ABI. Are you referring to a standardized mpirun? In
several of your e-mails, you have indicated that the standardized
mpirun would be a separate effort, not part of the ABI. So I'm a bit
confused by this comment.
- On the "Winners: Open Source Software Projects" slide: you say
"Tomorrow, MPI is just like everything else..." Are you saying that
MPI will be DLL Hell just like all other packages out there? That's
not a snide remark -- today, you have to find an RPM for your specific
distro, version, and architecture. Anything is else is a total
crapshoot as to whether it will work (e.g., DLL Hell). Do you really
want MPI implementations to fall into this category? Although there
are obvious drawbacks, using the source can be quite liberating in
terms of portability and freedom from DLL Hell.
- On the "Issues: Startup and queue systems" slide: it sounds like you
are now talking about standardizing queue systems which is a much, much
larger effort than just the MPI (or even the HPC) community.
It's quite possible that I'm missing the talking points (and therefore
the intent) of some of these slides; I did not see a presentation -- I
only read the PDF. So if I missed the point of some of these slides, I
apologize -- but please expand on your text and explain (the PDF is all
that everyone has to go on who was not at the IB meeting where they
were presented). Thanks!
-----
In conclusion (thanks for staying with me so long!), I guess I really
don't see a clear "win" for an MPI ABI and/or an uber-mpirun -- I don't
see a compelling "yes, this will make my life better" rationale (where
{"my" E (end-user, MPI implementor, ISV, ...etc.)}. Avoiding
recompiling certainly makes some people's lives better in incremental
ways. But it seems like we have far more important problems to solve
(extreme scalability, better performance, new platforms, etc.). Do we
know if users really want this? (i.e., a large percentage of users --
not a vocal few) Will users really find it easier? Can you really
sell this concept to all MPI implementors? Will ISVs really want the
additional support burden / user confusion? ...and so on.
I believe that the *problem* is not MPI, nor any particular
implementation. The *problem* is that there are a lot of different
types of systems out there. You *can't* distribute a binary (even a
serial binary) an expect it to work everywhere. Binaries have to be
tailored to specific systems. This is why, for example, in the Linux
world, you can't just grab any RPM that has the application you want --
you have to find the RPM for your distro, version, and hardware. If
nothing else, you prevent DLL Hell kinds of issues this way. Or, you
statically link the whole application and leave nothing to chance
(which obviates the need for an MPI ABI).
Indeed, even on a given system, there many different variations (which
compiler to use, which compiler and system flags to use, etc.). MPI
can neither be blamed for all of these variations nor can an MPI ABI be
expected to somehow provide uniformity across all of them (e.g., if the
application is compiled with -D_REENTRANT and the MPI library is not).
MPI is only one piece of this DLL Hell (etc.) puzzle. An MPI ABI isn't
nearly as useful as one would think unless all the other issues are
solved (e.g., compiler ABIs). Indeed, the set of "similar" systems out
there is pretty small: every cluster is different. Every one. There
are very, very few cookie-cutter clusters out there that can truly be
called "identical" to other clusters. As such, even expecting serial
binaries to be portable is quite a stretch.
To be blunt: an MPI ABI and/or an uber-mpirun will not solve any of
these other issues.
My $0.02: source code portability is enough. This was actually quite
wise of the MPI Forum; specifying mpi.h and/or making an ABI was never
part of the plan. Any valid MPI application can be recompiled for
other target systems. Indeed, properly engineered parallel
applications may only need to recompile small portions of their code
base to use a different MPI implementations. And with a little effort,
apps can be made to be MPI-independant (which is a lot less work than
getting all MPI implementations to agree to an ABI / uber-mpirun).
Sure, it would be great to not have to recompile apps, but given the
current state of technology, the sheer number of MPI implementations
that would have to agree to make an MPI ABI useful, and the fundamental
differences in goals between the different MPI implementation, it's
hard to justify all the work that would be required for this effort --
just to avoid a simple thing like recompiling.
Thanks for your time in reading this.
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/