Greetings. I loosely watched the MPI ABI discussions on the Beowulf list but refrained from commenting (I stopped checking -- is it still going on?). Now that the discussion has come to my project's list, I guess I should speak up. :)

Since I've been "saving up" for a while, this post is a bit lengthy. I apologize.

-----

On the surface, an ABI looks great. Greg's slides show a bunch of reasons why an ABI could be a Good Thing(tm) and how a bunch of groups of people could "win" -- the scenarios outlined are certainly attractive.

But I think that there are deeper problems. In short, I believe that MPI is only one part of a complex web of issues. Even if we wave our hands and assume that we have an MPI ABI today, there are many other factors that prevent the scenarios that are described in Greg's slides. I'll discuss these below.

Because e-mail is such an imprecise media, and because I don't personally know many (most?) of you, I want to stress right up front that I am *not* attempting to start a holy war, flame fest, or any other such nonsense. This is all simply the opinions of an MPI implementor. Although I've been around the block a few times, I'm certainly not going to claim to be a definitive expert on all systems everywhere. This mail simply summarizes my views.

-----

First, let me ask a question: what does an MPI ABI *really* get for you?

The obvious answer is that you don't have to recompile. Your app runs anywhere with any MPI on any system. Well, that is, unless want to run on a different architecture (32/64 bit, different CPU, different platform, etc.). Or if you want to use a different compiler on the same system (let's not forget C++ and F90 name mangling issues). Or if you want to use different system or compiler flags (e.g., threading / no threading, largefile support on Linux, optimization and debugging support, etc.).

So -- hmm. You can run your MPI app on any MPI implementation that is on exactly the same platform, architecture, uses the same compilers, and uses the same system and compiler flags that you want. So an MPI ABI does not enable the "compile once, run anywhere" scheme -- it really is much narrower than the casual observer might expect.

But let's say that even that would be a big "win" for users -- you have "similar" systems with "similar" MPI implementations, and you can run your MPI app on any combination of them. So how does the end user choose which MPI implementation to use?

The old way was to set your $PATH -- ensure that you have the "right" mpicc, mpirun, etc. Sure, you can have 20+ MPI implementations loaded on your cluster (and there are many that do), and they can all peacefully co-exist (within reason). Switching between MPI implementations is [usually] a matter of the user changing their $PATH (sometimes this has to be in their shell startup files).

As an MPI implementor and support provider, let me tell you that getting users to do this correctly is a nightmare. Users generally understand PATH, but getting them to set it right [consistently] is quite difficult. I'm not making any judgements on whether setting $PATH to switch MPI implementations is a good system or not -- I'm just saying that that's [usually] what it is. And it's difficult enough for the user who doesn't care why/how MPI works.

The new way is based on MPI shared libraries -- users will change their LD_LIBRARY_PATH instead of their PATH. Some will claim that this is equivalent; if a user can change their PATH, they can certainly change their LD_LIBRARY_PATH. I'm guessing, however, that it will be much harder. Users generally understand $PATH; how many of them have ever heard of LD_LIBRARY_PATH and/or will understand (or care) what it is for? I have visions of "set ld library path = /opt/xyz-mpi".

But I digress -- whether the users will get it right or not is speculative. My point here is that you have traded one "switch" mechanism for another that, from a procedural standpoint, is equivalent (setting LD_LIBRARY_PATH vs. setting PATH).

Ok, so let's wave our hands again and assume that this is all working fine and good, and users can switch between MPI implementations on their similar systems with ease. What have we solved? My cluster still has 20+ MPI implementations on it, and I (the user) still have to choose which to use. I don't have to recompile my app, but now I've got a somewhat-intangible way to know which MPI I'm using (look at $LD_LIBRARY_PATH). Users are now quite accustomed to "myapp-lam", "myapp-ftmpi", "myapp-lampi", "myapp-mpich-gm", etc., where the difference is quite obvious. Now it's much less obvious.

Is this a good or a bad thing? I don't know -- I just raise the point. When you only have one binary, it becomes harder (or, better put, takes more effort) to ensure that you're running with the MPI implementation that you intend to. Mistakes (by end users and/or pre-bundled MPI software) will become easier to make.

A final thought here: the -rpath (and equivalent) linker flags are extremely convenient for users. You compile against shared libraries and they are magically "found" at run time, regardless of your LD_LIBRARY_PATH. This is particularly helpful for packages that are installed in non-system-default locations (like the 20 MPI implementations you have installed on your cluster). Having an MPI ABI will pretty much stop this practice -- you don't want to link an MPI application with -rpath because you don't want to (or can't) assume which MPI a user will want to use. So they user *has* to set their LD_LIBRARY_PATH -- you no longer have an MPI implementation that "just works"; users must do one [more] thing before an MPI application will run.

Summary:

- With an MPI ABI, you can only run on "similar" systems
- Users now set their LD_LIBRARY_PATH instead of PATH
- It's less obvious which MPI the user is actually using
- -rpath linker flag can/should not be used; users *have* to set LD_LIBRARY_PATH

-----

What about the ISV?

Again, on the surface, this looks great -- an ISV can ship *one* executable and have it work "anywhere". Er, well, anywhere "similar" (so let's not forget that the ISV will still end up shipping a lot of executables -- they may be shipping *fewer* executables than before, but there will still be [far?] more than one).

But does an ISV really want that? Suddenly their app can [potentially] run in a lot of scenarios that they have not verified through their Q&A process. How do you know that you'll get the right answers? How do you know it won't crap out in the middle of the run because of a missing symbol (not involving MPI)? The fact is that the app can now run in a lot of unsupported places, whereas today, the possibilities of this happening are *much* more limited. ISVs generally choose which MPI implementations, but then their apps *only* run on those implementations (there are exceptions to this rule, I know).

This is quite an important point, and is something that several others have brought up in other mails: all MPI implementations are not created equal. Take any two production-quality MPI implementations and they'll have their own quirks and differences. They'll behave and perform differently. So even though your application is source code portable, it may not be performance / behavioral portable. This has been a well-known fact for years (as someone said -- it's an artifact of using a standard with multiple implementations). This is why ISV's Q&A test their applications with different MPI implementations, and only certify specific ones. More specifically, if your application works on one MPI implementation, you can't guarantee that it will work on another. It *probably* will, but customers don't pay for "probably" (e.g., you can't know if you're accidentally relying on a quirk of one [or more!] implementation[s] without testing on exactly the ones that you plan to support).

I'm not an ISV, so I won't pretend to speak for them, but several whom I've had conversations actually *prefer* having tight control of where their apps run (regardless of the mechanism) -- not just in terms of Q&A, but also in terms of support. Granted, today's system of enforcing that is rather klunky (you won't get any disagreement from me there), but it gives ISVs what they want (at least, the ones that I have talked to).

Let's again wave our hands and assume that we have an MPI ABI, and imagine a support call for an ISV's MPI application:

Tech: "Hello, welcome to ABC support."
User: "I'm having a problem with your XYZ product."
Tech: "Ah yes, this product uses MPI.  Which MPI are you using?"
User: "I'm using JKL MPI."
Tech: "I'm sorry, we don't support JKL MPI."

That's a bit fanciful and simplified, but my point here is that ISV's are still going to choose which MPI's they want to / can support. If you (the user) use something outside of that set, you're unsupported. This may be confusing for users because the application *runs* (or seems like it is *supposed* to run) -- they have an MPI, right? So why doesn't the ISV support that MPI?

More specifically: it is better for an application to not run at all rather than run poorly (or, even worse, silently/unknowingly generate incorrect results). Having a clear-cut distinction here is a Good Thing(tm).

Also, let's not forget that some ISV's have chosen to avoid today's klunky mechanisms and simply statically compile a libmpi.a into their application. They include a stripped down MPI implementation (potentially not their own) inside their own app, and provide varying degrees of hiding the MPI from the user. Hence, the ISV has delivered a solution that will always work.

Granted, this isn't [yet] possible for all scenarios. But it works quite well in a wide variety of environments (let's not discount the number of clusters that are being bought outside of "traditional" HPC environments -- bio, chemical, etc., where TCP-based networks are used heavily).

While we're on the topic: as cited in Greg's slides, non-traditional-HPC parallel applications (bio, chemical, etc.) are not going to tolerate recompiling. They expect to get a binary that "just works". This is certainly a valid point. However, these types of users will also be buying a complete solution, from hardware all the way to application (as much as possible). Specifically, these users don't care (and sometimes don't even know) which MPI they are using. They don't care about running with 20 different MPI implementations. They'll use one -- whichever one their application is bundled with -- and will never use another (on that system, at least). So an MPI ABI may not be very important to them.

Another solution that some ISV's use today is to have a thin message passing abstraction layer. So their main code base consists of 98% application-related stuff; 2% message passing stuff. Engineered properly, the 98% makes calls into a separate library (i.e., the 2%) that funnels all access to MPI. Hence, the ISV's really only need to recompile the small library that interfaces to MPI -- not their entire application -- to switch between MPI implementations.

Don't get me wrong -- I'm not saying that this is a perfect solution. All I'm saying is that with proper planning and engineering, it's not a *bad* one. Indeed, with slightly more effort, an ISV application could have a dynamic module that opens different libraries to talk to different MPI implementations (e.g., dlopen("mpi_interface_lam.so"), or dlopen("mpi_interface_mpich.so"), etc.).

Hence, it is possible for ISVs to ship MPI-independent applications *today*. More specifically, this would solve many of the same issues that an MPI ABI would solve *without requiring anything additional from MPI implementations* (and all the baggage that goes along with that) -- but you can still only run on "similar" systems.

Summary:

- ISVs are still only going to support some MPI implementations
- ISVs lost control over which MPI implementations their apps are used with - Potential user confusion because it's less obvious which MPI they're using
- ISVs can ship static executables *today*
- ISVs can write binary MPI-independent applications *today*

-----

Much of what is being discussed first centers around standardizing mpi.h. I think we all agree that it was not the Forum's goal to standardize mpi.h -- they actually didn't standardize it on purpose (to allow implementors to do whatever they want/need).

The main differences between mpi.h's can be summarized as:

1. values of constants
2. size of MPI_Status
3. size and types of MPI handles (crassly: pointer vs. integer)

#1 is probably fairly easy to solve, but it's dependent upon #3. #2 may present some arguments between implementors. #3 may introduce some fist fights. ;-) Canonical example: MPICH* uses integers; Open MPI uses pointers. I don't think that either side is willing to give them up -- a simple reason (but definitely not the only reason -- this mail is not intended to open that debate) is that the amount of code that will change as a result of converting from int->pointer or pointer->int is quite large. Admittedly, each change is fairly small, but it's still a *lot* of small changes.

Let's also not forget MPI for smaller niche environments -- do we really want to force an embedded MPI to use 32 or 64 bit handles? That's a silly example, of course, but the point I'm trying to make is that MPI spans a wide range of platforms -- what is suitable for one is not necessarily suitable for another (even at the mpi.h level).

A user recently asked me, "So why should I suffer because of religious differences between MPI implementors?"

My reply to that is "How exactly are you suffering?" Is recompiling really that difficult? The fact remains that you're still going to have many different MPI implementations out there -- an ABI will not change this. Just because you slightly change the mechanism of how to switch your application between them; does that really, fundamentally make life better somehow? I'll return to this question later.

Every MPI implementation has different goals (research, production, latency, bandwidth, portability / specivity, etc.). These goals strongly influence the design of that implementation and have tangible impacts on mpi.h.

Summary:

- Standardizing the size/type of MPI handles is problematic
- What is appropriate on one platform is not necessarily appropriate on another
- Every MPI implementation has different goals, which even affects mpi.h

-----

As mentioned multiple times in the slides, having a common mpi.h is only half the story. You'll still need a common mpirun to really make things transparent to the user (you may even be able to hide some of the LD_LIBRARY_PATH issues if you have a good uber-mpirun). The slides argue that you can't support multiple batch queue systems in most current MPI implementations.

I strongly disagree with this. LAM/MPI has been doing it for years. LAM currently supports -- out of the box -- the run-time decision of whether to use rsh/ssh, PBS, SLURM, BProc (both LANL and Scyld variants), and limited scenarios for Globus. Open MPI will support even more than this.

*** Sidenote: this same argument holds for support of different network interconnects. LAM/MPI has been supporting the run-time decision of which interconnect to use for years. Open MPI will continue capability. But let's get back to the RTE discussion...

I firmly believe -- and the software to back up my belief is freely available -- that this is purely a quality of implementation issue. If MPI implementations want to support multiple back-end run-time environments (RTEs), they can (this all goes back to the goals of an MPI implementation). All batch systems have some kind of interface (API or command line) to launch processes; although there is varying support for monitoring and killing, it's simply a question of the MPI implementation using the interface to launch its MPI processes. It's not difficult; support for all the RTE systems listed above is approximately 2% of LAM/MPI 7.1.2's code base (in terms of lines of code).

I'm wary of standardizing on an uber-mpirun. There's more to MPI_INIT than just discovering your peers and your identity (Greg mentioned a few issues in his slides: IO forwarding, process monitoring, etc.). In some cases, there is no out-of-band channel for newly-started MPI processes to contact mpirun; MPI_INIT has to figure out its peers and identity based on what the back-end RTE gave it (e.g., Quadrics, Portals, etc.). Hence, you can't hide everything in an uber-mpirun -- the MPI sometimes *needs* knowledge of the back-end RTE.

You're also going to be standardizing many of the MPI-2 dynamic functions, MPI_FINALIZE, and MPI_ABORT. That's a *lot* of ground to cover (and to get implementors with different opinions and goals to agree upon). Indeed, in MPI-2, the Forum went so far as to say (paraphrasing, obviously) "We didn't specify the exact behavior of MPI_FINALIZE on purpose."

The MPI's internal RTE is the soul of the machine; everyone has done theirs entirely differently. Given that mpi.h is halfway specified by the MPI standard and we *still* can't agree on the specifics, it is difficult for me to imagine standardizing critical elements of the back-end of all MPI implementations where there is currently no uniformity at all. Consider: as I mentioned above, standardizing mpi.h means touching potentially a lot of code in an MPI implementation. Standardizing the internal run-time environment will touch a lot *more* code in an MPI implementation. That's a hard sell.

Let's also not forget that some MPI implementations distinguish themselves by their run-time environments. Some have really good RTEs. Some don't. But consider: if performance is roughly equivalent among multiple MPI implementations, users will choose by feature sets. I speak from experience -- long before I became an MPI implementor, I chose to use a specific MPI implementation because it had a fast mpirun and when I hit ctrl-C, all my applications were guaranteed to be killed. If you eliminate these differences, you're asking some MPI implementations to standardize themselves out of existence. That, too, is a pretty hard sell.

Finally, this uber-mpirun will have a consistent syntax across all platforms and RTEs, but what about mpiexec? The MPI Forum explicitly specified mpiexec to fulfill this requirement. Has it failed? Are all the mpiexec implementations out there so radically different as to be useless in terms of uniform syntax? (this is an honest question)

Summary:

- Run-time decision of back-end RTE launcher support is easy and available today - An uber-mpirun cannot hide all job control details (MPI_INIT must be involved) - An uber-mpirun would effectively standardize MPI_INIT, MPI_FINALIZE, MPI_ABORT, and the MPI-2 dynamic functions - Standardizing the internal RTE in MPI implementations is a *LOT* of work
- What about mpiexec?

-----

I have a few random notes on Greg's slides:

- As I mentioned above, any MPI implementation can support multiple batch-queue systems (or, more specifically, any back-end launching system). It's purely a quality-of-implementation issue. An MPI ABI is not required to make MPI implementations support multiple different run-time environments.

- "Ever wonder why MPI applications don't come with a 'make check' target?" This is an oversimplification -- you're implying that lack of consistent mpirun syntax makes MPI applications non-portable, and therefore impossible to have a consistent launching mechanism. This is simply not true; it ignores at least two significant issues:

1. There are many other external factors required to run an MPI application (e.g., SSH keys, a batch-queue system, permission and time allocation on a cluster/parallel hardware, local setup decisions and policies [pre-staging executables or using a global filesystem], etc.). Indeed, the simple matter of choosing how many CPUs to use and which ones to launch across is different in every run-time environment. This is not the fault of MPI; this is the "fault" of heterogeneity of run-time environments that exist today.

2. mpiexec seems to be able to handle at least some of these issues; it already has a more-or-less standardized command line syntax. The slides did not address mpiexec at all -- are there issues with mpiexec? Regardless, doesn't "mpirun -np 4 my_app" pretty much work on many/most implementations?

- On the "Recompilation considered harmful" slide: what about different compilers? Even if we have an MPI ABI, compilers will be (or already are?) the next battleground. Whatever happened to the C++ ABI effort? Is there, or will there be an F90 ABI effort? Specifically: MPI is only one piece of the puzzle. There are a lot of other factors that determine whether recompilation is required or not. ABIs between compilers (not libraries) would be a good first step.

- On the "Winners: End Users" slide: this is also an oversimplification. "Any MPI app works on your system" / "Your app works on your collaborator's system". As discussed above, this only works for "similar" systems -- as long as your MPI app was compiled with for the same OS, hardware, same system and compiler flags -- then sure, your app will run in multiple places. Indeed, we have this today -- if you compile any non-trivial app (MPI or not), you can [only] run it on any similar system. But if it's not a "similar" system, you can (and will) run into DLL Hell or downright incompatibility. Therefore, this is not MPI's fault. This is the "fault" of the heterogeneity of systems out there.

- On the "Winners: MPI implementation researchers" slide: although there are some (a very small number), most implementation researchers do not write their own MPI from scratch. Most take an existing open source MPI and modify it. Having an MPI ABI gains nothing for MPI implementation researchers except that they don't have to recompile applications for their new implementation. This is exactly the same as it is for everyone else (per restrictions discussed above); singling out MPI implementation researchers is misleading.

- On the "Winners: Interconnect implementors" slide: Why will interconnect implementors only reach systems that recompile? Quadrics distributes binaries, for example. Are you saying that all interconnects must write their own MPI implementations? I can assure that most interconnect vendors do *not* want to do this.

- On the "Winners: Commercial software vendors" slide: I talked about this above. An ABI does *not* make testing easier -- the ISV still have to test with all the target MPIs that they are going to support. Just because they don't have to recompile will not significantly reduce the logistics of all ISV's. I don't see how automated testing becomes easier with an ABI. Are you referring to a standardized mpirun? In several of your e-mails, you have indicated that the standardized mpirun would be a separate effort, not part of the ABI. So I'm a bit confused by this comment.

- On the "Winners: Open Source Software Projects" slide: you say "Tomorrow, MPI is just like everything else..." Are you saying that MPI will be DLL Hell just like all other packages out there? That's not a snide remark -- today, you have to find an RPM for your specific distro, version, and architecture. Anything is else is a total crapshoot as to whether it will work (e.g., DLL Hell). Do you really want MPI implementations to fall into this category? Although there are obvious drawbacks, using the source can be quite liberating in terms of portability and freedom from DLL Hell.

- On the "Issues: Startup and queue systems" slide: it sounds like you are now talking about standardizing queue systems which is a much, much larger effort than just the MPI (or even the HPC) community.

It's quite possible that I'm missing the talking points (and therefore the intent) of some of these slides; I did not see a presentation -- I only read the PDF. So if I missed the point of some of these slides, I apologize -- but please expand on your text and explain (the PDF is all that everyone has to go on who was not at the IB meeting where they were presented). Thanks!

-----

In conclusion (thanks for staying with me so long!), I guess I really don't see a clear "win" for an MPI ABI and/or an uber-mpirun -- I don't see a compelling "yes, this will make my life better" rationale (where {"my" E (end-user, MPI implementor, ISV, ...etc.)}. Avoiding recompiling certainly makes some people's lives better in incremental ways. But it seems like we have far more important problems to solve (extreme scalability, better performance, new platforms, etc.). Do we know if users really want this? (i.e., a large percentage of users -- not a vocal few) Will users really find it easier? Can you really sell this concept to all MPI implementors? Will ISVs really want the additional support burden / user confusion? ...and so on.

I believe that the *problem* is not MPI, nor any particular implementation. The *problem* is that there are a lot of different types of systems out there. You *can't* distribute a binary (even a serial binary) an expect it to work everywhere. Binaries have to be tailored to specific systems. This is why, for example, in the Linux world, you can't just grab any RPM that has the application you want -- you have to find the RPM for your distro, version, and hardware. If nothing else, you prevent DLL Hell kinds of issues this way. Or, you statically link the whole application and leave nothing to chance (which obviates the need for an MPI ABI).

Indeed, even on a given system, there many different variations (which compiler to use, which compiler and system flags to use, etc.). MPI can neither be blamed for all of these variations nor can an MPI ABI be expected to somehow provide uniformity across all of them (e.g., if the application is compiled with -D_REENTRANT and the MPI library is not).

MPI is only one piece of this DLL Hell (etc.) puzzle. An MPI ABI isn't nearly as useful as one would think unless all the other issues are solved (e.g., compiler ABIs). Indeed, the set of "similar" systems out there is pretty small: every cluster is different. Every one. There are very, very few cookie-cutter clusters out there that can truly be called "identical" to other clusters. As such, even expecting serial binaries to be portable is quite a stretch.

To be blunt: an MPI ABI and/or an uber-mpirun will not solve any of these other issues.

My $0.02: source code portability is enough. This was actually quite wise of the MPI Forum; specifying mpi.h and/or making an ABI was never part of the plan. Any valid MPI application can be recompiled for other target systems. Indeed, properly engineered parallel applications may only need to recompile small portions of their code base to use a different MPI implementations. And with a little effort, apps can be made to be MPI-independant (which is a lot less work than getting all MPI implementations to agree to an ABI / uber-mpirun).

Sure, it would be great to not have to recompile apps, but given the current state of technology, the sheer number of MPI implementations that would have to agree to make an MPI ABI useful, and the fundamental differences in goals between the different MPI implementation, it's hard to justify all the work that would be required for this effort -- just to avoid a simple thing like recompiling.

Thanks for your time in reading this.

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to