[O-MPI users] Fwd: Thoughts on an MPI ABI

Jeff Squyres Sun, 13 Mar 2005 13:36:30 -0500

Greetings. I loosely watched the MPI ABI discussions on the Beowulflist but refrained from commenting (I stopped checking -- is it stillgoing on?). Now that the discussion has come to my project's list, Iguess I should speak up. :)

Since I've been "saving up" for a while, this post is a bit lengthy. Iapologize.


-----

On the surface, an ABI looks great. Greg's slides show a bunch ofreasons why an ABI could be a Good Thing(tm) and how a bunch of groupsof people could "win" -- the scenarios outlined are certainlyattractive.

But I think that there are deeper problems. In short, I believe thatMPI is only one part of a complex web of issues. Even if we wave ourhands and assume that we have an MPI ABI today, there are many otherfactors that prevent the scenarios that are described in Greg's slides.I'll discuss these below.

Because e-mail is such an imprecise media, and because I don'tpersonally know many (most?) of you, I want to stress right up frontthat I am *not* attempting to start a holy war, flame fest, or anyother such nonsense. This is all simply the opinions of an MPIimplementor. Although I've been around the block a few times, I'mcertainly not going to claim to be a definitive expert on all systemseverywhere. This mail simply summarizes my views.


-----

First, let me ask a question: what does an MPI ABI *really* get for you?

The obvious answer is that you don't have to recompile. Your app runsanywhere with any MPI on any system. Well, that is, unless want to runon a different architecture (32/64 bit, different CPU, differentplatform, etc.). Or if you want to use a different compiler on thesame system (let's not forget C++ and F90 name mangling issues). Or ifyou want to use different system or compiler flags (e.g., threading /no threading, largefile support on Linux, optimization and debuggingsupport, etc.).

So -- hmm. You can run your MPI app on any MPI implementation that ison exactly the same platform, architecture, uses the same compilers,and uses the same system and compiler flags that you want. So an MPIABI does not enable the "compile once, run anywhere" scheme -- itreally is much narrower than the casual observer might expect.

But let's say that even that would be a big "win" for users -- you have"similar" systems with "similar" MPI implementations, and you can runyour MPI app on any combination of them. So how does the end userchoose which MPI implementation to use?

The old way was to set your $PATH -- ensure that you have the "right"mpicc, mpirun, etc. Sure, you can have 20+ MPI implementations loadedon your cluster (and there are many that do), and they can allpeacefully co-exist (within reason). Switching between MPIimplementations is [usually] a matter of the user changing their $PATH(sometimes this has to be in their shell startup files).

As an MPI implementor and support provider, let me tell you thatgetting users to do this correctly is a nightmare. Users generallyunderstand PATH, but getting them to set it right [consistently] isquite difficult. I'm not making any judgements on whether setting$PATH to switch MPI implementations is a good system or not -- I'm justsaying that that's [usually] what it is. And it's difficult enough forthe user who doesn't care why/how MPI works.

The new way is based on MPI shared libraries -- users will change theirLD_LIBRARY_PATH instead of their PATH. Some will claim that this isequivalent; if a user can change their PATH, they can certainly changetheir LD_LIBRARY_PATH. I'm guessing, however, that it will be muchharder. Users generally understand $PATH; how many of them have everheard of LD_LIBRARY_PATH and/or will understand (or care) what it isfor? I have visions of "set ld library path = /opt/xyz-mpi".

But I digress -- whether the users will get it right or not isspeculative. My point here is that you have traded one "switch"mechanism for another that, from a procedural standpoint, is equivalent(setting LD_LIBRARY_PATH vs. setting PATH).

Ok, so let's wave our hands again and assume that this is all workingfine and good, and users can switch between MPI implementations ontheir similar systems with ease. What have we solved? My clusterstill has 20+ MPI implementations on it, and I (the user) still have tochoose which to use. I don't have to recompile my app, but now I'vegot a somewhat-intangible way to know which MPI I'm using (look at$LD_LIBRARY_PATH). Users are now quite accustomed to "myapp-lam","myapp-ftmpi", "myapp-lampi", "myapp-mpich-gm", etc., where thedifference is quite obvious. Now it's much less obvious.

Is this a good or a bad thing? I don't know -- I just raise the point.When you only have one binary, it becomes harder (or, better put,takes more effort) to ensure that you're running with the MPIimplementation that you intend to. Mistakes (by end users and/orpre-bundled MPI software) will become easier to make.

A final thought here: the -rpath (and equivalent) linker flags areextremely convenient for users. You compile against shared librariesand they are magically "found" at run time, regardless of yourLD_LIBRARY_PATH. This is particularly helpful for packages that areinstalled in non-system-default locations (like the 20 MPIimplementations you have installed on your cluster). Having an MPI ABIwill pretty much stop this practice -- you don't want to link an MPIapplication with -rpath because you don't want to (or can't) assumewhich MPI a user will want to use. So they user *has* to set theirLD_LIBRARY_PATH -- you no longer have an MPI implementation that "justworks"; users must do one [more] thing before an MPI application willrun.


Summary:

- With an MPI ABI, you can only run on "similar" systems
- Users now set their LD_LIBRARY_PATH instead of PATH
- It's less obvious which MPI the user is actually using

- -rpath linker flag can/should not be used; users *have* to setLD_LIBRARY_PATH


-----

What about the ISV?

Again, on the surface, this looks great -- an ISV can ship *one*executable and have it work "anywhere". Er, well, anywhere "similar"(so let's not forget that the ISV will still end up shipping a lot ofexecutables -- they may be shipping *fewer* executables than before,but there will still be [far?] more than one).

But does an ISV really want that? Suddenly their app can [potentially]run in a lot of scenarios that they have not verified through their Q&Aprocess. How do you know that you'll get the right answers? How doyou know it won't crap out in the middle of the run because of amissing symbol (not involving MPI)? The fact is that the app can nowrun in a lot of unsupported places, whereas today, the possibilities ofthis happening are *much* more limited. ISVs generally choose whichMPI implementations, but then their apps *only* run on thoseimplementations (there are exceptions to this rule, I know).

This is quite an important point, and is something that several othershave brought up in other mails: all MPI implementations are not createdequal. Take any two production-quality MPI implementations and they'llhave their own quirks and differences. They'll behave and performdifferently. So even though your application is source code portable,it may not be performance / behavioral portable. This has been awell-known fact for years (as someone said -- it's an artifact of usinga standard with multiple implementations). This is why ISV's Q&A testtheir applications with different MPI implementations, and only certifyspecific ones. More specifically, if your application works on one MPIimplementation, you can't guarantee that it will work on another. It*probably* will, but customers don't pay for "probably" (e.g., youcan't know if you're accidentally relying on a quirk of one [or more!]implementation[s] without testing on exactly the ones that you plan tosupport).

I'm not an ISV, so I won't pretend to speak for them, but several whomI've had conversations actually *prefer* having tight control of wheretheir apps run (regardless of the mechanism) -- not just in terms ofQ&A, but also in terms of support. Granted, today's system ofenforcing that is rather klunky (you won't get any disagreement from methere), but it gives ISVs what they want (at least, the ones that Ihave talked to).

Let's again wave our hands and assume that we have an MPI ABI, andimagine a support call for an ISV's MPI application:


Tech: "Hello, welcome to ABC support."
User: "I'm having a problem with your XYZ product."
Tech: "Ah yes, this product uses MPI.  Which MPI are you using?"
User: "I'm using JKL MPI."
Tech: "I'm sorry, we don't support JKL MPI."

That's a bit fanciful and simplified, but my point here is that ISV'sare still going to choose which MPI's they want to / can support. Ifyou (the user) use something outside of that set, you're unsupported.This may be confusing for users because the application *runs* (orseems like it is *supposed* to run) -- they have an MPI, right? So whydoesn't the ISV support that MPI?

More specifically: it is better for an application to not run at allrather than run poorly (or, even worse, silently/unknowingly generateincorrect results). Having a clear-cut distinction here is a GoodThing(tm).

Also, let's not forget that some ISV's have chosen to avoid today'sklunky mechanisms and simply statically compile a libmpi.a into theirapplication. They include a stripped down MPI implementation(potentially not their own) inside their own app, and provide varyingdegrees of hiding the MPI from the user. Hence, the ISV has delivereda solution that will always work.

Granted, this isn't [yet] possible for all scenarios. But it worksquite well in a wide variety of environments (let's not discount thenumber of clusters that are being bought outside of "traditional" HPCenvironments -- bio, chemical, etc., where TCP-based networks are usedheavily).

While we're on the topic: as cited in Greg's slides,non-traditional-HPC parallel applications (bio, chemical, etc.) are notgoing to tolerate recompiling. They expect to get a binary that "justworks". This is certainly a valid point. However, these types ofusers will also be buying a complete solution, from hardware all theway to application (as much as possible). Specifically, these usersdon't care (and sometimes don't even know) which MPI they are using.They don't care about running with 20 different MPI implementations.They'll use one -- whichever one their application is bundled with --and will never use another (on that system, at least). So an MPI ABImay not be very important to them.

Another solution that some ISV's use today is to have a thin messagepassing abstraction layer. So their main code base consists of 98%application-related stuff; 2% message passing stuff. Engineeredproperly, the 98% makes calls into a separate library (i.e., the 2%)that funnels all access to MPI. Hence, the ISV's really only need torecompile the small library that interfaces to MPI -- not their entireapplication -- to switch between MPI implementations.

Don't get me wrong -- I'm not saying that this is a perfect solution.All I'm saying is that with proper planning and engineering, it's not a*bad* one. Indeed, with slightly more effort, an ISV application couldhave a dynamic module that opens different libraries to talk todifferent MPI implementations (e.g., dlopen("mpi_interface_lam.so"), ordlopen("mpi_interface_mpich.so"), etc.).

Hence, it is possible for ISVs to ship MPI-independent applications*today*. More specifically, this would solve many of the same issuesthat an MPI ABI would solve *without requiring anything additional fromMPI implementations* (and all the baggage that goes along with that) --but you can still only run on "similar" systems.


Summary:

- ISVs are still only going to support some MPI implementations

- ISVs lost control over which MPI implementations their apps are usedwith- Potential user confusion because it's less obvious which MPI they'reusing

- ISVs can ship static executables *today*
- ISVs can write binary MPI-independent applications *today*

-----

Much of what is being discussed first centers around standardizingmpi.h. I think we all agree that it was not the Forum's goal tostandardize mpi.h -- they actually didn't standardize it on purpose (toallow implementors to do whatever they want/need).


The main differences between mpi.h's can be summarized as:

1. values of constants
2. size of MPI_Status
3. size and types of MPI handles (crassly: pointer vs. integer)

#1 is probably fairly easy to solve, but it's dependent upon #3. #2may present some arguments between implementors. #3 may introduce somefist fights. ;-) Canonical example: MPICH* uses integers; Open MPIuses pointers. I don't think that either side is willing to give themup -- a simple reason (but definitely not the only reason -- this mailis not intended to open that debate) is that the amount of code thatwill change as a result of converting from int->pointer or pointer->intis quite large. Admittedly, each change is fairly small, but it'sstill a *lot* of small changes.

Let's also not forget MPI for smaller niche environments -- do wereally want to force an embedded MPI to use 32 or 64 bit handles?That's a silly example, of course, but the point I'm trying to make isthat MPI spans a wide range of platforms -- what is suitable for one isnot necessarily suitable for another (even at the mpi.h level).

A user recently asked me, "So why should I suffer because of religiousdifferences between MPI implementors?"

My reply to that is "How exactly are you suffering?" Is recompilingreally that difficult? The fact remains that you're still going tohave many different MPI implementations out there -- an ABI will notchange this. Just because you slightly change the mechanism of how toswitch your application between them; does that really, fundamentallymake life better somehow? I'll return to this question later.

Every MPI implementation has different goals (research, production,latency, bandwidth, portability / specivity, etc.). These goalsstrongly influence the design of that implementation and have tangibleimpacts on mpi.h.


Summary:

- Standardizing the size/type of MPI handles is problematic

- What is appropriate on one platform is not necessarily appropriate onanother

- Every MPI implementation has different goals, which even affects mpi.h

-----

As mentioned multiple times in the slides, having a common mpi.h isonly half the story. You'll still need a common mpirun to really makethings transparent to the user (you may even be able to hide some ofthe LD_LIBRARY_PATH issues if you have a good uber-mpirun). The slidesargue that you can't support multiple batch queue systems in mostcurrent MPI implementations.

I strongly disagree with this. LAM/MPI has been doing it for years.LAM currently supports -- out of the box -- the run-time decision ofwhether to use rsh/ssh, PBS, SLURM, BProc (both LANL and Scyldvariants), and limited scenarios for Globus. Open MPI will supporteven more than this.

*** Sidenote: this same argument holds for support of different networkinterconnects. LAM/MPI has been supporting the run-time decision ofwhich interconnect to use for years. Open MPI will continuecapability. But let's get back to the RTE discussion...

I firmly believe -- and the software to back up my belief is freelyavailable -- that this is purely a quality of implementation issue. IfMPI implementations want to support multiple back-end run-timeenvironments (RTEs), they can (this all goes back to the goals of anMPI implementation). All batch systems have some kind of interface(API or command line) to launch processes; although there is varyingsupport for monitoring and killing, it's simply a question of the MPIimplementation using the interface to launch its MPI processes. It'snot difficult; support for all the RTE systems listed above isapproximately 2% of LAM/MPI 7.1.2's code base (in terms of lines ofcode).

I'm wary of standardizing on an uber-mpirun. There's more to MPI_INITthan just discovering your peers and your identity (Greg mentioned afew issues in his slides: IO forwarding, process monitoring, etc.). Insome cases, there is no out-of-band channel for newly-started MPIprocesses to contact mpirun; MPI_INIT has to figure out its peers andidentity based on what the back-end RTE gave it (e.g., Quadrics,Portals, etc.). Hence, you can't hide everything in an uber-mpirun --the MPI sometimes *needs* knowledge of the back-end RTE.

You're also going to be standardizing many of the MPI-2 dynamicfunctions, MPI_FINALIZE, and MPI_ABORT. That's a *lot* of ground tocover (and to get implementors with different opinions and goals toagree upon). Indeed, in MPI-2, the Forum went so far as to say(paraphrasing, obviously) "We didn't specify the exact behavior ofMPI_FINALIZE on purpose."

The MPI's internal RTE is the soul of the machine; everyone has donetheirs entirely differently. Given that mpi.h is halfway specified bythe MPI standard and we *still* can't agree on the specifics, it isdifficult for me to imagine standardizing critical elements of theback-end of all MPI implementations where there is currently nouniformity at all. Consider: as I mentioned above, standardizing mpi.hmeans touching potentially a lot of code in an MPI implementation.Standardizing the internal run-time environment will touch a lot *more*code in an MPI implementation. That's a hard sell.

Let's also not forget that some MPI implementations distinguishthemselves by their run-time environments. Some have really good RTEs.Some don't. But consider: if performance is roughly equivalent amongmultiple MPI implementations, users will choose by feature sets. Ispeak from experience -- long before I became an MPI implementor, Ichose to use a specific MPI implementation because it had a fast mpirunand when I hit ctrl-C, all my applications were guaranteed to bekilled. If you eliminate these differences, you're asking some MPIimplementations to standardize themselves out of existence. That, too,is a pretty hard sell.

Finally, this uber-mpirun will have a consistent syntax across allplatforms and RTEs, but what about mpiexec? The MPI Forum explicitlyspecified mpiexec to fulfill this requirement. Has it failed? Are allthe mpiexec implementations out there so radically different as to beuseless in terms of uniform syntax? (this is an honest question)


Summary:

- Run-time decision of back-end RTE launcher support is easy andavailable today- An uber-mpirun cannot hide all job control details (MPI_INIT must beinvolved)- An uber-mpirun would effectively standardize MPI_INIT, MPI_FINALIZE,MPI_ABORT, and the MPI-2 dynamic functions- Standardizing the internal RTE in MPI implementations is a *LOT* ofwork

- What about mpiexec?

-----

I have a few random notes on Greg's slides:

- As I mentioned above, any MPI implementation can support multiplebatch-queue systems (or, more specifically, any back-end launchingsystem). It's purely a quality-of-implementation issue. An MPI ABI isnot required to make MPI implementations support multiple differentrun-time environments.

- "Ever wonder why MPI applications don't come with a 'make check'target?" This is an oversimplification -- you're implying that lack ofconsistent mpirun syntax makes MPI applications non-portable, andtherefore impossible to have a consistent launching mechanism. This issimply not true; it ignores at least two significant issues:

1. There are many other external factors required to run an MPIapplication (e.g., SSH keys, a batch-queue system, permission and timeallocation on a cluster/parallel hardware, local setup decisions andpolicies [pre-staging executables or using a global filesystem], etc.).Indeed, the simple matter of choosing how many CPUs to use and whichones to launch across is different in every run-time environment. Thisis not the fault of MPI; this is the "fault" of heterogeneity ofrun-time environments that exist today.

2. mpiexec seems to be able to handle at least some of these issues; italready has a more-or-less standardized command line syntax. Theslides did not address mpiexec at all -- are there issues with mpiexec?Regardless, doesn't "mpirun -np 4 my_app" pretty much work onmany/most implementations?

- On the "Recompilation considered harmful" slide: what about differentcompilers? Even if we have an MPI ABI, compilers will be (or alreadyare?) the next battleground. Whatever happened to the C++ ABI effort?Is there, or will there be an F90 ABI effort? Specifically: MPI isonly one piece of the puzzle. There are a lot of other factors thatdetermine whether recompilation is required or not. ABIs betweencompilers (not libraries) would be a good first step.

- On the "Winners: End Users" slide: this is also anoversimplification. "Any MPI app works on your system" / "Your appworks on your collaborator's system". As discussed above, this onlyworks for "similar" systems -- as long as your MPI app was compiledwith for the same OS, hardware, same system and compiler flags -- thensure, your app will run in multiple places. Indeed, we have this today-- if you compile any non-trivial app (MPI or not), you can [only] runit on any similar system. But if it's not a "similar" system, you can(and will) run into DLL Hell or downright incompatibility. Therefore,this is not MPI's fault. This is the "fault" of the heterogeneity ofsystems out there.

- On the "Winners: MPI implementation researchers" slide: althoughthere are some (a very small number), most implementation researchersdo not write their own MPI from scratch. Most take an existing opensource MPI and modify it. Having an MPI ABI gains nothing for MPIimplementation researchers except that they don't have to recompileapplications for their new implementation. This is exactly the same asit is for everyone else (per restrictions discussed above); singlingout MPI implementation researchers is misleading.

- On the "Winners: Interconnect implementors" slide: Why willinterconnect implementors only reach systems that recompile? Quadricsdistributes binaries, for example. Are you saying that allinterconnects must write their own MPI implementations? I can assurethat most interconnect vendors do *not* want to do this.

- On the "Winners: Commercial software vendors" slide: I talked aboutthis above. An ABI does *not* make testing easier -- the ISV stillhave to test with all the target MPIs that they are going to support.Just because they don't have to recompile will not significantly reducethe logistics of all ISV's. I don't see how automated testing becomeseasier with an ABI. Are you referring to a standardized mpirun? Inseveral of your e-mails, you have indicated that the standardizedmpirun would be a separate effort, not part of the ABI. So I'm a bitconfused by this comment.

- On the "Winners: Open Source Software Projects" slide: you say"Tomorrow, MPI is just like everything else..." Are you saying thatMPI will be DLL Hell just like all other packages out there? That'snot a snide remark -- today, you have to find an RPM for your specificdistro, version, and architecture. Anything is else is a totalcrapshoot as to whether it will work (e.g., DLL Hell). Do you reallywant MPI implementations to fall into this category? Although thereare obvious drawbacks, using the source can be quite liberating interms of portability and freedom from DLL Hell.

- On the "Issues: Startup and queue systems" slide: it sounds like youare now talking about standardizing queue systems which is a much, muchlarger effort than just the MPI (or even the HPC) community.

It's quite possible that I'm missing the talking points (and thereforethe intent) of some of these slides; I did not see a presentation -- Ionly read the PDF. So if I missed the point of some of these slides, Iapologize -- but please expand on your text and explain (the PDF is allthat everyone has to go on who was not at the IB meeting where theywere presented). Thanks!


-----

In conclusion (thanks for staying with me so long!), I guess I reallydon't see a clear "win" for an MPI ABI and/or an uber-mpirun -- I don'tsee a compelling "yes, this will make my life better" rationale (where{"my" E (end-user, MPI implementor, ISV, ...etc.)}. Avoidingrecompiling certainly makes some people's lives better in incrementalways. But it seems like we have far more important problems to solve(extreme scalability, better performance, new platforms, etc.). Do weknow if users really want this? (i.e., a large percentage of users --not a vocal few) Will users really find it easier? Can you reallysell this concept to all MPI implementors? Will ISVs really want theadditional support burden / user confusion? ...and so on.

I believe that the *problem* is not MPI, nor any particularimplementation. The *problem* is that there are a lot of differenttypes of systems out there. You *can't* distribute a binary (even aserial binary) an expect it to work everywhere. Binaries have to betailored to specific systems. This is why, for example, in the Linuxworld, you can't just grab any RPM that has the application you want --you have to find the RPM for your distro, version, and hardware. Ifnothing else, you prevent DLL Hell kinds of issues this way. Or, youstatically link the whole application and leave nothing to chance(which obviates the need for an MPI ABI).

Indeed, even on a given system, there many different variations (whichcompiler to use, which compiler and system flags to use, etc.). MPIcan neither be blamed for all of these variations nor can an MPI ABI beexpected to somehow provide uniformity across all of them (e.g., if theapplication is compiled with -D_REENTRANT and the MPI library is not).

MPI is only one piece of this DLL Hell (etc.) puzzle. An MPI ABI isn'tnearly as useful as one would think unless all the other issues aresolved (e.g., compiler ABIs). Indeed, the set of "similar" systems outthere is pretty small: every cluster is different. Every one. Thereare very, very few cookie-cutter clusters out there that can truly becalled "identical" to other clusters. As such, even expecting serialbinaries to be portable is quite a stretch.

To be blunt: an MPI ABI and/or an uber-mpirun will not solve any ofthese other issues.

My $0.02: source code portability is enough. This was actually quitewise of the MPI Forum; specifying mpi.h and/or making an ABI was neverpart of the plan. Any valid MPI application can be recompiled forother target systems. Indeed, properly engineered parallelapplications may only need to recompile small portions of their codebase to use a different MPI implementations. And with a little effort,apps can be made to be MPI-independant (which is a lot less work thangetting all MPI implementations to agree to an ABI / uber-mpirun).

Sure, it would be great to not have to recompile apps, but given thecurrent state of technology, the sheer number of MPI implementationsthat would have to agree to make an MPI ABI useful, and the fundamentaldifferences in goals between the different MPI implementation, it'shard to justify all the work that would be required for this effort --just to avoid a simple thing like recompiling.


Thanks for your time in reading this.

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

[O-MPI users] Fwd: Thoughts on an MPI ABI

Reply via email to