Since I wrote and/or support most of the OMPI DRM interface code at one time or another, guess I'll add my $0.02 here. :-)
There is no simple, nor obvious "winning", answer here. There really aren't all that many DRMs out there when you filter the list according to the number of places that use them. Once you do, you find that only a very few see enough usage to merit a lot of support. We chose to provide support for a broader set of DRMs solely because (a) it wasn't all that hard to do so, and (b) we wanted to make OMPI available to as wide an audience as we could. Launching MPI jobs directly from the scheduler is not only possible, but available today with most (if not all) the MPI implementations. Not all DRMs support it, but some do. To understand why some choose not to support that mode, you have to understand that startup of an MPI job consists of two very distinct phases: 1. mapping of the process to the allocated nodes (defining what ranks go where), and subsequent spawning of those procs; and 2. wireup of the MPI interconnects across the processes. All of the DRMs/schedulers can do step 1. Doing the second step in a scalable, fast way is non-trivial. Some vendors provide specific interconnects that are tightly coupled to the DRM such that step 2 can be done without exchanging messages to pass contact info - but that introduces some constraints on portability of the DRM itself, requires development of specialized interconnects for a limited market, etc. Other DRMs provide software support for step 2 - with an attendant required investment for development and maintenance. You are correct in that it raises a question of return on investment, but I don't find that many DRM "vendors" are motivated by such things. Instead, they appear to be motivated primarily by ego ("we can build it better than anyone else") and competition (in many cases, the DRM is developed under a funding grant that continues so long as the developing organization can win grants). There is, therefore, little motivation to standardize DRM interfaces or support. So I very much doubt you'll see a consolidation of DRM interfaces any time soon. Of course, the various DRMs do provide differing levels of support (e.g., fault tolerance). We at OMPI made the decision to expend the effort to try and provide an even user-level experience by filling any differences in DRM capability from within OMPI. So there is a -lot- of code within OMPI's RTE dedicated to providing capabilities found in one environment that might be missing in another. We do that in our modular architecture, though, so where a capability is available via the DRM, we exploit it - and where it isn't, we implement it ourselves. Some DRM providers wonder at times as to why we do that - after all, if we only used what the DRM provided, our lives would be easier. But we believe the user would not benefit from that approach, and so we continue to make the effort. Our "reward" is that users can run an OMPI program on nearly every system we know about and have it behave exactly the same way (some setting of MCA params may be required). Long winded answer - hope it provides some insight into the decisions we make. Ralph On Mar 10, 2010, at 7:03 PM, Brian Smith wrote: > Hi, All, > > This may seem like an odd query (or not; perhaps it has been brought up > before). My work recently involves HPC usability i.e. making things > easier for new users by abstracting away the scheduler. I've been > working with DRMAA for interfacing with DRMs and it occurred to me: what > would be the advantage to letting the scheduler itself handle farming > out MPI processes as individual tasks rather than having a wrapper like > mpirun to handle this task via ssh/rsh/etc.? > > I thought about MPI2's ability to do dynamic process management and how > scheduling environments tend to allocate static pools of resources for > parallel tasks. A DRMAA-driven MPI would be able to request that the > scheduler launch these tasks as resources become available enabling > scheduled MPI jobs to dynamically add and remove processors during > execution. Several applications that I have worked with come to mind, > where pre-processing and other tasks are non-parallel whereas the > various solvers are. Being able to dynamically spawn processes based on > where you are in this work-flow could be very useful here. > > It also occurred to me that commercial application vendors tend to > roll-their-own when it comes to integrating their applications with an > MPI library. I've seen applications use HP-MPI, MPICH, MPICH2, > Intel-MPI, (and thankfully, recently) OpenMPI and then proceed to > butcher the execution mechanisms to such an extent that it makes > integration with common DRM systems quite a task. With the exception of > OpenMPI, none of these libraries provides turn-key compatibility with > most of the major DRMs and each require some degree of manual > integration and testing for use in a multi-user production environment. > I would think that vendors would be falling over themselves to integrate > OpenMPI with their applications for this very reason alone. Instead, > some opt to develop their own scheduling environments! Don't they have > bean counters that sit around and gripe about duplicated work? > > Then it occurred to me: with the exception of being able to easily > launch an MPI job with OpenMPI, the ability to monitor it from within > the application is still dependent on the vendor integrating with > various DRMs! This is another area where a DRMAA RAS can come in handy. > There are nice bindings for monitoring tasks and getting an idea of > where you are in execution without having to resort to kludgey > shell-script wrappers tailing output files. > > Anyway, its been a frustrating couple of weeks dealing with several > commercial vendors and integrating their applications with our DRM and > my mind has been trying to think of a solution that could save all of us > a lot of work (though, at the same time, raise job security concerns in > such turbulent times ;-/ ). What say you, MPI experts? > > Many thanks for your thoughts! > -Brian > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users