On Feb 5, 2010, at 6:40 PM, Gene Cooperman wrote:

> You're correct that we take a virtualized approach by intercepting network
> calls, etc.  However, we purposely never intercept any frequently
> called system calls.  So, for example, we never intercept a call
> to read() or to write() in TCP/IP, as part of our core design principles.
> Instead, we use things like the proc filesytem and the use of system calls
> to find the offset in an open file descriptor.

Coolio.

> We would love the opportunity to work with you on a demonstration for
> the high-performance networks that you mention.  Can you suggest an
> MPI code and the appropriate hardware testbed on which we could get an
> account and run?

Any MPI code should do -- even something as simple as a 
pass-the-message-around-in-a-ring app.  If you can checkpoint and restart it, 
that's a good start.

As for high-speed networks, any iWARP, IB, or Myrinet based network should do.  
iWARP+IB use the OpenFabrics verbs API; Myrinet networks use the MX API.  
AFAIK, neither of them export counters through /sys or /proc.

> We are aware of your plugin facilities and that in addition to BLCR,
> other checkpointers can also integrate with it.  And of course, we have the
> highest respect for BLCR.  We think that at this time, it is best to
> continue exploring both approaches.

One clarification -- our plugin interfaces were not designed specifically to 
support BLCR.  They were designed to support generic checkpointing facilities.  
Of course, we only had a few mind when they were designed, so it's possible 
that they might need to be extended if yours is different than at least the 
general model that we envisioned.  But all things are do-able.

I just mention this if you wish to pursue the inside-Open-MPI plugins approach. 
 Of course, staying outside of Open MPI is advantageous from a portability 
point of view.

> Although we haven't looked so closely at the plugin facility, we had
> assumed that it always interoperates with the OpenMPI checkpoint-restart
> service developed by Joshua Hursey (for which we also have very
> high respect).  

Ya, he's a smart guy.  But don't say it too loud or he'll get a big ego!  ;-)

> Our understanding was that the OpenMPI checkpoint-restart
> service arranges to halt all MPI messages, and then it calls BLCR
> for checkpointing on the local host.

The short answer is "yes".  The longer answer is that Josh designed a few 
different types of plugins -- some of quiescing a job, some for back-end 
checkpointer support, etc.  Hence, one is not directly dependent on the other.  
I believe that he has written some papers about this... ah, here's one of them 
(you may have seen this already?):

    http://www.open-mpi.org/papers/hpdc-2009/

> DMTCP tries to do the job of both Josh's checkpoint-restart service
> and also BLCR, and it does it all transparently by working at the TCP/IP 
> socket
> level.  So, we simply run:
>   dmtcp_checkpoint mpirun ./hello_mpi
>   dmtcp_command --checkpoint
>   ./dmtcp_restart_script.sh
> (The file QUICK-START in DMTCP has a few more details.)
> Hence, we don't use the OpenMPI checkpoint-restart service or its plugin
> interface, since we're already able to do the distributed checkpointing
> directly.  If it were important, we could modify DMTCP to be called
> by the plugin, and to do checkpointing only on the local host.

I guess that's what I was asking about -- if you thought it would be worthwhile 
to do that: have your checkpoint service be called by Open MPI.  In this way, 
you'd use Open MPI's infrastructure to invoke your single-process-checkpointer 
underneath.

I'm guessing there are advantages and disadvantages to both.

> Also, as a side comment, DMTCP was already working with OpenMPI 1.2, but then
> later versions of OpenMPI started using more sophisticated system calls.
> By then, we were already working through different tasks, and it has
> taken us this long to come back to OpenMPI and properly support it again
> through our virtualized approach (properly handling the multiple
> ptys of OpenMPI, etc.).

Gotcha.  FWIW, we don't checkpoint the run-time system in Open MPI -- we only 
checkpoint the MPI processes.  Upon restart, we rebuild the run-time system and 
then launch the "restart" phase in the MPI processes.  In this way, we avoided 
a lot of special case code and took advantage of much of the infrastructure 
that we already had.

This could probably be construed as an advantage to working in the plugin 
system of Open MPI -- you'd pretty much be isolated from using more complex 
system calls, etc.  Indeed, that was one of Josh's primary design goals: 
separate the "quiesce" phase from the "checkpoint" phase because they really 
are two different things, and they don't necessarily have to be related.

That being said, a disadvantage of this approach is that that work (i.e., the 
plugin -- not the actual back-end checkpointer) then becomes specifically tied 
to Open MPI.  What we did with BLCR was to write a thin plugin that simply 
links to BLCR where the majority of the work is contained.  Hence, the plugin 
was pretty small -- it just interfaces to external functionality (many of Open 
MPI's plugins do that -- OMPI/MPI-specific logic is in the plugin, but we link 
against external libraries for additional functionality).  Also, when working 
in our plugin system, you're using our model and infrastructure -- not your 
own.  

Josh had a lot of freedom to design our model and is finishing his PhD because 
of it :-), but he definitely had the "first implementor" advantage.  While we 
certainly encourage (and want!) new and novel work, the onus is now on new 
proposers to show why their system would be better than the one we have, etc.

> So, in conclusion, DMTCP will work fine with OpenMPI out of the box
> for small and medium jobs.  For questions of scalable computation,
> measuring overhead, and so on, we would need a partner to address those
> issues.  We would do most of the work, but we would need someone more
> intimately familiar with good testbeds for OpenMPI, prioritized goals
> for OpenMPI, and so on, in order to help lead us through the challenge
> of scalability.  If that partner recommends that we would integrate
> best through the OpenMPI plugin, we can certainly do that.  In fact, we are
> working right now with the Condor group to have them validate DMTCP
> as a checkpointer (initially for their vanilla universe) by operating
> through the Condor checkpoint interface.

Nifty.  If we want to have more detailed conversations, a phone call is likely 
best.  Ping me off-list and we can setup a time.

Little known fact: one of the primary communication tools between Open MPI 
developers is the telephone (!).  We all email and IM each other frequently, 
but you can save a week's worth of exhausting emails with a 30- or 60-minute 
phone conversation.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to