On Fri, 2010-10-22 at 07:36 -0600, Ralph Castain wrote:
> MPI won't do this - if a node dies, the entire MPI job is terminated.
> 
> 
> Take a look at OpenRCM, a subproject of Open MPI:
> 
> 
> http://www.open-mpi.org/projects/orcm/
> 
> 
> This is designed to do what you describe as we have a similar (open
> source) project underway at Cisco. If I were writing your system, I
> would:
> 
> 
> (a) add my sensors to the orte/mca/sensor framework. You'll find that
> we already monitor memory usage, for example. Use the orte/mca/db
> framework to store your data in a database. Several different
> databases are already supported, though it is easy to add another if
> you want (e.g., sqlite support).
> 
> 
> (b) add my desired error response to the src/orte/mca/errmgr/orcm
> module. The ability to migrate processes is already implemented, but
> you may need to do something additional to migrate a VM. If you
> prefer, you can create your own module in that area and use one of the
> other components as an example.
> 
> 
> Then let orcm start its daemons across your nodes. Orcm daemons will
> do the monitoring and reporting for you, and will start and monitor
> the virtual machines. If you set the max local restarts to 0, and max
> global restarts to some number, the system will automatically migrate
> any failures to other nodes.
> 
> 
> See the June 2010 presentation under "Publications" on the web page
> above for an overview of how it all works. If you decide to go this
> route, I'll be happy to provide advice and further explanation. And of
> course, you are welcome to participate in ORCM if you choose.
> 


Thank You very much. I think this is very useful for me. Can You provide
me link to presentation (i can't see it under
http://www.open-mpi.org/papers/)

And can You send me very simple example, how can i use ORCM.. (may be i
can get some useful information by reading
http://svn.open-mpi.org/svn/orcm/trunk/test...)

Does ORCM have man pages for functions like openmpi?

-- 
Vasiliy G Tolstov <v.tols...@selfip.ru>
Selfip.Ru

Reply via email to