On Fri, 2010-10-22 at 07:36 -0600, Ralph Castain wrote: > MPI won't do this - if a node dies, the entire MPI job is terminated. > > > Take a look at OpenRCM, a subproject of Open MPI: > > > http://www.open-mpi.org/projects/orcm/ > > > This is designed to do what you describe as we have a similar (open > source) project underway at Cisco. If I were writing your system, I > would: > > > (a) add my sensors to the orte/mca/sensor framework. You'll find that > we already monitor memory usage, for example. Use the orte/mca/db > framework to store your data in a database. Several different > databases are already supported, though it is easy to add another if > you want (e.g., sqlite support). > > > (b) add my desired error response to the src/orte/mca/errmgr/orcm > module. The ability to migrate processes is already implemented, but > you may need to do something additional to migrate a VM. If you > prefer, you can create your own module in that area and use one of the > other components as an example. > > > Then let orcm start its daemons across your nodes. Orcm daemons will > do the monitoring and reporting for you, and will start and monitor > the virtual machines. If you set the max local restarts to 0, and max > global restarts to some number, the system will automatically migrate > any failures to other nodes. > > > See the June 2010 presentation under "Publications" on the web page > above for an overview of how it all works. If you decide to go this > route, I'll be happy to provide advice and further explanation. And of > course, you are welcome to participate in ORCM if you choose. >
Thank You very much. I think this is very useful for me. Can You provide me link to presentation (i can't see it under http://www.open-mpi.org/papers/) And can You send me very simple example, how can i use ORCM.. (may be i can get some useful information by reading http://svn.open-mpi.org/svn/orcm/trunk/test...) Does ORCM have man pages for functions like openmpi? -- Vasiliy G Tolstov <v.tols...@selfip.ru> Selfip.Ru