Re: [OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration

2017-02-27 Thread George Bosilca
Alberto, In the master there is no such support (we had support for migration a while back, but we have stripped it out). However, at UTK we developed a fork of Open MPI, called ULFM, which provides fault management capabilities. This fork provides support to detect failures, and support for hand

[OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration

2017-02-27 Thread Alberto Ortiz
Hi, I am interested in using OpenMPI to manage the distribution on a MicroZed cluster. This MicroZed boards come with a Zynq device, which has a dual-core ARM cortex A9. One of the objectives of the project I am working on is resilience, so I am trully interested in the fault tolerance provided by

Re: [OMPI users] fault tolerance in open mpi

2009-12-24 Thread vipin kumar
Dear all, May I help in this context ? I can't promise to do big things or high availability in this regard, because I may get more busy in my work. And also I am not sure that my company will allow this or not. Any way I may do this in my spare time. Thanks & Regards, On 12/23/09, Ralph Castai

Re: [OMPI users] fault tolerance in open mpi

2009-12-23 Thread Ralph Castain
That's just OMPI's default behavior - as Josh said, we are working towards allowing other behaviors, but for now, this is what we have. On Dec 23, 2009, at 5:40 AM, vipin kumar wrote: > Thank you Ralph, > > I did as you said. Programs are running fine, But still killing one process > leads to

Re: [OMPI users] fault tolerance in open mpi

2009-12-23 Thread vipin kumar
Thank you Ralph, I did as you said. Programs are running fine, But still killing one process leads to terminate all processes. Am I missing something? Any thing else to be called with MPI::Comm::Disconnect()? Thanks & Regards, On Mon, Dec 21, 2009 at 8:00 PM, Ralph Castain wrote: > Disconnect

Re: [OMPI users] fault tolerance in open mpi

2009-12-21 Thread Ralph Castain
Disconnect is a -collective- operation. Both parent and child have to call it. Your child process is "hanging" while it waits for the parent. On Dec 21, 2009, at 1:37 AM, vipin kumar wrote: > Hello folks, > > As I explained my problem earlier, I am looking for Fault Tolerance in MPI > Programs

Re: [OMPI users] fault tolerance in open mpi

2009-12-21 Thread vipin kumar
Hello folks, As I explained my problem earlier, I am looking for Fault Tolerance in MPI Programs. I read in Open MPI 2.1 standard document that two DISCONNECTED processes does not affect each other, i.e. they can die or can be killed without whithout affecting other processes. So, I was trying th

[OMPI users] fault tolerance support via apt-get

2009-10-06 Thread Hui Jin
Hi, I was trying to install openmpi with fault tolerance support (blcr) on my cluster. The OS is Ubuntu 9.04 server version (64-bit). I was able to install open mpi by apt-get, apt-get install libopenmpi-dev libopenmpi1 openmpi-bin openmpi-common openmpi-doc However, it seems that the checkpo

Re: [OMPI users] fault tolerance in open mpi

2009-09-23 Thread Josh Hursey
Unfortunately I cannot provide a precise time frame for availability at this point, but we are targeting the v1.5 release series. There is a handful of core developers working on this issue at the moment. Pieces of this work have already made it into the Open MPI development trunk. If you

Re: [OMPI users] fault tolerance in open mpi

2009-09-18 Thread vipin kumar
Hi Josh, It is good to hear from you that work is in progress towards resiliency of Open-MPI. I was and I am waiting for this capability in Open-MPI. I have almost finished my development work and waiting for this to happen so that I can test my programs. It will be good if you can tell how long i

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread Josh Hursey
Task-farm or manager/worker recovery models typically depend on intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI implementation. William Gropp and Ewing Lusk have a paper entitled "Fault Tolerance in MPI Programs" that outlines how an application might take advantage of th

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread Durga Choudhury
Is that kind of approach possible within an MPI framework? Perhaps a grid approach would be better. More experienced people, speak up, please? (The reason I say that is that I too am interested in the solution of that kind of problem, where an individual blade of a blade server fails and correcting

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread jody
Hi I guess "task-farming" could give you a certain amount of the kind of fault-tolerance you want. (i.e. a master process distributes tasks to idle slave processors - however, this will only work if the slave processes don't need to communicate with each other) Jody On Mon, Aug 3, 2009 at 1:24

Re: [OMPI users] fault tolerance in open mpi

2009-08-03 Thread vipin kumar
Hi all, Thanks Durga for your reply. Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance in LAM-MPI. i. e. killing any slave process doesn't affect others. Exact behaviour I am looking for in Open MPI. I attempted, but no luck. Can you please tell how to write such program

Re: [OMPI users] fault tolerance in open mpi

2009-07-09 Thread Durga Choudhury
Although I have perhaps the least experience on the topic in this list, I will take a shot; more experienced people, please correct me: MPI standards specify communication mechanism, not fault tolerance at any level. You may achieve network tolerance at the IP level by implementing 'equal cost mul

[OMPI users] fault tolerance in open mpi

2009-07-09 Thread vipin kumar
Hi all, I want to know whether open mpi supports Network and process fault tolerance or not? If there is any example demonstrating these features that will be best. Regards, -- Vipin K. Research Engineer, C-DOTB, India

Re: [OMPI users] Fault Tolerance

2007-03-22 Thread Josh Hursey
LAM/MPI was able to checkpoint/restart an entire MPI job as you mention. Open MPI is now able to checkpoint/restart as well. In the past week I added to the Open MPI trunk a LAM/MPI-like checkpoint/ restart implementation. In Open MPI we revisited many of the design decisions from the LAM/MP

Re: [OMPI users] Fault Tolerance

2007-03-21 Thread George Bosilca
What you're looking for is called PVM. Moreover, your requirements are a mixed bags of FT features that comes from completely different worlds. 1) Recover any software/hardware crashes ? What kind of recovery you're looking for ? What is your definition of recovering ? If what you want is

Re: [OMPI users] Fault Tolerance

2007-03-21 Thread Thomas Spraggins
To migrate processes, you need to be able to checkpoint them. I believe that LAM-MPI is the only MPI implementation that allows this, although I have never used LAM-MPI. Good luck. Tom Spraggins t...@virginia.edu On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote: Hello folks, I am try

[OMPI users] Fault Tolerance

2007-03-21 Thread Mohammad Huwaidi
Hello folks, I am trying to write some fault-tolerance systems with the following criteria: 1) Recover any software/hardware crashes 2) Dynamically Shrink and grow. 3) Migrate processes among machines. Does anyone has examples of code? What MPI platform is recommended to accomplish such requi

Re: [OMPI users] Fault Tolerance

2007-03-16 Thread Jeff Squyres
On Mar 16, 2007, at 5:44 PM, Mohammad Huwaidi wrote: The following code is my trial to write a fault-tolerant application on OpenMPI; however, it still does not trap exceptions: I'm not sure what your question is. It does not seem to trap exceptions because, at least at first glance, your

[OMPI users] Fault Tolerance

2007-03-16 Thread Mohammad Huwaidi
The following code is my trial to write a fault-tolerant application on OpenMPI; however, it still does not trap exceptions: #include //#include #include #include #include #include #include #define BUFSIZE 100 using namespace std; using namespace MPI; static int nerr = 0; static int i

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Troy Telford
On Tue, 31 Oct 2006 08:43:10 -0700, Galen M. Shipman wrote: Okay, so these are percentage not modulus, the formula makes some sense now.. so the timeout is between 4.9 and 10.3 ms, you had better plug the cable in/out very quickly The Flash could do it. -- Troy Telford

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Galen M. Shipman
Galen M. Shipman wrote: Gleb Natapov wrote: On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other tran

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Galen M. Shipman
Gleb Natapov wrote: On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other transports you have. The reason is that OB1

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-31 Thread Gleb Natapov
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote: > On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov > wrote: > > > If you use OB1 PML (default one) it will never recover from link down > > error no matter how many other transports you have. The reason is that > > OB1 never tracks w

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-30 Thread Troy Telford
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov wrote: If you use OB1 PML (default one) it will never recover from link down error no matter how many other transports you have. The reason is that OB1 never tracks what happens with buffers submitted to BTL. So if BTL can't, for any reason, tr

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-29 Thread Gleb Natapov
On Thu, Oct 26, 2006 at 05:39:13PM -0600, Troy Telford wrote: > I'm also confident that both TCP & Myrinet would throw an error when they > time out; it's just that I haven't felt the need to verify it. (And with > some-odd 20 minutes for Myrinet, it takes a bit of attention span. The > las

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread Troy Telford
On Thu, 26 Oct 2006 15:11:46 -0600, George Bosilca wrote: The Open MPI behavior is the same independently of the network used for the job. At least the behavior dictated by our internal message passing layer. Which is one of the things I like about Open MPI. There is nothing (that has a r

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread George Bosilca
Moreover ... you have to have the admin right in order to modify these parameters. If it's the case, there is a trick for MX too. One can recompile it, with a different timeout (recompilation is required as far as I remember). Grep for timeout in the MX sources and you will find out how to

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread Durga Choudhury
As an alternate suggestion (although George's is better, since this will affect your entire network connectivity), you could override the default TCP timeout values with the "sysctl -w" command. The following three OIDs affect TCP timeout behavior under Linux: net.ipv4.tcp_keepalive_intvl = 75 <-

Re: [OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread George Bosilca
The Open MPI behavior is the same independently of the network used for the job. At least the behavior dictated by our internal message passing layer. But, for this to happens we should get a warning from the network that something is wrong (such a timeout). In the case of TCP (and Myrinet)

[OMPI users] Fault Tolerance & Behavior

2006-10-26 Thread Troy Telford
I've recently had the chance to see how Open MPI (as well as other MPIs) behave in the case of network failure. I've looked at what happens when a node has its network connection disconnected in the middle of a job, with Ethernet, Myrinet (GM), and InfiniBand (OpenIB). With Ethernet and M