Alberto,
In the master there is no such support (we had support for migration a
while back, but we have stripped it out). However, at UTK we developed a
fork of Open MPI, called ULFM, which provides fault management
capabilities. This fork provides support to detect failures, and support
for hand
Hi,
I am interested in using OpenMPI to manage the distribution on a MicroZed
cluster. This MicroZed boards come with a Zynq device, which has a
dual-core ARM cortex A9. One of the objectives of the project I am working
on is resilience, so I am trully interested in the fault tolerance provided
by
Dear all,
May I help in this context ? I can't promise to do big things or high
availability in this regard, because I may get more busy in my work.
And also I am not sure that my
company will allow this or not. Any way I may do this in my spare time.
Thanks & Regards,
On 12/23/09, Ralph Castai
That's just OMPI's default behavior - as Josh said, we are working towards
allowing other behaviors, but for now, this is what we have.
On Dec 23, 2009, at 5:40 AM, vipin kumar wrote:
> Thank you Ralph,
>
> I did as you said. Programs are running fine, But still killing one process
> leads to
Thank you Ralph,
I did as you said. Programs are running fine, But still killing one process
leads to terminate all processes. Am I missing something? Any thing else to
be called with MPI::Comm::Disconnect()?
Thanks & Regards,
On Mon, Dec 21, 2009 at 8:00 PM, Ralph Castain wrote:
> Disconnect
Disconnect is a -collective- operation. Both parent and child have to call it.
Your child process is "hanging" while it waits for the parent.
On Dec 21, 2009, at 1:37 AM, vipin kumar wrote:
> Hello folks,
>
> As I explained my problem earlier, I am looking for Fault Tolerance in MPI
> Programs
Hello folks,
As I explained my problem earlier, I am looking for Fault Tolerance in MPI
Programs. I read in Open MPI 2.1 standard document that two DISCONNECTED
processes does not affect each other, i.e. they can die or can be killed
without whithout affecting other processes.
So, I was trying th
Hi,
I was trying to install openmpi with fault tolerance support (blcr) on
my cluster.
The OS is Ubuntu 9.04 server version (64-bit).
I was able to install open mpi by apt-get,
apt-get install libopenmpi-dev libopenmpi1 openmpi-bin openmpi-common
openmpi-doc
However, it seems that the checkpo
Unfortunately I cannot provide a precise time frame for availability
at this point, but we are targeting the v1.5 release series. There is
a handful of core developers working on this issue at the moment.
Pieces of this work have already made it into the Open MPI
development trunk. If you
Hi Josh,
It is good to hear from you that work is in progress towards resiliency of
Open-MPI. I was and I am waiting for this capability in Open-MPI. I have
almost finished my development work and waiting for this to happen so that I
can test my programs. It will be good if you can tell how long i
Task-farm or manager/worker recovery models typically depend on
intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI
implementation. William Gropp and Ewing Lusk have a paper entitled
"Fault Tolerance in MPI Programs" that outlines how an application
might take advantage of th
Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting
Hi
I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)
Jody
On Mon, Aug 3, 2009 at 1:24
Hi all,
Thanks Durga for your reply.
Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance
in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. I attempted,
but no luck. Can you please tell how to write such program
Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct me:
MPI standards specify communication mechanism, not fault tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost mul
Hi all,
I want to know whether open mpi supports Network and process fault tolerance
or not? If there is any example demonstrating these features that will be
best.
Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India
LAM/MPI was able to checkpoint/restart an entire MPI job as you
mention. Open MPI is now able to checkpoint/restart as well. In the
past week I added to the Open MPI trunk a LAM/MPI-like checkpoint/
restart implementation. In Open MPI we revisited many of the design
decisions from the LAM/MP
What you're looking for is called PVM. Moreover, your requirements
are a mixed bags of FT features that comes from completely different
worlds.
1) Recover any software/hardware crashes ? What kind of recovery
you're looking for ? What is your definition of recovering ? If what
you want is
To migrate processes, you need to be able to checkpoint them. I
believe that LAM-MPI is the only MPI implementation that allows this,
although I have never used LAM-MPI.
Good luck.
Tom Spraggins
t...@virginia.edu
On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote:
Hello folks,
I am try
Hello folks,
I am trying to write some fault-tolerance systems with the following
criteria:
1) Recover any software/hardware crashes
2) Dynamically Shrink and grow.
3) Migrate processes among machines.
Does anyone has examples of code? What MPI platform is recommended to
accomplish such requi
On Mar 16, 2007, at 5:44 PM, Mohammad Huwaidi wrote:
The following code is my trial to write a fault-tolerant
application on OpenMPI; however, it still does not trap exceptions:
I'm not sure what your question is.
It does not seem to trap exceptions because, at least at first
glance, your
The following code is my trial to write a fault-tolerant application on
OpenMPI; however, it still does not trap exceptions:
#include
//#include
#include
#include
#include
#include
#include
#define BUFSIZE 100
using namespace std;
using namespace MPI;
static int nerr = 0;
static int i
On Tue, 31 Oct 2006 08:43:10 -0700, Galen M. Shipman
wrote:
Okay, so these are percentage not modulus, the formula makes some sense
now..
so the timeout is between 4.9 and 10.3 ms, you had better plug the cable
in/out very quickly
The Flash could do it.
--
Troy Telford
Galen M. Shipman wrote:
Gleb Natapov wrote:
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov
wrote:
If you use OB1 PML (default one) it will never recover from link down
error no matter how many other tran
Gleb Natapov wrote:
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov
wrote:
If you use OB1 PML (default one) it will never recover from link down
error no matter how many other transports you have. The reason is that
OB1
On Mon, Oct 30, 2006 at 11:45:53AM -0700, Troy Telford wrote:
> On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov
> wrote:
>
> > If you use OB1 PML (default one) it will never recover from link down
> > error no matter how many other transports you have. The reason is that
> > OB1 never tracks w
On Sun, 29 Oct 2006 01:34:06 -0700, Gleb Natapov
wrote:
If you use OB1 PML (default one) it will never recover from link down
error no matter how many other transports you have. The reason is that
OB1 never tracks what happens with buffers submitted to BTL. So if BTL
can't, for any reason, tr
On Thu, Oct 26, 2006 at 05:39:13PM -0600, Troy Telford wrote:
> I'm also confident that both TCP & Myrinet would throw an error when they
> time out; it's just that I haven't felt the need to verify it. (And with
> some-odd 20 minutes for Myrinet, it takes a bit of attention span. The
> las
On Thu, 26 Oct 2006 15:11:46 -0600, George Bosilca
wrote:
The Open MPI behavior is the same independently of the network used
for the job. At least the behavior dictated by our internal message
passing layer.
Which is one of the things I like about Open MPI.
There is nothing (that has a r
Moreover ... you have to have the admin right in order to modify
these parameters. If it's the case, there is a trick for MX too. One
can recompile it, with a different timeout (recompilation is required
as far as I remember). Grep for timeout in the MX sources and you
will find out how to
As an alternate suggestion (although George's is better, since this will
affect your entire network connectivity), you could override the default TCP
timeout values with the "sysctl -w" command.
The following three OIDs affect TCP timeout behavior under Linux:
net.ipv4.tcp_keepalive_intvl = 75 <-
The Open MPI behavior is the same independently of the network used
for the job. At least the behavior dictated by our internal message
passing layer. But, for this to happens we should get a warning from
the network that something is wrong (such a timeout). In the case of
TCP (and Myrinet)
I've recently had the chance to see how Open MPI (as well as other MPIs)
behave in the case of network failure.
I've looked at what happens when a node has its network connection
disconnected in the middle of a job, with Ethernet, Myrinet (GM), and
InfiniBand (OpenIB).
With Ethernet and M
33 matches
Mail list logo