On Thu, Feb 28, 2008 at 04:53:11PM -0500, George Bosilca wrote:
> In this particular case, I don't think the solution is that obvious. If
> you look at the stack in the original email, you will notice how we get
> into this. The problem here, is that the FREE_LIST_WAIT is used to get a
> fragmen
George Bosilca wrote:
[.]
I don't think the root crashed. I guess that one of the other nodes
crashed, the root got a bad socket (which is what the first error
message seems to indicate), and get terminated. As the output is not
synchronized between the nodes, one cannot rely on its ord
On Feb 28, 2008, at 2:45 PM, John Markus Bjørndalen wrote:
Hi, and thanks for the feedback everyone.
George Bosilca wrote:
Brian is completely right. Here is a more detailed description of
this
problem.
[]
On the other side, I hope that not many users write such
applications.
This i
In this particular case, I don't think the solution is that obvious.
If you look at the stack in the original email, you will notice how we
get into this. The problem here, is that the FREE_LIST_WAIT is used to
get a fragment to store an unexpected message. If this macro return
NULL (in oth
On Thu, 28 Feb 2008, Gleb Natapov wrote:
> The trick is to call progress only from functions that are called
> directly by a user process. Never call progress from a callback functions.
> The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT(). They
> should be changed to OMPI_FREE_LIST
Hi, and thanks for the feedback everyone.
George Bosilca wrote:
Brian is completely right. Here is a more detailed description of this
problem.
[]
On the other side, I hope that not many users write such applications.
This is the best way to completely kill the performances of any MPI
imp
On Wed, Feb 27, 2008 at 10:01:06AM -0600, Brian W. Barrett wrote:
> The only solution to this problem is to suck it up and audit all the code
> to eliminate calls to opal_progress() in situations where infinite
> recursion can result. It's going to be long and painful, but there's no
> quick
Brian is completely right. Here is a more detailed description of this
problem.
Upon receiving a fragment from the BTL (lower layer) we try to match
it with an MPI request. If the match fails, then we get a fragment
from the free_list (via the blocking call to FREE_LIST_WAIT) and copy
the
Bummer; ok.
On Feb 27, 2008, at 11:01 AM, Brian W. Barrett wrote:
I played with this to fix some things in ORTE at one point, and it's
a very dangerous slope -- you're essentially guaranteeing you have a
deadlock case. Now instead of running off the stack, you'll
deadlock. The issue is th
I played with this to fix some things in ORTE at one point, and it's a
very dangerous slope -- you're essentially guaranteeing you have a
deadlock case. Now instead of running off the stack, you'll deadlock.
The issue is that we call opal_progress to wait for something to happen
deep in the bo
Gleb / George --
Is there an easy way for us to put a cap on max recusion down in
opal_progress? Just put in a counter in opal_progress() such that if
it exceeds some max value, return success without doing anything (if
opal_progress_event_flag indicates that nothing *needs* to be done)?
Hi,
I ran into a bug when running a few microbenchmarks for OpenMPI. I had
thrown in Reduce and Gather for sanity checking, but OpenMPI crashed
when running those operations. Usually, this would happen when I reached
around 12-16 nodes.
My current crash-test code looks like this (I've remove
12 matches
Mail list logo