Re: [OMPI users] Program hangs in mpi_bcast

2011-12-09 Thread Alex A. Granovsky
> something else? Yes, this is with regards to collective hang issue. All the best, Alex - Original Message - From: "Jeff Squyres" To: "Alex A. Granovsky" ; Sent: Saturday, December 03, 2011 3:36 PM Subject: Re: [OMPI users] Program hangs in mpi_bcast

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-03 Thread Jeff Squyres
On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote: >I would like to start discussion on implementation of collective > operations within OpenMPI. The reason for this is at least twofold. > Last months, there was the constantly growing number of messages in > the list sent by persons facing p

Re: [OMPI users] Program hangs in mpi_bcast

2011-12-02 Thread Alex A. Granovsky
ose to the hardware limits does not make us happy at all. Kind regards, Alex Granovsky - Original Message - From: "Jeff Squyres" To: "Open MPI Users" Sent: Wednesday, November 30, 2011 11:45 PM Subject: Re: [OMPI users] Program hangs in mpi_bcast > Fair enough. T

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Jeff Squyres
Fair enough. Thanks anyway! On Nov 30, 2011, at 3:39 PM, Tom Rosmond wrote: > Jeff, > > I'm afraid trying to produce a reproducer of this problem wouldn't be > worth the effort. It is a legacy code that I wasn't involved in > developing and will soon be discarded, so I can't justify spending t

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Tom Rosmond
Jeff, I'm afraid trying to produce a reproducer of this problem wouldn't be worth the effort. It is a legacy code that I wasn't involved in developing and will soon be discarded, so I can't justify spending time trying to understand its behavior better. The bottom line is that it works correctly

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Jeff Squyres
Yes, but I'd like to see a reproducer that requires setting the sync_barrier_before=5. Your reproducers allowed much higher values, IIRC. I'm curious to know what makes that code require such a low value (i.e., 5)... On Nov 30, 2011, at 1:50 PM, Ralph Castain wrote: > FWIW: we already have a

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Ralph Castain
Oh - and another one at orte/test/mpi/reduce-hang.c On Nov 30, 2011, at 11:50 AM, Ralph Castain wrote: > FWIW: we already have a reproducer from prior work I did chasing this down a > couple of years ago. See orte/test/mpi/bcast_loop.c > > > On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote: >

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-30 Thread Ralph Castain
FWIW: we already have a reproducer from prior work I did chasing this down a couple of years ago. See orte/test/mpi/bcast_loop.c On Nov 29, 2011, at 9:35 AM, Jeff Squyres wrote: > That's quite weird/surprising that you would need to set it down to *5* -- > that's really low. > > Can you share

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-29 Thread Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- that's really low. Can you share a simple reproducer code, perchance? On Nov 15, 2011, at 11:49 AM, Tom Rosmond wrote: > Ralph, > > Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do > the job. T

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-15 Thread Tom Rosmond
Ralph, Thanks for the advice. I have to set 'coll_sync_barrier_before=5' to do the job. This is a big change from the default value (1000), so our application seems to be a pretty extreme case. T. Rosmond On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote: > Yes, this is well documented -

Re: [OMPI users] Program hangs in mpi_bcast

2011-11-14 Thread Ralph Castain
Yes, this is well documented - may be on the FAQ, but certainly has been in the user list multiple times. The problem is that one process falls behind, which causes it to begin accumulating "unexpected messages" in its queue. This causes the matching logic to run a little slower, thus making th