Re: [OMPI users] spin-wait backoff

2010-09-04 Thread Ralph Castain
On Sep 3, 2010, at 5:10 PM, David Singleton wrote: > On 09/03/2010 10:05 PM, Jeff Squyres wrote: >> On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote: >> >>> Backing off the polling rate requires more application-specific logic like >>> that offered below, so it is a little difficult for us to i

Re: [OMPI users] spin-wait backoff

2010-09-03 Thread David Singleton
On 09/03/2010 10:05 PM, Jeff Squyres wrote: On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote: Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't

Re: [OMPI users] spin-wait backoff

2010-09-03 Thread Jeff Squyres
On Sep 3, 2010, at 12:16 AM, Ralph Castain wrote: > Backing off the polling rate requires more application-specific logic like > that offered below, so it is a little difficult for us to implement at the > MPI library level. Not saying we eventually won't - just not sure anyone > quite knows ho

Re: [OMPI users] spin-wait backoff

2010-09-03 Thread Ralph Castain
In the upcoming 1.5 series, we will introduce a new "sensor" framework to help resolve such issues. Among other things, it will automatically track (if requested) the size of a sentinel file, cpu usage, and memory footprint and will terminate the job if any exceed user-specified limits (e.g., fi

Re: [OMPI users] spin-wait backoff

2010-09-02 Thread Douglas Guptill
Hi David: On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote: > > I'm sure this has been discussed before but having watched hundreds of > thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd > be keen to know why there isn't some sort of "spin-wait backoff" option.

[OMPI users] spin-wait backoff

2010-09-02 Thread David Singleton
I'm sure this has been discussed before but having watched hundreds of thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd be keen to know why there isn't some sort of "spin-wait backoff" option. For example, a way to specify spin-wait for x seconds/cycles/iterations then backo