Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

Magnus Hagander Thu, 10 May 2012 08:27:58 -0700

On May 10, 2012 4:59 PM, "Tom Lane" <t...@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Last night I changed the stats collector process to use
> > WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> > the postmaster has died.  This morning I observe that several Windows
> > buildfarm members are showing regression test failures caused by
> > unexpected "pgstat wait timeout" warnings.  Everybody else is fine.
>
> > This suggests that there is something broken in the Windows
> > implementation of WaitLatchOrSocket.  I wonder whether it also
> > tells us something we did not know about the underlying cause of
> > those messages.  Not sure what though.  Ideas?  Can anyone who
> > knows Windows take another look at WaitLatchOrSocket?
>
> Anybody have any clues about that?  If not, I think I'll have to revert
> the pgstat changes for beta1, which isn't really forward progress.


Haven't had time to look at the code itself, and won't before wrap time.
Sorry.

> I spent some time staring at the Windows WaitLatchOrSocket code myself.
> The only thing I could find that seemed wrong is that in the event
> array, we list the latch's event before pgwin32_signal_event.  The
> Microsoft documentation I looked at says that if more than one event
> is ready, WaitforMultipleObjects reports the first such array member.
> This means that if the latch is already set when control gets here,
> signal handlers will not be serviced.

Yeah, that does seem wrong.

>  That doesn't match what would
> happen on a Unix machine, so it seems like at least a violation of the
> POLA.  Hence I think we oughta swap the order of those two array
> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)  I do not however

Maybe we need a loop that checks for all events?

> see a way that that would explain the
> pgstat failures, because the stats collector's latch really shouldn't
> ever get set during normal regression test runs.

So could there be something wrong in the other end, meaning the latch
*does* get set?

/Magnus

Re: [HACKERS] "pgstat wait timeout" just got a lot more common on Windows

Reply via email to