On May 10, 2012 4:59 PM, "Tom Lane" <t...@sss.pgh.pa.us> wrote: > > I wrote: > > Last night I changed the stats collector process to use > > WaitLatchOrSocket instead of a periodic forced wakeup to see whether > > the postmaster has died. This morning I observe that several Windows > > buildfarm members are showing regression test failures caused by > > unexpected "pgstat wait timeout" warnings. Everybody else is fine. > > > This suggests that there is something broken in the Windows > > implementation of WaitLatchOrSocket. I wonder whether it also > > tells us something we did not know about the underlying cause of > > those messages. Not sure what though. Ideas? Can anyone who > > knows Windows take another look at WaitLatchOrSocket? > > Anybody have any clues about that? If not, I think I'll have to revert > the pgstat changes for beta1, which isn't really forward progress.
Haven't had time to look at the code itself, and won't before wrap time. Sorry. > I spent some time staring at the Windows WaitLatchOrSocket code myself. > The only thing I could find that seemed wrong is that in the event > array, we list the latch's event before pgwin32_signal_event. The > Microsoft documentation I looked at says that if more than one event > is ready, WaitforMultipleObjects reports the first such array member. > This means that if the latch is already set when control gets here, > signal handlers will not be serviced. Yeah, that does seem wrong. > That doesn't match what would > happen on a Unix machine, so it seems like at least a violation of the > POLA. Hence I think we oughta swap the order of those two array > elements. (Same issue in PGSemaphoreLock, btw, and I'm suspicious of > pgwin32_select.) I do not however Maybe we need a loop that checks for all events? > see a way that that would explain the > pgstat failures, because the stats collector's latch really shouldn't > ever get set during normal regression test runs. So could there be something wrong in the other end, meaning the latch *does* get set? /Magnus