There was no firewall in place, or more correctly the Windows Firewall is 
configured to be off.  There is no other firewall installed on the system.

To get to this point in the code, the return value from WSARecv() was 
WSAEWOULDBLOCK.  The socket is set for overlapped IO and is a datagram socket.  
MSDN documentation says that means there are too many outstanding overlapped IO 
requests.  I don't know if "too many" applies to just this socket or to the 
system as a whole.  The documentation isn't clear about how to handle the 
return code in this situation.

We don't need to know if this is a Kernel issue, a bug in winsock, or an 
undocumented behaviour.  Regardless, it can be treated as a fault.

Knowing that it is possible for WaitForMultipleObjectsEx to lock up means that 
it is not safe to call with an INFINITE timeout.  The workaround that's being 
discussed is beginning to look like the one at line 172 of socket.c.  It's bad 
enough that there is a WSASend in pgwin32_waitforsinglesocket().  I doubt you 
also want to add a WSARecv.  There should be a cleaner way to handle both of 
these situations.

I am planning to eventually kill the stats collector and see if that clears up 
the hanging issue, but I want to keep the system state in place for a bit 
longer in case there is some other diagnostic steps I should try.  I've 
exhausted everything I could think of.

-Luke


-----Original Message-----
From: Nikhil Sontakke [mailto:nikhil.sonta...@enterprisedb.com]
Sent: Monday, August 03, 2009 10:38 AM
To: Magnus Hagander
Cc: Alvaro Herrera; Luke Koops; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx 
while attempting to recv a datagram

Hi,

>>>>>
>>>>> Maybe. I'm unsure if it's enough to just try another
>>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>>> WSARecv() on it as well. Maybe it would be enough to just change
>>>>> the INIFINTE on line 318 of socket.c to a fixed value. That will
>>>>> loop down to WSARecv() which should exit with WSAEWOULDBLOCK which
>>>>> will cause us to do a short sleep and come back. But we'd have to
>>>>> change the limit of 5 somehow then, since in theory we should wait
>>>>> forever. Maybe that outer loop should just be a for(;;), what do you 
>>>>> think?
>>>>>
>>>>
>>>> Yes, line 318 seems to be a much better location to me. If Windows
>>>> and it's socket logic behaves properly most of the times :), most
>>>> of the calls should come out within the first few tries, so
>>>> changing 5 to an infinite loop shouldn't hurt those normal use cases in 
>>>> theory.
>>>>
>>>> OTOH, I was wondering what if we kill the stats collector and on a
>>>> restart the socket communication resumes properly. Would that
>>>> conclusively mean that it is a flaw in our code?
>>>
>>> No, if we kill the stats collector that will destroy all sockets,
>>> and when the new one starts all the sockets it operates on are fresh
>>> and new. So it doesn't show that the flaw is in our code - but it
>>> also doesn't show that it's in the kernel or runtime libraries.
>>>
>>
>> AFAICS in the code, the inherited pgStatSock socket FD remains the
>> same across the restart of the stats collector process...
>
> Partially correct, I think.
>
> Each backend has it's own handle on win32, since we use EXEC_BACKEND
> (this includes the "utility processes" like the stats collector). When
> we start the new one, we are going to use DuplicateHandle() in
> save_backend_variables(). This will therefor get it a new handle, but
> they are both pointing to the same kernel object. I don't know if
> WaitForMultipleObjectsEx() is going to see these as two different
> objects or not, but I think it does.
>

Hmm, got it. Nothing like adding more confusion into the mix :)

Regards,
Nikhils
--
http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to