Re: WL_SOCKET_ACCEPT fairness on Windows

Thomas Munro Fri, 31 Mar 2023 20:01:04 -0700

On Sat, Apr 1, 2023 at 2:35 PM Andres Freund <and...@anarazel.de> wrote:
> I wonder if we ought to bite the bullet and replace the use of
> WaitForMultipleObjects() with RegisterWaitForSingleObject() and then use
> GetQueuedCompletionStatus() to wait. The fairness issue here is a motivation,
> but the bigger one is that that'd get us out from under the
> MAXIMUM_WAIT_OBJECTS (IIRC 64) limit. Afaict that'd also allow us to read
> multiple notifications at once, using GetQueuedCompletionStatusEx().


Interesting.  So a callback would run in a system-managed thread, and
that would post a custom message in an IOCP for us to read, kinda like
the fake waitpid() thing?  Seems a bit gross and context-switchy but I
agree that the 64 event limit is also terrible.

> Medium term that'd also be a small step towards using readiness based APIs in
> a few places...

Yeah, that would be cool.

> > I think we could get the same effect as pgwin32_select() more cheaply
> > by doing the initial WaitForMultipleEvents() call with the caller's
> > timeout exactly as we do today, and then, while we have space,
> > repeatedly calling
> > WaitForMultipleEvents(handles=&events[last_signaled_event_index + 1],
> > timeout=0) until it reports timeout.
>
> That would make sense, and should indeed be reasonable cost-wise.

Cool.

> > I mention this now because I'm not sure whether to consider this an
> > 'open item' for 16, or merely an enhancement for 17.  I guess the
> > former, because someone might call that a new denial of service
> > vector.  On the other hand, if you fill up the listen queue for socket
> > 1 with enough vigour, you're also denying service to socket 1, so I
> > don't know if it's worth worrying about.  Opinions on that?
>
> I'm not sure either. It doesn't strike me as a particularly relevant
> bottleneck. And the old approach of doing more work for every single
> connection also made many connections worse, I think?

Alright, let's see if anyone else thinks this is worth fixing for 16.

> > diff --git a/src/backend/storage/ipc/latch.c 
> > b/src/backend/storage/ipc/latch.c
> > index f4123e7de7..cc7b572008 100644
> > --- a/src/backend/storage/ipc/latch.c
> > +++ b/src/backend/storage/ipc/latch.c
> > @@ -2025,6 +2025,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int 
> > cur_timeout,
> >        */
> >       cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
> >
> > +loop:
> > +
>
> As far as I can tell, we'll never see WL_LATCH_SET or WL_POSTMASTER_DEATH. I
> think it'd look cleaner to move the body of if (cur_event->events & 
> WL_SOCKET_MASK)
> into a separate function that we then also can call further down.

We could see them, AFAICS, and I don't see much advantage in making
that assumption?  Shouldn't we just shove it in a loop, just like the
other platforms' implementations?  Done in this version, which is best
viewed with git show --ignore-all-space.

> Seems like we could use returned_events to get nevents in the way you want it,
> without adding even more ++/-- to each of the different events?

Yeah.  This time I use reported_events.  I also fixed a maths failure:
I'd forgotten to use rc - WAIT_OBJECT_0, suggesting that CI never
reached the code.

From 125deab4030d7cf8918df19fb67fc4c8bee27570 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 1 Apr 2023 12:36:12 +1300
Subject: [PATCH v2] Teach WaitEventSetWait to report multiple events on
 Windows.

The WaitEventSet implementation on Windows has always reported only one
event at a time, and always the "lowest" in its event array.  Since
commit 7389aad6 started using WaitEventSet to handle incoming socket
connections, this unfairness changes the accept priority when using
multiple server sockets.  If one of them has a non-empty listen queue
due to incoming connections, the other might never be serviced.  The
previously coding based on select() was fair in that way.

Fix, by teaching WaitEventSetWait() to poll for extra events.  No change
in behavior in the common case of callers with nevents=1 (for example
the main FEBE socket code), but for the postmaster's main loop we'll
drain all the events that can fit in the output buffer, which is
deliberately set large enough to handle the maximum possible number of
sockets.  This brings the Windows behavior in line with Unix.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2BA2dk29hr5zRP3HVJQ-_PncNJM6HVQ7aaYLXLRBZU-xw%40mail.gmail.com

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..334236aefc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1933,14 +1933,10 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 #elif defined(WAIT_USE_WIN32)
 
 /*
- * Wait using Windows' WaitForMultipleObjects().
+ * Wait using Windows' WaitForMultipleObjects().  Each call only "consumes" one
+ * event, so we keep calling until we've filled up our output buffer.
  *
- * Unfortunately this will only ever return a single readiness notification at
- * a time.  Note that while the official documentation for
- * WaitForMultipleObjects is ambiguous about multiple events being "consumed"
- * with a single bWaitAll = FALSE call,
- * https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 confirms
- * that only one event is "consumed".
+ * https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
  */
 static inline int
 WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
@@ -2025,106 +2021,145 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 */
 	cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
 
-	occurred_events->pos = cur_event->pos;
-	occurred_events->user_data = cur_event->user_data;
-	occurred_events->events = 0;
-
-	if (cur_event->events == WL_LATCH_SET)
-	{
-		/*
-		 * We cannot use set->latch->event to reset the fired event if we
-		 * aren't waiting on this latch now.
-		 */
-		if (!ResetEvent(set->handles[cur_event->pos + 1]))
-			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-		if (set->latch && set->latch->is_set)
-		{
-			occurred_events->fd = PGINVALID_SOCKET;
-			occurred_events->events = WL_LATCH_SET;
-			occurred_events++;
-			returned_events++;
-		}
-	}
-	else if (cur_event->events == WL_POSTMASTER_DEATH)
-	{
-		/*
-		 * Postmaster apparently died.  Since the consequences of falsely
-		 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we take
-		 * the trouble to positively verify this with PostmasterIsAlive(),
-		 * even though there is no known reason to think that the event could
-		 * be falsely set on Windows.
-		 */
-		if (!PostmasterIsAliveInternal())
-		{
-			if (set->exit_on_postmaster_death)
-				proc_exit(1);
-			occurred_events->fd = PGINVALID_SOCKET;
-			occurred_events->events = WL_POSTMASTER_DEATH;
-			occurred_events++;
-			returned_events++;
-		}
-	}
-	else if (cur_event->events & WL_SOCKET_MASK)
+	for (;;)
 	{
-		WSANETWORKEVENTS resEvents;
-		HANDLE		handle = set->handles[cur_event->pos + 1];
-
-		Assert(cur_event->fd);
+		int			next_pos;
+		int			count;
 
-		occurred_events->fd = cur_event->fd;
+		occurred_events->pos = cur_event->pos;
+		occurred_events->user_data = cur_event->user_data;
+		occurred_events->events = 0;
 
-		ZeroMemory(&resEvents, sizeof(resEvents));
-		if (WSAEnumNetworkEvents(cur_event->fd, handle, &resEvents) != 0)
-			elog(ERROR, "failed to enumerate network events: error code %d",
-				 WSAGetLastError());
-		if ((cur_event->events & WL_SOCKET_READABLE) &&
-			(resEvents.lNetworkEvents & FD_READ))
+		if (cur_event->events == WL_LATCH_SET)
 		{
-			/* data available in socket */
-			occurred_events->events |= WL_SOCKET_READABLE;
-
-			/*------
-			 * WaitForMultipleObjects doesn't guarantee that a read event will
-			 * be returned if the latch is set at the same time.  Even if it
-			 * did, the caller might drop that event expecting it to reoccur
-			 * on next call.  So, we must force the event to be reset if this
-			 * WaitEventSet is used again in order to avoid an indefinite
-			 * hang.  Refer https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
-			 * for the behavior of socket events.
-			 *------
+			/*
+			 * We cannot use set->latch->event to reset the fired event if we
+			 * aren't waiting on this latch now.
 			 */
-			cur_event->reset = true;
-		}
-		if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
-			(resEvents.lNetworkEvents & FD_WRITE))
-		{
-			/* writeable */
-			occurred_events->events |= WL_SOCKET_WRITEABLE;
-		}
-		if ((cur_event->events & WL_SOCKET_CONNECTED) &&
-			(resEvents.lNetworkEvents & FD_CONNECT))
-		{
-			/* connected */
-			occurred_events->events |= WL_SOCKET_CONNECTED;
+			if (!ResetEvent(set->handles[cur_event->pos + 1]))
+				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+			if (set->latch && set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
 		}
-		if ((cur_event->events & WL_SOCKET_ACCEPT) &&
-			(resEvents.lNetworkEvents & FD_ACCEPT))
+		else if (cur_event->events == WL_POSTMASTER_DEATH)
 		{
-			/* incoming connection could be accepted */
-			occurred_events->events |= WL_SOCKET_ACCEPT;
+			/*
+			 * Postmaster apparently died.  Since the consequences of falsely
+			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+			 * take the trouble to positively verify this with
+			 * PostmasterIsAlive(), even though there is no known reason to
+			 * think that the event could be falsely set on Windows.
+			 */
+			if (!PostmasterIsAliveInternal())
+			{
+				if (set->exit_on_postmaster_death)
+					proc_exit(1);
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
 		}
-		if (resEvents.lNetworkEvents & FD_CLOSE)
+		else if (cur_event->events & WL_SOCKET_MASK)
 		{
-			/* EOF/error, so signal all caller-requested socket flags */
-			occurred_events->events |= (cur_event->events & WL_SOCKET_MASK);
-		}
+			WSANETWORKEVENTS resEvents;
+			HANDLE		handle = set->handles[cur_event->pos + 1];
 
-		if (occurred_events->events != 0)
-		{
-			occurred_events++;
-			returned_events++;
+			Assert(cur_event->fd);
+
+			occurred_events->fd = cur_event->fd;
+
+			ZeroMemory(&resEvents, sizeof(resEvents));
+			if (WSAEnumNetworkEvents(cur_event->fd, handle, &resEvents) != 0)
+				elog(ERROR, "failed to enumerate network events: error code %d",
+					 WSAGetLastError());
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(resEvents.lNetworkEvents & FD_READ))
+			{
+				/* data available in socket */
+				occurred_events->events |= WL_SOCKET_READABLE;
+
+				/*------
+				 * WaitForMultipleObjects doesn't guarantee that a read event
+				 * will be returned if the latch is set at the same time.  Even
+				 * if it did, the caller might drop that event expecting it to
+				 * reoccur on next call.  So, we must force the event to be
+				 * reset if this WaitEventSet is used again in order to avoid
+				 * an indefinite hang.
+				 *
+				 * Refer
+				 * https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
+				 * for the behavior of socket events.
+				 *------
+				 */
+				cur_event->reset = true;
+			}
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(resEvents.lNetworkEvents & FD_WRITE))
+			{
+				/* writeable */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+			if ((cur_event->events & WL_SOCKET_CONNECTED) &&
+				(resEvents.lNetworkEvents & FD_CONNECT))
+			{
+				/* connected */
+				occurred_events->events |= WL_SOCKET_CONNECTED;
+			}
+			if ((cur_event->events & WL_SOCKET_ACCEPT) &&
+				(resEvents.lNetworkEvents & FD_ACCEPT))
+			{
+				/* incoming connection could be accepted */
+				occurred_events->events |= WL_SOCKET_ACCEPT;
+			}
+			if (resEvents.lNetworkEvents & FD_CLOSE)
+			{
+				/* EOF/error, so signal all caller-requested socket flags */
+				occurred_events->events |= (cur_event->events & WL_SOCKET_MASK);
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events++;
+				returned_events++;
+			}
 		}
+
+		/* Is the output buffer full? */
+		if (returned_events == nevents)
+			break;
+
+		/* Have we run out of possible events? */
+		next_pos = cur_event->pos + 1;
+		if (next_pos == set->nevents)
+			break;
+
+		/*
+		 * Poll the rest of the event handles in the array starting at
+		 * next_pos being careful to skip over the initial signal handle too.
+		 * This time we use a zero timeout.
+		 */
+		count = set->nevents - next_pos;
+		rc = WaitForMultipleObjects(count,
+									set->handles + 1 + next_pos,
+									false,
+									0);
+
+		/*
+		 * We don't distinguish between errors and WAIT_TIMEOUT here because
+		 * we already have events to report.
+		 */
+		if (rc < WAIT_OBJECT_0 || rc >= WAIT_OBJECT_0 + count)
+			break;
+
+		/* We have another event to decode. */
+		cur_event = &set->events[next_pos + (rc - WAIT_OBJECT_0)];
 	}
 
 	return returned_events;
-- 
2.40.0

Re: WL_SOCKET_ACCEPT fairness on Windows

Reply via email to