Microsecond-based timeouts

Thomas Munro Sun, 12 Mar 2023 22:23:56 -0700

Hi,

Over in [1], I thought for a moment that a new function
WaitLatchUs(..., timeout_us, ...) was going to be useful to fix that
bug report, at least in master, until I realised the required Linux
syscall is a little too new (for example RHEL 9 shipped May '22,
Debian 12 is expected to be declared "stable" in a few months).  So
I'm kicking this proof-of-concept over into a new thread to talk about
in the next cycle, in case it turns out to be useful later.


There probably isn't too much call for very high resolution sleeping.
Most time-based sleeping is probably bad, but when it's legitimately
used to spread CPU or I/O out (instead of illegitimate use for
polling-based algorithms), it seems nice to be able to use all the
accuracy your hardware can provide, and yet it is still important to
be able to process other kinds of events, so WaitLatchUs() seems like
a better building block than pg_usleep().

One question is whether it'd be better to use nanoseconds instead,
since the relevant high-resolution primitives use those under the
covers (struct timespec).  On the other hand, microseconds are a good
match for our TimestampTz which is the ultimate source of many of our
timeout decisions.  I suppose we could also consider an interface with
an absolute timeout instead, and then stop thinking about the units so
much.

As mentioned in that other thread, the only systems that currently
seem to be able to sleep less than 1ms through these multiplexing APIs
are: Linux 5.11+ (epoll_pwait2()), FreeBSD (kevent()), macOS (ditto).
Everything else will round up to milliseconds at the kernel interface
(because poll(), epoll_wait() and WaitForMultipleObjects() take those)
or later inside the kernel due to kernel tick rounding.  There might
be ways to do better on Windows with separate timer events, but I
don't know.

[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com

From e99b7d31831f31888a9433a83d3e64ccbe2cc5c7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Mar 2023 15:16:47 +1300
Subject: [PATCH 1/3] Support microsecond based timeouts in WaitEventSet API.

WaitLatch() can only wait for whole numbers of milliseconds, a
limitation inherited ultimately from poll() and similar interfaces.  In
the past it didn't matter much as sleep times were very inaccurate in
practice on common systems, but Linux and others can now be accurate
down to small fractions of a millisecond.  In order to be able to
replace pg_usleep() calls, provide WaitLatchUs().  Just like
pg_usleep(), the actual resolution of the sleeping depends on the OS and
hardware.

For Linux, this requires epoll_pwait2() (Linux 5.11), otherwise we have
to round to milliseconds for epoll_wait().  For macOS and *BSD, kevent()
has always supported nanosecond-based timeouts, but only macOS and
FreeBSD are known to support high resolution timers (other BSDs tested
currently round up to kernel ticks so WaitLatch() already couldn't sleep
for only 1ms).  For Solaris and AIX, we currently use poll() and that
requires rounding up to milliseconds, so no improvement over WaitLatch()
there.  Likewise for Windows (which already couldn't sleep for only 1ms
due to internal rounding to tick size).

Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 configure                       |   2 +-
 configure.ac                    |   1 +
 meson.build                     |   1 +
 src/backend/storage/ipc/latch.c | 146 ++++++++++++++++++++++++--------
 src/include/pg_config.h.in      |   3 +
 src/include/storage/latch.h     |  13 ++-
 src/tools/msvc/Solution.pm      |   1 +
 7 files changed, 128 insertions(+), 39 deletions(-)

diff --git a/configure b/configure
index e35769ea73..914361f91b 100755
--- a/configure
+++ b/configure
@@ -15699,7 +15699,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile epoll_pwait2 getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index af23c15cb2..4249f8002c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	epoll_pwait2
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/meson.build b/meson.build
index d4384f1bf6..fe9b0470aa 100644
--- a/meson.build
+++ b/meson.build
@@ -2344,6 +2344,7 @@ func_checks = [
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
   ['dlsym', {'dependencies': [dl_dep], 'define': false}],
+  ['epoll_pwait2'],
   ['explicit_bzero'],
   ['fdatasync', {'dependencies': [rt_dep, posix4_dep], 'define': false}], # Solaris
   ['getifaddrs'],
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..ba9ccb19ac 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -194,7 +194,7 @@ static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
 static void WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event);
 #endif
 
-static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+static inline int WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 										WaitEvent *occurred_events, int nevents);
 
 /*
@@ -475,10 +475,9 @@ DisownLatch(Latch *latch)
  * to wait for. If the latch is already set (and WL_LATCH_SET is given), the
  * function returns immediately.
  *
- * The "timeout" is given in milliseconds. It must be >= 0 if WL_TIMEOUT flag
- * is given.  Although it is declared as "long", we don't actually support
- * timeouts longer than INT_MAX milliseconds.  Note that some extra overhead
- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
+ * The "timeout" is given in microseconds.  It must be >= 0 if WL_TIMEOUT flag
+ * is given.  Note that some extra overhead is incurred when WL_TIMEOUT is
+ * given, so avoid using a timeout if possible.
  *
  * The latch must be owned by the current process, ie. it must be a
  * process-local latch initialized with InitLatch, or a shared latch
@@ -489,8 +488,8 @@ DisownLatch(Latch *latch)
  * we return all of them in one call, but we will return at least one.
  */
 int
-WaitLatch(Latch *latch, int wakeEvents, long timeout,
-		  uint32 wait_event_info)
+WaitLatchUs(Latch *latch, int wakeEvents, int64 timeout_us,
+			uint32 wait_event_info)
 {
 	WaitEvent	event;
 
@@ -510,15 +509,32 @@ WaitLatch(Latch *latch, int wakeEvents, long timeout,
 	LatchWaitSet->exit_on_postmaster_death =
 		((wakeEvents & WL_EXIT_ON_PM_DEATH) != 0);
 
-	if (WaitEventSetWait(LatchWaitSet,
-						 (wakeEvents & WL_TIMEOUT) ? timeout : -1,
-						 &event, 1,
-						 wait_event_info) == 0)
+	if (WaitEventSetWaitUs(LatchWaitSet,
+						   (wakeEvents & WL_TIMEOUT) ? timeout_us : -1,
+						   &event, 1,
+						   wait_event_info) == 0)
 		return WL_TIMEOUT;
 	else
 		return event.events;
 }
 
+/*
+ * Like WaitLatchUs(), but with the timeout in milliseconds.
+ *
+ * The "timeout" is given in milliseconds. It must be >= 0 if WL_TIMEOUT flag
+ * is given.  Although it is declared as "long", we don't actually support
+ * timeouts longer than INT_MAX milliseconds.  Note that some extra overhead
+ * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
+ */
+int
+WaitLatch(Latch *latch, int wakeEvents, long timeout_ms,
+		  uint32 wait_event_info)
+{
+	return WaitLatchUs(latch, wakeEvents,
+					   timeout_ms <= 0 ? timeout_ms : timeout_ms * 1000,
+					   wait_event_info);
+}
+
 /*
  * Like WaitLatch, but with an extra socket argument for WL_SOCKET_*
  * conditions.
@@ -537,8 +553,8 @@ WaitLatch(Latch *latch, int wakeEvents, long timeout,
  * WaitEventSet instead; that's more efficient.
  */
 int
-WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout, uint32 wait_event_info)
+WaitLatchOrSocketUs(Latch *latch, int wakeEvents, pgsocket sock,
+					int64 timeout_us, uint32 wait_event_info)
 {
 	int			ret = 0;
 	int			rc;
@@ -546,9 +562,9 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
-		Assert(timeout >= 0);
+		Assert(timeout_us >= 0);
 	else
-		timeout = -1;
+		timeout_us = -1;
 
 	if (wakeEvents & WL_LATCH_SET)
 		AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET,
@@ -575,7 +591,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 		AddWaitEventToSet(set, ev, sock, NULL, NULL);
 	}
 
-	rc = WaitEventSetWait(set, timeout, &event, 1, wait_event_info);
+	rc = WaitEventSetWaitUs(set, timeout_us, &event, 1, wait_event_info);
 
 	if (rc == 0)
 		ret |= WL_TIMEOUT;
@@ -591,6 +607,20 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Like WaitLatchOrSocket, but with timeout in milliseconds.
+ */
+int
+WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
+				  long timeout_ms, uint32 wait_event_info)
+{
+	return WaitLatchOrSocketUs(latch,
+							   wakeEvents,
+							   sock,
+							   timeout_ms > 0 ? timeout_ms * 1000 : timeout_ms,
+							   wait_event_info);
+}
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
@@ -1380,14 +1410,14 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
  * values associated with the registered event.
  */
 int
-WaitEventSetWait(WaitEventSet *set, long timeout,
-				 WaitEvent *occurred_events, int nevents,
-				 uint32 wait_event_info)
+WaitEventSetWaitUs(WaitEventSet *set, int64 timeout_us,
+				   WaitEvent *occurred_events, int nevents,
+				   uint32 wait_event_info)
 {
 	int			returned_events = 0;
 	instr_time	start_time;
 	instr_time	cur_time;
-	long		cur_timeout = -1;
+	int64		cur_timeout = -1;
 
 	Assert(nevents > 0);
 
@@ -1395,11 +1425,11 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	 * Initialize timeout if requested.  We must record the current time so
 	 * that we can determine the remaining timeout if interrupted.
 	 */
-	if (timeout >= 0)
+	if (timeout_us >= 0)
 	{
 		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
+		Assert(timeout_us >= 0);
+		cur_timeout = timeout_us;
 	}
 	else
 		INSTR_TIME_SET_ZERO(start_time);
@@ -1487,11 +1517,11 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			returned_events = rc;
 
 		/* If we're not done, update cur_timeout for next iteration */
-		if (returned_events == 0 && timeout >= 0)
+		if (returned_events == 0 && timeout_us >= 0)
 		{
 			INSTR_TIME_SET_CURRENT(cur_time);
 			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			cur_timeout = timeout_us - INSTR_TIME_GET_MICROSEC(cur_time);
 			if (cur_timeout <= 0)
 				break;
 		}
@@ -1505,6 +1535,20 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	return returned_events;
 }
 
+/*
+ * Like WaitEventSetWaitUs(), but the timeout specified in milliseconds.
+ */
+int
+WaitEventSetWait(WaitEventSet *set, long timeout_ms,
+				 WaitEvent *occurred_events, int nevents,
+				 uint32 wait_event_info)
+{
+	return WaitEventSetWaitUs(set,
+							  timeout_ms <= 0 ? timeout_ms : timeout_ms * 1000,
+							  occurred_events,
+							  nevents,
+							  wait_event_info);
+}
 
 #if defined(WAIT_USE_EPOLL)
 
@@ -1517,17 +1561,31 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
  * easy.
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	int			rc;
 	WaitEvent  *cur_event;
 	struct epoll_event *cur_epoll_event;
+#ifdef HAVE_EPOLL_PWAIT2
+	struct timespec nap;
+#endif
 
 	/* Sleep */
+#ifdef HAVE_EPOLL_PWAIT2
+	nap.tv_sec = cur_timeout_us / 1000000;
+	nap.tv_nsec = (cur_timeout_us % 1000000) * 1000;
+	rc = epoll_pwait2(set->epoll_fd, set->epoll_ret_events,
+					  Min(nevents, set->nevents_space),
+					  cur_timeout_us >= 0 ? &nap : NULL,
+					  NULL);
+#else
 	rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
-					Min(nevents, set->nevents_space), cur_timeout);
+					Min(nevents, set->nevents_space),
+					cur_timeout_us >= 0 ? (cur_timeout_us + 999) / 1000
+					: cur_timeout_us);
+#endif
 
 	/* Check return code */
 	if (rc < 0)
@@ -1653,7 +1711,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * with separate system calls.
  */
 static int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
@@ -1663,12 +1721,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	struct timespec timeout;
 	struct timespec *timeout_p;
 
-	if (cur_timeout < 0)
+	if (cur_timeout_us < 0)
 		timeout_p = NULL;
 	else
 	{
-		timeout.tv_sec = cur_timeout / 1000;
-		timeout.tv_nsec = (cur_timeout % 1000) * 1000000;
+		timeout.tv_sec = cur_timeout_us / 1000000;
+		timeout.tv_nsec = (cur_timeout_us % 1000000) * 1000;
 		timeout_p = &timeout;
 	}
 
@@ -1806,16 +1864,25 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * but requires iterating through all of set->pollfds.
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	int			rc;
 	WaitEvent  *cur_event;
 	struct pollfd *cur_pollfd;
+	int			cur_timeout_ms;
+
+	/* Round up to the nearest millisecond, and cap at INT_MAX. */
+	if (cur_timeout_us >= PG_INT64_MAX - 999)
+		cur_timeout_ms = INT_MAX;
+	else if (cur_timeout_us > 0)
+		cur_timeout_ms = Min((int64) INT_MAX, (cur_timeout_us + 999) / 1000);
+	else
+		cur_timeout_ms = cur_timeout_us;
 
 	/* Sleep */
-	rc = poll(set->pollfds, set->nevents, (int) cur_timeout);
+	rc = poll(set->pollfds, set->nevents, cur_timeout_ms);
 
 	/* Check return code */
 	if (rc < 0)
@@ -1943,12 +2010,21 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * that only one event is "consumed".
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	DWORD		rc;
 	WaitEvent  *cur_event;
+	int			cur_timeout_ms;
+
+	/* Round up to the nearest millisecond, and cap at INT_MAX. */
+	if (cur_timeout_us >= PG_INT64_MAX - 999)
+		cur_timeout_ms = INT_MAX;
+	else if (cur_timeout_us > 0)
+		cur_timeout_ms = Min((int64) INT_MAX, (cur_timeout_us + 999) / 1000);
+	else
+		cur_timeout_ms = cur_timeout_us;
 
 	/* Reset any wait events that need it */
 	for (cur_event = set->events;
@@ -2000,7 +2076,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 * Need to wait for ->nevents + 1, because signal handle is in [0].
 	 */
 	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
-								cur_timeout);
+								cur_timeout_ms);
 
 	/* Check return code */
 	if (rc == WAIT_FAILED)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 20c82f5979..c1f1fc6e70 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -149,6 +149,9 @@
 /* Define to 1 if you have the <editline/readline.h> header file. */
 #undef HAVE_EDITLINE_READLINE_H
 
+/* Define to 1 if you have the `epoll_pwait2' function. */
+#undef HAVE_EPOLL_PWAIT2
+
 /* Define to 1 if you have the <execinfo.h> header file. */
 #undef HAVE_EXECINFO_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 99cc47874a..756c3114ed 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -180,13 +180,20 @@ extern int	AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 							  Latch *latch, void *user_data);
 extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
 
-extern int	WaitEventSetWait(WaitEventSet *set, long timeout,
+extern int	WaitEventSetWait(WaitEventSet *set, long timeout_ms,
 							 WaitEvent *occurred_events, int nevents,
 							 uint32 wait_event_info);
-extern int	WaitLatch(Latch *latch, int wakeEvents, long timeout,
+extern int	WaitEventSetWaitUs(WaitEventSet *set, int64 timeout_us,
+							   WaitEvent *occurred_events, int nevents,
+							   uint32 wait_event_info);
+extern int	WaitLatch(Latch *latch, int wakeEvents, long timeout_ms,
 					  uint32 wait_event_info);
+extern int	WaitLatchUs(Latch *latch, int wakeEvents, int64 timeout_us,
+						uint32 wait_event_info);
 extern int	WaitLatchOrSocket(Latch *latch, int wakeEvents,
-							  pgsocket sock, long timeout, uint32 wait_event_info);
+							  pgsocket sock, long timeout_ms, uint32 wait_event_info);
+extern int	WaitLatchOrSocketUs(Latch *latch, int wakeEvents,
+								pgsocket sock, int64 timeout_us, uint32 wait_event_info);
 extern void InitializeLatchWaitSet(void);
 extern int	GetNumRegisteredWaitEvents(WaitEventSet *set);
 extern bool WaitEventSetCanReportClosed(void);
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 5eaea6355e..f88fffa5e2 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -247,6 +247,7 @@ sub GenerateFiles
 		HAVE_DECL_STRNLEN                           => 1,
 		HAVE_EDITLINE_HISTORY_H                     => undef,
 		HAVE_EDITLINE_READLINE_H                    => undef,
+		HAVE_EPOLL_PWAIT2                           => undef,
 		HAVE_EXECINFO_H                             => undef,
 		HAVE_EXPLICIT_BZERO                         => undef,
 		HAVE_FSEEKO                                 => 1,
-- 
2.39.2

From 9b5e8922b7c603f902fecfb30e24a962b6c08176 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Mar 2023 16:22:59 +1300
Subject: [PATCH 2/3] Use microsecond-based naps for vacuum_cost_delay sleep.

Now that we have microsecond support in the WaitEventSet API, we can use
the standard programming pattern to implement the high resolution sleep
in vacuum_delay_point().

XXX We wouldn't be able to do this until Linux 5.11 is in common stable
distributions, otherwise the sleep would lose precision when changing
from the pg_usleep() coding.

Reported-by: Melanie Plageman <melanieplage...@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 src/backend/commands/vacuum.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..f379e60dca 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2232,10 +2232,10 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 msec,
-						 WAIT_EVENT_VACUUM_DELAY);
+		(void) WaitLatchUs(MyLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						   msec * 1000,
+						   WAIT_EVENT_VACUUM_DELAY);
 		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
-- 
2.39.2

From dd40a3e3c28c69466bc2e8c2a223608ac51e05b7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 10 Mar 2023 16:19:42 +1300
Subject: [PATCH 3/3] Use microsecond-based naps in walreceiver.

Since anything based on timestamp differences is really in microseconds
under the covers, we might as well use the new higher resolution API for
waiting.

XXX For illustration; there would be many other places that could change
like this
---
 src/backend/replication/walreceiver.c | 16 ++++++++--------
 src/backend/utils/adt/timestamp.c     | 20 ++++++++++++++++++++
 src/include/utils/timestamp.h         |  2 ++
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..18c66c0c63 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 				pgsocket	wait_fd = PGINVALID_SOCKET;
 				int			rc;
 				TimestampTz nextWakeup;
-				long		nap;
+				int64		nap;
 
 				/*
 				 * Exit walreceiver if we're not in recovery. This should not
@@ -530,7 +530,7 @@ WalReceiverMain(void)
 
 				/* Calculate the nap time, clamping as necessary. */
 				now = GetCurrentTimestamp();
-				nap = TimestampDifferenceMilliseconds(now, nextWakeup);
+				nap = TimestampDifferenceMicroseconds(now, nextWakeup);
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -544,12 +544,12 @@ WalReceiverMain(void)
 				 * avoiding some system calls.
 				 */
 				Assert(wait_fd != PGINVALID_SOCKET);
-				rc = WaitLatchOrSocket(MyLatch,
-									   WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
-									   WL_TIMEOUT | WL_LATCH_SET,
-									   wait_fd,
-									   nap,
-									   WAIT_EVENT_WAL_RECEIVER_MAIN);
+				rc = WaitLatchOrSocketUs(MyLatch,
+										 WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
+										 WL_TIMEOUT | WL_LATCH_SET,
+										 wait_fd,
+										 nap,
+										 WAIT_EVENT_WAL_RECEIVER_MAIN);
 				if (rc & WL_LATCH_SET)
 				{
 					ResetLatch(MyLatch);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index de93db89d4..52f6568397 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1719,6 +1719,26 @@ TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)
 		return (long) ((diff + 999) / 1000);
 }
 
+/*
+ * TimestampDifferenceMicroseconds -- convert the difference between two
+ * 		timestamps into microseconds
+ *
+ * Compute a wait time for WaitLatchUs().
+ */
+int64
+TimestampDifferenceMicroseconds(TimestampTz start_time, TimestampTz stop_time)
+{
+	TimestampTz diff;
+
+	/* Deal with zero or negative elapsed time quickly. */
+	if (start_time >= stop_time)
+		return 0;
+	/* To not fail with timestamp infinities, we must detect overflow. */
+	if (pg_sub_s64_overflow(stop_time, start_time, &diff))
+		return PG_INT64_MAX;
+	return diff;
+}
+
 /*
  * TimestampDifferenceExceeds -- report whether the difference between two
  *		timestamps is >= a threshold (expressed in milliseconds)
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..1caa15221d 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -100,6 +100,8 @@ extern void TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 								long *secs, int *microsecs);
 extern long TimestampDifferenceMilliseconds(TimestampTz start_time,
 											TimestampTz stop_time);
+extern int64 TimestampDifferenceMicroseconds(TimestampTz start_time,
+											 TimestampTz stop_time);
 extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 									   TimestampTz stop_time,
 									   int msec);
-- 
2.39.2

Microsecond-based timeouts

Reply via email to