Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Andres Freund Fri, 12 Jun 2020 16:29:06 -0700

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.


CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] 
agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN           
                                                │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual 
time=2610.235..2610.235 rows=1 loops=1)                         │
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=50000016 width=0) 
(actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms                                                       
                                                │
│ Execution Time: 2610.256 ms                                                   
                                                │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.


There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not.  The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.


The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed.  If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Greetings,

Andres Freund

>From 325140820e5745bea901bffd685626aa84a16d23 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 28 May 2020 14:52:48 -0700
Subject: [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds,
 that's cheaper.

Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 62 ++++++++++++----------------
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327ccf..fc058d548a8 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -83,63 +83,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 1000000000) \
-		{ \
-			(x).tv_nsec -= 1000000000; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 1000000000; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 1000000000; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 1000000000) \
-		{ \
-			(x).tv_nsec -= 1000000000; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else							/* !HAVE_CLOCK_GETTIME */
 
-- 
2.25.0.114.g5b0ca878e0

>From 29a12302bab6b2fb8c2475834510b90c4a6197ce Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Thu, 11 Jun 2020 19:38:18 -0700
Subject: [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure
 time for instrumentation.

---
 src/include/portability/instr_time.h | 68 ++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index fc058d548a8..8b2f9a2e707 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -83,7 +83,9 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
+/* time in baseline cpu cycles */
 typedef int64 instr_time;
+
 #define NS_PER_S INT64CONST(1000000000)
 #define US_PER_S INT64CONST(1000000)
 #define MS_PER_S INT64CONST(1000)
@@ -95,17 +97,67 @@ typedef int64 instr_time;
 
 #define INSTR_TIME_SET_ZERO(t)	((t) = 0)
 
-static inline instr_time pg_clock_gettime_ns(void)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+/*
+ * Return what the number of cycles needs to be multiplied with to end up with
+ * seconds.
+ *
+ * FIXME: The cold portion should probably be out-of-line. And it'd be better
+ * to not recompute this in every file that uses this. Best would probably be
+ * to require explicit initialization of cycles_to_sec, because having a
+ * branch really is unnecessary.
+ *
+ * FIXME: We should probably not unnecessarily use floating point math
+ * here. And it's likely that the numbers are small enough that we are running
+ * into floating point inaccuracies already. Probably worthwhile to be a good
+ * bit smarter.
+ *
+ * FIXME: This would need to be conditional, with a fallback to something not
+ * rdtsc based.
+ */
+static inline double __attribute__((const))
+get_cycles_to_sec(void)
 {
-	struct timespec tmp;
+	static double cycles_to_sec = 0;
 
-	clock_gettime(PG_INSTR_CLOCK, &tmp);
+	/*
+	 * Compute baseline cpu peformance, determines speed at which rdtsc advances
+	 */
+	if (unlikely(cycles_to_sec == 0))
+	{
+		uint32 cpuinfo[4] = {0};
 
-	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+		__get_cpuid(0x16, cpuinfo, cpuinfo + 1, cpuinfo + 2, cpuinfo + 3);
+		cycles_to_sec = 1 / ((double) cpuinfo[0] * 1000 * 1000);
+	}
+
+	return cycles_to_sec;
+}
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
+{
+	/*
+	 * The rdtscp waits for all in-flight instructions to finish (but allows
+	 * later instructions to start concurrently). That's good for some timing
+	 * situations (when the time is supposed to cover all the work), but
+	 * terrible for others (when sub-parts of work are measured, because then
+	 * the pipeline stall due to the wait change the overall timing).
+	 */
+#if 0
+	unsigned int aux;
+	int64 tsc = __rdtscp(&aux);
+
+	return tsc;
+#else
+
+	return __rdtsc();
+#endif
 }
 
 #define INSTR_TIME_SET_CURRENT(t) \
-	(t) = pg_clock_gettime_ns()
+	(t) = pg_clock_gettime_ref_cycles()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
@@ -123,13 +175,13 @@ static inline instr_time pg_clock_gettime_ns(void)
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	((double) (t) / NS_PER_S)
+	((double) (t) * get_cycles_to_sec())
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	((double) (t) / NS_PER_MS)
+	((double) (t) * (get_cycles_to_sec() * MS_PER_S))
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	((double) (t) / NS_PER_US)
+	((double) (t) * (get_cycles_to_sec() * US_PER_S))
 
 #else							/* !HAVE_CLOCK_GETTIME */
 
-- 
2.25.0.114.g5b0ca878e0

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to