Hi, Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes the resulting timing enough that the times aren't meaningful. E.g.
CREATE TABLE lotsarows(key int not null); INSERT INTO lotsarows SELECT generate_series(1, 50000000); VACUUM FREEZE lotsarows; -- best of three: SELECT count(*) FROM lotsarows; Time: 1923.394 ms (00:01.923) -- best of three: EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; Time: 2319.830 ms (00:02.320) -- best of three: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4202.649 ms (00:04.203) That nearly *double* the execution time without TIMING. Looking at a profile of this shows that we spend a good bit of cycles "normalizing" timstamps etc. That seems pretty unnecessary, just forced on us due to struct timespec. So the first attached patch just turns instr_time to be a 64bit integer, counting nanoseconds. That helps, a tiny bit: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4179.302 ms (00:04.179) but obviously doesn't move the needle. Looking at a profile it's easy to confirm that we spend a lot of time acquiring time: - 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined) - agg_retrieve_direct (inlined) - 79.27% fetch_input_tuple - ExecProcNode (inlined) - 75.72% ExecProcNodeInstr + 25.22% SeqNext - 21.74% InstrStopNode + 17.80% __GI___clock_gettime (inlined) - 21.44% InstrStartNode + 19.23% __GI___clock_gettime (inlined) + 4.06% ExecScan + 13.09% advance_aggregates (inlined) 1.06% MemoryContextReset And that's even though linux avoids a syscall (in most cases) etc to acquire the time. Unless the kernel detects there's a reason not to do so, linux does this by using 'rdtscp' and multiplying it by kernel provided factors to turn the cycles into time. Some of the time is spent doing function calls, dividing into struct timespec, etc. But most of it just the rdtscp instruction: 65.30 │1 63: rdtscp The reason for that is largely that rdtscp waits until all prior instructions have finished (but it allows later instructions to already start). Multiple times for each tuple. In the second attached prototype patch I've change instr_time to count in cpu cycles instead of nanoseconds. And then just turned the cycles into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later). When using rdtsc that results in *vastly* lower overhead: ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │ │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │ │ Planning Time: 0.028 ms │ │ Execution Time: 2610.256 ms │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (4 rows) Time: 2610.589 ms (00:02.611) And there's still some smaller improvements that could be made ontop of that. As a comparison, here's the time when using rdtscp directly in instr_time, instead of going through clock_gettime: Time: 3481.162 ms (00:03.481) That shows pretty well how big the cost of the added pipeline stalls are, and how important out-of-order execution is for decent performance... In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we do *not* want to wait for prior instructions to finish, since that actually leads to the timing being less accurate, rather than more. There are other cases where that'd be different, e.g. measuring how long an entire query takes or such (but there it's probably irrelevant which to use). I've above skipped a bit over the details of how to turn the cycles returned by rdtsc into time: On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that have actually been run, but instead returns the number of 'reference cycles'. That's important because otherwise things like turbo mode and lower power modes would lead to completely bogus times. Thus, knowing the "base frequency" of the CPU allows to turn the difference between two rdtsc return values into seconds. In the attached prototype I just determined the number of cycles using cpuid(0x16). That's only available since Skylake (I think). On older CPUs we'd have to look at /proc/cpuinfo or /sys/devices/system/cpu/cpu0/cpufreq/base_frequency. There's also other issues with using rdtsc directly: On older CPUs, in particular older multi-socket systems, the tsc will not be synchronized in detail across cores. There's bits that'd let us check whether tsc is suitable or not. The more current issue of that is that things like virtual machines being migrated can lead to rdtsc suddenly returning a different value / the frequency differening. But that is supposed to be solved these days, by having virtualization technologies set frequency multipliers and offsets which then cause rdtsc[p] to return something meaningful, even after migration. The attached patches are really just a prototype. I'm also not really planning to work on getting this into a "production ready" patchset anytime soon. I developed it primarily because I found it the overhead made it too hard to nail down in which part of a query tree performance changed. If somebody else wants to continue from here... I do think it's be a pretty significant improvement if we could reduce the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a bunch of low-level code. Greetings, Andres Freund
>From 325140820e5745bea901bffd685626aa84a16d23 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Thu, 28 May 2020 14:52:48 -0700 Subject: [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper. Author: Reviewed-By: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 62 ++++++++++++---------------- 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index d6459327ccf..fc058d548a8 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -83,63 +83,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(1000000000) +#define US_PER_S INT64CONST(1000000) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(1000000) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, &tmp); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 1000000000) \ - { \ - (x).tv_nsec -= 1000000000; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 1000000000; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 1000000000; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 1000000000) \ - { \ - (x).tv_nsec -= 1000000000; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000)) + ((double) (t) / NS_PER_US) #else /* !HAVE_CLOCK_GETTIME */ -- 2.25.0.114.g5b0ca878e0
>From 29a12302bab6b2fb8c2475834510b90c4a6197ce Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Thu, 11 Jun 2020 19:38:18 -0700 Subject: [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation. --- src/include/portability/instr_time.h | 68 ++++++++++++++++++++++++---- 1 file changed, 60 insertions(+), 8 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index fc058d548a8..8b2f9a2e707 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -83,7 +83,9 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif +/* time in baseline cpu cycles */ typedef int64 instr_time; + #define NS_PER_S INT64CONST(1000000000) #define US_PER_S INT64CONST(1000000) #define MS_PER_S INT64CONST(1000) @@ -95,17 +97,67 @@ typedef int64 instr_time; #define INSTR_TIME_SET_ZERO(t) ((t) = 0) -static inline instr_time pg_clock_gettime_ns(void) +#include <x86intrin.h> +#include <cpuid.h> + +/* + * Return what the number of cycles needs to be multiplied with to end up with + * seconds. + * + * FIXME: The cold portion should probably be out-of-line. And it'd be better + * to not recompute this in every file that uses this. Best would probably be + * to require explicit initialization of cycles_to_sec, because having a + * branch really is unnecessary. + * + * FIXME: We should probably not unnecessarily use floating point math + * here. And it's likely that the numbers are small enough that we are running + * into floating point inaccuracies already. Probably worthwhile to be a good + * bit smarter. + * + * FIXME: This would need to be conditional, with a fallback to something not + * rdtsc based. + */ +static inline double __attribute__((const)) +get_cycles_to_sec(void) { - struct timespec tmp; + static double cycles_to_sec = 0; - clock_gettime(PG_INSTR_CLOCK, &tmp); + /* + * Compute baseline cpu peformance, determines speed at which rdtsc advances + */ + if (unlikely(cycles_to_sec == 0)) + { + uint32 cpuinfo[4] = {0}; - return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; + __get_cpuid(0x16, cpuinfo, cpuinfo + 1, cpuinfo + 2, cpuinfo + 3); + cycles_to_sec = 1 / ((double) cpuinfo[0] * 1000 * 1000); + } + + return cycles_to_sec; +} + +static inline instr_time pg_clock_gettime_ref_cycles(void) +{ + /* + * The rdtscp waits for all in-flight instructions to finish (but allows + * later instructions to start concurrently). That's good for some timing + * situations (when the time is supposed to cover all the work), but + * terrible for others (when sub-parts of work are measured, because then + * the pipeline stall due to the wait change the overall timing). + */ +#if 0 + unsigned int aux; + int64 tsc = __rdtscp(&aux); + + return tsc; +#else + + return __rdtsc(); +#endif } #define INSTR_TIME_SET_CURRENT(t) \ - (t) = pg_clock_gettime_ns() + (t) = pg_clock_gettime_ref_cycles() #define INSTR_TIME_ADD(x,y) \ do { \ @@ -123,13 +175,13 @@ static inline instr_time pg_clock_gettime_ns(void) } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - ((double) (t) / NS_PER_S) + ((double) (t) * get_cycles_to_sec()) #define INSTR_TIME_GET_MILLISEC(t) \ - ((double) (t) / NS_PER_MS) + ((double) (t) * (get_cycles_to_sec() * MS_PER_S)) #define INSTR_TIME_GET_MICROSEC(t) \ - ((double) (t) / NS_PER_US) + ((double) (t) * (get_cycles_to_sec() * US_PER_S)) #else /* !HAVE_CLOCK_GETTIME */ -- 2.25.0.114.g5b0ca878e0