Hi hackers,
Since current patch is in the commitfest with the status 'Ready for
committer', I’d like to summarize what it does, the problems it
addresses, and answer the key questions raised in the discussion thread.
Enabling pg_stat_statements can cause a performance drop due to two main
reasons:
* Contention on the exclusive lock, which blocks the hash table when
allocating or deallocating an entry.
* Contention on the spinlock, which blocks access to an existing entry.
The exclusive lock issue can be mitigated by normalizing similar queries
to the same queryid. There is an ongoing thread discussing this approach
for IN and ARRAY queries: [0]. Another solution is integrating
pg_stat_statements into central pgstats with the custom APIs, without
pushing the module into core [1].
However, even if we do all that, the spinlock contention problem will
persist for very frequent queries. This is exactly what I aim to solve
with sampling. Other approaches, such as replacing the spinlock with
atomic variables or an lwlock, have not yielded good results [2]. Even
track_planning was disabled by default to reduce contention, but on
high-core systems, the issue still remains.So far, I see no alternative
solution to this problem.
The benchmark results on different number of CPUs demonstrating the
impact of this patch can be found here:
pgbench -c{CPUs} -j20 -S -Mprepared -T120
CPUs | sample_rate | tps | CPU waits | ClientRead wait | SpinDelay wait
192 | 1.0 | 484338| 9568 | 929 | 11107
192 | 0.75 | 909547| 12079 | 2100 | 4781
192 | 0.5 |1028594| 13253 | 3378 | 174
192 | 0.25 |1019507| 13397 | 3423 | -
192 | 0.0 |1015425| 13106 | 3502 | -
48 | 1.0 | 643251| 911 | 932 | 44
48 | 0.75 | 653946| 939 | 939 | 3
48 | 0.5 | 651654| 841 | 932 | -
48 | 0.25 | 652668| 860 | 910 | -
48 | 0.0 | 659111| 849 | 882 | -
32 | 1.0 | 620667| 1782 | 560 | -
32 | 0.75 | 620663| 1736 | 554 | -
32 | 0.5 | 624094| 1741 | 648 | -
32 | 0.25 | 628638| 1702 | 576 | -
32 | 0.0 | 630483| 1638 | 574 | -
Some suggested sampling based on execution time instead of a random
generator. While this is a good idea, it has a critical limitation.
Execution time can only be checked in pgss_ExecutorEnd. However,
pgss_planner already interacts with the spinlock, and we can not sample
based on execution time at that stage. Finding a criterion that works
for both pgss_planner and pgss_ExecutorEnd is difficult. This is why
random sampling remains the most viable option.
BTW, Instead of using a bool flag to check if we are inside
pgss_post_parse_analyze, we now pass a pgssStoreKind to pgss_enabled.
This makes the logic clearer and more readable, explicitly indicating
where we need to check whether a query should be sampled. I made it in
new v15 patch.
Are there any remaining concerns or alternative suggestions regarding
this approach? I'm happy to discuss further.
[0]: https://commitfest.postgresql.org/51/2837/
[1]: https://www.postgresql.org/message-id/Zz0MFPq1s4WFxxpL%40paquier.xyz
[2]: https://commitfest.postgresql.org/29/2634/
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
From c9540fac9f6d6ad6d15acdaad844ca6c4ecba455 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdoki...@tantorlabs.ru>
Date: Mon, 10 Feb 2025 11:00:46 +0300
Subject: [PATCH] Allow setting sample rate for pg_stat_statements
New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
contrib/pg_stat_statements/Makefile | 2 +-
.../pg_stat_statements/expected/sampling.out | 174 ++++++++++++++++++
contrib/pg_stat_statements/meson.build | 1 +
.../pg_stat_statements/pg_stat_statements.c | 53 +++++-
contrib/pg_stat_statements/sql/sampling.sql | 50 +++++
doc/src/sgml/pgstatstatements.sgml | 19 ++
6 files changed, 291 insertions(+), 8 deletions(-)
create mode 100644 contrib/pg_stat_statements/expected/sampling.out
create mode 100644 contrib/pg_stat_statements/sql/sampling.sql
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..b70bdfaf2d 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
REGRESS = select dml cursors utility level_tracking planning \
user_activity wal entry_timestamp privileges extended \
- parallel cleanup oldextversions
+ parallel sampling cleanup oldextversions
# Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
# which typical installcheck users do not have (e.g. buildfarm clients).
NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 0000000000..2204215f64
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,174 @@
+--
+-- sample statements
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int
+-----
+ 1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int
+-----
+ 1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls
+--------------------+-------
+ SELECT $1 AS "int" | 1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column?
+----------
+ 1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column?
+----------
+ 1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls
+-----------+-------
+ SELECT $1 | 1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+ ORDER BY query COLLATE "C";
+ toplevel | calls | query
+----------+-------+----------------------------------------------------
+ t | 2 | EXPLAIN (COSTS OFF) SELECT $1
+ f | 2 | SELECT $1
+ t | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t | 2 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+ ORDER BY query COLLATE "C";
+ toplevel | calls | query
+----------+-------+-----------------------------------------
+ t | 2 | EXPLAIN (COSTS OFF) SELECT $1
+ f | 3 | SELECT $1
+ t | 1 | SET pg_stat_statements.sample_rate = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..351354a6a5 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
'privileges',
'extended',
'parallel',
+ 'sampling',
'cleanup',
'oldextversions',
],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..9e79e6ac33 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
#include "access/parallel.h"
#include "catalog/pg_authid.h"
#include "common/int.h"
+#include "common/pg_prng.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -256,6 +257,9 @@ typedef struct pgssSharedState
/* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
static int nesting_level = 0;
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
/* Saved hook values */
static shmem_request_hook_type prev_shmem_request_hook = NULL;
static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -294,12 +298,14 @@ static bool pgss_track_utility = true; /* whether to track utility commands */
static bool pgss_track_planning = false; /* whether to track planning
* duration */
static bool pgss_save = true; /* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
(!IsParallelWorker() && \
(pgss_track == PGSS_TRACK_ALL || \
- (pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+ (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+ (skip_sampling_check == PGSS_INVALID || current_query_sampled()))
#define record_gc_qtexts() \
do { \
@@ -373,6 +379,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
int query_loc);
static int comp_location(const void *a, const void *b);
+static bool current_query_sampled(void);
/*
@@ -414,6 +421,19 @@ _PG_init(void)
NULL,
NULL);
+ DefineCustomRealVariable("pg_stat_statements.sample_rate",
+ "Fraction of queries to track.",
+ NULL,
+ &pgss_sample_rate,
+ 1.0,
+ 0.0,
+ 1.0,
+ PGC_SUSET,
+ 0,
+ NULL,
+ NULL,
+ NULL);
+
DefineCustomEnumVariable("pg_stat_statements.track",
"Selects which statements are tracked by pg_stat_statements.",
NULL,
@@ -836,7 +856,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
prev_post_parse_analyze_hook(pstate, query, jstate);
/* Safety check... */
- if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+ if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, PGSS_INVALID))
return;
/*
@@ -894,7 +914,7 @@ pgss_planner(Query *parse,
* pgss_store needs it. We also ignore query without queryid, as it would
* be treated as a utility statement, which may not be the case.
*/
- if (pgss_enabled(nesting_level)
+ if (pgss_enabled(nesting_level, PGSS_PLAN)
&& pgss_track_planning && query_string
&& parse->queryId != UINT64CONST(0))
{
@@ -999,7 +1019,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
* counting of optimizable statements that are directly contained in
* utility statements.
*/
- if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+ if (pgss_enabled(nesting_level, PGSS_EXEC) &&
+ queryDesc->plannedstmt->queryId != UINT64CONST(0))
{
/*
* Set up to track total elapsed time in ExecutorRun. Make sure the
@@ -1068,7 +1089,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
uint64 queryId = queryDesc->plannedstmt->queryId;
if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
- pgss_enabled(nesting_level))
+ pgss_enabled(nesting_level, PGSS_EXEC))
{
/*
* Make sure stats accumulation is done. (Note: it's okay if several
@@ -1111,7 +1132,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
uint64 saved_queryId = pstmt->queryId;
int saved_stmt_location = pstmt->stmt_location;
int saved_stmt_len = pstmt->stmt_len;
- bool enabled = pgss_track_utility && pgss_enabled(nesting_level);
+ bool enabled = pgss_track_utility && pgss_enabled(nesting_level, PGSS_EXEC);
/*
* Force utility statements to get queryId zero. We do this even in cases
@@ -3011,3 +3032,21 @@ comp_location(const void *a, const void *b)
return pg_cmp_s32(l, r);
}
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(void)
+{
+ if (nesting_level == 0)
+ is_query_sampled = pgss_sample_rate != 0.0 &&
+ (pgss_sample_rate == 1.0 ||
+ pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+ return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 0000000000..b09f45991b
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,50 @@
+--
+-- sample statements
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+ ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+ ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..cd3fdaeffe 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,25 @@
</listitem>
</varlistentry>
+ <varlistentry>
+ <term>
+ <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+ <indexterm>
+ <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+
+ <listitem>
+ <para>
+ <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+ track a fraction of the statements in each session. The default is <literal>1</literal>,
+ meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+ tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+ In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term>
<varname>pg_stat_statements.save</varname> (<type>boolean</type>)
--
2.34.1