Re: pg_stat_advisor extension
Hi hackers, I'm reaching out again regarding the patch with new extension 'pg_stat_advisor' aimed at enhancing query plan efficiency through the suggestion of creating statistics. I understand the community is busy, but I would greatly value any feedback or thoughts on this extension. Thank you for your time and consideration. Best regards,Ilia Evdokimov,Tantor Labs LLC.
Re: pg_stat_advisor extension
Hi hackers, I've encountered and addressed errors in the "0001-pg_stat_advisor-extension.patch" when applying it to the main branch, specifically trailing whitespace issues at lines 117 and 118: ``` 0001-pg_stat_advisor-extension.patch:117: trailing whitespace. QUERY PLAN 0001-pg_stat_advisor-extension.patch:118: trailing whitespace. warning: 2 lines add whitespace errors. ``` An updated patch is attached for review I welcome your insights, feedback, and evaluations regarding the necessity of integrating this new extension into PostgreSQL. Kind regards, Ilia Evdokimov, Tantor Labs LLC. From 6316706c42996219e507bb6ded9dd1e872180e38 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 6 Feb 2024 18:11:04 +0300 Subject: [PATCH] pg_stat_advisor extension --- contrib/Makefile | 1 + contrib/meson.build | 1 + contrib/pg_stat_advisor/.gitignore| 3 + contrib/pg_stat_advisor/Makefile | 20 + contrib/pg_stat_advisor/README.md | 85 .../expected/pg_stat_advisor.out | 96 contrib/pg_stat_advisor/meson.build | 30 ++ contrib/pg_stat_advisor/pg_stat_advisor.c | 477 ++ .../pg_stat_advisor/sql/pg_stat_advisor.sql | 50 ++ 9 files changed, 763 insertions(+) create mode 100644 contrib/pg_stat_advisor/.gitignore create mode 100644 contrib/pg_stat_advisor/Makefile create mode 100644 contrib/pg_stat_advisor/README.md create mode 100644 contrib/pg_stat_advisor/expected/pg_stat_advisor.out create mode 100644 contrib/pg_stat_advisor/meson.build create mode 100644 contrib/pg_stat_advisor/pg_stat_advisor.c create mode 100644 contrib/pg_stat_advisor/sql/pg_stat_advisor.sql diff --git a/contrib/Makefile b/contrib/Makefile index da4e2316a3..da9a4ceeaa 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -34,6 +34,7 @@ SUBDIRS = \ pg_buffercache \ pg_freespacemap \ pg_prewarm \ + pg_stat_advisor \ pg_stat_statements \ pg_surgery \ pg_trgm \ diff --git a/contrib/meson.build b/contrib/meson.build index c12dc906ca..a20d99443b 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -49,6 +49,7 @@ subdir('pgcrypto') subdir('pg_freespacemap') subdir('pg_prewarm') subdir('pgrowlocks') +subdir('pg_stat_advisor') subdir('pg_stat_statements') subdir('pgstattuple') subdir('pg_surgery') diff --git a/contrib/pg_stat_advisor/.gitignore b/contrib/pg_stat_advisor/.gitignore new file mode 100644 index 00..913175ff6e --- /dev/null +++ b/contrib/pg_stat_advisor/.gitignore @@ -0,0 +1,3 @@ +/log/ +/results/ +/tmp_check/ diff --git a/contrib/pg_stat_advisor/Makefile b/contrib/pg_stat_advisor/Makefile new file mode 100644 index 00..f31b939e8a --- /dev/null +++ b/contrib/pg_stat_advisor/Makefile @@ -0,0 +1,20 @@ +# contrib/pg_stat_advisor/Makefile + +MODULE_big = pg_stat_advisor +OBJS = \ + $(WIN32RES) \ + pg_stat_advisor.o +PGFILEDESC = "pg_stat_advisor - analyze query performance and recommend the creation of additional statistics" + +REGRESS = pg_stat_advisor + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_stat_advisor +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_stat_advisor/README.md b/contrib/pg_stat_advisor/README.md new file mode 100644 index 00..f9610f2ed5 --- /dev/null +++ b/contrib/pg_stat_advisor/README.md @@ -0,0 +1,85 @@ +## pg_stat_advisor - PostgreSQL advisor to create statistics + +pg_stat_advisor is a PostgreSQL extension designed to analyze query performance and recommend the creation of additional statistics to improve query plan. + +Append pg_stat_advisor to shared_preload_libraries configuration parameter in your postgresql.conf file then restart the PostgreSQL database to apply the changes. Or you can use "LOAD 'pg_stat_advisor';"command +``` +LOAD 'pg_stat_advisor'; +``` + +There is the pg_stat_advisor.suggest_statistics_threshold GUC that can be used to set a suggest_statistics_threshold. It is the the ratio of total tuples produced compared to the planned rows. If parameter is set by 0, the printing switches off. + +For example: +``` +SET pg_stat_advisor.suggest_statistics_threshold = 1.0; +``` + +Examples: + + +``` +postgres=# create table t (i int, j int); +CREATE TABLE +postgres=# insert into t select i/10, i/100 from generate_series(1, 100) i; +INSERT 0 100 +postgres=# analyze t; +ANALYZE +postgres=# explain analyze select * from t where i = 100 and j = 10; + QUERY PLAN + +-- +
Re: pg_stat_advisor extension
Dear Team, Firstly, I would like to extend my sincere apologies for the confusion and technical oversights in our previous discussions regarding the 'pg_stat_advisor extension'. To address this and facilitate a clearer, more focused dialogue, I have initiated a new thread to consolidate our discussions on this matter. For context, our previous conversation can be found here: https://www.postgresql.org/message-id/flat/4681151706615977%40mail.yandex.ru. The extension 'pg_stat_advisor' extension is architected to optimize query plan. It operates by suggesting when to create extended statistics, particularly in queries where current selectivity estimates fall short. This is achieved through the GUC parameter 'pg_stat_advisor.suggest_statistics_threshold', which assesses the ratio of total tuples compared to the planned rows. This feature is instrumental in identifying scenarios where the planner's estimates could be optimized. You can install the extension by: ``` LOAD 'pg_stat_advisor' SET pg_stat_advisor.suggest_statistics_threshold = 1.0; ``` Example: ``` EXPLAIN ANALYZE SELECT * FROM t WHERE i = 100 AND j = 10; NOTICE: pg_stat_advisor suggestion: CREATE STATISTICS t_i_j ON i, j FROM t QUERY PLAN ``` After EXPLAIN ANALYZE command you can see the message of suggestion creating statistics with name 't_i_j' on 'i', 'j' columns from 't' table. Thank you for your understanding, patience, and continued support. Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: pg_stat_advisor extension
On Feb 8 2024 at 00:00:00 jian he >INT MAX >should be 1.0? I don’t know why Konstantin Knizhnik used the ratio of actual tuples to the planned ones, but most who start testing my extension expect that it will be a coefficient from 0 to 1, which will be the ratio of the estimated tuples to the actual ones. Therefore, I changed the value of this coefficient the other way around and now the value can be from 0 to 1. The patch with changes has been attached. > now CREATE STATISTICS, the statistics name is optional I constructed the name of the statistics so that the user could copy the line with 'CREATE STATISTICS' with the mouse and execute this command faster. But if the user wants ITS name, he can do it manually. > here you can explicitly mention the statistics kind would be great I agree with you. That would be my next step. That's why I'm doing it now. > Also since the documentation is limited, more comments explainingSuggestMultiColumnStatisticsForNode would be great. overall the comments are very little, it should be more (that's my opinion). Yes, certainly. I'll do it in the next patch. I'm looking forward to your thoughts and feedback. Regards, Ilia Evdokimov, Tantor Labs LLC. From f87f4a57e532d57f43dab4764d08ddf83d9f3d8f Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 8 Feb 2024 16:00:57 +0300 Subject: [PATCH] 'pg_stat_advisor' extension. This serves as a hook into the executor for determining total rows and planned rows. The process starts by checking the `pg_stat_advisor.suggest_statistics_threshold` GUC parameter. If it's set to 0.0, the extension does not proceed further. When the parameter is greater than 0.0, the extension evaluates the accuracy of planned rows. A suggestion for creating statistics is made if the ratio of total rows to planned rows is greater than or equal to this threshold. Only then does it extract the relation and columns from the query. The extension checks pg_statistic_ext for existing relevant statistics. If no statistics are found, it prints a notice suggesting the creation of statistics, using the naming format 'relationName_columns'. Author: Ilia Evdokimov --- contrib/Makefile | 1 + contrib/meson.build | 1 + contrib/pg_stat_advisor/.gitignore| 3 + contrib/pg_stat_advisor/Makefile | 20 + contrib/pg_stat_advisor/README.md | 85 +++ .../expected/pg_stat_advisor.out | 96 contrib/pg_stat_advisor/meson.build | 30 ++ contrib/pg_stat_advisor/pg_stat_advisor.c | 482 ++ .../pg_stat_advisor/sql/pg_stat_advisor.sql | 50 ++ 9 files changed, 768 insertions(+) create mode 100644 contrib/pg_stat_advisor/.gitignore create mode 100644 contrib/pg_stat_advisor/Makefile create mode 100644 contrib/pg_stat_advisor/README.md create mode 100644 contrib/pg_stat_advisor/expected/pg_stat_advisor.out create mode 100644 contrib/pg_stat_advisor/meson.build create mode 100644 contrib/pg_stat_advisor/pg_stat_advisor.c create mode 100644 contrib/pg_stat_advisor/sql/pg_stat_advisor.sql diff --git a/contrib/Makefile b/contrib/Makefile index da4e2316a3..da9a4ceeaa 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -34,6 +34,7 @@ SUBDIRS = \ pg_buffercache \ pg_freespacemap \ pg_prewarm \ + pg_stat_advisor \ pg_stat_statements \ pg_surgery \ pg_trgm \ diff --git a/contrib/meson.build b/contrib/meson.build index c12dc906ca..a20d99443b 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -49,6 +49,7 @@ subdir('pgcrypto') subdir('pg_freespacemap') subdir('pg_prewarm') subdir('pgrowlocks') +subdir('pg_stat_advisor') subdir('pg_stat_statements') subdir('pgstattuple') subdir('pg_surgery') diff --git a/contrib/pg_stat_advisor/.gitignore b/contrib/pg_stat_advisor/.gitignore new file mode 100644 index 00..913175ff6e --- /dev/null +++ b/contrib/pg_stat_advisor/.gitignore @@ -0,0 +1,3 @@ +/log/ +/results/ +/tmp_check/ diff --git a/contrib/pg_stat_advisor/Makefile b/contrib/pg_stat_advisor/Makefile new file mode 100644 index 00..f31b939e8a --- /dev/null +++ b/contrib/pg_stat_advisor/Makefile @@ -0,0 +1,20 @@ +# contrib/pg_stat_advisor/Makefile + +MODULE_big = pg_stat_advisor +OBJS = \ + $(WIN32RES) \ + pg_stat_advisor.o +PGFILEDESC = "pg_stat_advisor - analyze query performance and recommend the creation of additional statistics" + +REGRESS = pg_stat_advisor + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_stat_advisor +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_stat_advisor/README.md b/contrib/pg_stat_advisor/README
Re: pg_stat_advisor extension
Our further discussion of this new extension takes place in this thread: https://www.postgresql.org/message-id/flat/f822b674-9697-43b9-931b-4d69729a26ff%40tantorlabs.com .Due to technical difficulties in the current thread, I will not be able to conduct a dialogue except in HTML format. And this will make it inconvenient for everyone to read the messages. I apologize for the inconvenience caused. Regards, Ilia Evdokimov, TantorLabs LLC.
Re: pg_stat_advisor extension
On Feb 08 2024 at 07:14:18, Andrei Lepikhov wrote: 1. In the case of parallel workers the plan_rows value has a different semantics than the number of rows predicted. Just explore get_parallel_divisor(). Yes, this is a very weighty and important issue. I need to think about this very carefully. 2. The extension recommends new statistics immediately upon an error finding. But what if the reason for the error is stale statistics? Or this error may be raised for only one specific set of constants, and estimation will be done well in another 99.% of cases for the same expression. According to No.2, it might make sense to collect and track clause combinations and cardinality errors found and let the DBA make decisions on their own. Your proposal is very interesting. In my opinion, it is worth considering updating the extended statistics if they are truly stale. And write about this in a separate message with suggestion updating statistics. If I succeed, then in the next patch I will add the kind of extended statistics to the message, deal with the parallel workers and update statistics if necessary. If you have additional suggestions and thoughts, feel free to write them in this thread. Regards, Ilia Evdokimov, Tantor Labs LLC.
list_free in addRangeTableEntryForJoin
Hi Hackers I have identified a potential memory leak in the `addRangeTableEntryForJoin()` function. The second parameter of `addRangeTableEntryForJoin()`, `colnames`, is a `List*` that is concatenated with another `List*`, `eref->colnames`, using `list_concat()`. We need to pass only the last `numaliases` elements of the list, for which we use `list_copy_tail`. This function creates a copy of the `colnames` list, resulting in `colnames` pointing to the current list that will not be freed. Consequently, a new list is already concatenated. To address this issue, I have invoked `list_free(colnames)` afterwards. If anyone is aware of where the list is being freed or has any suggestions for improvement, I would greatly appreciate your input. Best Regards, Ilia Evdokimov, TantorLabs LCC From cd7aa7ac5904501085a980944dc3c18f42721c06 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 10 Jun 2024 13:10:31 +0300 Subject: [PATCH] After concatenation two lists where the second one is from list_copy_tail do not free it --- src/backend/parser/parse_relation.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c index 2f64eaf0e3..8230cf27cc 100644 --- a/src/backend/parser/parse_relation.c +++ b/src/backend/parser/parse_relation.c @@ -2258,8 +2258,11 @@ addRangeTableEntryForJoin(ParseState *pstate, /* fill in any unspecified alias columns */ if (numaliases < list_length(colnames)) + { eref->colnames = list_concat(eref->colnames, list_copy_tail(colnames, numaliases)); + list_free(colnames); + } if (numaliases > list_length(colnames)) ereport(ERROR, -- 2.34.1
Re: list_free in addRangeTableEntryForJoin
>But callers of addRangeTableEntryForJoin(), expects to handle a list or NIL, if we free the memory I've thoroughly reviewed all callers of the `addRangeTableEntryForJoin()` function and confirmed that the list is not used after this function is called. Since `addRangeTableEntryForJoin()` is the last function to use this list, it would be more efficient to free the `List` at the point of its declaration. I'll attach new patch where I free the lists. Regards, Ilia Evdokimov, Tantor Labs LCC From 853c5bc854bcc03e791cf32cc8cad7b257ec558f Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 10 Jun 2024 15:09:12 +0300 Subject: [PATCH] After concatenation two lists where the second one is from list_copy_tail do not free it --- src/backend/parser/analyze.c | 2 ++ src/backend/parser/parse_clause.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index 28fed9d87f..c0708a6e3e 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -1943,6 +1943,8 @@ transformSetOperationStmt(ParseState *pstate, SelectStmt *stmt) if (pstate->p_hasAggs || qry->groupClause || qry->groupingSets || qry->havingQual) parseCheckAggregates(pstate, qry); + list_free(targetnames); + return qry; } diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c index 8118036495..25f8c84a9f 100644 --- a/src/backend/parser/parse_clause.c +++ b/src/backend/parser/parse_clause.c @@ -1620,6 +1620,8 @@ transformFromClauseItem(ParseState *pstate, Node *n, *top_nsitem = nsitem; *namespace = lappend(my_namespace, nsitem); + list_free(res_colnames); + return (Node *) j; } else -- 2.34.1
Re: list_free in addRangeTableEntryForJoin
>Now you need to analyze whether the memory in question is not managed by a Context I've already analyzed. Let's explain details: 1. analyze.c 1718: List* targetnames; 1815: targetnames = NIL; 1848: targetnames = lappend(targetnames, makeString(colName)); 1871: addRangeTableEntryForJoin(...); => list_free(targetnames); 2. parse_clause.c 1163: List* res_colnames; 1289: res_colnames = NIL; 1324: foreach(col, res_colnames); 1396: res_colnames = lappend(res_colnames, lfirst(ucol)); 1520, 1525: extractRemainingColumns(...); | 290: *res_colnames = lappend(*res_colnames, lfirst(lc)); 1543: addRangeTableEntryForJoin(...); => list_free(res_colnames); As you can see, there are no other pointers working with this block of memory, and all operations above are either read-only or append nodes to the lists. If I am mistaken, please correct me. Furthermore, I will call `list_free` immediately after `addRangeTableEntryForJoin()`. The new patch is attached. Thanks for reviewing. I'm looking forward to new suggestions. Regards, Ilia Evdokimov, Tantor Labs LCC From b545770d57ac3ca137e9ad97c004576e77213648 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 10 Jun 2024 16:39:14 +0300 Subject: [PATCH] After concatenation two lists where the second one is from list_copy_tail do not free it --- src/backend/parser/analyze.c | 2 ++ src/backend/parser/parse_clause.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index 28fed9d87f..50c47a64ed 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -1880,6 +1880,8 @@ transformSetOperationStmt(ParseState *pstate, SelectStmt *stmt) NULL, false); + list_free(targetnames); + sv_namespace = pstate->p_namespace; pstate->p_namespace = NIL; diff --git a/src/backend/parser/parse_clause.c b/src/backend/parser/parse_clause.c index 8118036495..2fa6c03be7 100644 --- a/src/backend/parser/parse_clause.c +++ b/src/backend/parser/parse_clause.c @@ -1552,6 +1552,8 @@ transformFromClauseItem(ParseState *pstate, Node *n, j->alias, true); + list_free(res_colnames); + /* Verify that we correctly predicted the join's RT index */ Assert(j->rtindex == nsitem->p_rtindex); /* Cross-check number of columns, too */ -- 2.34.1
Re: pg_stat_advisor extension
1. In the case of parallel workers the plan_rows value has a different semantics than the number of rows predicted. Just explore get_parallel_divisor(). 2. The extension recommends new statistics immediately upon an error finding. But what if the reason for the error is stale statistics? Or this error may be raised for only one specific set of constants, and estimation will be done well in another 99.% of cases for the same expression. The new parameter, `pg_stat_advisor.analyze_scale_factor`, can suggest the execution of the ANALYZE command on specific tables. The extension now evaluates the ratio of `n_live_tup` (number of live tuples) to `n_mod_since_analyze` (number of modifications since last analyze) in the `pg_stat_all_tables` catalog. If this ratio exceeds the value specified in `analyze_scale_factor`, the extension will suggest an update to the table's statistics. There are a lot of parameters that influences on estimated rows. Statistics might not help improve estimated rows. This feature is designed to provide users with data-driven insights to decide whether updating statistics via the ANALYZE command could potentially improve query performance. By suggesting rather than automatically executing statistics updates, we empower you to make informed decisions based on the specific needs and conditions of your database environment. I've developed an extension that provides suggestions on whether to update or create statistics for your PostgreSQL database, without executing any changes. This approach allows you to consider various parameters that influence row estimates and make informed decisions about optimizing your database's performance. Your feedback is invaluable, and we look forward to hearing about your experiences and any improvements you might suggest. Best regards, Ilia Evdokimov Tantor Labs LLC. From eb998bea96a3640d240afa63e08cc8cf98925bf7 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 13 May 2024 14:41:59 +0300 Subject: [PATCH] 'pg_stat_advisor' extension This service as a hook into executor. It has two GUC-parameters. pg_stat_advisor.analyze_scale_factor: if ratio of pg_stat_all_tables.n_live_tup to pg_stat_all_tables.n_mod_since_analyze is greater than pg_stat_advisor.analyze_scale_factor extension prints suggestion executing ANALYZE command. pg_stat_advisor.suggest_statistics_threshold: the ratio of total rows to planned rows is greater than or equal to this threshold the extension prints suggestion executing the creation of statistics, using the naming format 'relationName_columns' --- contrib/Makefile | 1 + contrib/pg_stat_advisor/Makefile | 20 + .../expected/pg_stat_advisor.out | 52 ++ contrib/pg_stat_advisor/meson.build | 30 + contrib/pg_stat_advisor/pg_stat_advisor.c | 560 ++ .../pg_stat_advisor/sql/pg_stat_advisor.sql | 24 + 6 files changed, 687 insertions(+) create mode 100644 contrib/pg_stat_advisor/Makefile create mode 100644 contrib/pg_stat_advisor/expected/pg_stat_advisor.out create mode 100644 contrib/pg_stat_advisor/meson.build create mode 100644 contrib/pg_stat_advisor/pg_stat_advisor.c create mode 100644 contrib/pg_stat_advisor/sql/pg_stat_advisor.sql diff --git a/contrib/Makefile b/contrib/Makefile index abd780f277..d6ce2fe562 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -33,6 +33,7 @@ SUBDIRS = \ pg_buffercache \ pg_freespacemap \ pg_prewarm \ + pg_stat_advisor \ pg_stat_statements \ pg_surgery \ pg_trgm \ diff --git a/contrib/pg_stat_advisor/Makefile b/contrib/pg_stat_advisor/Makefile new file mode 100644 index 00..f31b939e8a --- /dev/null +++ b/contrib/pg_stat_advisor/Makefile @@ -0,0 +1,20 @@ +# contrib/pg_stat_advisor/Makefile + +MODULE_big = pg_stat_advisor +OBJS = \ + $(WIN32RES) \ + pg_stat_advisor.o +PGFILEDESC = "pg_stat_advisor - analyze query performance and recommend the creation of additional statistics" + +REGRESS = pg_stat_advisor + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_stat_advisor +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_stat_advisor/expected/pg_stat_advisor.out b/contrib/pg_stat_advisor/expected/pg_stat_advisor.out new file mode 100644 index 00..8f3dab2c2f --- /dev/null +++ b/contrib/pg_stat_advisor/expected/pg_stat_advisor.out @@ -0,0 +1,52 @@ +LOAD 'pg_stat_advisor'; +SET pg_stat_advisor.analyze_scale_factor = 0.4; +SET pg_stat_advisor.suggest_statistics_threshold = 0.11; +CREATE TABLE my_tbl(fld_1 INTEGER, fld_2 BIGINT) WITH (autovacuum_enabled = false); +INSERT INTO my_tbl (fld_1, fld_2) +SELECT + i/100 as fld_1, + i/500 as fld_2 +FROM generate_series(1, 1000) s(i); +ANALYZE my_tbl; +INSERT
Adding NOTICE for differences between estimated and actual rows
Hi hackers, It's quite common that poor query performance can be attributed to inaccurate row estimations by the planner. To make it easier to detect these discrepancies, rather than scrutinizing the estimates manually, it would be helpful to output a dedicated |NOTICE| message. In the current patch, I've introduced a new GUC parameter called 'estimated_rows_scale_factor'. If the ratio of the estimated rows to the actual rows is less than this factor, or if the estimated rows significantly exceed the actual rows (when the ratio is greater than this factor), a NOTICE message will be printed. The message reads: "Estimated rows (%.0f) less(greater) than actual rows (%.0f)." Here is an example: CREATE TABLE t(a int, b int); INSERT INTO t SELECT x/10, x FROM generate_series(1,1000) g(x); ANALYZE; SET estimated_rows_scale_factor = 0.9; EXPLAIN ANALYZE SELECT * FROM t WHERE a > 10 AND b <= 200; NOTICE: Estimated rows (1000) greater than actual rows (91). QUERY PLAN - Gather (cost=1000.00..107848.00 rows=1000 width=8) (actual time=0.446..122.476 rows=91 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on t (cost=0.00..106748.00 rows=417 width=8) (actual time=77.657..118.000 rows=30 loops=3) Filter: ((a > 10) AND (b <= 200)) Rows Removed by Filter: 303 Planning Time: 0.097 ms Execution Time: 122.502 ms (8 rows) EXPLAIN ANALYZE SELECT * FROM t WHERE a = 10 AND b <= 200; NOTICE: Estimated rows (1) less than actual rows (10). QUERY PLAN -- Gather (cost=1000.00..107748.10 rows=1 width=8) (actual time=0.280..104.752 rows=10 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on t (cost=0.00..106748.00 rows=1 width=8) (actual time=66.493..101.102 rows=3 loops=3) Filter: ((b <= 200) AND (a = 10)) Rows Removed by Filter: 330 Planning Time: 0.129 ms Execution Time: 104.768 ms (8 rows) If you have any suggestions regarding the wording of the message, its placement, or if you'd like to see a different criterion used, I would greatly appreciate your feedback. Looking forward to your thoughts and suggestions. -- Regards, Ilia Evdokimov, Tantor Labs LCC. From 8e038f16d45d5f8c39c7204fc661ed4b7f504a63 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 5 Sep 2024 18:15:01 +0300 Subject: [PATCH v1] Add message when estimated rows differ from actual rows If the ratio of estimated rows to actual rows is less than the 'estimated_rows_scale_factor', or if the estimated rows exceed the actual rows, a NOTICE message will be printed. This helps to identify discrepancies between the planner's estimations and the actual query execution, enabling better analysis and optimization. --- src/backend/commands/explain.c | 31 + src/backend/utils/misc/guc_tables.c | 11 ++ src/include/utils/guc.h | 1 + 3 files changed, 43 insertions(+) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 11df4a04d4..da435cfee1 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -419,6 +419,35 @@ ExplainResultDesc(ExplainStmt *stmt) return tupdesc; } +/* + * Notice whether estimated rows is bad. + */ +static void +ExplainPrintEstimatedRowsAttention(ExplainState *es, QueryDesc *queryDesc) +{ + double rows; + double nloops; + PlanState *planstate = queryDesc->planstate; + + if (es->analyze && + planstate->instrument && planstate->instrument->nloops > 0) + { + nloops = planstate->instrument->nloops; + rows = planstate->instrument->ntuples / nloops; + + if(queryDesc->planstate->plan->plan_rows / rows < estimated_rows_scale_factor) + ereport(NOTICE, + (errmsg("Estimated rows (%.0f) less than actual rows (%.0f).", +queryDesc->planstate->plan->plan_rows, +rows), errhidestmt(true))); + else if (rows / queryDesc->planstate->plan->plan_rows < estimated_rows_scale_factor) + ereport(NOTICE, + (errmsg("Estimated rows (%.0f) greater than actual rows (%.0f).", +queryDesc->planstate->plan->plan_rows, +rows), errhidestmt(true))); + } +} + /* * ExplainOneQuery - * print out the execution plan for one Query @@ -721,6 +750,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, /* Create textual dump of plan tree */ ExplainPrintPlan(es, queryDesc); + ExplainPrintEstimatedRowsAttention(es, queryDesc); +
Re: Adding NOTICE for differences between estimated and actual rows
On 05.09.2024 23:32, Euler Taveira wrote: On Thu, Sep 5, 2024, at 1:05 PM, Ilia Evdokimov wrote: It's quite common that poor query performance can be attributed to inaccurate row estimations by the planner. To make it easier to detect these discrepancies, rather than scrutinizing the estimates manually, it would be helpful to output a dedicated |NOTICE| message. I don't know if NOTICE is a good UI for an inaccurate estimation. The main issue with your proposal is that it does not indicate where it is. It is easier to inspect small query plans but what if you have a plan with hundreds of lines? IMO the client should provide this feature. The shell provides a way to change the color and/or style from the output. I have a perl script that reads an EXPLAIN output and mark with different colors (red, yellow) if the estimations are off. psql could do the same. In your case if the output was changed to something like: \033[0;1;31mGather (cost=1000.00..107848.00 rows=1000 width=8) (actual time=0.446..122.476 rows=91 loops=1)\033[0m Workers Planned: 2 Workers Launched: 2 -> \033[0;1;31mParallel Seq Scan on t (cost=0.00..106748.00 rows=417 width=8) (actual time=77.657..118.000 rows=30 loops=3)\033[0m Filter: ((a > 10) AND (b <= 200)) Rows Removed by Filter: 303 Planning Time: 0.097 ms Execution Time: 122.502 ms (8 rows) Note "\033[0;1;31m" and "\033[0m" that means foreground bold red and default, respectively. Another alternative if you don't want to modify psql is to use the pager. Create a script that contains your logic to apply color and/or style to the desired (sub)string(s). The following example that I extracted from [1] can apply colors to psql output. $ cat /tmp/pcc.pl #!/usr/bin/perl -n print "\033[1m\033[35m$1\033[36m$2\033[32m$3\033[33m$4\033[m" while /([|+-]+)|([0-9]+)|([a-zA-Z_]+)|([^\w])/g; and then you can start psql as: $ PAGER="/c/mypager.pl" psql [1] https://stackoverflow.com/questions/5947742/how-to-change-the-output-color-of-echo-in-linux/28938235#28938235 -- Euler Taveira EDB https://www.enterprisedb.com/ Yes, you are right. It probably doesn't make sense to implement such a notification on the server side. It makes more sense to handle this on the client side, where there are many different tools, including your suggestion, to highlight inaccurate estimates. Thank you very much for the review! -- Regards, Ilia Evdokimov, Tantor Labs LCC.
Re: Add support for (Var op Var) clause in extended MCV statistics
Hi everyone, I've taken a closer look at the patch and realized that we don't need the new function 'mcv_clause_selectivity_var_op_var()' we can use 'mcv_clause_selectivity()' instead. I'm attaching the updated patch and test generator. -- Regards, Ilia Evdokimov, Tantor Labs LCC. #!/usr/bin/python3 import psycopg2 import random import select import hashlib def generate_conditions(nclauses, attributes = ['a', 'b', 'c', 'd'], operators = ['<', '<=', '=', '!=', '>=', '>']): if nclauses == 1: cols = [random.choice(attributes), random.choice(attributes)] oper = ' ' + random.choice(operators) + ' ' clause = oper.join(cols) if random.randint(0,100) < 50: clause = 'NOT ' + clause return clause nparts = random.randint(2, nclauses) # distribute the clauses between query parts nclauses_parts = [1 for p in range(0, nparts)] for x in range(0, nclauses - nparts): nclauses_parts[random.randint(0, nparts) - 1] += 1 parts = [] for p in range(0, nparts): parts.append('(' + generate_conditions(nclauses_parts[p], attributes, operators) + ')') c = random.choice([' AND ', ' OR ']) return c.join(parts) def generate_data(nrows, attributes = ['a', 'b', 'c', 'd']): sql = 'insert into t (' + ','.join(attributes) + ') select ' attrs = [] for attr in attributes: x = random.choice([-1, 2, 5, 10, 20, 30]) if x == -1: x = random.randint(5, 20) expr = '(random() * ' + str(x) + ')::int' else: expr = 'mod(i,' + str(x) + ')' if random.randint(0,100) < 50: x = random.choice([2, 5, 10, 20, 30]) attrs.append('case when mod(i,' + str(x) + ') = 0 then null else ' + expr + ' end') else: attrs.append(expr) sql += ', '.join(attrs) + ' from generate_series(1,' + str(nrows) + ') s(i)' return sql def wait(conn): while True: state = conn.poll() if state == psycopg2.extensions.POLL_OK: break elif state == psycopg2.extensions.POLL_WRITE: select.select([], [conn.fileno()], []) elif state == psycopg2.extensions.POLL_READ: select.select([conn.fileno()], [], []) else: raise psycopg2.OperationalError("poll() returned %s" % state) def run_everywhere(conns, queries): curs = [conn.cursor() for conn in conns] for q in queries: [cur.execute(q) for cur in curs] [wait(conn) for conn in conns] [cur.close() for cur in curs] if __name__ == '__main__': conns = [ psycopg2.connect('host=localhost port=5001 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5002 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5003 user=postgres dbname=postgres', async_=True), psycopg2.connect('host=localhost port=5004 user=postgres dbname=postgres', async_=True)] [wait(conn) for conn in conns] curs = [conn.cursor() for conn in conns] # 100 data sets for cnt in [3, 10, 100]: for d in range(1,100): # generate the data on all versions data_sql = generate_data(cnt) data_md5 = hashlib.md5(data_sql.encode('utf-8')).hexdigest() with open('data.csv', 'a') as f: f.write('%s\t%s\n' % (data_md5, data_sql)) run_everywhere(conns, ['drop table if exists t', 'create table t (a int, b int, c int, d int)', data_sql, 'commit', 'analyze']) # generate the clauses conditions = [] for c in range(1, 6): for q in range(1,100): conditions.append({'clauses' : c, 'conditions' : generate_conditions(c)}) with open('results.csv', 'a') as f: for conds in conditions: sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')" [cur.execute(sql) for cur in curs] [wait(conn) for conn in conns] r = [cur.fetchone() for cur in curs] actual_rows = r[0][1] estimated_rows = [str(x[0]) for x in r] f.write(('%s\t%s\t%s\t%s\t%s\t%s\t%s\n') % (data_md5, cnt, conds['clauses'], conds['conditions'], 'no', actual_rows, '\t'.join(estimated_rows))) run_everywhere(conns, ['create statistics s (mcv) on a, b, c, d from t', 'commit', 'analyze']) with open('results.csv', 'a') as f: for conds in conditions: sql = "select * from check_estimated_rows('select * from t where " + conds['conditions'] + "')" [cur.execute(sql) for cur in curs]
Re: refactor the CopyOneRowTo
Hi, hackers I'm sure this refactoring is useful because it's more readable when datum value is binary or not. However, I can see a little improvement. We can declare variable 'bytea *outputbytes' in 'else' because variable is used nowhere except this place. Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Vacuum statistics
Hello, everyone! Thank you for your interesting patch with extended information statistics about autovacuum. Do you consider not to create new table in pg_catalog but to save statistics in existing table? I mean pg_class or pg_stat_progress_analyze, pg_stat_progress_vacuum? P.S. If I sent this mail twice, I'm sorry :) Regards Ilia Evdokimov, Tantor Labs.
Add support for (Var op Var) clause in extended MCV statistics
Hi hackers, I'd like to submit a patch that improves the estimated rows for queries containing (Var op Var) clauses by applying extended MCV statistics. *New functions:* * mcv_clauselist_selectivity_var_op_var() - calculates the selectivity for (Var op Var) clauses. * is_opclause_var_op_var() - Checks whether a clause is of the (Var op Var) form. *Implementation Details:* * A new 'if' statement was added to the 'clause_selectivity_ext()' function to handle (Var op Var) clauses. This allows the process to locate matching MCV extended statistics and calculate selectivity using the newly introduced function. * Additionally, I added 'if' statement in statext_is_compatible_clause_internal() function to determine which columns are included in the clause, find matching extended statistics, and then calculate selectivity through the new function. I did the same in mcv_get_match_bitmap() to check what values are true for (Var op Var). * To support this, I created a new enum type to differentiate between OR/AND and (Var op Var) clauses. *Examples:* create table t (a int, b int); insert into t select mod(i,10), mod(i,10)+1 from generate_series(1,10) s(i); analyze t; explain select * from t where a < b; ` Estimated: 3 Actual: 10 explain select * from t where a > b; ` Estimated: 3 Actual: 10 create statistics s (mcv) on a,b from t; analyze t; explain select * from t where a < b; ` Estimated without patch: 3 Estimated with patch: 10 Actual: 10 explain select * from t where a > b; ` Estimated without patch: 3 Estimated with patch: 10 Actual: 10 If you want to see more examples, see regress tests in the patch. *Previous thread:* This feature was originally developed two years ago in [1], and at that time, the approach was almost the same. My implementation uses dedicated functions and 'if' statements directly for better readability and maintainability. Additionally, there was a bug in the previous approach that has been resolved with my patch. Here’s an example of the bug and its fix: CREATE TABLE foo (a int, b int); INSERT INTO foo SELECT x/10+1, x FROM generate_series(1,1) g(x); ANALYZE foo; EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10); ` Estimated: 18 Actual: 9 CREATE STATISTICS foo_s (mcv) ON a,b FROM foo; ANALYZE foo; EXPLAIN ANALYZE SELECT * FROM foo WHERE a = 1 OR (b > 0 AND b < 10); ` Estimated previous patch: 18 Estimated current patch: 9 Actual: 9 [1]: https://www.postgresql.org/message-id/flat/9e0a12e0-c05f-b193-ed3d-fe88bc1e5fe1%40enterprisedb.com I look forward to any feedback or suggestions from the community. Best regars, Ilia Evdokimov Tantor Labs LLC. From fd468972f0ce27291523a28fdf0d9966c2fdf6e1 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sat, 10 Aug 2024 14:35:25 +0300 Subject: [PATCH] Add support for (Var op Var) clause in extended MCV statistics Added a new leaf to the existing clauses tree, allowing the calculation of selectivity specifically for (Var op Var) clauses. The new function for this selectivity calculation has been integratedinto the extended statistics mechanism, ensuring accurate application during query planning. --- src/backend/optimizer/path/clausesel.c| 25 +- src/backend/statistics/README | 6 +- src/backend/statistics/README.mcv | 6 +- src/backend/statistics/extended_stats.c | 108 +++- src/backend/statistics/mcv.c | 184 -- .../statistics/extended_stats_internal.h | 6 + src/include/statistics/statistics.h | 11 +- src/test/regress/expected/stats_ext.out | 553 ++ src/test/regress/sql/stats_ext.sql| 166 ++ 9 files changed, 982 insertions(+), 83 deletions(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..eec6c6182c 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -152,7 +152,7 @@ clauselist_selectivity_ext(PlannerInfo *root, */ s1 = statext_clauselist_selectivity(root, clauses, varRelid, jointype, sjinfo, rel, - &estimatedclauses, false); + &estimatedclauses, AND_TYPE); } /* @@ -384,7 +384,7 @@ clauselist_selectivity_or(PlannerInfo *root, */ s1 = statext_clauselist_selectivity(root, clauses, varRelid, jointype, sjinfo, rel, - &estimatedclauses, true); + &estimatedclauses, OR_TYPE); } /* @@ -691,6 +691,7 @@ clause_selectivity_ext(PlannerInfo *root, Selectivity s1 = 0.5; /* default for any unhandled clause type */ RestrictI
Re: Add support for (Var op Var) clause in extended MCV statistics
Another issue mentioned in [1] involves cases where the clause is in the form (A op A). In my view, this isn't related to the current patch, as it can be addressed by rewriting the clause, similar to transforming A = A into A IS NOT NULL. This adjustment would result in more accurate estimation. [1]: https://www.postgresql.org/message-id/7C0F91B5-8A43-428B-8D31-556458720305%40enterprisedb.com Ilia Evdokimov, Tantor Labs LLC.
Re: Add support for (Var op Var) clause in extended MCV statistics
On 12.8.24 14:53, Tomas Vondra wrote: I agree, and I'm grateful someone picked up the original patch. I'll try to help to keep it moving forward. If the thread gets stuck, feel free to ping me to take a look. Good. Thank you! I started reviewing it and want to suggest some changes to better code: I think we should consider the case where the expression is not neither an OpExpr and VarOpVar expression. Do you have some specific type of clauses in mind? Most of the extended statistics only really handles this type of clauses, so I'm not sure it's feasible to extend that - at least not in this patch. I agree with Alena that we need to consider the following clauses: (Expr op Var), (Var op Expr) and (Expr op Expr). And we need to return false in these cases because we did it before my patch in /* Check if the expression has the right shape */ if (!examine_opclause_args(expr->args, &clause_expr, NULL, NULL)) return false; In is_opclause_var_op_var() function it is really useless local Node *expr_left, *expr_right variables. However, we can't assign them NULL at the begin because if I passed not-null pointers I have to return the values. Otherwise remain them NULL. Nevertheless, thank you for review, Alena. Have you tested this code with any benchmarks? FWIW I think we need to test two things - that it (a) improves the estimates and (b) does not have significant overhead. Yes, but only TPC-B. And the performance did not drop. In general, it'd be better to do more tests and those listed by Tomas with new attached patch.From fa67b0fa34408c0f1b0c9f079b84e7c71f3b5599 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sat, 10 Aug 2024 14:35:25 +0300 Subject: [PATCH] Add support for (Var op Var) clause in extended MCV statistics Added a new leaf to the existing clauses tree, allowing the calculation of selectivity specifically for (Var op Var) clauses. The new function for this selectivity calculation has been integratedinto the extended statistics mechanism, ensuring accurate application during query planning. --- src/backend/optimizer/path/clausesel.c| 25 +- src/backend/statistics/README | 6 +- src/backend/statistics/README.mcv | 6 +- src/backend/statistics/extended_stats.c | 108 +++- src/backend/statistics/mcv.c | 186 -- .../statistics/extended_stats_internal.h | 6 + src/include/statistics/statistics.h | 11 +- src/test/regress/expected/stats_ext.out | 553 ++ src/test/regress/sql/stats_ext.sql| 166 ++ 9 files changed, 983 insertions(+), 84 deletions(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..cb39bd 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -152,7 +152,7 @@ clauselist_selectivity_ext(PlannerInfo *root, */ s1 = statext_clauselist_selectivity(root, clauses, varRelid, jointype, sjinfo, rel, - &estimatedclauses, false); + &estimatedclauses, AND_CLAUSE); } /* @@ -384,7 +384,7 @@ clauselist_selectivity_or(PlannerInfo *root, */ s1 = statext_clauselist_selectivity(root, clauses, varRelid, jointype, sjinfo, rel, - &estimatedclauses, true); + &estimatedclauses, OR_CLAUSE); } /* @@ -691,6 +691,7 @@ clause_selectivity_ext(PlannerInfo *root, Selectivity s1 = 0.5; /* default for any unhandled clause type */ RestrictInfo *rinfo = NULL; bool cacheable = false; + Node *src = clause; if (clause == NULL) /* can this still happen? */ return s1; @@ -832,6 +833,7 @@ clause_selectivity_ext(PlannerInfo *root, { OpExpr *opclause = (OpExpr *) clause; Oid opno = opclause->opno; + List *clauses = list_make1(src); if (treat_as_join_clause(root, clause, rinfo, varRelid, sjinfo)) { @@ -842,6 +844,25 @@ clause_selectivity_ext(PlannerInfo *root, jointype, sjinfo); } + else if(use_extended_stats) + { + /* Check whether clauses are from one relation */ + RelOptInfo *rel = find_single_rel_for_clauses(root, clauses); + Bitmapset *estimatedclauses = NULL; + if (rel && rel->rtekind == RTE_RELATION && rel->statlist != NIL) +s1 = statext_clauselist_selectivity(root, clauses, varRelid, + jointype, sjinfo, rel, + &estimatedclauses, VAR_OP_VAR_CLAUSE); + + if (bms_num_members(estimatedclauses) != 1) + { +/* If there is no multi-column MCV statistics */ +s1 = restriction_selectivity(root, opno, + opclause->args, + opclause->inputcollid, + varRelid); + } + } else { /* Estimate selectivity for a restriction clause. */ diff --git a/src/backend/statistics/README b/src/backend/statistics/README index 13a97a
Re: change regexp_substr first argument make tests more easier to understand.
Hi everybody Current tests with regexp_instr() and regexp_substr() with string 'abcabcabc' are really unreadable and you would spend time to understand that happens in these tests and if they are really correct. I'd better change them into "abcdefghi" just like in query SELECT regexp_substr('abcdefghi', 'd.q') IS NULL AS t; Regards Ilia Evdokimov, Tantor Labs LLC.
Re: Vacuum statistics
On 10.8.24 22:37, Andrei Zubkov wrote: Hi, Ilia! Do you consider not to create new table in pg_catalog but to save statistics in existing table? I mean pg_class or pg_stat_progress_analyze, pg_stat_progress_vacuum? Thank you for your interest on our patch! *_progress views is not our case. They hold online statistics while vacuum is in progress. Once work is done on a table the entry is gone from those views. Idea of this patch is the opposite - it doesn't provide online statistics but it accumulates statistics about rosources consumed by all vacuum passes over all relations. It's much closer to the pg_stat_all_tables than pg_stat_progress_vacuum. It seems pg_class is not the right place because it is not a statistic view - it holds the current relation state and haven't anything about the relation workload. Maybe the pg_stat_all_tables is the right place but I have several thoughts about why it is not: - Some statistics provided by this patch is really vacuum specific. I don't think we want them in the relation statistics view. - Postgres is extreamly extensible. I'm sure someday there will be table AMs that does not need the vacuum at all. Right now vacuum specific workload views seems optimal choice to me. Regards, Agreed. They are not god places to store such statistics. I have some suggestions: 1. pgstatfuncs.c in functions tuplestore_put_for_database() and tuplestore_put_for_relation you can remove 'nulls' array if you're sure that columns cannot be NULL. 2. These functions are almost the same and I would think of writing one function depending of type 'ExtVacReportType'
Re: Vacuum statistics
And I have one suggestion for pg_stat_vacuum_database: I suppose we should add database's name column after 'dboid' column because it is difficult to read statistics without database's name. We could call it 'datname' just like in 'pg_stat_database' view. Regards, Ilia Evdokimov, Tantor Labs LCC.
Re: Vacuum statistics
On 15.8.24 11:49, Alena Rybakina wrote: I have some suggestions: 1. pgstatfuncs.c in functions tuplestore_put_for_database() and tuplestore_put_for_relation you can remove 'nulls' array if you're sure that columns cannot be NULL. We need to use this for tuplestore_putvalues function. With this function, we fill the table with the values of the statistics. Ah, right! I'm sorry. 1. 2. These functions are almost the same and I would think of writing one function depending of type 'ExtVacReportType' I'm not sure that I fully understand what you mean. Can you explain it more clearly, please? Ah, I didn't notice that the size of all three tables is different. Therefore, it won't be possible to write one function instead of two to avoid code duplication. My mistake.
Remove redundant NULL check in clause_selectivity_ext() function
Hi everyone, In the `clause_selectivity_ext()` function, there’s a comment regarding a NULL clause check that asks, "can this still happen?". I decided to investigate whether this scenario is still valid. Here’s what I found after reviewing the relevant cases: - approx_tuple_count(): The function iterates over a loop of other clauses. - get_foreign_key_join_selectivity(): The function is invoked within an `if (clause)` condition. - consider_new_or_clause(): The clause is generated by `make_restrictinfo()`, which never returns NULL. - clauselist_selectivity_ext(): This function either checks `list_length(clauses) == 1` before being called, or it is called within a loop of clauses. In other cases, the function is called recursively from `clause_selectivity_ext()`. If you are aware of any situations where a NULL clause could be passed or if I’ve missed anything, please let me know. I’m also open to any other suggestions. -- Regards, Ilia Evdokimov, Tantor Labs LCC. From fab0575ef1350cb700b6fc079230397ecb5ca19d Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 19 Aug 2024 12:08:53 +0300 Subject: [PATCH] Remove redundant NULL check for clause during selectivity calculation The function clause_selectivity_ext() takes a Node *clause parameter, which is either iterated over in a loop with other clauses or passed as a RestrictInfo from other functions. Since these functions guarantee that clause cannot be NULL, the NULL check is unnecessary and has been removed. --- src/backend/optimizer/path/clausesel.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 0ab021c1e8..5992f96ed2 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -692,9 +692,6 @@ clause_selectivity_ext(PlannerInfo *root, RestrictInfo *rinfo = NULL; bool cacheable = false; - if (clause == NULL) /* can this still happen? */ - return s1; - if (IsA(clause, RestrictInfo)) { rinfo = (RestrictInfo *) clause; -- 2.34.1
Re: Remove redundant NULL check in clause_selectivity_ext() function
On 19.8.24 18:02, Tom Lane wrote: Ilia Evdokimov writes: In the `clause_selectivity_ext()` function, there’s a comment regarding a NULL clause check that asks, "can this still happen?". I decided to investigate whether this scenario is still valid. Here’s what I found after reviewing the relevant cases: - approx_tuple_count(): The function iterates over a loop of other clauses. - get_foreign_key_join_selectivity(): The function is invoked within an `if (clause)` condition. - consider_new_or_clause(): The clause is generated by `make_restrictinfo()`, which never returns NULL. - clauselist_selectivity_ext(): This function either checks `list_length(clauses) == 1` before being called, or it is called within a loop of clauses. That list_length check doesn't prove anything about whether the list element is NULL, though. While I suspect that you may be right that the case doesn't occur anymore (if it ever did), I'm inclined to leave this test in place. It's cheap enough compared to what the rest of the function will do, and more importantly we can't assume that all interesting call sites are within Postgres core. There are definitely extensions calling clauselist_selectivity and related functions. It's possible that some of them rely on clause_selectivity not crashing on a NULL. Certainly, such an assumption could be argued to be a bug they should fix; but I'm disinclined to make them jump through that hoop for a vanishingly small performance improvement. Also, there are boatloads of other places where the planner has possibly-redundant checks for null clause pointers. It's likely that some of the other ones are more performance-critical than this. But I wouldn't be in favor of retail removal of the others either. Maybe with a more holistic approach we could remove a whole lot of them and make a measurable improvement; but it would require some careful thought about what invariants we want to assume. There's not really any design principles about this right now, and where we've ended up is that most functions operating on expression trees assume they've got to defend against NULL inputs. To remove those checks, we'd need a clear understanding of where caller-side checks need to be added instead. regards, tom lane Let's assume that this check needs to remain, and the length check doesn't guarantee anything. However, I'm a bit concerned that there's a NULL check here, but it's missing in the |clauselist_selectivity_ext()| function. For the reasons mentioned above, I would suggest the following: either we perform the NULL check in both places, or we don't perform it in either. -- Regards, Ilia Evdokimov, Tantor Labs LCC.
Re: Add support for (Var op Var) clause in extended MCV statistics
On 12.8.24 19:25, Tomas Vondra wrote: Is TPC-B really interesting/useful for this patch? The queries are super simple, with only a single clause (so it may not even get to the code handling extended statistics). Did you create any extended stats? No, it's not the case. I simply wanted to verify that other queries are not slowed down after applying my patch. I think you'll need to construct a custom test, with queries that have multiple (var op var) clauses, extended stats created, etc. And benchmark that. I used the test generator from a previous thread [1] and ran it with |default_statistics_target = 1000| to achieve more accurate estimates for 3000 rows. It would also be beneficial to run tests with 10,000 and 100,000 rows for a broader perspective. I've attached the python test. Here’s a breakdown of the issues: 1. (A op A) Clause: Before applying my patch, there were poor estimates for expressions like |(A op A)|. Currently, we only have correct estimates for the |(A = A)| clause, which transforms into |A IS NOT NULL|. Should I address this in this thread? I believe we should extend the same correction to clauses like |(A != A)|, |(A < A)|, and similar conditions. However, this issue is not for current thread. 2. AND Clauses: The estimates for AND clauses were inaccurate before my patch. I noticed code segments where I could add something specific for the |(Var op Var)| clause, but I'm unsure if I'm missing anything crucial. If my understanding is incorrect, I'd appreciate any guidance or corrections. FWIW I don't think it makes sense to benchmark the query execution - if the estimate improves, it's possible to get arbitrary speedup, but that's expected and mostly mostly irrelevant I think. What I'd focus on is benchmarking just the query planning - we need the overhead to be negligible (or at least small) so that it does not hurt people with already good plans. BTW can you elaborate why you are interested in this patch? Do you just think it's interesting/useful, or do you have a workload where it would actually help? I'm asking because me being uncertain how beneficial this is in practice (not just being nice in theory) was one of the reasons why I didn't do more work on this in 2021. regards I have two reasons for pursuing this. Firstly, I've encountered some of these queries in practice, although they are quite rare. While it might be easy to dismiss these cases due to their infrequency, I believe that we shouldn't overlook the opportunity to develop better handling for them, regardless of how seldom they occur. Secondly, I see that you're working on improving estimates for JOIN clauses in thread [2]. I believe that enhancing estimates for these rare cases could also benefit future work on JOIN queries, particularly those with multiple |ON (T1.column = T2.column)| conditions, which are essentially |(Var op Var)| clauses. My idea is to start with non-JOIN queries, and then apply the same approach to improve JOIN estimates. Of course, I might be wrong, but I think this approach has potential. P.S. If I sent this mail twice I'm sorry. I wanted to sent results of the test, and it was not sent to hackers because of big size of attached file. Now I sent only test. [1]: https://www.postgresql.org/message-id/ecc0b08a-518d-7ad6-17ed-a5e962fc4f5f%40enterprisedb.com [2]: https://www.postgresql.org/message-id/flat/c8c0ff31-3a8a-7562-bbd3-78b2ec65f16c%40enterprisedb.com -- Regards, Ilia Evdokimov, Tantor Labs LCC. #!/usr/bin/python3 import psycopg2 import random import select import hashlib def generate_conditions(nclauses, attributes = ['a', 'b', 'c', 'd'], operators = ['<', '<=', '=', '!=', '>=', '>']): if nclauses == 1: cols = [random.choice(attributes), random.choice(attributes)] oper = ' ' + random.choice(operators) + ' ' clause = oper.join(cols) if random.randint(0,100) < 50: clause = 'NOT ' + clause return clause nparts = random.randint(2, nclauses) # distribute the clauses between query parts nclauses_parts = [1 for p in range(0, nparts)] for x in range(0, nclauses - nparts): nclauses_parts[random.randint(0, nparts) - 1] += 1 parts = [] for p in range(0, nparts): parts.append('(' + generate_conditions(nclauses_parts[p], attributes, operators) + ')') c = random.choice([' AND ', ' OR ']) return c.join(parts) def generate_data(nrows, attributes = ['a', 'b', 'c', 'd']): sql = 'insert into t (' + ','.join(attributes) + ') select ' attrs = [] for attr in attributes: x = random.choice([-1, 2, 5, 10, 20, 30]) if x == -1: x = random.randint(5,
Re: Vacuum statistics
Are you certain that all tables are included in `pg_stat_vacuum_tables`? I'm asking because of the following: SELECT count(*) FROM pg_stat_all_tables ; count --- 108 (1 row) SELECT count(*) FROM pg_stat_vacuum_tables ; count --- 20 (1 row) -- Regards, Ilia Evdokimov, Tantor Labs LCC.
Re: ANALYZE ONLY
On 20.8.24 10:42, Jelte Fennema-Nio wrote: On Tue, 20 Aug 2024 at 07:52, Michael Harris wrote: 1. Would such a feature be welcomed? Are there any traps I might not have thought of? I think this sounds like a reasonable feature. 2. The existing ANALYZE command has the following structure: ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ] It would be easiest to add ONLY as another option, but that doesn't look quite right to me - surely the ONLY should be attached to the table name? An alternative would be: ANALYZE [ ( option [, ...] ) ] [ONLY] [ table_and_columns [, ...] ] Any feedback or advice would be great. Personally I'd prefer a new option to be added. But I agree ONLY isn't a good name then. Maybe something like SKIP_PARTITIONS. Hi everyone, Your proposal is indeed interesting, but I have a question: can't your issue be resolved by properly configuring |autovacuum| instead of developing a new feature for |ANALYZE|? From my perspective, |ANALYZE| is intended to forcibly gather statistics from all partitioned tables. If the goal is to ensure that statistics are updated at the right moment, adjusting the |autovacuum_analyze_threshold| and |autovacuum_analyze_scale_factor| parameters might be the solution. -- Regards, Ilia Evdokimov, Tantor Labs LCC.
Re: Use function smgrclose() to replace the loop
On 14.08.2024 09:32, Steven Niu wrote: Hi, Kirill, Junwang, I made this patch to address the refactor issue in our previous email discussion. https://www.postgresql.org/message-id/flat/CABBtG=cdtcbdcbk7mcsy6bjr3s5xutog0vsffuw8olduyyc...@mail.gmail.com That is, the for loop in function smgrdestroy() and smgrdounlinkall can be replaced with smgrclose(). for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) smgrsw[reln->smgr_which].smgr_close(reln, forknum); --> smgrclose(rels[i]); Please let me know if you have any questions. Best Regards, Steven from Highgo.com Hello, Are you sure we can refactor loop by 'smgrclose()'? I see it is not quite the same. Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Vacuum statistics
On 08.10.2024 19:18, Alena Rybakina wrote: Made a rebase on a fresh master branch. -- Regards, Alena Rybakina Postgres Professional Thank you for rebasing. I have noticed that when I create a table or an index on this table, there is no information about the table or index in pg_stat_vacuum_tables and pg_stat_vacuum_indexes until we perform a VACUUM. Example: CREATE TABLE t (i INT, j INT); INSERT INTO t SELECT i/10, i/100 FROM GENERATE_SERIES(1,100) i; SELECT * FROM pg_stat_vacuum_tables WHERE relname = 't'; (0 rows) CREATE INDEX ON t (i); SELECT * FROM pg_stat_vacuum_indexes WHERE relname = 't_i_idx'; ... (0 rows) I can see the entries after running VACUUM or executing autovacuum. or when autovacuum is executed. I would suggest adding a line about the relation even if it has not yet been processed by vacuum. Interestingly, this issue does not occur with pg_stat_vacuum_database: CREATE DATABASE example_db; SELECT * FROM pg_stat_vacuum_database WHERE dbname = 'example_db'; dboid | dbname | ... ... | example_db | ... (1 row) BTW, I recommend renaming the view pg_stat_vacuum_database to pg_stat_vacuum_database_S_ for consistency with pg_stat_vacuum_tables and pg_stat_vacuum_indexes -- Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Check for tuplestorestate nullness before dereferencing
On 14.10.2024 12:25, Alexander Kuznetsov wrote: Hello everyone, I'd like to propose adding a check for the nullness of tuplestorestate before dereferencing it in src/backend/executor/nodeModifier.c. The patch is attached. I am proposing this fix based on the assumption that tuplestorestate could be NULL since there is a check for it when calculating eof_tuplestore at line 85. However, since this code hasn't been changed since 2006 and hasn't caused any issues, it is possible that the check for (tuplestorestate == NULL) is redundant when calculating eof_tuplestore. Hi Alexander, The 'tuplestorestate' variable may be initialized at line 64 if it is NULL. You should consider initializing this variable earlier. Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Vacuum statistics
On 08.11.2024 22:23, Alena Rybakina wrote: Hi! Thank you for review! On 07.11.2024 17:49, Ilia Evdokimov wrote: Thank you for fixing it. 1) I have found some typos in the test output files (out-files) when running 'make check' and 'make check-world'. These typos might cause minor discrepancies in test results. You may already be aware of them, but I wanted to bring them to your attention in case they haven't been noticed. I believe these can be fixed quickly. Yes, I'll fix it) 2) Additionally, I observed that when we create a table and insert some rows, executing the VACUUM FULL command does not update the information in the 'pg_stat_get_vacuum_tables' However, running the VACUUM command does update this information as expected. This seems inconsistent, and it might be a bug. Example: CREATE TABLE t (i INT, j INT) WITH (autovacuum_enabled = false); INSERT INTO t SELECT i/10, i/100 FROM GENERATE_SERIES(1,100) i; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+- public | t | 21416 | 0 | .. (1 row) VACUUM FULL; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+- public | t | 21416 | 0 | .. (1 row) VACUUM; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+- public | t | 21416 | 4425 | .. (1 row) vacuum full operation doesn't call a vacuum operation, so we can't collect statistics for it. Furthermore, this is a different operation than vacuum because it completely rebuilds the table and indexes, so it looks like your previous table and its indexes were completely removed. To sum up, I think it isn't a bug that the statistics aren't showing here. Ah, you're right. This table does contain the _statistics_ for it. Everything is okay then. Sorry for the confusion. Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Vacuum statistics
On 22.10.2024 22:30, Alena Rybakina wrote: Hi! On 16.10.2024 14:01, Alena Rybakina wrote: Thank you for rebasing. I have noticed that when I create a table or an index on this table, there is no information about the table or index in pg_stat_vacuum_tables and pg_stat_vacuum_indexes until we perform a VACUUM. Example: CREATE TABLE t (i INT, j INT); INSERT INTO t SELECT i/10, i/100 FROM GENERATE_SERIES(1,100) i; SELECT * FROM pg_stat_vacuum_tables WHERE relname = 't'; (0 rows) CREATE INDEX ON t (i); SELECT * FROM pg_stat_vacuum_indexes WHERE relname = 't_i_idx'; ... (0 rows) I can see the entries after running VACUUM or executing autovacuum. or when autovacuum is executed. I would suggest adding a line about the relation even if it has not yet been processed by vacuum. Interestingly, this issue does not occur with pg_stat_vacuum_database: CREATE DATABASE example_db; SELECT * FROM pg_stat_vacuum_database WHERE dbname = 'example_db'; dboid | dbname | ... ... | example_db | ... (1 row) BTW, I recommend renaming the view pg_stat_vacuum_database to pg_stat_vacuum_database_S_ for consistency with pg_stat_vacuum_tables and pg_stat_vacuum_indexes Thanks for the review. I'm investigating this. I agree with the renaming, I will do it in the next version of the patch. I fixed it. I added the left outer join to the vacuum views and for converting the coalesce function from NULL to null values. I also fixed the code in getting database statistics - we can get it through the existing pgstat_fetch_stat_dbentry function and fixed couple of comments. I attached a diff file, as well as new versions of patches. -- Regards, Alena Rybakina Postgres Professional Thank you for fixing it. 1) I have found some typos in the test output files (out-files) when running 'make check' and 'make check-world'. These typos might cause minor discrepancies in test results. You may already be aware of them, but I wanted to bring them to your attention in case they haven't been noticed. I believe these can be fixed quickly. 2) Additionally, I observed that when we create a table and insert some rows, executing the VACUUM FULL command does not update the information in the 'pg_stat_get_vacuum_tables' However, running the VACUUM command does update this information as expected. This seems inconsistent, and it might be a bug. Example: CREATE TABLE t (i INT, j INT) WITH (autovacuum_enabled = false); INSERT INTO t SELECT i/10, i/100 FROM GENERATE_SERIES(1,100) i; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+- public | t | 21416 | 0 | .. (1 row) VACUUM FULL; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+- public | t | 21416 | 0 | .. (1 row) VACUUM; SELECT * FROM pg_stat_get_vacuum_tables WHERE relname = 't'; schema | relname | relid | total_blks_read | . ---++-+--+----- public | t | 21416 | 4425 | .. (1 row) Regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 22.11.2024 09:08, Alexander Korotkov wrote: On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier wrote: On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues. FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that. If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). -- Regards, Alexander Korotkov Supabase I really liked your idea, and I’d like to propose an enhancement that I believe improves it further. Yes, if a query’s execution time exceeds the threshold, it should always be tracked without sampling. However, for queries with execution times below the threshold, the sampling logic should prioritize shorter queries over those closer to the threshold. In my view, the ideal approach is for shorter queries to have the highest probability of being sampled, while queries closer to the threshold are less likely to be sampled. This behavior can be achieved with the following logic: pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time Here’s how it works: * As a query’s execution time approaches zero, the probability of it being sampled approaches one. * Conversely, as a query’s execution time approaches the threshold, the probability of it being sampled approaches zero. In other words, the sampling probability decreases linearly from 1 to 0 as the execution time gets closer to the threshold. I believe this approach offers an ideal user experience. I have attached a new patch implementing this logic. Please let me know if you have any feedback regarding the comments in the code, the naming of variables or documentation. I’m always open to discussion. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From edba1bcefa85e995ec3e9df6c9e8d30adcd940b9 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 26 Nov 2024 00:58:24 +0300 Subject: [PATCH] Add time-based sampling to pg_stat_statements New configuration parameter pg_stat_statements.sample_exectime_threshold, which allows tracking only queries that meet a specified execution time threshold. Queries with execution times above the threshold are always tracked, while shorter queries are sampled probabilistically. This helps reduce the overhead of tracking frequent, fast queries while preserving data for longer-running ones. --- .../pg_stat_statements/pg_stat_statements.c | 57 ++- doc/src/sgml/pgstatstatements.sgml| 20 +++ 2 files changed, 64 insertions(+), 13 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..6ec1e3e7ed 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -289,6 +290,7 @@ static const struct config_enum_entry track_options[] = }; static int pgss_max = 5000; /* max # statements to track */ +static int pgss_sample_exectime_threshold = 0; /* Threshold for query execution sampling (msec) */ static int pgss_track = PGSS_TRACK_TOP; /* tracking level */ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning @@ -414,6 +416,19 @@ _PG_init(void) NULL, NULL); + DefineCustomIntV
Re: Sample rate added to pg_stat_statements
Hi everyone, I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd() ensures that a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the pg_stat_statements.sample_rate criteria. This approach adds flexibility by allowing critical queries to be captured while maintaining efficient sampling. I attached new version of patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From c33e1ae07e8eb4f797b47e7deb07e6322b1375a3 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 19 Nov 2024 12:53:52 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to control just a fraction of the queries meeting the configured threshold, to reduce the amount of controlling. --- .../pg_stat_statements/pg_stat_statements.c | 25 --- doc/src/sgml/pgstatstatements.sgml| 17 + 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..42b3fee815 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -294,7 +295,10 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ @@ -414,6 +418,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to process.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -989,6 +1006,8 @@ pgss_planner(Query *parse, static void pgss_ExecutorStart(QueryDesc *queryDesc, int eflags) { + current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate); + if (prev_ExecutorStart) prev_ExecutorStart(queryDesc, eflags); else @@ -999,7 +1018,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags) * counting of optimizable statements that are directly contained in * utility statements. */ - if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0)) + if (current_query_sampled && pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0)) { /* * Set up to track total elapsed time in ExecutorRun. Make sure the @@ -1068,8 +1087,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc) { uint64 queryId = queryDesc->plannedstmt->queryId; - if (queryId != UINT64CONST(0) && queryDesc->totaltime && - pgss_enabled(nesting_level)) + if (current_query_sampled && queryId != UINT64CONST(0) && + queryDesc->totaltime && pgss_enabled(nesting_level)) { /* * Make sure stats accumulation is done. (Note: it's okay if several diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml index 501b468e9a..1e2533a802 100644 --- a/doc/src/sgml/pgstatstatements.sgml +++ b/doc/src/sgml/pgstatstatements.sgml @@ -936,6 +936,23 @@ + + + pg_stat_statements.sample_rate (real) + + pg_stat_statements.sample_rate configuration parameter + + + + + + pg_stat_statements.sample_rate causes pg_stat_statements to only + control a fraction of the statements in each session. The default is 1, + meaning control all the queries. Only superusers can change this setting. + + + + pg_stat_statements.save (boolean) -- 2.34.1
Sample rate added to pg_stat_statements
Hi hackers, Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces substantial overhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable pg_stat_statements or adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One potential solution to this issue could be query sampling in pg_stat_statements. A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very useful in high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for improved performance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the pg_stat_statements extension. The patch provides the ability to control the sampling of query statistics in pg_stat_statements. This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is viable and applicable. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.From b19081bd257014f0c4046570097e4bc7b28a3126 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 18 Nov 2024 21:05:43 +0300 Subject: [PATCH] [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to control just a fraction of the queries meeting the configured threshold, to reduce the amount of controlling. --- .../pg_stat_statements/pg_stat_statements.c | 22 ++- doc/src/sgml/pgstatstatements.sgml| 17 ++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..51d1a48252 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -73,6 +74,8 @@ PG_MODULE_MAGIC; +static double pgss_sample_rate = 1; + /* Location of permanent stats file (valid when database is shut down) */ #define PGSS_DUMP_FILE PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat" @@ -295,6 +298,8 @@ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to process.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -989,6 +1007,8 @@ pgss_planner(Query *parse, static void pgss_ExecutorStart(QueryDesc *queryDesc, int eflags) { + current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate); + if (prev_ExecutorStart) prev_ExecutorStart(queryDesc, eflags); else @@ -999,7 +1019,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags) * counting of optimizable statements that are directly contained in * utility statements. */ - if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0)) + if (current_query_sampled && pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0)) { /* * Set up to track total elapsed time in ExecutorRun. Make sure the diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml index 501b468e9a..1e2533a802 100644 --- a/doc/src/sgml/pgstatstatements.sgml +++ b/doc/src/sgml/pgstatstatements.sgml @@ -936,6 +936,23 @@ + + + pg_stat_statements.sample_rate (real) + + pg_stat_statements.sample_rate configuration parameter + + + + + + pg_stat_statements.sample_rate causes pg_stat_statements to only + control a fraction of the statements in each session. The default is 1, + meaning control all the queries. Only superusers can change this setting. + + + + pg_stat_statements.save (boolean) -- 2.34.1
Re: Sample rate added to pg_stat_statements
On 19.11.2024 15:11, Andrey M. Borodin wrote: On 18 Nov 2024, at 23:33, Ilia Evdokimov wrote: Hi hackers, Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces substantial overhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable pg_stat_statements or adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One potential solution to this issue could be query sampling in pg_stat_statements. A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very useful in high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for improved performance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the pg_stat_statements extension. The patch provides the ability to control the sampling of query statistics in pg_stat_statements. This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is viable and applicable. +1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though. But at least it could give an easy way to rule out performance impact of pgss. Thank you for review. On 19 Nov 2024, at 15:09, Ilia Evdokimov wrote: I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd() ensures that a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the pg_stat_statements.sample_rate criteria. This approach adds flexibility by allowing critical queries to be captured while maintaining efficient sampling. Is there a reason why pgss_ProcessUtility is excluded? Best regards, Andrey Borodin. Ah, you’re right! Moreover, this check is needed not only in pgss_ProcessUtility() but in all places where pgss_enabled() is called. Therefore, it’s better to move the sampling check directly into pgss_enabled(). However, another issue arises with the initialization of 'current_query_sample' variable that contains whether query is sampling or not. Initializing it in pgss_ExecutorStart()||is not sufficient, because pgss_post_parse_analyze() or pgss_ProcessUtility() might be called earlier. This could lead to different values of 'current_query_sample' being used in these functions, which is undesirable. Run the regression tests for pg_stat_statements with initializing in pgss_ExecutorStart(), and you'll see this. To avoid this, we need to find a function that is called earlier than all the others. In my opinion, pgss_post_parse_analyze() is a good candidate for this purpose. If you have objections or better suggestions, feel free to share them. I attached the patch with fixes. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 725d0a428040d46cfedac0ca707d7d7a27db8721 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 19 Nov 2024 16:58:28 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to control just a fraction of the queries meeting the configured threshold, to reduce the amount of controlling. --- .../pg_stat_statements/pg_stat_statements.c | 22 ++- doc/src/sgml/pgstatstatements.sgml| 17 ++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..01586c32b0 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -294,12 +295,16 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ - (pgss_track == PGSS_TRACK_TOP && (level) == 0))) + (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ + current_query_sampled) #define record_gc_qtexts() \ do { \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to process.", +
Re: Showing applied extended statistics in explain Part 2
Hi everyone! Thank you for your work. 1) While exploring extended statistics, I encountered a bug that occurs when using EXPLAIN (STATS) with queries containing OR conditions: CREATE TABLE t (a int, b int, c int, d int); INSERT INTO t SELECT x/10+1, x, x + 10, x * 2 FROM generate_series(1,1) g(x); CREATE STATISTICS ON a, b FROM t; CREATE STATISTICS ON c, d FROM t; ANALYZE; The following query works as expected: EXPLAIN (STATS) SELECT * FROM t WHERE a > 0 AND b > 0 AND c > 0 AND d > 0; QUERY PLAN Seq Scan on t (cost=0.00..255.00 rows=1 width=16) Filter: ((a > 0) AND (b > 0) AND (c > 0) AND (d > 0)) Ext Stats: public.t_a_b_stat Clauses: ((a > 0) AND (b > 0)) Ext Stats: public.t_c_d_stat Clauses: ((c > 0) AND (d > 0)) (4 rows) However, when using OR conditions, the following query results in an error: EXPLAIN (ANALYZE, STATS) SELECT * FROM t WHERE a > 0 AND b > 0 OR c > 0 AND d > 0; ERROR: unrecognized node type: 314 2) It would be great if the STATS flag appeared as an option when pressing Tab during query input in the psql command-line interface. Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 20.11.2024 01:07, Michael Paquier wrote: On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues. FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that. This would also get rid of the pgss text file for the queries, which is a source of one of the bottlenecks, as we could just store query strings upper-bounded based on a postmaster GUC to control the size of the entries in the pgstats dshash. More normalization for IN and ANY clauses would also help a not here, these are a cause of a lot of bloat. This integration is not something I will be able to work on for the PG18 dev cycle as I'm in full review/commit mode for the rest of this release, but I got some plans for it in PG19 except if somebody beats me to it. -- Michael I agree. Your proposal can indeed improve performance. Currently, I am working on these changes and will validate them with benchmarks. Once I have concrete results, I will open new threads to facilitate further discussion. However, in my opinion, the suggested improvements are not enough, and sampling is essential. 1. I agree with Greg that pgss is widely used. It's quite odd that sampling exists in 'auto_explain' but not in pgss. 2. If performance issues arise even after these improvements and it turns out that pgss is the cause, the only painless solution without restarting the instance is sampling. The current pgss's parameters are not optimal for achieving this. BTW, I forgot to include a case of nested statements. Either all will be tracked or none. I attached new version of patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 68f5451019b261bf03a222f5a05ac57cd0fb9a24 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 21 Nov 2024 11:24:03 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/pg_stat_statements.c | 23 ++- doc/src/sgml/pgstatstatements.sgml| 18 +++ 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..d06e0d8a44 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -294,12 +295,16 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ - (pgss_track == PGSS_TRACK_TOP && (level) == 0))) + (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ + current_query_sampled) #define record_gc_qtexts() \ do { \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to process.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query, jstate); + if (nesting_level
Typo in comment of auto_explain.c
Hi everyone, I found typo in comment auto_explain.c line 282 (/* Enable per-node instrumentation iff log_analyze is required. */). Not 'iff' but 'if'. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Typo in comment of auto_explain.c
On 27.11.2024 15:54, Heikki Linnakangas wrote: On 27/11/2024 14:46, Ilia Evdokimov wrote: Hi everyone, I found typo in comment auto_explain.c line 282 (/* Enable per-node instrumentation iff log_analyze is required. */). Not 'iff' but 'if'. It's short for "if and only if". See https://en.wikipedia.org/wiki/If_and_only_if. There was some previous discussion on pgsql-hackers on whether that usage common enough, although I don't find the thread right now. You're not the first one to think it's a typo :-). Ah, now everything is clear. I need to look at history of pgsql-hackers more often. Then there are no questions. Thanks.
Re: Vacuum statistics
In my opinion, the patches are semantically correct. However, not all dead code has been removed - I'm referring to pgstat_update_snapshot(). Also, the tests need to be fixed. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 26.11.2024 01:15, Ilia Evdokimov wrote: On 22.11.2024 09:08, Alexander Korotkov wrote: On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier wrote: On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues. FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that. If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). -- Regards, Alexander Korotkov Supabase I really liked your idea, and I’d like to propose an enhancement that I believe improves it further. Yes, if a query’s execution time exceeds the threshold, it should always be tracked without sampling. However, for queries with execution times below the threshold, the sampling logic should prioritize shorter queries over those closer to the threshold. In my view, the ideal approach is for shorter queries to have the highest probability of being sampled, while queries closer to the threshold are less likely to be sampled. This behavior can be achieved with the following logic: pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time Here’s how it works: * As a query’s execution time approaches zero, the probability of it being sampled approaches one. * Conversely, as a query’s execution time approaches the threshold, the probability of it being sampled approaches zero. In other words, the sampling probability decreases linearly from 1 to 0 as the execution time gets closer to the threshold. I believe this approach offers an ideal user experience. I have attached a new patch implementing this logic. Please let me know if you have any feedback regarding the comments in the code, the naming of variables or documentation. I’m always open to discussion. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. I’ve been closely reviewing my last (v5-*.patch) patch on implementing time-based sampling, and I’ve realized that it might not be the best approach. Let me explain the reasons. * We can only perform sampling before the 'pgss_planner()' function. However, at that point, we don’t yet know the query's execution time since it only becomes available during 'pgss_ExecutorEnd()' or 'pgss_ProcessUtility()'; * If we wait to sample until the execution completes and we have the actual execution time, this introduces a problem. By that point, we might have already recorded the query's statistics into shared memory from the 'pgss_planner()' making it too late to decide whether to sample the query; * Delaying sampling until the execution finishes would require waiting for the execution time, which could introduce performance overhead. This defeats the purpose of sampling, which aims to reduce the cost of tracking query. I believe we should reconsider the approach and revert to sampling based on v4-*.patch. If I’ve missed anything or there’s an alternative way to implement time threshold-based sampling efficiently, I’d be grateful to hear your insights. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Showing applied extended statistics in explain Part 2
On 19.11.2024 00:38, Tomas Vondra wrote: On 11/18/24 22:15, Tomas Vondra wrote: ... So I think the correct solution is to not pass any expressions with RestrictInfo to deparse_expression(). Either by stripping the nodes, or by not adding them at all. The patch tries to do the stripping by maybe_extract_actual_clauses(), but that only looks at the top node, and that is not sufficient here. Maybe it would be possible to walk the whole tree, and remove all the RestrictInfos nodes - including intermediate ones, not just the top. But I'm not quite sure it wouldn't cause issues elsewhere (assuming it modifies the existing nodes). It still feels a bit like fixing a problem we shouldn't really have ... To make this idea a bit more concrete, here's a patch removing all RestrictInfo nodes in show_stat_qual(). But still, it feels wrong. regards Yes, removing all 'RestrictInfos' during deparsing using 'expression_tree_mutator()' is not optimal. However, I don't see an alternative. Perhaps it could be done this earlier in extended_stats.c to avoid the need for cleanup later ... -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Define STATS_MIN_ROWS for minimum rows of stats in ANALYZE
Hi hackers, The repeated use of the number 300 in the ANALYZE-related code creates redundancy and relies on scattered, sometimes unclear, comments to explain its purpose. This can make the code harder to understand, especially for new contributors who might not immediately understand its significance. To address this, I propose introducing a macro STATS_MIN_ROWS to represent this value and consolidating its explanation in a single place, making the code more consistent and readable. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From d3cbea3875fdc9866c86c64c0d0e0d838887040c Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 9 Dec 2024 15:50:40 +0300 Subject: [PATCH v1] Define STATS_MIN_ROWS for minimum rows of stats in ANALYZE This introduces a macro STATS_MIN_ROWS to represent the default minimum number of rows (300) sampled in ANALYZE. --- src/backend/commands/analyze.c| 26 ++- src/backend/statistics/extended_stats.c | 2 +- src/backend/tsearch/ts_typanalyze.c | 4 +-- src/backend/utils/adt/rangetypes_typanalyze.c | 7 +++-- src/include/statistics/statistics.h | 23 5 files changed, 31 insertions(+), 31 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 9a56de2282..ff7b6ffe8f 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1897,42 +1897,20 @@ std_typanalyze(VacAttrStats *stats) { /* Seems to be a scalar datatype */ stats->compute_stats = compute_scalar_stats; - /* - * The following choice of minrows is based on the paper - * "Random sampling for histogram construction: how much is enough?" - * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in - * Proceedings of ACM SIGMOD International Conference on Management - * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 - * says that for table size n, histogram size k, maximum relative - * error in bin size f, and error probability gamma, the minimum - * random sample size is - * r = 4 * k * ln(2*n/gamma) / f^2 - * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain - * r = 305.82 * k - * Note that because of the log function, the dependence on n is - * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66 - * bin size error with probability 0.99. So there's no real need to - * scale for n, which is a good thing because we don't necessarily - * know it at this point. - * - */ - stats->minrows = 300 * stats->attstattarget; } else if (OidIsValid(eqopr)) { /* We can still recognize distinct values */ stats->compute_stats = compute_distinct_stats; - /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; } else { /* Can't do much but the trivial stuff */ stats->compute_stats = compute_trivial_stats; - /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; } + stats->minrows = (STATS_MIN_ROWS * stats->attstattarget); + return true; } diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 99fdf208db..451e2d1e9c 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -320,7 +320,7 @@ ComputeExtStatisticsRows(Relation onerel, MemoryContextDelete(cxt); /* compute sample size based on the statistics target */ - return (300 * result); + return (STATS_MIN_ROWS * result); } /* diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c index ccafe42729..befd90c9d5 100644 --- a/src/backend/tsearch/ts_typanalyze.c +++ b/src/backend/tsearch/ts_typanalyze.c @@ -17,6 +17,7 @@ #include "catalog/pg_operator.h" #include "commands/vacuum.h" #include "common/hashfn.h" +#include "statistics/statistics.h" #include "tsearch/ts_type.h" #include "utils/builtins.h" #include "varatt.h" @@ -64,8 +65,7 @@ ts_typanalyze(PG_FUNCTION_ARGS) stats->attstattarget = default_statistics_target; stats->compute_stats = compute_tsvector_stats; - /* see comment about the choice of minrows in commands/analyze.c */ - stats->minrows = 300 * stats->attstattarget; + stats->minrows = (STATS_MIN_ROWS * stats->attstattarget); PG_RETURN_BOOL(true); } diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c index 3773f98115..7fcac5621c 100644 --- a/src/backend/utils/adt/rangetypes_typanalyze.c +++ b/src/backend/utils/adt/rangetypes_typanalyze.c @@ -26,6 +26,7 @@ #include "catalog/pg_operator.h" #include "commands/vacuum.h" +#include "statistics/statistics.h" #include "utils/float.h" #include "util
Re: Sample rate added to pg_stat_statements
I completely forgot about ordering pg_stat_statements in the test, which is why the test failed in [0]. I've added ORDER BY query COLLATE "C" to avoid any non-deterministic ordering in the table. [0]: https://cirrus-ci.com/task/5724458477944832 -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From be8a69ae36d0a20c945ff12bb385d634ce50a84e Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 7 Jan 2025 16:43:34 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 37 ++ .../pg_stat_statements/pg_stat_statements.c | 38 ++- contrib/pg_stat_statements/sql/select.sql | 11 ++ doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 102 insertions(+), 2 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..8485a7c451 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,43 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + -- -- queries with locking clauses -- diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index bebf8134eb..4bab2a74ea 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -50,6 +50,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" #include "common/int.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "funcapi.h" #include "jit/jit.h" @@ -255,6 +256,7 @@ typedef struct pgssSharedState /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */ static int nesting_level = 0; +static double random_sample = 0.0; /* Saved hook values */ static shmem_request_hook_type prev_shmem_request_hook = NULL; @@ -294,12 +296,17 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; + +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ - (pgss_track == PGSS_TRACK_TOP && (level) == 0))) + (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ + current_query_sampled) #define record_gc_qtexts() \ do { \ @@ -414,6 +421,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to track.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -835,6 +855,11 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query, jstate); + random_sample = pg_prng_double(&pg_global_prng_state); + + if (nesting_level == 0) + current_query_sampled = random_sample < pgss_sample_rate; + /* Safety check... */ if (!pgss || !pgss_hash || !pgss_enabled(nesti
Re: Sample rate added to pg_stat_statements
On 07.01.2025 22:29, Sami Imseih wrote: You are right. This is absolutely unexpected for users. Thank you for the review. To fix this, I suggest storing a random number in the [0, 1) range in a separate variable, which will be used for comparisons in any place. We will compare 'sample_rate' and random value not only in pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and pgss_planner(). I attached patch with test and fixes. I still think this is not the correct approach. It seems that post_parse_analyze should not have to deal with checking for a sample rate. This is because post_parse_analyze, which is the only hook with access to jstate, is only responsible for storing a normalized query text on disk and creating a not-yet user visible entry in the hash. i.e. pgss_store will never increment counters when called from pgss_post_parse_analyze. /* Increment the counts, except when jstate is not NULL */ if (!jstate) { What I think may be a better idea is to add something similar to auto_explain.c, but it should only be added to the top of pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility. if (nesting_level == 0) { if (!IsParallelWorker()) current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate; else current_query_sampled = false; } This is needed for ExecutorStart and not ExecutorEnd because ExecutorStart is where the instrumentation is initialized with queryDesc->totaltime. The above code block could be turned into a macro and reused in the routines mentioned. However, it seems with this change, we can see a much higher likelihood of non-normalized query texts being stored. This is because jstate is only available during post_parse_analyze. Therefore, if the first time you are sampling the statement is in ExecutorEnd, then you will end up storing a non-normalized version of the query text, see below example with the attached v8. postgres=# set pg_stat_statements.sample_rate = 0; SET postgres=# select pg_stat_statements_reset(); pg_stat_statements_reset --- 2025-01-07 13:02:11.807952-06 (1 row) postgres=# SELECT 1 \parse stmt postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ---+--- (0 rows) postgres=# set pg_stat_statements.sample_rate = 1; SET postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ---+--- (0 rows) postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query| calls -+--- SELECT 1| 1 SELECT query, calls FROM pg_stat_statements | 1 (2 rows) postgres=# \bind_named stmt \g ?column? -- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query| calls -+--- SELECT 1| 2 SELECT query, calls FROM pg_stat_statements | 2 (2 rows) One idea is to make jstate available to all hooks, and completely remove reliance on post_parse_analyze in pg_stat_statements. I can't find the thread now, but I know this has come up in past discussions when troubleshooting gaps in query normalization. My concern is this feature will greatly increase the likelihood of non-normalized query texts. Also, with regards to the benchmarks, it seems sampling will be beneficial for workloads that are touching a small number of entries with high concurrency. This is why we see benefit for a standard pgbench workload. Samping becomes less beneficial when there is a large set of queries being updated. I still think this is a good approach for workloads that touch a small set of entries. Regards, Sami Wow, thank you for pointing this out. Your solution looks interesting, but I'd like to explore other ways to solve the issue besides making it available to all hooks. If I don't find anything better, I'll go with yours. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 06.01.2025 18:57, Andrey M. Borodin wrote: On 6 Jan 2025, at 15:50, Ilia Evdokimov wrote: Any suggestions for improvements? The patch looks good to me, just a few nits. 0. Perhaps, we could have a test for edge values of 0 and 1. I do not insist, just an idea to think about. I would add tests, because they are never useless. I've added a simple test which verifies hash table with queries after setting sample_rate = 0.0 and sample_rate = 1.0. 1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<". xact_is_sampled = log_xact_sample_rate != 0 && (log_xact_sample_rate == 1 || pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate); Thanks! Best regards, Andrey Borodin. Are we sure we're discussing the same patch? Because these remarks refer to the 5 version of the patch, which I abandoned due to your remarks. On 06.01.2025 20:51, Sami Imseih wrote: Hi, I was looking at this patch, and I was specifically curious about how this works with prepared statements. The way the patch works now is that it determines if the statement is to be sampled at post_parse_analyze time which could lead to unexpected behavior. Let's take an example below in which the pg_stat_statements.sample_rate is set to 0 ( to mimic some sampling rate < 1 in which this query does not get sampled ). At that point, all subsequent executions of the statement will not get tracked at all. Is this what is expected for prepared statements? My concern is we will even lose more stats than what a user may expect. This of course will not be an issue for simple query. postgres=# set pg_stat_statements.sample_rate = 0; SET postgres=# select pg_stat_statements_reset(); pg_stat_statements_reset --- 2025-01-06 11:45:23.484793-06 (1 row) postgres=# SELECT $1 \parse stmt postgres=# postgres=# \bind_named stmt 1 \g ?column? -- 1 (1 row) postgres=# \bind_named stmt 1 \g ?column? -- 1 (1 row) postgres=# set pg_stat_statements.sample_rate = 1; SET postgres=# \bind_named stmt 1 \g ?column? -- 1 (1 row) postgres=# \bind_named stmt 1 \g ?column? -- 1 (1 row) postgres=# SELECT query, calls FROM pg_stat_statements; query | calls ---+--- (0 rows) Regards, Sami Imseih Amazon Web Services (AWS) You are right. This is absolutely unexpected for users. Thank you for the review. To fix this, I suggest storing a random number in the [0, 1) range in a separate variable, which will be used for comparisons in any place. We will compare 'sample_rate' and random value not only in pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and pgss_planner(). I attached patch with test and fixes. If you have any objections or suggestions on how to improve it, I'm always open to feedback. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 388c12632610cc6abddd6fc750a1b15712d62e63 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 7 Jan 2025 14:02:32 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 37 ++ .../pg_stat_statements/pg_stat_statements.c | 38 ++- contrib/pg_stat_statements/sql/select.sql | 11 ++ doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 102 insertions(+), 2 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..724ec9c8ac 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,43 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements; + query| calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements| 0 +(3 rows) + +SELECT pg_stat_statement
Re: Sample rate added to pg_stat_statements
On 10.12.2024 17:35, Ilia Evdokimov wrote: Hi everyone, I attached previous sampling patch for pg_stat_statements (v4). Suggestions like sampling based on execution time remain unfeasible, as pg_stat_statements can track query during query planning, not execution. To evaluate the implementation, I ran a benchmark creating 1,000 random tables and executing randomized JOIN queries on a small machine. When pg_stat_statements enabled, performance decreases, but reducing the sampling rate helps mitigate the impact and improves performance. I’d be interested in hearing your new thoughts. Are there areas where this patch could be improved, or other ideas worth exploring? -- Best regards. Ilia Evdokimov, Tantor Labs LLC. I've fixed some small typos in the documentation and in the GUC description in the attached patch. Any suggestions for improvements? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From e0955b12e6a71efd6fc5d406e3d93bda4ec57bbf Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 6 Jan 2025 13:38:22 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/pg_stat_statements.c | 23 ++- doc/src/sgml/pgstatstatements.sgml| 18 +++ 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index bebf8134eb..a8e91f9469 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -50,6 +50,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" #include "common/int.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "funcapi.h" #include "jit/jit.h" @@ -294,12 +295,16 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ - (pgss_track == PGSS_TRACK_TOP && (level) == 0))) + (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ + current_query_sampled) #define record_gc_qtexts() \ do { \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to track.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query, jstate); + if (nesting_level == 0) + current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate); + /* Safety check... */ if (!pgss || !pgss_hash || !pgss_enabled(nesting_level)) return; diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml index 501b468e9a..37f6788798 100644 --- a/doc/src/sgml/pgstatstatements.sgml +++ b/doc/src/sgml/pgstatstatements.sgml @@ -936,6 +936,24 @@ + + + pg_stat_statements.sample_rate (real) + + pg_stat_statements.sample_rate configuration parameter + + + + + + pg_stat_statements.sample_rate causes pg_stat_statements to only + track a fraction of the statements in each session. The default is 1, + meaning track all the queries. In case of nested statements, + either all will be tracked or none. Only superusers can change this setting. + + + + pg_stat_statements.save (boolean) -- 2.34.1
Re: Remove unused rel parameter in lookup_var_attr_stats
On 03.01.2025 18:42, Fabrízio de Royes Mello wrote: On Fri, Jan 3, 2025 at 11:09 AM Ilia Evdokimov wrote: Hi hackers, I've attached a small patch that remove unused parameter 'rel' from the static function lookup_var_attr_stats() in extended_stats.c While exploring src/backend/statistics, I noticed that there are several other parameters which appear to be unused. However, since the discussion around import/export statistics [1] attribute_stats.c is still ongoing and many functions from other files are non-static, those parameters may yet prove useful. Therefore, I'm only removing the parameter from this static function. LGTM looks like a leftover from a4d75c86. Regards, -- Fabrízio de Royes Mello True. Moreover, I checked the commit and found nothing else. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Removing unused parameter in compute_expr_stats
Hi hackers, I'm sharing a small patch that removes an unused parameter (totalrows) from compute_expr_stats function in extended_stats.c . This function originally appeared in commit a4d75c86bf15220df22de0a92c819ecef9db3849, which introduced extended statistics on expressions. From what I can see, total rows is never used in the current implementation, so this patch removes it If there are reasons to keep this parameter, or if there's anything I might have missed, I'd be happy to discuss further. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 0de17516bec761474761bf2862724b23ccb11beb Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 26 Dec 2024 12:09:14 +0300 Subject: [PATCH v1] Remove unused totalrows parameter in compute_expr_stats The totalrows parameter in compute_expr_stats is never referenced anyway in the function or by its callers. It appears to have been introduced in commit a4d75c86bf15220df22de0a92c819ecef9db3849, but it remained unused --- src/backend/statistics/extended_stats.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 99fdf208db..85da3edb10 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -89,9 +89,8 @@ typedef struct AnlExprData VacAttrStats *vacattrstat; /* statistics attrs to analyze */ } AnlExprData; -static void compute_expr_stats(Relation onerel, double totalrows, - AnlExprData *exprdata, int nexprs, - HeapTuple *rows, int numrows); +static void compute_expr_stats(Relation onerel, AnlExprData *exprdata, + int nexprs, HeapTuple *rows, int numrows); static Datum serialize_expr_stats(AnlExprData *exprdata, int nexprs); static Datum expr_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); static AnlExprData *build_expr_data(List *exprs, int stattarget); @@ -220,8 +219,7 @@ BuildRelationExtStatistics(Relation onerel, bool inh, double totalrows, exprdata = build_expr_data(stat->exprs, stattarget); nexprs = list_length(stat->exprs); -compute_expr_stats(onerel, totalrows, - exprdata, nexprs, +compute_expr_stats(onerel, exprdata, nexprs, rows, numrows); exprstats = serialize_expr_stats(exprdata, nexprs); @@ -2107,9 +2105,8 @@ examine_opclause_args(List *args, Node **exprp, Const **cstp, * Compute statistics about expressions of a relation. */ static void -compute_expr_stats(Relation onerel, double totalrows, - AnlExprData *exprdata, int nexprs, - HeapTuple *rows, int numrows) +compute_expr_stats(Relation onerel, AnlExprData *exprdata, + int nexprs, HeapTuple *rows, int numrows) { MemoryContext expr_context, old_context; -- 2.34.1
Re: Exists pull-up application with JoinExpr
Hi Alena, Thank you for your work on subqueries with JOIN. Have you considered the scenario where in subquery includes a qual like (tc.aid = 1)? When I tried executing those queries I receive different results. In my opinion, to prevent this, we should add filters for such quals within the loop 'foreach (lc, all_clauses)' EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb, tc WHERE ta.id = tb.id AND tc.aid = 1); QUERY PLAN -- Hash Join (actual rows=1 loops=1) Hash Cond: (ta.id = tb.id) Buffers: local hit=3 -> Seq Scan on ta (actual rows=3 loops=1) Buffers: local hit=1 -> Hash (actual rows=3 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 33kB Buffers: local hit=2 -> HashAggregate (actual rows=3 loops=1) Group Key: tb.id Batches: 1 Memory Usage: 121kB Buffers: local hit=2 -> Nested Loop (actual rows=3 loops=1) Buffers: local hit=2 -> Seq Scan on tb (actual rows=3 loops=1) Buffers: local hit=1 -> Materialize (actual rows=1 loops=3) Storage: Memory Maximum Storage: 17kB Buffers: local hit=1 -> Seq Scan on tc (actual rows=1 loops=1) Filter: (aid = 1) Rows Removed by Filter: 1 Buffers: local hit=1 (23 rows) EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ta WHERE EXISTS (SELECT * FROM tb JOIN tc ON ta.id = tb.id WHERE tc.aid = 1); QUERY PLAN --- Seq Scan on ta (actual rows=1 loops=1) Filter: EXISTS(SubPlan 1) Rows Removed by Filter: 2 Buffers: local hit=6 SubPlan 1 -> Nested Loop (actual rows=0 loops=3) Buffers: local hit=5 -> Index Only Scan using tb_pkey on tb (actual rows=0 loops=3) Index Cond: (id = ta.id) Heap Fetches: 1 Buffers: local hit=4 -> Seq Scan on tc (actual rows=1 loops=1) Filter: (aid = 1) Buffers: local hit=1 (14 rows) -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 04.02.2025 20:59, Sami Imseih wrote: To summarize the results of all benchmarks, I compiled them into a table: Thanks for compiling the benchmark data above. The main benefit of this patch will be to give the user a toggle if they are observing high spinlock contention due to pg_stat_statements which will likely occur on larger machines. I also did not see this thread [1] mentioned in the thread above, but one of the reasons pg_stat_statements.track_planning was turned off by default was due to the additional spinlock acquire to track the planning stats. Bringing this up as sample_rate might also be beneficial as well if a user decides to track planning. Regards, Sami [1] https://www.postgresql.org/message-id/flat/2895b53b033c47ccb22972b589050...@ex13d05uwc001.ant.amazon.com Thanks for the thread. As we can see, simply enabling or disabling not only track_planning but also other pg-stat_statements parameters is too radical a measure for managing performance. With the introduction of sample_rate, users now have more flexibility in controlling spinlock contention. This allows them to balance overhead and statistics collection rather than completely disabling the feature. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Avoid possible deference NULL pointer (src/backend/optimizer/path/allpaths.c)
But what should we do if cheapest == NULL further? Should we return NULL of get_cheapest_parameterized_child_path() function? If it is, we should write it like this: if (cheapset == NULL || bms(PATH_REQ_OUTER(cheapset), required_outer)) return cheapest; I'll look into this issue further. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 08.02.2025 00:01, Matheus Alcantara wrote: Em sex., 7 de fev. de 2025 às 17:41, Ilia Evdokimov escreveu: Strange, I don't have any problems to apply it on master. postgres$ git branch * master postgres$ git pull Already up to date. postgres$ git apply v6-0001-Clarify-display-of-rows-and-loops-as-decimal-fraction.patch postgres$ Just for reference I'm trying to apply based on commit fb056564ec5. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=fb056564ec5bc1c18dd670c963c893cdb6de927e You are right, because two commits were appeared after creating v6-patch on partition_prune.out and patch v6 must not have applied on master. Then I created v7 patch rebased on fb056564ec5 . Thank for your remark! -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From d94d30618b6f363a54afe5457ef8fe89476e096e Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sat, 8 Feb 2025 00:15:54 +0300 Subject: [PATCH] Clarify display of rows and loops as decimal fractions When the average number of rows is small compared to the number of loops, both rows and loops are displayed as decimal fractions with two digits after the decimal point in EXPLAIN ANALYZE. --- doc/src/sgml/perform.sgml | 6 ++- src/backend/commands/explain.c| 49 +-- src/test/regress/expected/partition_prune.out | 10 ++-- src/test/regress/sql/partition_prune.sql | 2 +- 4 files changed, 44 insertions(+), 23 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..3f13d17fe9 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -757,7 +757,11 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total time actually spent in the node. In the above example, we spent a total of 0.030 milliseconds -executing the index scans on tenk2. +executing the index scans on tenk2. If a subplan node +is executed multiple times and the average number of rows is less than one, +the rows and loops values are shown as a decimal fraction +(with two digits after the decimal point) to indicate that some rows +were actually processed rather than simply rounding down to zero. diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c24e66f82e..200294b756 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1981,14 +1981,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { + appendStringInfo(es->str, " (actual"); if (es->timing) -appendStringInfo(es->str, - " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", - startup_ms, total_ms, rows, nloops); +appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + + if (nloops > 1 && planstate->instrument->ntuples < nloops) +appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else -appendStringInfo(es->str, - " (actual rows=%.0f loops=%.0f)", - rows, nloops); +appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -1999,8 +1999,16 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Actual Total Time", "ms", total_ms, 3, es); } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + if (nloops > 1 && planstate->instrument->ntuples < nloops) + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 2, es); + } + else + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } } } else if (es->analyze) @@ -2052,14 +2060,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { ExplainIndentText(es); +appendStringInfo(es->str, "actual"); if (es->timing) - appendStringInfo(es->str, - "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", - startup_ms, total_ms, rows, nloops); + appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + +if (nloops > 1 && planstate->instrument->ntuples < nloops) + appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else - appendStringInfo(es->str, - "actual rows=%.0f loops=%.0f\n", - rows, nloops); + appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloop
Re: Sample rate added to pg_stat_statements
Hi hackers, Since current patch is in the commitfest with the status 'Ready for committer', I’d like to summarize what it does, the problems it addresses, and answer the key questions raised in the discussion thread. Enabling pg_stat_statements can cause a performance drop due to two main reasons: * Contention on the exclusive lock, which blocks the hash table when allocating or deallocating an entry. * Contention on the spinlock, which blocks access to an existing entry. The exclusive lock issue can be mitigated by normalizing similar queries to the same queryid. There is an ongoing thread discussing this approach for IN and ARRAY queries: [0]. Another solution is integrating pg_stat_statements into central pgstats with the custom APIs, without pushing the module into core [1]. However, even if we do all that, the spinlock contention problem will persist for very frequent queries. This is exactly what I aim to solve with sampling. Other approaches, such as replacing the spinlock with atomic variables or an lwlock, have not yielded good results [2]. Even track_planning was disabled by default to reduce contention, but on high-core systems, the issue still remains.So far, I see no alternative solution to this problem. The benchmark results on different number of CPUs demonstrating the impact of this patch can be found here: pgbench -c{CPUs} -j20 -S -Mprepared -T120 CPUs | sample_rate | tps | CPU waits | ClientRead wait | SpinDelay wait 192 | 1.0 | 484338| 9568 | 929 | 11107 192 | 0.75 | 909547| 12079 | 2100 | 4781 192 | 0.5 |1028594| 13253 | 3378 | 174 192 | 0.25 |1019507| 13397 | 3423 | - 192 | 0.0 |1015425| 13106 | 3502 | - 48 | 1.0 | 643251| 911 | 932 | 44 48 | 0.75 | 653946| 939 | 939 | 3 48 | 0.5 | 651654| 841 | 932 | - 48 | 0.25 | 652668| 860 | 910 | - 48 | 0.0 | 659111| 849 | 882 | - 32 | 1.0 | 620667| 1782 | 560 | - 32 | 0.75 | 620663| 1736 | 554 | - 32 | 0.5 | 624094| 1741 | 648 | - 32 | 0.25 | 628638| 1702 | 576 | - 32 | 0.0 | 630483| 1638 | 574 | - Some suggested sampling based on execution time instead of a random generator. While this is a good idea, it has a critical limitation. Execution time can only be checked in pgss_ExecutorEnd. However, pgss_planner already interacts with the spinlock, and we can not sample based on execution time at that stage. Finding a criterion that works for both pgss_planner and pgss_ExecutorEnd is difficult. This is why random sampling remains the most viable option. BTW, Instead of using a bool flag to check if we are inside pgss_post_parse_analyze, we now pass a pgssStoreKind to pgss_enabled. This makes the logic clearer and more readable, explicitly indicating where we need to check whether a query should be sampled. I made it in new v15 patch. Are there any remaining concerns or alternative suggestions regarding this approach? I'm happy to discuss further. [0]: https://commitfest.postgresql.org/51/2837/ [1]: https://www.postgresql.org/message-id/Zz0MFPq1s4WFxxpL%40paquier.xyz [2]: https://commitfest.postgresql.org/29/2634/ -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From c9540fac9f6d6ad6d15acdaad844ca6c4ecba455 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 10 Feb 2025 11:00:46 +0300 Subject: [PATCH] Allow setting sample rate for pg_stat_statements New configuration parameter pg_stat_statements.sample_rate makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- contrib/pg_stat_statements/Makefile | 2 +- .../pg_stat_statements/expected/sampling.out | 174 ++ contrib/pg_stat_statements/meson.build| 1 + .../pg_stat_statements/pg_stat_statements.c | 53 +- contrib/pg_stat_statements/sql/sampling.sql | 50 + doc/src/sgml/pgstatstatements.sgml| 19 ++ 6 files changed, 291 insertions(+), 8 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/sampling.out create mode 100644 contrib/pg_stat_statements/sql/sampling.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 241c02587b..b70bdfaf2d 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf REGRESS = select dml cursors utility level_tracking planning \ user_activity wal entry_timestamp privileges extended \ - parallel cleanup
Re: explain analyze rows=%.0f
On 07.02.2025 22:59, Matheus Alcantara wrote: Hi v6 is not applying on master, could you please rebase? git apply v6-0001-Clarify-display-of-rows-and-loops-as-decimal-fraction.patch error: patch failed: src/test/regress/expected/partition_prune.out:3041 error: src/test/regress/expected/partition_prune.out: patch does not apply Hi Strange, I don't have any problems to apply it on master. postgres$ git branch * master postgres$ git pull Already up to date. postgres$ git apply v6-0001-Clarify-display-of-rows-and-loops-as-decimal-fraction.patch postgres$ -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 10.02.2025 18:32, Matheus Alcantara wrote: Thanks for the new patch version. -- v7-0001-Clarify-display-of-rows-and-loops-as-decimal-fraction.patch +if (nloops > 1 && planstate->instrument->ntuples < nloops) +appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); Sorry, but why do the 'loops' need to be changed? IIUC the issue is only with the rows field? When testing this patch I got some results similar with this: Index Scan using t_a_idx on t (cost=0.14..0.28 rows=1 width=8) (actual time=0.049..0.049 rows=0.50 loops=2.00) The only reason I added this was to make it clear to the user that loops > 1, along with the 'rows' value. If no one finds it useful and it only raises more questions, I can remove it. When the total number of returned tuples is less than the number of loops currently shows 'rows = 0'. This can mislead users into thinking that no rows were returned at all, even though some might have appeared occasionally. I think that this can happen when the returned rows and the loops are small enough to result in a 'row' value like 0.00045? I'm not sure if we have "bigger" values (e.g 1074(ntuples) / 117(nloops) which would result in 9.17 rows) this would also be true, what do you think? If you could provide an example of this case would be great! Based on what was discussed earlier in the thread, there are cases with large loops [0]. However, I believe it's better not to display average rows with excessively long digits or in scientific notation. And, of course, I agree with you regarding small values. I think we should also add a check to ensure that the total rows is actually greater than zero. When the total rows is zero, we could simply display it as an integer without decimals. It could help users average rows is very small but not zero. What do you think about this approach? -executing the index scans on tenk2. +executing the index scans on tenk2. If a subplan node +is executed multiple times and the average number of rows is less than one, +the rows and loops values are shown as a decimal fraction +(with two digits after the decimal point) to indicate that some rows +were actually processed rather than simply rounding down to zero. * I think that it would be good to mention what a 'row' value in decimal means. For example, if its says "0.1 rows" the user should assume that typically 0 rows will be returned but sometimes it can return 1 or more. * There are more spaces than necessary before "If a subplan node ..." * Maybe wrap 'rows' with ? I agree with the last two points. As for the first one—maybe we could simply state that the average rows value can be decimal, especially for very small values? [0]: https://www.postgresql.org/message-id/a9393107-6bb9-c835-50b7-c0f453a514b8%40postgrespro.ru -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Avoid possible deference NULL pointer (src/backend/optimizer/path/allpaths.c)
On 05.02.2025 21:56, Tom Lane wrote: It's not a bug. Since the call specifies NIL pathkeys (meaning it doesn't care about sort order) and does not insist on a parallel-safe path, there should always be a path that satisfies it. The only way it could fail to find a path is if the rel's pathlist is entirely empty, a case already rejected by the sole caller. Moreover, the argument that we might not have gotten a report is not credible. In a production build without an Assert, the next line would still dump core just as effectively if the result were NULL. While the proposed patch doesn't break anything, it's removing a logic cross-check that was put there intentionally. So I don't find it to be an improvement. regards, tom lane Ah, if this Assert was intentionally added to ensure that a path must be always found under the given conditions, and that any issues with this can be detected in the right place, then I agree. The patch likely makes worse. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Showing applied extended statistics in explain Part 2
On 05.02.2025 09:28, Tatsuro Yamada wrote: Hi All, Thank you everyone for your cooperation with comments on the patch and solution ideas. I am sorting through your review comments now. And after rebasing the patch, I plan to send a patch that addresses the comments as much as possible to -hackers by Feb 21 at the latest. Therefore, the status of the patch for this feature from last month's commit fest will be "moving to the next commit fest". P.S. Thanks for letting me know, Ilia. Thanks, Tatsuro Yamada Thank for your work! Regarding dependency statistics, I noticed that we can't apply them individually, which makes it difficult to map specific clauses to individual stats. However, why not display the dependency statistics above the execution plan as a list? We could format it as a list, but I believe it's crucial to highlight when dependency statistics are applied, Analyzing dependencies stats often takes the longest compared to other statistics, and if users could see the extended statistics for dependencies upfront, they might be able to remove them if they're not useful for the plan. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Avoid possible deference NULL pointer (src/backend/optimizer/path/allpaths.c)
On 05.01.2025 02:29, Ranier Vilela wrote: Hi. Per Coverity. All call sites of function *get_cheapest_path_for_pathkeys* checks for NULL returns. So, it is highly likely that the function will return NULL. IMO, the Assert in this particular call, is not fully effective. Fix removing the Assert and always check if the return is NULL. best regards, Ranier Vilela Hi! Thanks for noticing this. If this happens in the planner, it poses a serious risk of a segmentation fault that could crash the instance if a NULL pointer is dereferenced. Since checking for NULL is very cheap, I support this patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Tests for extended MCV stats compatibility with multi-filter conditions on a single column
Hi hackers, When a query uses multiple filters on a single column in the WHERE clause, and extended mcv statistics are applied on the column, the optimizer should take each clause into account when estimating selectivity. For example, this could be (A = 0 OR A = 1), (A BETWEEN 0 AND 5), or (A > 0 AND A <= 5). Such range conditions are quite common in practice. However, I did not find any tests covering these scenarios in the stats_ext.sql regression test. I propose adding tests for these cases. This will make the behavior of MCV extended statistics more clear to users and developers. Any feedback, suggestions, or additions to the tests are welcome and appreciated. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From e2b2e591e0df5a7e2ed589182b204781229d5754 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Fri, 7 Feb 2025 00:29:53 +0300 Subject: [PATCH v1] Add tests for extended MCV statistics with OR, BETWEEN Since extended MCV statistics statistics are compatible with conditions like OR (e.g., A = 0 OR A = 1); AND, BETWEEN (e.g., A >= 0 AND A < 3), this patch adds tests to verify their correct handling. --- src/test/regress/expected/stats_ext.out | 24 src/test/regress/sql/stats_ext.sql | 8 2 files changed, 32 insertions(+) diff --git a/src/test/regress/expected/stats_ext.out b/src/test/regress/expected/stats_ext.out index a4c7be487e..f26d07261e 100644 --- a/src/test/regress/expected/stats_ext.out +++ b/src/test/regress/expected/stats_ext.out @@ -2803,6 +2803,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 6 |102 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 OR a = 10) AND (b = 0 OR b = 10) AND (c = 0 OR c = 10)'); + estimated | actual +---+ + 1 |104 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a BETWEEN 7 AND 10) AND (b BETWEEN 7 AND 10) AND (c BETWEEN 7 AND 10)'); + estimated | actual +---+ + 1 |308 +(1 row) + CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c FROM mcv_lists_partial; ANALYZE mcv_lists_partial; @@ -2854,6 +2866,18 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 108 |102 (1 row) +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 OR a = 10) AND (b = 0 OR b = 10) AND (c = 0 OR c = 10)'); + estimated | actual +---+ + 104 |104 +(1 row) + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a BETWEEN 7 AND 10) AND (b BETWEEN 7 AND 10) AND (c BETWEEN 7 AND 10)'); + estimated | actual +---+ + 309 |308 +(1 row) + DROP TABLE mcv_lists_partial; -- check the ability to use multiple MCV lists CREATE TABLE mcv_lists_multi ( diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql index 5c786b16c6..b6fe084675 100644 --- a/src/test/regress/sql/stats_ext.sql +++ b/src/test/regress/sql/stats_ext.sql @@ -1410,6 +1410,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 OR a = 10) AND (b = 0 OR b = 10) AND (c = 0 OR c = 10)'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a BETWEEN 7 AND 10) AND (b BETWEEN 7 AND 10) AND (c BETWEEN 7 AND 10)'); + CREATE STATISTICS mcv_lists_partial_stats (mcv) ON a, b, c FROM mcv_lists_partial; @@ -1431,6 +1435,10 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 AND b = 0) OR (a = 0 AND c = 0) OR (b = 0 AND c = 0)'); +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a = 0 OR a = 10) AND (b = 0 OR b = 10) AND (c = 0 OR c = 10)'); + +SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_partial WHERE (a BETWEEN 7 AND 10) AND (b BETWEEN 7 AND 10) AND (c BETWEEN 7 AND 10)'); + DROP TABLE mcv_lists_partial; -- check the ability to use multiple MCV lists -- 2.34.1
Re: explain analyze rows=%.0f
On 10.02.2025 23:43, Matheus Alcantara wrote: When the total number of returned tuples is less than the number of loops currently shows 'rows = 0'. This can mislead users into thinking that no rows were returned at all, even though some might have appeared occasionally. I think that this can happen when the returned rows and the loops are small enough to result in a 'row' value like 0.00045? I'm not sure if we have "bigger" values (e.g 1074(ntuples) / 117(nloops) which would result in 9.17 rows) this would also be true, what do you think? If you could provide an example of this case would be great! Based on what was discussed earlier in the thread, there are cases with large loops [0]. However, I believe it's better not to display average rows with excessively long digits or in scientific notation. And, of course, I agree with you regarding small values. I think we should also add a check to ensure that the total rows is actually greater than zero. When the total rows is zero, we could simply display it as an integer without decimals. It could help users average rows is very small but not zero. What do you think about this approach? Yeah, I agree with you about the long digits. My question is more about why do we need the planstate->instrument->ntuples < nloops check? I tried to remove this check and I got a lot of EXPLAIN output that shows 'rows' values with .00, so I'm just trying to understand the reason. From what I've understood about this thread is that just avoiding .00 decimals of 'rows' values that could be just integers would be enough, is that right or I'm missing something here? I'm just worried if we could have a scenario where nloops > 1 && planstate->instrument->ntuples < nloops which would make the 'rows' not be formatted correctly. Sorry for missing your question earlier. If you notice in the code above, the variable(average) 'rows' is defined as: double rows = planstate->instrument->ntuples / nloops; This represents the total rows divided by the number of loops. The condition means that variable 'rows' will always between zero and one. Therefore, the average rows under such conditions cannot be greater than or even equal to one. I wrote this condition specifically to avoid the verbose expression 'rows > 0 && rows < 1'. However, since this might not be obvious to everyone, perhaps it'd be better to write is using 'rows' directly or add a comment explaining this logic. -executing the index scans on tenk2. +executing the index scans on tenk2. If a subplan node +is executed multiple times and the average number of rows is less than one, +the rows and loops values are shown as a decimal fraction +(with two digits after the decimal point) to indicate that some rows +were actually processed rather than simply rounding down to zero. * I think that it would be good to mention what a 'row' value in decimal means. For example, if its says "0.1 rows" the user should assume that typically 0 rows will be returned but sometimes it can return 1 or more. * There are more spaces than necessary before "If a subplan node ..." * Maybe wrap 'rows' with ? I agree with the last two points. As for the first one—maybe we could simply state that the average rows value can be decimal, especially for very small values? I'm just not sure about the "small values"; the 'rows' in decimal will only happen with small values? What would be a "small value" in this context? My main point here is more that I think that it would be good to mention *why* the 'rows' can be decimal, not just describe that it could be decimal. As for 'small values', it means that the average rows is between zero and one, to avoid rounding errors and misunderstanding. I think this would be ideal. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: pull-up subquery if JOIN-ON contains refs to upper-query
On 09.02.2025 18:14, Alena Rybakina wrote: Hi! I found another example where the transformation worked incorrectly and reconsidered the idea. As for conversion of exists_sublink_to_ANY, we need to get the flattened implicit-AND list of clauses and pull out the chunks of the WHERE clause that belong to the parent query, since we are called halfway through the parent's preprocess_expression() and earlier steps of preprocess_expression() wouldn't get applied to the pulled-up stuff unless we do them here. We also do some processing for vars depending on which side the var is on - if it's in a subquery, we only need to lower its level (varlevel) because subquery will be flatted, while for other vars that belong to the parent query, we need to do preparation to pull up the sub-select into top range table. For those expressions that we couldn't assign to either list, we define newWhere and apply both cases. When I run 'make -C contrib/ check', tests of postgres_fdw extension failed. I might be wrong, but you should be careful with LIMIT. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. diff -U3 /home/ilia/postgres/contrib/postgres_fdw/expected/postgres_fdw.out /home/ilia/postgres/contrib/postgres_fdw/results/postgres_fdw.out --- /home/ilia/postgres/contrib/postgres_fdw/expected/postgres_fdw.out 2025-02-10 23:31:19.353059650 +0300 +++ /home/ilia/postgres/contrib/postgres_fdw/results/postgres_fdw.out 2025-02-10 23:46:43.249745683 +0300 @@ -4835,13 +4835,15 @@ EXISTS (SELECT 1 FROM ft4 WHERE ft4.c2 = ft2.c2)) AND ft2.c1 > 900 ORDER BY ft2.c1 LIMIT 10; - QUERY PLAN - Foreign Scan + QUERY PLAN + + Limit Output: ft2.c1, ft2.c2, ft2.c3, ft2.c4, ft2.c5, ft2.c6, ft2.c7, ft2.c8 - Relations: (public.ft2) SEMI JOIN ((public.ft2 ft2_1) SEMI JOIN (public.ft4)) - Remote SQL: SELECT r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8 FROM "S 1"."T 1" r1 WHERE ((r1."C 1" > 900)) AND EXISTS (SELECT NULL FROM "S 1"."T 1" r3 WHERE ((r1."C 1" = r3."C 1")) AND EXISTS (SELECT NULL FROM "S 1"."T 3" r4 WHERE ((r3.c2 = r4.c2 ORDER BY r1."C 1" ASC NULLS LAST LIMIT 10::bigint -(4 rows) + -> Foreign Scan + Output: ft2.c1, ft2.c2, ft2.c3, ft2.c4, ft2.c5, ft2.c6, ft2.c7, ft2.c8 + Relations: (public.ft2) SEMI JOIN ((public.ft2 ft2_1) SEMI JOIN (public.ft4)) + Remote SQL: SELECT r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8 FROM "S 1"."T 1" r1 WHERE ((r1."C 1" > 900)) AND EXISTS (SELECT NULL FROM "S 1"."T 1" r3 WHERE ((r1."C 1" = r3."C 1")) AND EXISTS (SELECT NULL FROM "S 1"."T 3" r4 WHERE ((r3.c2 = r4.c2 ORDER BY r1."C 1" ASC NULLS LAST +(6 rows) SELECT ft2.* FROM ft2 WHERE c1 = ANY ( @@ -4871,13 +4873,20 @@ EXISTS (SELECT 1 FROM ft4 WHERE ft4.c2 = ft2.c2) AND c1 = ftupper.c1 ) AND ftupper.c1 > 900 ORDER BY ftupper.c1 LIMIT 10; - QUERY PLAN -
Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)
Hi, Thank you for your work! 1. Perhaps In EXPLAIN you forget to check that usage->wal_buffers_full > 0: if ((usage->wal_records > 0) || (usage->wal_fpi > 0) || (usage->wal_bytes > 0)) 2. I have a small suggestion for pg_stat_statements: would it make sense to move wal_buffers_full next to wal_records, wal_fpi and wal_bytes? This way, all WAL-related information would be grouped together. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: 2025-02-13 release announcement draft
On 11.02.2025 03:09, Jonathan S. Katz wrote: Hi, Attached is a draft of the 2025-02-13 update release announcement. Please provide feedback no later than 2025-02-13 0:00 UTC. Thanks, Jonathan you may simply shutdown PostgreSQL you may simply shut down PostgreSQL -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 12.02.2025 11:54, Andrei Lepikhov wrote: May we agree on a more general formula to print at least two meaningful digits if we have a fractional part? Examples: - actual rows = 2, nloops = 2 -> rows = 1 - actual rows = 9, nloops = 5 -> rows = 1.8 - actual rows = 101, nloops = 100 -> rows = 1.0 - actual rows = 99, nloops = 100 -> rows = 0.99 It may guarantee that an EXPLAIN exposes most of the data passed the node, enough to tangle it with actual timing and tuples at the upper levels of the query. I think the idea of keeping two significant digits after the decimal point is quite reasonable. The thing is, rows=0.01 or something similar can only occur when loops is quite large. If we show the order of magnitude in rows, it will be easier for the user to estimate the order of total rows. For example, if we see this: rows=0.56 loops=4718040 the user can quickler approximate the order of total rows for analyzing the upper levels of the query. However, keep in mind that I am against using the E notation, as many users have mentioned that they are not mathematicians and are not familiar with the concept of "E". I suggest that when thinking about what to change here, you start by considering how you'd adjust the docs at https://www.postgresql.org/docs/devel/using-explain.html to explain the new behavior. If you can't explain it clearly for users, then maybe it's not such a great idea. Agree So do I. Firstly, I'll think how to explain it. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
[PATCH] Fix incorrect range in pg_regress comment
Hi hackers, I noticed that a comment in pg_regress incorrectly states that alternative output files can be named filename{_i}.out with 0 < i <= 9. However, the actual valid range is 0 <= i <= 9. This patch corrects the comment. The fix is trivial but ensures that the documentation in the code accurately reflects the behavior of pg_regress. Attached is a small patch correcting this. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From d2ef1b2ff2f42bc38eb2d2d87201a2822d53d8b5 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sun, 2 Feb 2025 21:35:10 +0300 Subject: [PATCH v1] Fix incorrect range in pg_regress comment The comment in pg_regress incorrectly stated that alternative output files could be named test{_i}.out with 0 < i <= 9. However, the valid range is actually 0 <= i <= 9. --- src/test/regress/pg_regress.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 61a234ae21..5d85dcc62f 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -1335,7 +1335,7 @@ make_directory(const char *dir) } /* - * In: filename.ext, Return: filename_i.ext, where 0 < i <= 9 + * In: filename.ext, Return: filename_i.ext, where 0 <= i <= 9 */ static char * get_alternative_expectfile(const char *expectfile, int i) -- 2.34.1
Re: Sample rate added to pg_stat_statements
Hi everyone, I attached previous sampling patch for pg_stat_statements (v4). Suggestions like sampling based on execution time remain unfeasible, as pg_stat_statements can track query during query planning, not execution. To evaluate the implementation, I ran a benchmark creating 1,000 random tables and executing randomized JOIN queries on a small machine. When pg_stat_statements enabled, performance decreases, but reducing the sampling rate helps mitigate the impact and improves performance. I’d be interested in hearing your new thoughts. Are there areas where this patch could be improved, or other ideas worth exploring? -- Best regards. Ilia Evdokimov, Tantor Labs LLC. From 68f5451019b261bf03a222f5a05ac57cd0fb9a24 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 21 Nov 2024 11:24:03 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/pg_stat_statements.c | 23 ++- doc/src/sgml/pgstatstatements.sgml| 18 +++ 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 49c657b3e0..d06e0d8a44 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "common/pg_prng.h" #include "common/int.h" #include "executor/instrument.h" #include "funcapi.h" @@ -294,12 +295,16 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ - (pgss_track == PGSS_TRACK_TOP && (level) == 0))) + (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ + current_query_sampled) #define record_gc_qtexts() \ do { \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to process.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query, jstate); + if (nesting_level == 0) + current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate); + /* Safety check... */ if (!pgss || !pgss_hash || !pgss_enabled(nesting_level)) return; diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml index 501b468e9a..d06349d097 100644 --- a/doc/src/sgml/pgstatstatements.sgml +++ b/doc/src/sgml/pgstatstatements.sgml @@ -936,6 +936,24 @@ + + + pg_stat_statements.sample_rate (real) + + pg_stat_statements.sample_rate configuration parameter + + + + + + pg_stat_statements.sample_rate causes pg_stat_statements to only + track a fraction of the statements in each session. The default is 1, + meaning track all the queries. In case of nested statements, + either all will be tracked or none. Only superusers can change this setting. + + + + pg_stat_statements.save (boolean) -- 2.34.1
Re: Define STATS_MIN_ROWS for minimum rows of stats in ANALYZE
On 09.12.2024 16:10, Ilia Evdokimov wrote: Hi hackers, The repeated use of the number 300 in the ANALYZE-related code creates redundancy and relies on scattered, sometimes unclear, comments to explain its purpose. This can make the code harder to understand, especially for new contributors who might not immediately understand its significance. To address this, I propose introducing a macro STATS_MIN_ROWS to represent this value and consolidating its explanation in a single place, making the code more consistent and readable. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. Hi everyone, Currently, the value 300 is used as the basis for determining the number of rows sampled during ANALYZE, both for single-column and extended statistics. While this value has a well-established rationale for single-column statistics, its suitability for extended statistics remains uncertain, as no specific research has confirmed that this is an optimal choice for them. To better reflect this distinction, I propose introducing two macros: STATS_MIN_ROWS for single-column statistics and EXT_STATS_MIN_ROWS for extended statistics. This change separates the concerns of single-column and extended statistics sampling, making the code more explicit and easier to adapt if future research suggests a different approach for extended statistics. The values remain the same for now, but the introduction of distinct macros improves clarity and prepares the codebase for potential refinements. Does this seem like a reasonable approach to handling these differences? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 164535616040447207c36ea47c52b206f4d77dcf Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 10 Dec 2024 16:25:29 +0300 Subject: [PATCH v2] Define macros for minimum rows of stats in ANALYZE This introduces two macros, STATS_MIN_ROWS and EXT_STATS_MIN_ROWS, to represent the default minimum number of rows sampled in ANALYZE. STATS_MIN_ROWS is used for single-column statistics, while EXT_STATS_MIN_ROWS is intended for extended statistics. Both macros replace the hardcoded value of 300, improving clarity. --- src/backend/commands/analyze.c| 31 +-- src/backend/statistics/extended_stats.c | 2 +- src/backend/tsearch/ts_typanalyze.c | 4 +-- src/backend/utils/adt/rangetypes_typanalyze.c | 7 ++--- .../statistics/extended_stats_internal.h | 11 +++ src/include/statistics/statistics.h | 23 ++ 6 files changed, 47 insertions(+), 31 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 9a56de2282..e358b0d828 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1897,42 +1897,25 @@ std_typanalyze(VacAttrStats *stats) { /* Seems to be a scalar datatype */ stats->compute_stats = compute_scalar_stats; - /* - * The following choice of minrows is based on the paper - * "Random sampling for histogram construction: how much is enough?" - * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in - * Proceedings of ACM SIGMOD International Conference on Management - * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 - * says that for table size n, histogram size k, maximum relative - * error in bin size f, and error probability gamma, the minimum - * random sample size is - * r = 4 * k * ln(2*n/gamma) / f^2 - * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain - * r = 305.82 * k - * Note that because of the log function, the dependence on n is - * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66 - * bin size error with probability 0.99. So there's no real need to - * scale for n, which is a good thing because we don't necessarily - * know it at this point. - * - */ - stats->minrows = 300 * stats->attstattarget; } else if (OidIsValid(eqopr)) { /* We can still recognize distinct values */ stats->compute_stats = compute_distinct_stats; - /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; } else { /* Can't do much but the trivial stuff */ stats->compute_stats = compute_trivial_stats; - /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; } + /* + * For the scalar types, STATS_MIN_ROWS is derived from research on + * optimal sample sizes for histograms. For other cases, + * we assume the same value. + */ + stats->minrows = (STATS_MIN_ROWS * stats->attstattarget); + return true; } diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index 99fdf208db..8cd024b5ce 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -320,7 +320,7 @@ ComputeExtStatisticsRow
Re: explain analyze rows=%.0f
On 01.08.2023 23:29, Daniel Gustafsson wrote: On 3 Jul 2023, at 18:34, Daniel Gustafsson wrote: On 8 Jun 2023, at 19:49, Ibrar Ahmed wrote: On Mon, Mar 20, 2023 at 7:56 PM Gregory Stark (as CFM) mailto:stark@gmail.com>> wrote: This patch was marked Returned with Feedback and then later Waiting on Author. And it hasn't had any updates since January. What is the state on this feedback? If it's already done we can set the patch to Ready for Commit and if not do you disagree with the proposed changes? If there is a consensus to modify the test cases' output, I am willing to make the necessary changes and adjust the patch accordingly. However, if there is a preference to keep the output of certain test cases unchanged, I can rebase and modify the patch accordingly to accommodate those preferences. As there hasn't been any other comments I suggest updating your patch to address Tom's comments to see if we can make progress here. Since there hasn't been any updates here, and the thread has been stalled, I'm marking this returned with feedback. Please feel free to resubmit a version of the patch addressing comments to a future CF. -- Daniel Gustafsson Hi everybody, When the total number of returned tuples is less than the number of loops currently shows 'rows = 0'. This can mislead users into thinking that no rows were returned at all, even though some might have appeared occasionally. For example, if there is 1 tuple over 100 loops, the average is 0.01 rows per loop, but as displayed it simply looks like zero. To clarify this situation, it has been suggested that display rows with two decimal places in scenarios where 'loops > 1 && ntuples < loops'. In other words, show something like 'rows = 0.01' instead of 'rows=0'. This minor change would make it evident that rows did occur, just very infrequently. For all other cases, the current formatting would remain the same. This approach aims to provide a more nuanced understanding of query execution behavior without introducing unnecessary complexity or false precision. I would appreciate any thoughts or feedback on this proposal. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index a201ed3082..b3f7fa8743 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1981,14 +1981,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { + appendStringInfo(es->str, " (actual"); if (es->timing) -appendStringInfo(es->str, - " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", - startup_ms, total_ms, rows, nloops); +appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + + if (nloops > 1 && planstate->instrument->ntuples < nloops) +appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else -appendStringInfo(es->str, - " (actual rows=%.0f loops=%.0f)", - rows, nloops); +appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -1999,8 +1999,16 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Actual Total Time", "ms", total_ms, 3, es); } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + if (nloops > 1 && planstate->instrument->ntuples < nloops) + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 2, es); + } + else + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } } } else if (es->analyze) @@ -2052,14 +2060,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { ExplainIndentText(es); +appendStringInfo(es->str, "actual"); if (es->timing) - appendStringInfo(es->str, - "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", - startup_ms, total_ms, rows, nloops); + appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + +if (nloops > 1 && planstate->instrument->ntuples < nloops) + appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else - appendStringInfo(es->str, - "actual rows=%.0f loops=%.0f\n", - rows, nloops); + appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -2070,8 +2078
Re: explain analyze rows=%.0f
On 11.01.2025 17:02, Guillaume Lelarge wrote: Le sam. 11 janv. 2025 à 12:10, Ilia Evdokimov a écrit : On 11.01.2025 12:15, Guillaume Lelarge wrote: Thanks for your patch, this looks like a very interesting feature that I'd like to see in a future release. It did a quick run: patch OK, make OK, make install OK, but make check fails quite a lot on partition_prune.sql. I guess it would need some work on partition_prune.sql tests and perhaps also on the docs. Thanks again. -- Guillaume. Yes, certainly. I have fixed partition_prune.sql. In the documentation example for EXPLAIN ANALYZE where loops is greater than one, I updated how 'rows' and 'loops' values are displayed so they appear as decimal fractions with two digits after the decimal point. I attached fixed patch. Any suggestions? Thanks for the updated patch. "make check" is happy, so no more suggestions, except you should register it in the next commitfest. -- Guillaume. Done! https://commitfest.postgresql.org/52/5501/ -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 11.01.2025 12:15, Guillaume Lelarge wrote: Thanks for your patch, this looks like a very interesting feature that I'd like to see in a future release. It did a quick run: patch OK, make OK, make install OK, but make check fails quite a lot on partition_prune.sql. I guess it would need some work on partition_prune.sql tests and perhaps also on the docs. Thanks again. -- Guillaume. Yes, certainly. I have fixed partition_prune.sql. In the documentation example for EXPLAIN ANALYZE where loops is greater than one, I updated how 'rows' and 'loops' values are displayed so they appear as decimal fractions with two digits after the decimal point. I attached fixed patch. Any suggestions? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From a934fbd278350edf7a3a3e7fe620a0027ceb2580 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Sat, 11 Jan 2025 13:59:34 +0300 Subject: [PATCH] Clarify display of rows and loops as decimal fractions When the average number of rows is small compared to the number of loops, both rows and loops are displayed as decimal fractions with two digits after the decimal point in EXPLAIN ANALYZE. --- doc/src/sgml/perform.sgml | 2 +- src/backend/commands/explain.c| 49 +-- src/test/regress/expected/partition_prune.out | 10 ++-- src/test/regress/sql/partition_prune.sql | 2 +- 4 files changed, 40 insertions(+), 23 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..da7a19f4b8 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -730,7 +730,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1) Index Cond: (unique1 < 10) Buffers: shared hit=2 - -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1 loops=10) + -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10.00) Index Cond: (unique2 = t1.unique2) Buffers: shared hit=24 read=6 Planning: diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c24e66f82e..200294b756 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1981,14 +1981,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { + appendStringInfo(es->str, " (actual"); if (es->timing) -appendStringInfo(es->str, - " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", - startup_ms, total_ms, rows, nloops); +appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + + if (nloops > 1 && planstate->instrument->ntuples < nloops) +appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else -appendStringInfo(es->str, - " (actual rows=%.0f loops=%.0f)", - rows, nloops); +appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -1999,8 +1999,16 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Actual Total Time", "ms", total_ms, 3, es); } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + if (nloops > 1 && planstate->instrument->ntuples < nloops) + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 2, es); + } + else + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } } } else if (es->analyze) @@ -2052,14 +2060,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { ExplainIndentText(es); +appendStringInfo(es->str, "actual"); if (es->timing) - appendStringInfo(es->str, - "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n", - startup_ms, total_ms, rows, nloops); + appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + +if (nloops > 1 && planstate->instrument->ntuples < nloops) + appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else - appendStringInfo(es->str, - "actual rows=%.0f loops=%.0f\n", - rows, nloops); + appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@
Re: Sample rate added to pg_stat_statements
On 09.01.2025 22:13, Alena Rybakina wrote: Hi! Thank you for the work with this subject. I looked at your patch and noticed that this part of the code is repeated several times: if (nesting_level == 0) { if (!IsParallelWorker()) current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate; else current_query_sampled = false; } I think you should put this in a function like update_current_query_sampled. I've attached a diff file with the changes. Agree, thanks. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 22.11.2024 09:08, Alexander Korotkov wrote: On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier wrote: On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote: Oh, and a +1 in general to the patch, OP, although it would also be nice to start finding the bottlenecks that cause such performance issues. FWIW, I'm not eager to integrate this proposal without looking at this exact argument in depth. One piece of it would be to see how much of such "bottlenecks" we would be able to get rid of by integrating pg_stat_statements into the central pgstats with the custom APIs, without pushing the module into core. This means that we would combine the existing hash of pgss to shrink to 8 bytes for objid rather than 13 bytes now as the current code relies on (toplevel, userid, queryid) for the entry lookup (entry removal is sniped with these three values as well, or dshash seq scans). The odds of conflicts still still play in our favor even if we have a few million entries, or even ten times that. If you run "pgbench -S -M prepared" on a pretty large machine with high concurrency, then spin lock in pgss_store() could become pretty much of a bottleneck. And I'm not sure switching all counters to atomics could somehow improve the situation given we already have pretty many counters. I'm generally +1 for the approach taken in this thread. But I would suggest introducing a threshold value for a query execution time, and sample just everything below that threshold. The slower query shouldn't be sampled, because it can't be too frequent, and also it could be more valuable to be counter individually (while very fast queries probably only matter "in average"). -- Regards, Alexander Korotkov Supabase BTW, since we're performing sampling outside of pgss_post_parse_analyze(), I'd like to reconsider Alexander's proposal—to sample only those queries that have an execution time below a specified threshold, as defined in the hook where execution time is recorded. Queries exceeding this threshold would not be sampled. Certainly I'll fix all Andrey's comments. What are your thoughts on this approach? -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 20.01.2025 17:20, Ilia Evdokimov wrote: On 15.01.2025 20:16, Sami Imseih wrote: Probably, but first I suggest benchmarking with sampling applied to all queries. If the results are good, we can later filter certain queries based on different characteristics. Absolutely. The benchmark numbers to justify this feature are the next step. Thanks for your work on this! Regards, Sami Hi hackers, I’d like to share the results of my benchmark. To stress the spinlock in pg_stat_statements while incrementing counters for popular entries, it's important to use the same query repeatedly. To avoid overloading pgss with normalization, the queries should not contain constants. I found that using the query 'SELECT now()' works best for this purpose. I ran the benchmark on a machine with 48 CPUs, which may not be sufficient to fully test sampling, but I was able to achieve around 1.5 million TPS using 'SELECT now()'. To load the CPUs to about 85-90%, I ran pgbench with -c 45 -j 45, using a custom 'SELECT now()' in pgbench_script.sql file containing 'SELECT now()'. The benchmark was conducted on a newly created 'pgbench' database, , with processes like autovacuum, fsync, and checkpoints disabled(e.g., checkpoint _timeout = '24h'). I tested various 'sample_rate' values and compared the performance with pgss both enabled and disabled. The detailed results are attached in the 'sample_rate.txt' file, along with pgbench progress reports every 10 seconds. pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench sample_rate | num of transactions | lat | stddev | tps 1.0 | 1207127.572580 |0.037| 0.030 | 241400836 0.8 | 1403551.516338 |0.032| 0.031 | 280673286 0.5 | 1658596.614064 |0.027| 0.012 | 331679344 0.2 | 1757747.509476 |0.025| 0.008 | 351507156 0.0 | 1760055.986259 |0.025| 0.008 | 351965559 pgss off | 1828743.791205 |0.024| 0.008 | 365703762 If anyone has the capability to run this benchmark on machines with more CPUs or with different queries, it would be nice. I’d appreciate any suggestions or feedback. --. Best regards, Ilia Evdokimov, Tantor Labs LLC. Hi hackers, I'd like to share the results of a new benchmark I conducted using the pgbench tool with the -M prepared -S flags! Previously, I shared a benchmark focused on extreme high-load scenarios, but now I’ve conducted a more realistic test case based on Alexander’s suggestion [0]. Machine Characteristics: * 48 CPUs * 374 GB RAM * 3 TB Disk To avoid interference from background processes, I disabled autovacuum and synchronous operations using the following settings in postgresql.auto.conf: autovacuum = off fsync = off checkpoint_timeout = '24h' shared_preload_libraries = 'pg_stat_statements' Create a test database: createdb pgbench pgbench -i -s 3000 pgbench If pg_stat_statements is enabled, reset its hash table before starting the benchmark: psql -c 'SELECT pg_stat_statements_reset()' Run the benchmark with the following command: pgbench -c 46 -j 46 -T 200 -M prepared -S --progress=10 pgbench Here are the results with and without pg_stat_statements, varying the sample_rate parameter: sample_rate | tps | lat | stddev | num of transactions 1.0 | 1299070.105018 |0.035| 0.055 | 259781895 0.8 | 1393327.873249 |0.033| 0.039 | 278630669 0.5 | 1420303.856480 |0.032| 0.037 | 284023396 0.2 | 1432482.672301 |0.032| 0.037 | 286461834 0.0 | 1760055.986259 |0.032| 0.036 | 290169584 pgss off | 1460920.805259 |0.031| 0.036 | 292144256 When pg_stat_statements is enabled, TPS decreases by about 10%. However, by leveraging the new sampling feature (sample_rate), it is possible to significantly mitigate this overhead. For detailed benchmark results with the '-M prepared -S' flags, please see the attached file 'sample_rate_prepared.txt' P.S. Previously, I shared results under extreme high-load conditions using the query 'SELECT now()'. For clarity, I’m rewriting those results for better readability: sample_rate | tps | lat | stddev | num of transactions 1.0 | 1207127.572580 |0.037| 0.030 | 241400836 0.8 | 1403551.516338 |0.032| 0.031 | 280673286 0.5 | 1658596.614064 |0.027| 0.012 | 331679344 0.2 | 1757747.509476 |0.025| 0.008 | 351507156 0.0 | 1760055.986259 |0.025| 0.008 | 351965559 pgss off | 1828743.791205 |0.024| 0.008 | 365703762 Here, we observed a significant ~33% decrease in TPS when pgss was enabled without sampling. Any thoughts? [0]: https://www.postgresql.org/message-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w%3Dutk7thQcOMNr7Q%40mail.gmail
Re: Sample rate added to pg_stat_statements
On 23.01.2025 08:21, Andrey Borodin wrote: There’s a typo in the commit message (ratio instead of rate). Besides this the patch looks ready for committer. Best regards, Andrey Borodin. Fixed. Thank you for review! I noticed that the code has not enough comments, so I added additional ones to improve clarity. Additionally, I moved the update_current_query_sampled() function to the end of the file, as keeping it between hooks didn’t seem appropriate to me. All these changes are included in the updated patch (v12), which I have attached. The patch is part of the current 2025-01 Commitfest: https://commitfest.postgresql.org/51/5390/ -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From ddbc6af2af511ff342b183cb13a9027edadc0ad3 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 23 Jan 2025 11:31:57 +0300 Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_rate makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 76 +++ .../pg_stat_statements/pg_stat_statements.c | 73 +++--- contrib/pg_stat_statements/sql/select.sql | 21 + doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 179 insertions(+), 9 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..558d93fb46 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +set pg_stat_statements.sample_rate = 0.0; +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT 1 \parse stmt +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls +---+--- +(0 rows) + +set pg_stat_statements.sample_rate = 1.0; +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(2 rows) + +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + -- -- queries with locking clauses -- diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index bebf8134eb..92ba954d41 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -50,6 +50,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" #include "common/int.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "funcapi.h" #include "jit/jit.h" @@ -294,6 +295,10 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1.0; /* fraction of statements to track */ + +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ @@ -373,6 +378,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query, static v
Re: Sample rate added to pg_stat_statements
On 15.01.2025 20:16, Sami Imseih wrote: Probably, but first I suggest benchmarking with sampling applied to all queries. If the results are good, we can later filter certain queries based on different characteristics. Absolutely. The benchmark numbers to justify this feature are the next step. Thanks for your work on this! Regards, Sami Hi hackers, I’d like to share the results of my benchmark. To stress the spinlock in pg_stat_statements while incrementing counters for popular entries, it's important to use the same query repeatedly. To avoid overloading pgss with normalization, the queries should not contain constants. I found that using the query 'SELECT now()' works best for this purpose. I ran the benchmark on a machine with 48 CPUs, which may not be sufficient to fully test sampling, but I was able to achieve around 1.5 million TPS using 'SELECT now()'. To load the CPUs to about 85-90%, I ran pgbench with -c 45 -j 45, using a custom 'SELECT now()' in pgbench_script.sql file containing 'SELECT now()'. The benchmark was conducted on a newly created 'pgbench' database, , with processes like autovacuum, fsync, and checkpoints disabled(e.g., checkpoint _timeout = '24h'). I tested various 'sample_rate' values and compared the performance with pgss both enabled and disabled. The detailed results are attached in the 'sample_rate.txt' file, along with pgbench progress reports every 10 seconds. pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench sample_rate | num of transactions | lat | stddev | tps 1.0 | 1207127.572580 |0.037| 0.030 | 241400836 0.8 | 1403551.516338 |0.032| 0.031 | 280673286 0.5 | 1658596.614064 |0.027| 0.012 | 331679344 0.2 | 1757747.509476 |0.025| 0.008 | 351507156 0.0 | 1760055.986259 |0.025| 0.008 | 351965559 pgss off | 1828743.791205 |0.024| 0.008 | 365703762 If anyone has the capability to run this benchmark on machines with more CPUs or with different queries, it would be nice. I’d appreciate any suggestions or feedback. --. Best regards, Ilia Evdokimov, Tantor Labs LLC. pg_stat_statements.sample_rate = 1.0 cpu = 85% pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench pgbench (16.6) starting vacuum...end. progress: 10.0 s, 1200445.1 tps, lat 0.037 ms stddev 0.038, 0 failed progress: 20.0 s, 1232634.5 tps, lat 0.036 ms stddev 0.030, 0 failed progress: 30.0 s, 1209760.7 tps, lat 0.037 ms stddev 0.027, 0 failed progress: 40.0 s, 1204625.5 tps, lat 0.037 ms stddev 0.026, 0 failed progress: 50.0 s, 1206608.7 tps, lat 0.037 ms stddev 0.027, 0 failed progress: 60.0 s, 1214047.9 tps, lat 0.037 ms stddev 0.034, 0 failed progress: 70.0 s, 1222126.5 tps, lat 0.037 ms stddev 0.036, 0 failed progress: 80.0 s, 1198452.9 tps, lat 0.037 ms stddev 0.030, 0 failed progress: 90.0 s, 1198871.7 tps, lat 0.037 ms stddev 0.027, 0 failed progress: 100.0 s, 1200786.9 tps, lat 0.037 ms stddev 0.028, 0 failed progress: 110.0 s, 1200419.5 tps, lat 0.037 ms stddev 0.030, 0 failed progress: 120.0 s, 1208394.4 tps, lat 0.037 ms stddev 0.030, 0 failed progress: 130.0 s, 1199204.3 tps, lat 0.037 ms stddev 0.028, 0 failed progress: 140.0 s, 1213919.6 tps, lat 0.037 ms stddev 0.037, 0 failed progress: 150.0 s, 1195670.1 tps, lat 0.037 ms stddev 0.028, 0 failed progress: 160.0 s, 1190519.2 tps, lat 0.038 ms stddev 0.029, 0 failed progress: 170.0 s, 1213450.1 tps, lat 0.037 ms stddev 0.027, 0 failed progress: 180.0 s, 1220433.0 tps, lat 0.037 ms stddev 0.029, 0 failed progress: 190.0 s, 1199628.7 tps, lat 0.037 ms stddev 0.028, 0 failed transaction type: pgbench_script.sql scaling factor: 1 query mode: simple number of clients: 45 number of threads: 45 maximum number of tries: 1 duration: 200 s number of transactions actually processed: 241400836 number of failed transactions: 0 (0.000%) latency average = 0.037 ms latency stddev = 0.030 ms initial connection time = 25.051 ms tps = 1207127.572580 (without initial connection time) == pg_stat_statements.sample_rate = 0.8 cpu = 90% pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench pgbench (16.6) starting vacuum...end. progress: 10.0 s, 1429922.8 tps, lat 0.031 ms stddev 0.028, 0 failed progress: 20.0 s, 1200380.2 tps, lat 0.037 ms stddev 0.029, 0 failed progress: 30.0 s, 1198102.6 tps, lat 0.037 ms stddev 0.027, 0 failed progress: 40.0 s, 1225912.9 tps, lat 0.037 ms stddev 0.028, 0 failed progress: 50.0 s, 1234056.2 tps, lat 0.036 ms stddev 0.027, 0 failed progress: 60.0 s, 1234894.6 tps, lat 0.036 ms stddev 0.027, 0 failed progress: 70.0 s, 1235032.9 tps, lat 0.036 ms stddev 0.027, 0 failed progress: 80.0 s, 1548607.9 tps, lat 0.029 ms stddev 0.035, 0 failed progress: 90.0 s, 1553966.9 tps, lat 0.029 ms stddev 0.031, 0 failed progress: 100
Re: Sample rate added to pg_stat_statements
On 28.01.2025 02:00, Sami Imseih wrote: Hi, I have a few comments about the patch. 1/ tests are missing for pg_stat_statements.track=all and with a sample rate < 1 applied. Can we add tests for this case? I think I’ve come up with a good idea for implementing this. We can create a new file, sampling.sql, for regression tests, where we move written tests in select.out and check test with setting "pg_stat_statements.track = 'all'" and "pg_stat_statements.sample_rate = 0.5". Additionally, we can create two files in the 'expected' directory: sampling.out and sampling_1.out where we can verify either all nested statements are tracked or none of them. 2/ This declaration: +static bool current_query_sampled = false; should be moved near the declaration of nesting_level, the other local variable. +1 3/ I do not really like the calls to update_current_query_sampled() the way there are, and I think the pgss_enabled should encapsulate the new sample rate check as well. My thoughts are: 1/ change the name of the function from update_current_query_sampled to current_query_sampled Also, the static bool should be called is_query_sampled or something like that. The function should also allow you to skip the sample rate check, such as if called from pgss_post_parse_analyze. We can also remove the IsParallelWorker() check in this case, since that is done in pgss_enabled. something like this is what I am thinking: static bool current_query_sampled(bool skip) { if (skip) return true; if (nesting_level == 0) { is_query_sampled = pgss_sample_rate != 0.0 && (pgss_sample_rate == 1.0 || pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate); } return is_query_sampled; } #define pgss_enabled(level, skip_sampling_check) \ (!IsParallelWorker() && \ (pgss_track == PGSS_TRACK_ALL || \ (pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \ current_query_sampled(skip_sampling_check)) What do you think? I agree with all of these changes; they make the code more readable. Adding comments to the current_query_sampled() function will make it even clearer. 4/ Now we have pg_stat_statements.track = "off" which is effectively the same thing as pg_stat_statements.sample_rate = "0". Does this need to be called out in docs? +1, because we have the same thing in log_statement_sample_rate. All the changes mentioned above are included in the v13 patch. Since the patch status is 'Ready for Committer,' I believe it is now better for upstream inclusion, with improved details in tests and documentation. Do you have any further suggestions? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 149688fe3bccce25f8d01dfe8c01444555dce03c Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 28 Jan 2025 05:31:20 +0300 Subject: [PATCH] Allow setting sample rate for pg_stat_statements New configuration parameter pg_stat_statements.sample_rate makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- contrib/pg_stat_statements/Makefile | 2 +- .../pg_stat_statements/expected/sampling.out | 99 +++ .../expected/sampling_1.out | 97 ++ .../pg_stat_statements/pg_stat_statements.c | 59 +-- contrib/pg_stat_statements/sql/sampling.sql | 34 +++ doc/src/sgml/pgstatstatements.sgml| 19 6 files changed, 302 insertions(+), 8 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/sampling.out create mode 100644 contrib/pg_stat_statements/expected/sampling_1.out create mode 100644 contrib/pg_stat_statements/sql/sampling.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 241c02587b..b70bdfaf2d 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf REGRESS = select dml cursors utility level_tracking planning \ user_activity wal entry_timestamp privileges extended \ - parallel cleanup oldextversions + parallel sampling cleanup oldextversions # Disabled because these tests require "shared_preload_libraries=pg_stat_statements", # which typical installcheck users do not have (e.g. buildfarm clients). NO_INSTALLCHECK = 1 diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out new file mode 100644 index 00..6e73a0a318 --- /dev/null +++ b/contrib/pg_stat_statements/expected/sampling.out @@ -0,0 +1,99 @@ +-- +-- sample statements +-- +SET pg_stat_statements.
Re: Sample rate added to pg_stat_statements
On 28.01.2025 20:21, Sami Imseih wrote: All the changes mentioned above are included in the v13 patch. Since the patch status is 'Ready for Committer,' I believe it is now better for upstream inclusion, with improved details in tests and documentation. Do you have any further suggestions? I am not quite clear on the sample_1.out. I do like the idea of separating the sample tests, but I was thinking of something a bit more simple. What do you think of my attached, sampling.sql, test? It tests sample rate in both simple and extended query protocols and for both top level and nested levels? That sounds great! I've added your sample.sql file to my v14 patch. However, I was focused on testing sample_rate values between 0 and 1. The approach I came up with was using the sample{_1}.out files. I’ve removed the test involving those files for now, but if the committer prefers to keep it, I can reintroduce them. If anyone has the capability to run this benchmark on machines with more CPUs or with different queries, it would be nice. I’d appreciate any suggestions or feedback. I wanted to share some additional benchmarks I ran as well on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured with 16GB of shared_buffers. I also attached the benchmark.sh script used to generate the output. The benchmark is running the select-only pgbench workload, so we have a single heavily contentious entry, which is the worst case. The test shows that the spinlock (SpinDelay waits) becomes an issue at high connection counts and will become worse on larger machines. A sample_rate going from 1 to .75 shows a 60% improvement; but this is on a single contentious entry. Most workloads will likely not see this type of improvement. I also could not really observe this type of difference on smaller machines ( i.e. 32 vCPUs), as expected. ## init pgbench -i -s500 ### 192 connections pgbench -c192 -j20 -S -Mprepared -T120 --progress 10 sample_rate = 1 tps = 484338.769799 (without initial connection time) waits - 11107 SpinDelay 9568 CPU 929 ClientRead 13 DataFileRead 3 BufferMapping sample_rate = .75 tps = 909547.562124 (without initial connection time) waits - 12079 CPU 4781 SpinDelay 2100 ClientRead sample_rate = .5 tps = 1028594.555273 (without initial connection time) waits - 13253 CPU 3378 ClientRead 174 SpinDelay sample_rate = .25 tps = 1019507.126313 (without initial connection time) waits - 13397 CPU 3423 ClientRead sample_rate = 0 tps = 1015425.288538 (without initial connection time) waits - 13106 CPU 3502 ClientRead ### 32 connections pgbench -c32 -j20 -S -Mprepared -T120 --progress 10 sample_rate = 1 tps = 620667.049565 (without initial connection time) waits - 1782 CPU 560 ClientRead sample_rate = .75 tps = 620663.131347 (without initial connection time) waits - 1736 CPU 554 ClientRead sample_rate = .5 tps = 624094.688239 (without initial connection time) waits - 1741 CPU 648 ClientRead sample_rate = .25 tps = 628638.538204 (without initial connection time) waits - 1702 CPU 576 ClientRead sample_rate = 0 tps = 630483.464912 (without initial connection time) waits - 1638 CPU 574 ClientRead Regards, Sami Thank you so much for benchmarking this on a pretty large machine with a large number of CPUs. The results look fantastic, and I truly appreciate your effort. BWT, I realized that the 'sampling' test needs to be added not only to the Makefile but also to meson.build. I've included that in the v14 patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 7c9d45325e29a65259740d5255d39f9f57ee6fba Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 28 Jan 2025 23:04:34 +0300 Subject: [PATCH] Allow setting sample rate for pg_stat_statements New configuration parameter pg_stat_statements.sample_rate makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- contrib/pg_stat_statements/Makefile | 2 +- .../pg_stat_statements/expected/sampling.out | 174 ++ contrib/pg_stat_statements/meson.build| 1 + .../pg_stat_statements/pg_stat_statements.c | 59 +- contrib/pg_stat_statements/sql/sampling.sql | 50 + doc/src/sgml/pgstatstatements.sgml| 19 ++ 6 files changed, 297 insertions(+), 8 deletions(-) create mode 100644 contrib/pg_stat_statements/expected/sampling.out create mode 100644 contrib/pg_stat_statements/sql/sampling.sql diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index 241c02587b..b70bdfaf2d 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS)) REGRESS_OPTS = --temp-config $(top_srcd
Re: Sample rate added to pg_stat_statements
On 29.01.2025 21:52, Ilia Evdokimov wrote: ... I also attached the benchmark.sh script used to generate the output. In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we should determine the CPU count at which it becomes. This will help us understand the scale of the problem. Does this make sense, or are there really no real workloads where the same query runs on more than 32 CPUs, and we've been trying to solve a non-existent problem? I ran the same benchmark on 48 CPUs for -c 48 -j 20 for objectivity. ### 48 connections pgbench -c48 -j20 -S -Mprepared -T120 --progress 10 sample_rate = 1 tps = 643251.640175 (without initial connection time) waits - 932 ClientRead 911 CPU 44 SpinDelay sample_rate = .75 tps = 653946.777122 (without initial connection time) waits - 939 CPU 875 ClientRead 3 SpinDelay sample_rate = .5 tps = 651654.348463 (without initial connection time) waits - 932 ClientRead 841 CPU sample_rate = .25 tps = 652668.807245 (without initial connection time) waits - 910 ClientRead 860 CPU sample_rate = 0 tps = 659111.347019 (without initial connection time) waits - 882 ClientRead 849 CPU There is a small amount ofSpinDelay, as the user mentioned. However, we can identify the threshold where the problem appears. To summarize the results of all benchmarks, I compiled them into a table: CPUs | sample_rate | tps | CPU waits | ClientRead wait | SpinDelay wait 192 | 1.0 | 484338| 9568 | 929 | 11107 192 | 0.75 | 909547| 12079 | 2100 | 4781 192 | 0.5 |1028594| 13253 | 3378 | 174 192 | 0.25 |1019507| 13397 | 3423 | - 192 | 0.0 |1015425| 13106 | 3502 | - 48 | 1.0 | 643251| 911 | 932 | 44 48 | 0.75 | 653946| 939 | 939 | 3 48 | 0.5 | 651654| 841 | 932 | - 48 | 0.25 | 652668| 860 | 910 | - 48 | 0.0 | 659111| 849 | 882 | - 32 | 1.0 | 620667| 1782 | 560 | - 32 | 0.75 | 620667| 1736 | 554 | - 32 | 0.5 | 624094| 1741 | 648 | - 32 | 0.25 | 628638| 1702 | 576 | - 32 | 0.0 | 630483| 1638 | 574 | - -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
On 28.01.2025 23:50, Ilia Evdokimov wrote: If anyone has the capability to run this benchmark on machines with more CPUs or with different queries, it would be nice. I’d appreciate any suggestions or feedback. I wanted to share some additional benchmarks I ran as well on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured with 16GB of shared_buffers. I also attached the benchmark.sh script used to generate the output. The benchmark is running the select-only pgbench workload, so we have a single heavily contentious entry, which is the worst case. The test shows that the spinlock (SpinDelay waits) becomes an issue at high connection counts and will become worse on larger machines. A sample_rate going from 1 to .75 shows a 60% improvement; but this is on a single contentious entry. Most workloads will likely not see this type of improvement. I also could not really observe this type of difference on smaller machines ( i.e. 32 vCPUs), as expected. ## init pgbench -i -s500 ### 192 connections pgbench -c192 -j20 -S -Mprepared -T120 --progress 10 sample_rate = 1 tps = 484338.769799 (without initial connection time) waits - 11107 SpinDelay 9568 CPU 929 ClientRead 13 DataFileRead 3 BufferMapping sample_rate = .75 tps = 909547.562124 (without initial connection time) waits - 12079 CPU 4781 SpinDelay 2100 ClientRead sample_rate = .5 tps = 1028594.555273 (without initial connection time) waits - 13253 CPU 3378 ClientRead 174 SpinDelay sample_rate = .25 tps = 1019507.126313 (without initial connection time) waits - 13397 CPU 3423 ClientRead sample_rate = 0 tps = 1015425.288538 (without initial connection time) waits - 13106 CPU 3502 ClientRead ### 32 connections pgbench -c32 -j20 -S -Mprepared -T120 --progress 10 sample_rate = 1 tps = 620667.049565 (without initial connection time) waits - 1782 CPU 560 ClientRead sample_rate = .75 tps = 620663.131347 (without initial connection time) waits - 1736 CPU 554 ClientRead sample_rate = .5 tps = 624094.688239 (without initial connection time) waits - 1741 CPU 648 ClientRead sample_rate = .25 tps = 628638.538204 (without initial connection time) waits - 1702 CPU 576 ClientRead sample_rate = 0 tps = 630483.464912 (without initial connection time) waits - 1638 CPU 574 ClientRead Regards, Sami Thank you so much for benchmarking this on a pretty large machine with a large number of CPUs. The results look fantastic, and I truly appreciate your effort. BWT, I realized that the 'sampling' test needs to be added not only to the Makefile but also to meson.build. I've included that in the v14 patch. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we should determine the CPU count at which it becomes. This will help us understand the scale of the problem. Does this make sense, or are there really no real workloads where the same query runs on more than 32 CPUs, and we've been trying to solve a non-existent problem? -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 11.01.2025 14:10, Ilia Evdokimov wrote: On 11.01.2025 12:15, Guillaume Lelarge wrote: Thanks for your patch, this looks like a very interesting feature that I'd like to see in a future release. It did a quick run: patch OK, make OK, make install OK, but make check fails quite a lot on partition_prune.sql. I guess it would need some work on partition_prune.sql tests and perhaps also on the docs. Thanks again. -- Guillaume. Yes, certainly. I have fixed partition_prune.sql. In the documentation example for EXPLAIN ANALYZE where loops is greater than one, I updated how 'rows' and 'loops' values are displayed so they appear as decimal fractions with two digits after the decimal point. I attached fixed patch. Any suggestions? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. I guess, it's not ideal to modify the existing example in the documentation of the v5 patch because readers wouldn't immediately understand why decimal fractions appear there. Instead, I'll add a brief note in the documentation clarifying how rows and loops are displayed when the average row count is below one. The changes the of documentation are attached v6 patch. If you have any other suggestions or different opinions, I'd be happy to discuss them. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 08f92c7e11829045014598e1dcc042f3e5a1e1a3 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Mon, 13 Jan 2025 23:01:44 +0300 Subject: [PATCH] Clarify display of rows and loops as decimal fractions When the average number of rows is small compared to the number of loops, both rows and loops are displayed as decimal fractions with two digits after the decimal point in EXPLAIN ANALYZE. --- doc/src/sgml/perform.sgml | 6 ++- src/backend/commands/explain.c| 49 +-- src/test/regress/expected/partition_prune.out | 10 ++-- src/test/regress/sql/partition_prune.sql | 2 +- 4 files changed, 44 insertions(+), 23 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..3f13d17fe9 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -757,7 +757,11 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total time actually spent in the node. In the above example, we spent a total of 0.030 milliseconds -executing the index scans on tenk2. +executing the index scans on tenk2. If a subplan node +is executed multiple times and the average number of rows is less than one, +the rows and loops values are shown as a decimal fraction +(with two digits after the decimal point) to indicate that some rows +were actually processed rather than simply rounding down to zero. diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c24e66f82e..200294b756 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1981,14 +1981,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { + appendStringInfo(es->str, " (actual"); if (es->timing) -appendStringInfo(es->str, - " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", - startup_ms, total_ms, rows, nloops); +appendStringInfo(es->str, " time=%.3f..%.3f", startup_ms, total_ms); + + if (nloops > 1 && planstate->instrument->ntuples < nloops) +appendStringInfo(es->str," rows=%.2f loops=%.2f)", rows, nloops); else -appendStringInfo(es->str, - " (actual rows=%.0f loops=%.0f)", - rows, nloops); +appendStringInfo(es->str," rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -1999,8 +1999,16 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Actual Total Time", "ms", total_ms, 3, es); } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + if (nloops > 1 && planstate->instrument->ntuples < nloops) + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 2, es); + } + else + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } } } else if (es->analyze) @@ -2052,14 +2060,14 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { ExplainIndentText(es); +appendStringInfo(es->str, "actual"); if (es->timing) - appendStringInfo(es->str, -
Re: Sample rate added to pg_stat_statements
On 09.01.2025 22:13, Alena Rybakina wrote: Hi! Thank you for the work with this subject. I looked at your patch and noticed that this part of the code is repeated several times: if (nesting_level == 0) { if (!IsParallelWorker()) current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate; else current_query_sampled = false; } I think you should put this in a function like update_current_query_sampled. I've attached a diff file with the changes. There was redundant checking of nesting_level, so I removed one of them. The fixed version is included in patch v10. Alena, Sami – I apologize for not including you in the previous email. If you're interested in this approach, I'm open to any suggestions. [0]: https://www.postgresql.org/message-id/1b13d748-5e98-479c-9222-3253a734a038%40tantorlabs.com -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From e8d9911bd7b1384112773977c1a30a476fc77471 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 14 Jan 2025 12:18:52 +0300 Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 76 +++ .../pg_stat_statements/pg_stat_statements.c | 46 ++- contrib/pg_stat_statements/sql/select.sql | 21 + doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 158 insertions(+), 3 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..558d93fb46 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +set pg_stat_statements.sample_rate = 0.0; +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT 1 \parse stmt +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls +---+--- +(0 rows) + +set pg_stat_statements.sample_rate = 1.0; +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(2 rows) + +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + -- -- queries with locking clauses -- diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index bebf8134eb..d79402bac4 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -50,6 +50,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" #include "common/int.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "funcapi.h" #include "jit/jit.h" @@ -294,6 +295,10 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; + +/* Is the c
Re: Define STATS_MIN_ROWS for minimum rows of stats in ANALYZE
On 10.12.2024 16:32, Ilia Evdokimov wrote: On 09.12.2024 16:10, Ilia Evdokimov wrote: Hi hackers, The repeated use of the number 300 in the ANALYZE-related code creates redundancy and relies on scattered, sometimes unclear, comments to explain its purpose. This can make the code harder to understand, especially for new contributors who might not immediately understand its significance. To address this, I propose introducing a macro STATS_MIN_ROWS to represent this value and consolidating its explanation in a single place, making the code more consistent and readable. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. Hi everyone, Currently, the value 300 is used as the basis for determining the number of rows sampled during ANALYZE, both for single-column and extended statistics. While this value has a well-established rationale for single-column statistics, its suitability for extended statistics remains uncertain, as no specific research has confirmed that this is an optimal choice for them. To better reflect this distinction, I propose introducing two macros: STATS_MIN_ROWS for single-column statistics and EXT_STATS_MIN_ROWS for extended statistics. This change separates the concerns of single-column and extended statistics sampling, making the code more explicit and easier to adapt if future research suggests a different approach for extended statistics. The values remain the same for now, but the introduction of distinct macros improves clarity and prepares the codebase for potential refinements. Does this seem like a reasonable approach to handling these differences? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. Hi everyone, In my opinion, it is more appropriate to define||EXT_STATS_MIN_ROWS as STATS_MIN_ROWS. I also reverted some of the code comments and rewrote others. I attached patch. Any thoughts? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From d8e38a78857006189a149acc5b36df3d77fe1e40 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Fri, 3 Jan 2025 16:32:43 +0300 Subject: [PATCH] Define macros for minimum rows of stats in ANALYZE This introduces two macros, STATS_MIN_ROWS and EXT_STATS_MIN_ROWS, to represent the default minimum number of rows sampled in ANALYZE. STATS_MIN_ROWS is used for single-column statistics, while EXT_STATS_MIN_ROWS is intended for extended statistics. Both macros replace the hardcoded value of 300, improving clarity. --- src/backend/commands/analyze.c| 26 +++ src/backend/statistics/extended_stats.c | 2 +- src/backend/tsearch/ts_typanalyze.c | 5 ++-- src/backend/utils/adt/rangetypes_typanalyze.c | 5 ++-- .../statistics/extended_stats_internal.h | 11 src/include/statistics/statistics.h | 23 6 files changed, 45 insertions(+), 27 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 2a7769b1fd..4c02e4f18a 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1897,40 +1897,22 @@ std_typanalyze(VacAttrStats *stats) { /* Seems to be a scalar datatype */ stats->compute_stats = compute_scalar_stats; - /* - * The following choice of minrows is based on the paper - * "Random sampling for histogram construction: how much is enough?" - * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in - * Proceedings of ACM SIGMOD International Conference on Management - * of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5 - * says that for table size n, histogram size k, maximum relative - * error in bin size f, and error probability gamma, the minimum - * random sample size is - * r = 4 * k * ln(2*n/gamma) / f^2 - * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain - * r = 305.82 * k - * Note that because of the log function, the dependence on n is - * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66 - * bin size error with probability 0.99. So there's no real need to - * scale for n, which is a good thing because we don't necessarily - * know it at this point. - * - */ - stats->minrows = 300 * stats->attstattarget; + /* see comment about the choice of minrows in statistics/statistics.h */ + stats->minrows = (STATS_MIN_ROWS * stats->attstattarget); } else if (OidIsValid(eqopr)) { /* We can still recognize distinct values */ stats->compute_stats = compute_distinct_stats; /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; + stats->minrows = (STATS_MIN_ROWS * stats->attstattarget); } else { /* Can't do much but the trivial stuff */ stats->compute_stats = compute_trivial_stats; /* Might as well use the same minrows as above */ - stats->minrows = 300 * stats->attstattarget; + stats->minrows = (STA
Remove unused rel parameter in lookup_var_attr_stats
Hi hackers, I've attached a small patch that remove unused parameter 'rel' from the static function lookup_var_attr_stats() in extended_stats.c While exploring src/backend/statistics, I noticed that there are several other parameters which appear to be unused. However, since the discussion around import/export statistics [1] attribute_stats.c is still ongoing and many functions from other files are non-static, those parameters may yet prove useful. Therefore, I'm only removing the parameter from this static function. Any thoughts? [1]: https://www.postgresql.org/message-id/flat/CADkLM%3DcB0rF3p_FuWRTMSV0983ihTRpsH%2BOCpNyiqE7Wk0vUWA%40mail.gmail.com -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From a59af002463c962f153c79202b5f091a24fa6199 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Fri, 3 Jan 2025 16:46:51 +0300 Subject: [PATCH] Remove unused rel parameter in lookup_var_attr_stats The rel parameter in lookup_var_attr_stats is unused, so remove it. This is a static function, so the parameter can easily be added again if it's ever needed. --- src/backend/statistics/extended_stats.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c index aeec3ece41..34dcb535e1 100644 --- a/src/backend/statistics/extended_stats.c +++ b/src/backend/statistics/extended_stats.c @@ -74,7 +74,7 @@ typedef struct StatExtEntry static List *fetch_statentries_for_relation(Relation pg_statext, Oid relid); -static VacAttrStats **lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs, +static VacAttrStats **lookup_var_attr_stats(Bitmapset *attrs, List *exprs, int nvacatts, VacAttrStats **vacatts); static void statext_store(Oid statOid, bool inh, MVNDistinct *ndistinct, MVDependencies *dependencies, @@ -165,7 +165,7 @@ BuildRelationExtStatistics(Relation onerel, bool inh, double totalrows, * Check if we can build these stats based on the column analyzed. If * not, report this fact (except in autovacuum) and move on. */ - stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs, + stats = lookup_var_attr_stats(stat->columns, stat->exprs, natts, vacattrstats); if (!stats) { @@ -293,7 +293,7 @@ ComputeExtStatisticsRows(Relation onerel, * analyzed. If not, ignore it (don't report anything, we'll do that * during the actual build BuildRelationExtStatistics). */ - stats = lookup_var_attr_stats(onerel, stat->columns, stat->exprs, + stats = lookup_var_attr_stats(stat->columns, stat->exprs, natts, vacattrstats); if (!stats) @@ -687,7 +687,7 @@ examine_expression(Node *expr, int stattarget) * indicate to the caller that the stats should not be built. */ static VacAttrStats ** -lookup_var_attr_stats(Relation rel, Bitmapset *attrs, List *exprs, +lookup_var_attr_stats(Bitmapset *attrs, List *exprs, int nvacatts, VacAttrStats **vacatts) { int i = 0; -- 2.34.1
Re: Sample rate added to pg_stat_statements
On 15.01.2025 01:07, Sami Imseih wrote: Alena, Sami – I apologize for not including you in the previous email. If you're interested in this approach, I'm open to any suggestions. [0]: https://www.postgresql.org/message-id/1b13d748-5e98-479c-9222-3253a734a038%40tantorlabs.com Here are my thoughts on this: There is good reason to apply sample rate selectively, but I am not sure if execution time is the way to go. I would rather apply a sample rate on the most frequently called queries, since I can gather enough samples to draw conclusions about performance of the query. I just don't know if that can be done in a sensible way, because while we can check the number of calls in the entry, we will need to do that with a shared lock and spin lock, which will defeat the purpose of this patch. Agree. Indeed, we should reduce the load on the spin locks, but we can’t check the most popular called queries without inspecting the hash table and locking spin locks. This also got me thinking if maybe we should allow the user to disable sample rate for utility statements as those are less frequent in most workloads and a user may want to capture all such statements, i.e. DROP, CREATE, etc. Regards, Sami Probably, but first I suggest benchmarking with sampling applied to all queries. If the results are good, we can later filter certain queries based on different characteristics. On 06.01.2025 18:57, Andrey M. Borodin wrote: 1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<". xact_is_sampled = log_xact_sample_rate != 0 && (log_xact_sample_rate == 1 || pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate); Sorry for the delayed reply. Andrey was right about this suggestion—first, it makes the code more readable for others, and second, it avoids engaging the PRNG on edge values of 0.0 and 1.0. I’ve attached patch v11 with these changes. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 838c7e591a5cf157e88bc82cd8adf549b60ce212 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Wed, 15 Jan 2025 12:37:36 +0300 Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 76 +++ .../pg_stat_statements/pg_stat_statements.c | 48 +++- contrib/pg_stat_statements/sql/select.sql | 21 + doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 160 insertions(+), 3 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..558d93fb46 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +set pg_stat_statements.sample_rate = 0.0; +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT 1 \parse stmt +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls +---+--- +(0 rows) + +set pg_stat_statements.sample_rate = 1.0; +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1
Re: Sample rate added to pg_stat_statements
On 08.01.2025 22:39, Ilia Evdokimov wrote: After the changes in v9-patch, won’t all the necessary queries now be normalized? Since there are no longer any restrictions in the parser hook, queries will be normalized as usual, and pgss_planner, pgss_ExecutorStart, and ExecutorEnd will simply fetch them from 'pgss_query_texts.stat' file. For now, this seems like the simplest approach instead of making JumbleState available to other hooks. Moreover, if we do proceed with that, we might be able to remove the 'pgss_query_texts.stat' file altogether and more other improvements. In my view, if no other options arise, we should address making JumbleState available to other hooks in a separate thread. If you have any suggestions, I'm always open to feedback. I attached v9 patch with changes and test above. Unfortunately, these changes do not achieve the intended sampling goal. I looked into this more deeply: while the sampled-out queries do not appear in pg_stat_statements, an entry is still allocated in the hash table after normalization, which, in my view, should not happen when sampling is in effect. Therefore, patch v9 is unlikely to meet our needs. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Sample rate added to pg_stat_statements
ng JumbleState available to other hooks in a separate thread. If you have any suggestions, I'm always open to feedback. I attached v9 patch with changes and test above. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From b40ab06155cac38e57f0551a3858d0574d96865f Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Wed, 8 Jan 2025 22:15:02 +0300 Subject: [PATCH] Allow setting sample ratio for pg_stat_statements New configuration parameter pg_stat_statements.sample_ratio makes it possible to track just a fraction of the queries meeting the configured threshold, to reduce the amount of tracking. --- .../pg_stat_statements/expected/select.out| 76 +++ .../pg_stat_statements/pg_stat_statements.c | 54 - contrib/pg_stat_statements/sql/select.sql | 21 + doc/src/sgml/pgstatstatements.sgml| 18 + 4 files changed, 166 insertions(+), 3 deletions(-) diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out index 37a30af034..558d93fb46 100644 --- a/contrib/pg_stat_statements/expected/select.out +++ b/contrib/pg_stat_statements/expected/select.out @@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t; t (1 row) +-- +-- sample statements +-- +SET pg_stat_statements.sample_rate = 0.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query| calls ++--- + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 +(1 row) + +SET pg_stat_statements.sample_rate = 1.0; +SELECT 1 AS "int"; + int +- + 1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 AS "int" | 1 + SELECT pg_stat_statements_reset() IS NOT NULL AS t | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(3 rows) + +SELECT pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +set pg_stat_statements.sample_rate = 0.0; +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + +SELECT 1 \parse stmt +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls +---+--- +(0 rows) + +set pg_stat_statements.sample_rate = 1.0; +\bind_named stmt \g + ?column? +-- +1 +(1 row) + +SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C"; + query | calls ++--- + SELECT $1 | 1 + SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 +(2 rows) + +select pg_stat_statements_reset() IS NOT NULL AS t; + t +--- + t +(1 row) + -- -- queries with locking clauses -- diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index bebf8134eb..52957c4628 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -50,6 +50,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" #include "common/int.h" +#include "common/pg_prng.h" #include "executor/instrument.h" #include "funcapi.h" #include "jit/jit.h" @@ -294,6 +295,10 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* whether to save stats across shutdown */ +static double pgss_sample_rate = 1; + +/* Is the current top-level query to be sampled? */ +static bool current_query_sampled = false; #define pgss_enabled(level) \ @@ -414,6 +419,19 @@ _PG_init(void) NULL, NULL); + DefineCustomRealVariable("pg_stat_statements.sample_rate", + "Fraction of queries to track.", + NULL, + &pgss_sample_rate, + 1.0, + 0.0, + 1.0, + PGC_SUSET, + 0, + NULL, + NULL, + NULL); + DefineCustomEnumVariable("pg_stat_statements.track", "Selects which statements are tracked by pg_stat_statements.", NULL, @@ -889,12 +907,21 @@ pgss_planner(Query *parse, { PlannedStmt *result; + if (nesting_level == 0) + { + if (!IsParalle
Re: Sample rate added to pg_stat_statements
On 09.01.2025 05:29, Sami Imseih wrote: Unfortunately, these changes do not achieve the intended sampling goal. I looked into this more deeply: while the sampled-out queries do not appear in pg_stat_statements, an entry is still allocated in the hash table after normalization, which, in my view, should not happen when sampling is in effect. Therefore, patch v9 is unlikely to meet our needs. pg_stat_statements creates entries as "sticky" initially to give them more time to stay in the hash before the first execution completes. It's not perfect, but it works for the majority of cases. So, what you are observing is how pg_stat_statements currently works. If an entry is popular enough, we will need it anyways ( even with the proposed sampling ). An entry that's not popular will eventually be aged out. From my understanding, what the proposed sampling will do is to reduce the overhead of incrementing counters of popular entries, because of the spinlock to update the counters. This is particularly the case with high concurrency on large machines ( high cpu count ), and especially when there is a small set of popular entries. IMO, This patch should also have a benchmark that proves that a user can benefit with sampling in those types of workloads. Ah, so patch version 9 might be the best fit to achieve this. I’ll need to benchmark it on a large, high-concurrency machine then. -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: explain analyze rows=%.0f
On 12.02.2025 22:56, Robert Haas wrote: On Wed, Feb 12, 2025 at 2:55 PM Andrei Lepikhov wrote: On 13/2/2025 01:40, Tom Lane wrote: I was idly speculating yesterday about letting the Ryu code print the division result, so that we get a variable number of digits. Realistically, that'd probably result in many cases in more digits than anybody wants, so it's not a serious proposal. I'm cool with the fixed-two-digits approach to start with. Okay, since no one else voted for the meaningful-numbers approach, I would say that fixed size is better than nothing. It may cover some of my practical cases, but unfortunately, not the most problematic ones. I don't love it either, but I do think it is significantly better than nothing. I'm in favor of having some improvement rather than nothing at all—otherwise, we might never reach a consensus. 1. Documentation (v9-0001-Clarify-display-of-rows-as-decimal-fractions-DOC.patch) One thing that bothers me is that the documentation explains how to compute total time, but it does not clarify how to compute total rows. Maybe this was obvious to others before, but now that we are displaying |rows| as a fraction, we should explicitly document how to interpret it alongside total time. I believe it would be helpful to show a non-integer rows value in an example query. However, to achieve this, we need the index scan results to vary across iterations. One way to accomplish this is by using the condition t1.unique2 > t2.unique2. Additionally, I suggest making loops a round number (e.g., 100) for better readability, which can be achieved using t1.thousand < 10. The final query would look like this: EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2; I believe this is an ideal example for the documentation because it not only demonstrates fractional rows, but also keeps the execution plan nearly unchanged. While the estimated and actual average row counts become slightly rounded, I don't see another way to ensure different results for each index scan. I'm open to any feedback or suggestions for a better example to use in the documentation or additional explaining fractional rows in the text. 2. Code and tests (v9-0002-Clarify-display-of-rows-as-decimal-fractions.patch) I left the code and tests unchanged since we agreed on a fixed format of two decimal places. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 789d98383988ef6d3b0bc2c6b9b5c73a83ffd6d4 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 13 Feb 2025 11:19:03 +0300 Subject: [PATCH v9] Clarify display of rows as decimal fractions When loops > 1, the average rows value is now displayed as a decimal fraction with two digits after the decimal point in EXPLAIN ANALYZE. Previously, the average rows value was always rounded to an integer, which could lead to misleading results when estimating total rows by multiplying rows by loop. This change ensures that users get a more accurate representation of the data flow through execution nodes. --- doc/src/sgml/perform.sgml | 41 --- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..dd61ae4507 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -717,26 +717,26 @@ FROM tenk1 t1 WHERE t1.ten = (SELECT (random() * 10)::integer); EXPLAIN ANALYZE SELECT * FROM tenk1 t1, tenk2 t2 -WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; - - QUERY PLAN &zwsp;-- - Nested Loop (cost=4.65..118.50 rows=10 width=488) (actual time=0.017..0.051 rows=10 loops=1) - Buffers: shared hit=36 read=6 - -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.38 rows=10 width=244) (actual time=0.009..0.017 rows=10 loops=1) - Recheck Cond: (unique1 < 10) - Heap Blocks: exact=10 - Buffers: shared hit=3 read=5 written=4 - -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.004..0.004 rows=10 loops=1) - Index Cond: (unique1 < 10) +WHERE t1.thousand < 10 AND t1.unique2 > t2.unique2; + + QUERY PLAN +---&zwsp;- + Nested Loop (cost=5.40..11571.44 rows=356667 width=488) (actual time=0.042..117.205 rows=513832 loops=1) + Buffers: shared hit=19377 read=29 + -> Bitmap Heap Scan on tenk1 t1 (cost=5.11..233.60 rows=107 width=244) (actual time=0.021..0.103 rows=100 loops=1) + Recheck Cond: (thousand < 10) + Heap
Re: explain analyze rows=%.0f
On 11.02.2025 20:41, Robert Haas wrote: On Tue, Feb 11, 2025 at 12:14 PM Andrei Lepikhov wrote: I support the idea in general, but I believe it should be expanded to cover all cases of parameterised plan nodes. Each rescan iteration may produce a different number of tuples, and rounding can obscure important data. For example, consider five loops of a scan node: the first loop returns nine tuples, and each other - zero tuples. When we calculate the average, 9 divided by 5 equals 1.8. This results in an explanation that indicates "rows = 1," masking almost 40% of the data. Now, if we apply the same two loops but have a total of 900,000 tuples, then 400,000 masked tuples represent a significant portion of the data. Moreover, switching to a floating-point type for row explanations in each parameterised node would provide a more comprehensive view and add valuable information about the parameterisation of the node, which may not be immediately apparent. I agree strongly with all of this. I believe we should just implement what was agreed here: https://www.postgresql.org/message-id/21013.1243618236%40sss.pgh.pa.us Let's just display 2 fractional digits when nloops>1, else 0, and call it good. Thank you for your review! With such example, it's hard to disagree with it. This would really add valuable information. Taking all opinions into account, I have updated the patch v8. I have also included a check for the case where there are only zeros after the decimal point. We do not want to clutter the rows with unnecessary zeros. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From e1a46d3fa5a4d20b6ea4728b2a3955f375539295 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Tue, 11 Feb 2025 22:12:41 +0300 Subject: [PATCH] Clarify display of rows as decimal fractions When loops > 1, the average rows value is shown as decimal fractions with two digits after the decimal point in EXPLAIN ANALYZE. --- doc/src/sgml/perform.sgml | 2 + src/backend/commands/explain.c| 55 +-- src/test/regress/expected/partition_prune.out | 18 +++--- src/test/regress/sql/partition_prune.sql | 2 +- 4 files changed, 51 insertions(+), 26 deletions(-) diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index a502a2aaba..a01832040c 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -758,6 +758,8 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; the loops value to get the total time actually spent in the node. In the above example, we spent a total of 0.030 milliseconds executing the index scans on tenk2. +If loops is greater than 1, +the rows is shown as a decimal unless it's an integer. diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index c24e66f82e..ae81aeea13 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -13,6 +13,8 @@ */ #include "postgres.h" +#include + #include "access/xact.h" #include "catalog/pg_type.h" #include "commands/createas.h" @@ -69,6 +71,9 @@ typedef struct SerializeMetrics */ #define BYTES_TO_KILOBYTES(b) (((b) + 1023) / 1024) +/* Check if float/double has any decimal number */ +#define HAS_DECIMAL(x) (floor(x) != x) + static void ExplainOneQuery(Query *query, int cursorOptions, IntoClause *into, ExplainState *es, ParseState *pstate, ParamListInfo params); @@ -1981,14 +1986,15 @@ ExplainNode(PlanState *planstate, List *ancestors, if (es->format == EXPLAIN_FORMAT_TEXT) { + appendStringInfo(es->str, " (actual "); + if (es->timing) -appendStringInfo(es->str, - " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)", - startup_ms, total_ms, rows, nloops); +appendStringInfo(es->str, "time=%.3f..%.3f ", startup_ms, total_ms); + + if (nloops > 1 && HAS_DECIMAL(rows)) +appendStringInfo(es->str, "rows=%.2f loops=%.0f)", rows, nloops); else -appendStringInfo(es->str, - " (actual rows=%.0f loops=%.0f)", - rows, nloops); +appendStringInfo(es->str, "rows=%.0f loops=%.0f)", rows, nloops); } else { @@ -1999,8 +2005,16 @@ ExplainNode(PlanState *planstate, List *ancestors, ExplainPropertyFloat("Actual Total Time", "ms", total_ms, 3, es); } - ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); - ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + if (nloops > 1 && HAS_DECIMAL(rows)) + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 2, es); +ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es); + } + else + { +ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); +ExplainPropertyFloat("Ac
Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option
On 19.03.2025 12:55, vignesh C wrote: Looks good overall. However, after pressing Tab, ONLY appears in the completion list, but it doesn't autocomplete after typing the first letter. In which case you noticed this? Ah, I just tested it using these examples from the regression tests: CREATE TABLE only_parted (a int, b text) PARTITION BY LIST (a); CREATE TABLE only_parted1 PARTITION OF only_parted FOR VALUES IN (1); ONLY doesn't autocomplete here because the table names begin with "only...", which conflicts with autocompletion—this was my oversight. -- Everything seems fine with the ANALYZE command autocompletion. Regarding VACUUM, I'm not entirely convinced we should provide autocompletion in every case. I'd prefer to keep the behavior from the v3 patch because the original intention behind adding the ONLY keyword was specifically for collecting statistics exclusively from parent tables. Therefore, autocompletion makes sense primarily for the ANALYZE case. Any thoughts? -- Best regaards, Ilia Evdokimov, Tantor Labs LLC.
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
After considering the opinions expressed in this discussion, I tend to agree more with David. If we add info about estimated unique keys and estimated capacity, then any additional information - such as evict_ratio and hit_ratio - can also be calculated, as EXPLAIN ANALYZE provides all the necessary details to compute these values. For now, I’m attaching a rebased v4 patch, which includes the fix for ExplainPropertyText. -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From b60e508583b1cb7f30814cef6f39c4b570693350 Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Fri, 21 Mar 2025 11:52:29 +0300 Subject: [PATCH v4] Show ndistinct and est_entries in EXPLAIN for Memoize --- src/backend/commands/explain.c | 16 src/backend/optimizer/path/costsize.c | 3 +++ src/backend/optimizer/plan/createplan.c | 10 +++--- src/backend/optimizer/util/pathnode.c | 3 +++ src/include/nodes/pathnodes.h | 2 ++ src/include/nodes/plannodes.h | 6 ++ 6 files changed, 37 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 391b34a2af2..cb7e295ca8a 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3628,6 +3628,22 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es) ExplainPropertyText("Cache Key", keystr.data, es); ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es); + if (es->costs) + { + if (es->format == EXPLAIN_FORMAT_TEXT) + { + ExplainIndentText(es); + appendStringInfo(es->str, "Estimated Capacity: %u Estimated Distinct Lookup Keys: %0.0f\n", + ((Memoize *) plan)->est_entries, + ((Memoize *) plan)->est_unique_keys); + } + else + { + ExplainPropertyUInteger("Estimated Capacity", "", ((Memoize *) plan)->est_entries, es); + ExplainPropertyFloat("Estimated Distinct Lookup Keys", "", ((Memoize *) plan)->est_unique_keys, 0, es); + } + } + pfree(keystr.data); if (!es->analyze) diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index f6f77b8fe19..e3abf9ccc26 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath, mpath->est_entries = Min(Min(ndistinct, est_cache_entries), PG_UINT32_MAX); + /* Remember ndistinct for a potential EXPLAIN later */ + mpath->est_unique_keys = ndistinct; + /* * When the number of distinct parameter values is above the amount we can * store in the cache, then we'll have to evict some entries from the diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 75e2b0b9036..596b82c5dc9 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -284,7 +284,8 @@ static Material *make_material(Plan *lefttree); static Memoize *make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations, List *param_exprs, bool singlerow, bool binary_mode, - uint32 est_entries, Bitmapset *keyparamids); + uint32 est_entries, Bitmapset *keyparamids, + double est_unique_keys); static WindowAgg *make_windowagg(List *tlist, WindowClause *wc, int partNumCols, AttrNumber *partColIdx, Oid *partOperators, Oid *partCollations, int ordNumCols, AttrNumber *ordColIdx, Oid *ordOperators, Oid *ordCollations, @@ -1703,7 +1704,8 @@ create_memoize_plan(PlannerInfo *root, MemoizePath *best_path, int flags) plan = make_memoize(subplan, operators, collations, param_exprs, best_path->singlerow, best_path->binary_mode, - best_path->est_entries, keyparamids); + best_path->est_entries, keyparamids, + best_path->est_unique_keys); copy_generic_path_info(&plan->plan, (Path *) best_path); @@ -6636,7 +6638,8 @@ materialize_finished_plan(Plan *subplan) static Memoize * make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations, List *param_exprs, bool singlerow, bool binary_mode, - uint32 est_entries, Bitmapset *keyparamids) + uint32 est_entries, Bitmapset *keyparamids, + double est_unique_keys) { Memoize*node = makeNode(Memoize); Plan *plan = &node->plan; @@ -6654,6 +6657,7 @@ make_memoize(Plan *lefttree, Oid *hashoperators, Oid *collations, node->binary_mode = binary_mode; node->est_entries = est_entries; node->keyparamids = keyparamids; + node->est_unique_keys = est_unique_keys; return node; } diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 93e73cb44db..1fbcda99067 100644 --- a/src/backend/optimizer/util/pathnode.c +++ b/src/backend/optimizer/util/pathnode.c @@ -1701,6 +1701,9 @@ cre
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
On 20.03.2025 13:37, David Rowley wrote: est_entries is a uint32, so %u is the correct format character for that type. I don't think you need to prefix all these properties with "Cache" just because the other two properties have that prefix. I also don't think the names you've chosen really reflect the meaning. How about something like: "Estimated Distinct Lookup Keys: 721 Estimated Capacity: 655", in that order. I think maybe having that as one line for format=text is better than 2 lines. The EXPLAIN output is already often taking up more lines in v18 than in v17, would be good to not make that even worse unnecessarily. LGTM I see the existing code there could use ExplainPropertyText rather than have a special case for text and non-text formats. That's likely my fault. If we're touching this code, then we should probably tidy that up. Do you want to create a precursor fixup patch for that? I fully agree with this suggestion. Then I'll begin with this on another new thread. + double ndistinct; /* Estimated number of distinct memoization keys, + * used for cache size evaluation. Kept for EXPLAIN */ Maybe this field in MemoizePath needs a better name. How about "est_unique_keys"? and also do the rename in struct Memoize. LGTM I'm also slightly concerned about making struct Memoize bigger. I had issues with a performance regression [1] for 908a96861 when increasing the WindowAgg struct size last year and the only way I found to make it go away was to shuffle the fields around so that the struct size didn't increase. I think we'll need to see a benchmark of a query that hits Memoize quite hard with a small cache size to see if the performance decreases as a result of adding the ndistinct field. It's unfortunate that we'll not have the luxury of squeezing this double into padding if we do see a slowdown. I tried rearranging the fields in the structure with 'ndistinct' field, but unfortunately, sizeof(Memoize) did not remain the same. Therefore, benchmarking Memoize is definitely necessary. Thanks for the information! -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Proposal - Allow extensions to set a Plan Identifier
On 18.03.2025 14:46, Ilia Evdokimov wrote: However, I have a question: what value should planid have when we execute the standard planner without using a planner_hook? For example, if pg_stat_statementsis disabled, would planid default to zero? If zero is the expected default, should we explicitly write this behavior? result->planid = 0; Sorry for spam. I found the answer this question in thread [0]. [0]: https://www.postgresql.org/message-id/flat/CAP53Pkyow59ajFMHGpmb1BK9WHDypaWtUsS_5DoYUEfsa_Hktg%40mail.gmail.com -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option
On 18.03.2025 17:57, Ilia Evdokimov wrote: Hi, Thank you for the patch. However, Is it ok if I can't tab 'ONLY' in VACUUM ANALYZE command? CREATE TABLE only_parted (a int, b text) PARTITION BY LIST (a); CREATE TABLE only_parted1 PARTITION OF only_parted FOR VALUES IN (1); INSERT INTO only_parted VALUES (1, 'a'); VACUUM ANALYZE {Press Tab} information_schema. only_parted only_parted1 public. - Although it works: postgres=# VACUUM ANALYZE ONLY only_parted; WARNING: VACUUM ONLY of partitioned table "only_parted" has no effect VACUUM postgres=# -- Best regards, Ilia Evdokimov, Tantor Labs LLC. Umar, Vignesh, sorry for not adding you to CC earlier! -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: [PERF] Improve Cardinality Estimation for Joins with GROUP BY Having Single Clause
Hi, On commit 122a9af I can't see any problem with query: explain analyze select * from t1 left join (select a, max(b) from t2 group by a) t2 on t1.a = t2.a; QUERY PLAN - Hash Right Join (cost=397.65..669.04 rows=1000 width=12) (actual time=9.008..11.549 rows=1000.00 loops=1) Hash Cond: (t2.a = t1.a) Buffers: shared read=90 -> HashAggregate (cost=370.15..560.25 rows=19010 width=8) (actual time=8.103..10.160 rows=19010.00 loops=1) Group Key: t2.a Batches: 1 Memory Usage: 2321kB Buffers: shared read=85 -> Seq Scan on t2 (cost=0.00..275.10 rows=19010 width=8) (actual time=0.015..1.907 rows=19010.00 loops=1) Buffers: shared read=85 -> Hash (cost=15.00..15.00 rows=1000 width=4) (actual time=0.365..0.366 rows=1000.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 44kB Buffers: shared read=5 -> Seq Scan on t1 (cost=0.00..15.00 rows=1000 width=4) (actual time=0.033..0.187 rows=1000.00 loops=1) Buffers: shared read=5 Planning: Buffers: shared hit=31 read=16 Planning Time: 0.751 ms Execution Time: 12.599 ms (18 rows) -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option
On 18.03.2025 15:25, vignesh C wrote: On Thu, 27 Feb 2025 at 11:42, vignesh C wrote: On Thu, 6 Feb 2025 at 16:29, Umar Hayat wrote: This will include "ONLY" also when we display the tables too: + else if (HeadMatches("ANALYZE")) + COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_analyzables, "ONLY"); like below: postgres=# analyze only t1 information_schema. ONLY public. t1 The same issue exists with vacuum too: postgres=# vacuum only t1 information_schema. ONLY public. t1 The attached patch has the fixes for the above issue. Regards, Vignesh Hi, Thank you for the patch. However, Is it ok if I can't tab 'ONLY' in VACUUM ANALYZE command? CREATE TABLE only_parted (a int, b text) PARTITION BY LIST (a); CREATE TABLE only_parted1 PARTITION OF only_parted FOR VALUES IN (1); INSERT INTO only_parted VALUES (1, 'a'); VACUUM ANALYZE {Press Tab} information_schema. only_parted only_parted1 public. - Although it works: postgres=# VACUUM ANALYZE ONLY only_parted; WARNING: VACUUM ONLY of partitioned table "only_parted" has no effect VACUUM postgres=# -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: [PATCH] Optionally record Plan IDs to track plan changes for a query
On 12.02.2025 19:00, Alvaro Herrera wrote: On 2025-Feb-12, Julien Rouhaud wrote: On Wed, Feb 12, 2025 at 01:57:47PM +0100, Alvaro Herrera wrote: Anyway, I think that's different. We do support compute_query_id=off as a way for a custom module to compute completely different query IDs using their own algorithm, which I think is what you're referring to. However, the ability to affect the way the in-core algorithm works is a different thing: you still want in-core code to compute the query ID. I don't think that's the actual behavior, or at least not what it was supposed to be. What we should have is the ability to compute queryid, which can be either in core or done by an external module, but one only one can / should be done. Yes, that's what I tried to say, but I don't understand why you say I said something different. Right now, the proposal in the other thread is that if you want to affect that algorithm in order to merge arrays to be considered a single query element regardless of its length, you set the GUC for that. Initially the GUC was in the core code. Then, based on review, the GUC was moved to the extension, _BUT_ the implementation was still in the core code: in order to activate it, the extension calls a function that modifies core code behavior. So there are more moving parts than before, and if you for whatever reason want that behavior but not the extension, then you need to write a C function. To me this is absurd. So what I suggest we do is return to having the GUC in the core code. I agree, although that probably breaks the queryid extensibility. It does? I haven't read the patch but IIUC if you want the feature to work you need to both change the queryid calculation but also the way the constants are recorded and the query text is normalized, and I don't know if extensions have access to it. Hmm. As for the query text: with Andrey's feature with the GUC in core, a query like this SELECT foo FROM tab WHERE col1 IN (1,2,3,4) will have in pg_stat_activity an identical query_id to a query like this SELECT foo WHERE tab WHERE col1 IN (1,2,3,4,5) even though the query texts differ (in the number of elements in the array). I don't think this is a problem. This means that the query_id for two different queries can be identical, but that should be no surprise, precisely because the GUC that controls it is documented to do that. If pg_stat_statements is enabled with Andrey's patch, then the same query_id will have a single entry (which has stats for both execution of those queries) with that query_id, with a normalized query text that is going to be different from those two above; without Andrey's feature, the text would be SELECT foo WHERE tab WHERE col1 IN ($1,$2,$3,$4); SELECT foo WHERE tab WHERE col1 IN ($1,$2,$3,$4,$5); (that is, pg_stat_statements transformed the values into placeholders, but using exactly the same number of items in the array as the original queries). With Andrey's feature, it will be SELECT foo WHERE tab WHERE col1 IN (...); that is, the query text has been modified and no longer matches exactly any of the queries in pg_stat_activity. But note that the query text already does not match what's in pg_stat_activity, even before Andrey's patch. I don't understand what you mean with "the way the constants are recorded". What constants are you talking about? pg_stat_statements purposefully discards any constants used in the query (obviously). If they have access and fail to do what the GUC asked then of course that's just a bug in that extension. I don't understand what bug are you thinking that such hypothetical extension would have. (pg_stat_statements does of course have access to the query text and to the location of all constants). Now I admit I'm not sure what the solution would be for the problem discussed in this subthread. Apparently the problem is related to temp tables and their changing OIDs. I'm not sure what exactly the proposal for a GUC is. I'm not proposing anything, just explaining why pg_stat_statements is generally useless if you use temp tables as someone asked. Ah, okay. Well, where you see a deficiency, I see an opportunity for improvement :-) Hi everyone, I support the idea of computing the planid for temporary tables using 'pg_temp.rel_name'. Moreover, we have already started using this approach for computing queryid [0]. It seems reasonable to apply the same logic to the planid calculation as well. [0]: https://www.postgresql.org/message-id/flat/Z9mkqplmUpQ4xG52%40msg.df7cb.de#efb20f01bec32aeafd58e5d4ab0dfc16 -- Best regards, Ilia Evdokimov, Tantor Labs LLC.
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
On 06.07.2023 11:27, Lukas Fittl wrote: On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson wrote: Lukas: do you have an updated patch for this commitfest to address David's comments? I have a draft - I should be able to post an updated patch in the next days. Thanks for checking! Thanks, Lukas -- Lukas Fittl Hi hackers, While debugging a query execution plan involving Memoize, it'd be nice to determine how many unique keys would fit into the cache. The est_entries value provides some insight, but without knowing ndistinct, it is unclear whether the cache is large enough to hold all unique keys or if some will be evicted. Given its potential usefulness, I would like to work for this. I attached v2 patch with changes. Example from memoize.sql EXPLAIN SELECT COUNT(*),AVG(t1.unique1) FROM tenk1 t1 INNER JOIN tenk1 t2 ON t1.unique1 = t2.thousand WHERE t2.unique1 < 1200; QUERY PLAN - Aggregate (cost=815.12..815.13 rows=1 width=40) -> Nested Loop (cost=0.30..809.12 rows=1200 width=4) -> Seq Scan on tenk1 t2 (cost=0.00..470.00 rows=1200 width=4) Filter: (unique1 < 1200) -> Memoize (cost=0.30..0.41 rows=1 width=4) Cache Key: t2.thousand Cache Mode: logical Cache Estimated Entries: 655 Cache Estimated NDistinct: 721 -> Index Only Scan using tenk1_unique1 on tenk1 t1 (cost=0.29..0.40 rows=1 width=4) Index Cond: (unique1 = t2.thousand) (11 rows) Additionally, since this information would only be shown in EXPLAIN when costs are enabled, it should not cause any performance regression in normal execution. However, reviewers should be especially careful when verifying test outputs, as this change could affect plan details in regression tests. Any thoughts? -- Best regards, Ilia Evdokimov, Tantor Labs LLC. From 2e7d9bd9bfe58825fbdbb0100b9bd19d5bb7e53b Mon Sep 17 00:00:00 2001 From: Ilia Evdokimov Date: Thu, 20 Mar 2025 11:45:02 +0300 Subject: [PATCH v2] Show ndistinct and est_entries in EXPLAIN for Memoize --- src/backend/commands/explain.c | 12 src/backend/optimizer/path/costsize.c | 3 +++ src/backend/optimizer/plan/createplan.c | 10 +++--- src/backend/optimizer/util/pathnode.c | 3 +++ src/include/nodes/pathnodes.h | 2 ++ src/include/nodes/plannodes.h | 6 ++ 6 files changed, 33 insertions(+), 3 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 33a16d2d8e2..95110e63c7d 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -3629,6 +3629,11 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es) { ExplainPropertyText("Cache Key", keystr.data, es); ExplainPropertyText("Cache Mode", mstate->binary_mode ? "binary" : "logical", es); + if (es->costs) + { + ExplainPropertyFloat("Cache Estimated Entries", "", ((Memoize *) plan)->est_entries, 0, es); + ExplainPropertyFloat("Cache Estimated NDistinct", "", ((Memoize *) plan)->ndistinct, 0, es); + } } else { @@ -3636,6 +3641,13 @@ show_memoize_info(MemoizeState *mstate, List *ancestors, ExplainState *es) appendStringInfo(es->str, "Cache Key: %s\n", keystr.data); ExplainIndentText(es); appendStringInfo(es->str, "Cache Mode: %s\n", mstate->binary_mode ? "binary" : "logical"); + if (es->costs) + { + ExplainIndentText(es); + appendStringInfo(es->str, "Cache Estimated Entries: %d\n", ((Memoize *) plan)->est_entries); + ExplainIndentText(es); + appendStringInfo(es->str, "Cache Estimated NDistinct: %0.0f\n", ((Memoize *) plan)->ndistinct); + } } pfree(keystr.data); diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 256568d05a2..9c0738da4fe 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2604,6 +2604,9 @@ cost_memoize_rescan(PlannerInfo *root, MemoizePath *mpath, mpath->est_entries = Min(Min(ndistinct, est_cache_entries), PG_UINT32_MAX); + /* Remember ndistinct for a potential EXPLAIN later */ + mpath->ndistinct = ndistinct; + /* * When the number of distinct parameter values is above the amount we can * store in the cache, then we'll have to evict some entries from the diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 75e2b0b9036..79795a2b5f2 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -284,7 +284,8 @@ static Materia