Re: Removing unneeded self joins
On 1/3/2022 03:03, Greg Stark wrote: On Thu, 1 Jul 2021 at 02:38, Ronan Dunklau wrote: Well in some cases they can't, when the query is not emitting redundant predicates by itself but they are added by something else like a view or a RLS policy. Maybe it would be worth it to allow spending a bit more time planning for those cases ? Yeah, I'm generally in favour of doing more work in the optimizer to save query authors work writing queries. My question is whether it handles cases like: select b.x,c.y from t join t2 as b on (b.id = t.id) join t2 as c on (c.id = t.id) That is, if you join against the same table twice on the same qual. Does the EC mechanism turn this into a qual on b.id = c.id and then turn this into a self-join that can be removed? Yes, the self-join removal machinery uses EC mechanism as usual to get all join clauses. So, this case works (See demo in attachment). Also, in new version of the patch I fixed one stupid bug: checking a self-join candidate expression operator - we can remove only expressions like F(arg1) = G(arg2). -- regards, Andrey Lepikhov Postgres ProfessionalFrom 70398361a0a0d9c6c3c7ddd1fd305ac11138e7b1 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 15 Jul 2021 15:26:13 +0300 Subject: [PATCH] Remove self-joins. Remove inner joins of a relation to itself if could prove that the join can be replaced with a scan. We can prove the uniqueness using the existing innerrel_is_unique machinery. We can remove a self-join when for each outer row: 1. At most one inner row matches the join clauses. 2. If the join target list contains any inner vars, an inner row must be (physically) the same row as the outer one. In this patch we use Rowley's [1] approach to identify a self-join: 1. Collect all mergejoinable join quals which look like a.x = b.x 2. Check innerrel_is_unique() for the qual list from (1). If it returns true, then outer row matches only the same row from the inner relation. So proved, that this join is self-join and can be replaced by a scan. Some regression tests changed due to self-join removal logic. [1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com --- src/backend/optimizer/plan/analyzejoins.c | 888 +- src/backend/optimizer/plan/planmain.c | 5 + src/backend/optimizer/util/joininfo.c | 3 + src/backend/optimizer/util/relnode.c | 26 +- src/backend/utils/misc/guc.c | 10 + src/include/optimizer/pathnode.h | 4 + src/include/optimizer/planmain.h | 2 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 426 +++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 197 + 12 files changed, 1583 insertions(+), 29 deletions(-) diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 337f470d58..c5ac8e2bd4 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -22,6 +22,7 @@ */ #include "postgres.h" +#include "catalog/pg_class.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/joininfo.h" @@ -32,10 +33,12 @@ #include "optimizer/tlist.h" #include "utils/lsyscache.h" +bool enable_self_join_removal; + /* local functions */ static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo); static void remove_rel_from_query(PlannerInfo *root, int relid, - Relids joinrelids); + Relids joinrelids, int subst_relid); static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved); static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel); static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, @@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root, RelOptInfo *innerrel, JoinType jointype, List *restrictlist); +static void change_rinfo(RestrictInfo* rinfo, Index from, Index to); +static Bitmapset* change_relid(Relids relids, Index oldId, Index newId); +static void change_varno(Expr *expr, Index oldRelid, Index newRelid); /* @@ -86,7 +92,7 @@ restart: remove_rel_from_query(root, innerrelid, bms_union(sjinfo->min_lefthand, - sjinfo->min
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 29.12.2020 16:20, Hou, Zhijie wrote: see new version in attachment. I took a look into the patch, and have some comments. 1. + PG_FINALLY(); + { + copy_fmstate = NULL; /* Detect problems */ I don't quite understand this comment, does it means we want to detect something like Null reference ? 2. + PG_FINALLY(); + { ... + if (!OK) + PG_RE_THROW(); + } Is this PG_RE_THROW() necessary ? IMO, PG_FINALLY will reproduce the PG_RE_THROW action if we get to the code block due to an error being thrown. This is a debugging stage atavisms. fixed. 3. + ereport(ERROR, + (errmsg("unexpected extra results during COPY of table: %s", + PQerrorMessage(conn; I found some similar message like the following: pg_log_warning("unexpected extra results during COPY of table \"%s\"", tocEntryTag); How about using existing messages style ? This style is intended for use in frontend utilities, not for contrib extensions, i think. 4. I noticed some not standard code comment[1]. I think it's better to comment like: /* * line 1 * line 2 */ [1]--- + /* Finish COPY IN protocol. It is needed to do after successful copy or +* after an error. +*/ +/* + * + * postgresExecForeignCopy +/* + * + * postgresBeginForeignCopy Thanks, fixed. The patch in attachment rebased on 107a2d4204. -- regards, Andrey Lepikhov Postgres Professional From 1c5439d802b7654ee50dc4326b9bc24fc7f44677 Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Mon, 14 Dec 2020 13:37:40 +0500 Subject: [PATCH 2/2] Fast COPY FROM into the foreign or sharded table. This feature enables bulk COPY into foreign table in the case of multi inserts is possible and foreign table has non-zero number of columns. FDWAPI was extended by next routines: * BeginForeignCopy * EndForeignCopy * ExecForeignCopy BeginForeignCopy and EndForeignCopy initialize and free the CopyState of bulk COPY. The ExecForeignCopy routine send 'COPY ... FROM STDIN' command to the foreign server, in iterative manner send tuples by CopyTo() machinery, send EOF to this connection. Code that constructed list of columns for a given foreign relation in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList(). It is reused in the deparseCopyFromSql(). Added TAP-tests on the specific corner cases of COPY FROM STDIN operation. By the analogy of CopyFrom() the CopyState structure was extended with data_dest_cb callback. It is used for send text representation of a tuple to a custom destination. The PgFdwModifyState structure is extended with the cstate field. It is needed for avoid repeated initialization of CopyState. ALso for this reason CopyTo() routine was split into the set of routines CopyToStart()/ CopyTo()/CopyToFinish(). Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert field of the ResultRelInfo sructure. Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote --- contrib/postgres_fdw/deparse.c| 60 ++-- .../postgres_fdw/expected/postgres_fdw.out| 46 ++- contrib/postgres_fdw/postgres_fdw.c | 130 ++ contrib/postgres_fdw/postgres_fdw.h | 1 + contrib/postgres_fdw/sql/postgres_fdw.sql | 45 ++ doc/src/sgml/fdwhandler.sgml | 73 ++ src/backend/commands/copy.c | 4 +- src/backend/commands/copyfrom.c | 126 ++--- src/backend/commands/copyto.c | 84 --- src/backend/executor/execMain.c | 8 +- src/backend/executor/execPartition.c | 27 +++- src/include/commands/copy.h | 8 +- src/include/foreign/fdwapi.h | 15 ++ 13 files changed, 533 insertions(+), 94 deletions(-) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index ca2f9f3215..b2a71faabc 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList, static void appendFunctionName(Oid funcid, deparse_expr_cxt *context); static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno, deparse_expr_cxt *context); +static List *deparseRelColumnList(StringInfo buf, Relation rel, + bool enclose_in_parens); /* * Helper functions @@ -1763,6 +1765,20 @@ d
Increase value of OUTER_VAR
Hi, Playing with a large value of partitions I caught the limit with 65000 table entries in a query plan: if (IS_SPECIAL_VARNO(list_length(glob->finalrtable))) ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("too many range table entries"))); Postgres works well with so many partitions. The constants INNER_VAR, OUTER_VAR, INDEX_VAR are used as values of the variable 'var->varno' of integer type. As I see, they were introduced with commit 1054097464 authored by Marc G. Fournier, in 1996. Value 65000 was relevant to the size of the int type at that time. Maybe we will change these values to INT_MAX? (See the patch in attachment). -- regards, Andrey Lepikhov Postgres Professional From 98da77cdefd53dbf4ce0c740d1a0f356da970648 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 3 Mar 2021 11:22:32 +0300 Subject: [PATCH] Set values of special varnos to the upper bound of the int type --- src/include/nodes/primnodes.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h index d4ce037088..9038d97886 100644 --- a/src/include/nodes/primnodes.h +++ b/src/include/nodes/primnodes.h @@ -168,9 +168,9 @@ typedef struct Expr * in the planner and doesn't correspond to any simple relation column may * have varnosyn = varattnosyn = 0. */ -#defineINNER_VAR 65000 /* reference to inner subplan */ -#defineOUTER_VAR 65001 /* reference to outer subplan */ -#defineINDEX_VAR 65002 /* reference to index column */ +#defineINNER_VAR (INT_MAX - 2) /* reference to inner subplan */ +#defineOUTER_VAR (INT_MAX - 1) /* reference to outer subplan */ +#defineINDEX_VAR (INT_MAX) /* reference to index column */ #define IS_SPECIAL_VARNO(varno)((varno) >= INNER_VAR) -- 2.29.2
Re: Increase value of OUTER_VAR
On 3/3/21 12:52, Julien Rouhaud wrote: On Wed, Mar 3, 2021 at 4:57 PM Amit Langote wrote: On Wed, Mar 3, 2021 at 5:52 PM David Rowley wrote: Something like 1 million seems like a more realistic limit to me. That might still be on the high side, but it'll likely mean we'd not need to revisit this for quite a while. +1 Also, I got reminded of this discussion from not so long ago: https://www.postgresql.org/message-id/flat/16302-e45634e2c0e34e97%40postgresql.org Thank you +1 Ok. I changed the value to 1 million and explained this decision in the comment. This issue caused by two cases: 1. Range partitioning on a timestamp column. 2. Hash partitioning. Users use range distribution by timestamp because they want to insert new data quickly and analyze entire set of data. Also, in some discussions, I see Oracle users discussing issues with more than 1e5 partitions. -- regards, Andrey Lepikhov Postgres Professional From a3c1ee9d2e197dee40aed81cb6a08695a8fa2917 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 3 Mar 2021 11:22:32 +0300 Subject: [PATCH] Increase values of special varnos to 1 million. Use this value as a realistic limit for number of range table entries. Restrict it to detect possible errors which can cause exceeding of MaxAllocSize in palloc() on the elements of the array. This value can be changed later to suit the needs of the real world. --- src/include/nodes/primnodes.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h index d4ce037088..06016340a3 100644 --- a/src/include/nodes/primnodes.h +++ b/src/include/nodes/primnodes.h @@ -168,9 +168,9 @@ typedef struct Expr * in the planner and doesn't correspond to any simple relation column may * have varnosyn = varattnosyn = 0. */ -#defineINNER_VAR 65000 /* reference to inner subplan */ -#defineOUTER_VAR 65001 /* reference to outer subplan */ -#defineINDEX_VAR 65002 /* reference to index column */ +#defineINNER_VAR 100 /* reference to inner subplan */ +#defineOUTER_VAR 101 /* reference to outer subplan */ +#defineINDEX_VAR 102 /* reference to index column */ #define IS_SPECIAL_VARNO(varno)((varno) >= INNER_VAR) -- 2.29.2
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote: Hi Andrey-san, From: Tomas Vondra I needed to look at this patch while working on something related, and I found it got broken by 6973533650c a couple days ago. So here's a fixed version, to keep cfbot happy. I haven't done any serious review yet. Could I or my colleague continue this patch in a few days? It looks it's stalled over one month. I don't found any problems with this patch that needed to be corrected. It is wait for actions from committers side, i think. -- regards, Andrey Lepikhov Postgres Professional
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 11/24/20 9:27 AM, tsunakawa.ta...@fujitsu.com wrote: Andrey-san, Fujita-san, From: Etsuro Fujita On Mon, Nov 23, 2020 at 5:39 PM Andrey Lepikhov wrote: On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote: Could I or my colleague continue this patch in a few days? It looks it's stalled over one month. I don't found any problems with this patch that needed to be corrected. It is wait for actions from committers side, i think. I'm planning to review this patch. I think it would be better for another pair of eyes to take a look at it, though. There are the following two issues left untouched. https://www.postgresql.org/message-id/TYAPR01MB2990DC396B338C98F27C8ED3FE1F0%40TYAPR01MB2990.jpnprd01.prod.outlook.com I disagree with your opinion about changing the interface of the ExecRelationAllowsMultiInsert routine. If you insist on the need for this change, we need another opinion. -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
Thank you for the review, On 27.11.2020 21:49, Heikki Linnakangas wrote: On 31/10/2020 11:26, Andrey V. Lepikhov wrote: + /* + * Process restrictlist to seperate out the self join quals from + * the other quals. e.g x = x goes to selfjoinquals and a = b to + * otherjoinquals. + */ + split_selfjoin_quals(root, restrictlist, &selfjoinquals, + &otherjoinquals); + + if (list_length(selfjoinquals) == 0) + { + /* + * Have a chance to remove join if target list contains vars from + * the only one relation. + */ I don't understand the logic here. If 'selfjoinquals' is empty, it means that there is no join qual between the two relations, right? How can we ever remove the join in that case? And how does the target list affect that? Can you give an example query of that? Maybe it is a problem of variable naming. Following the idea of David Rowley, we split quals into two subsets: {x==x} and another, for example {x=y}. First set is an trivial case of self-join: if we have unique index on the attribute 'x', then this join is self-join. Second set is give us a chance: if right side is unique for right side of the qual and no vars from right side end up in the target list of the join, then this is a self-join case. Example: CREATE TABLE a(x int, y int); CREATE UNIQUE INDEX ON a(x); SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.x; -- self-join CREATE UNIQUE INDEX ON a(y); SELECT a1.* FROM a a1, a a2 WHERE a1.x = a2.y; -- self-join too --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -4553,11 +4553,13 @@ explain (costs off) select p.* from (parent p left join child c on (p.k = c.k)) join parent x on p.k = x.k where p.k = 1 and p.k = 2; - QUERY PLAN --- + QUERY PLAN + Result One-Time Filter: false -(2 rows) + -> Index Scan using parent_pkey on parent x + Index Cond: (k = 1) +(4 rows) -- bug 5255: this is not optimizable by join removal begin; That doesn't seem like an improvement... I investigated this case. It is a planner feature: it simplifies dummy joins and dummy scans to different plans. Maybe it can cause some discussion, but this example so rare and doesn't make a problem. My general impression of this patch is that it's a lot of code to deal with a rare special case. At the beginning of this thread there was discussion on the overhead that this might add to planning queries that don't benefit, but adding a lot of code isn't nice either, even if the performance is acceptable. That's not necessarily a show-stopper, but it does make me much less excited about this. I'm not sure what to suggest to do about that, except a really vague "Can you make is simpler?" Currently planner reduces useless outer joins and unique semijoins. Reduce self join feature continues the development of the planner in the same direction. For example, it is needed for ORM software. Most of the code dedicated to removing unnecessary relation and replacing of one oid with another. We are trying to use remove_rel_from_query() machinery. Perhaps this will allow us to make the code shorter. -- regards, Andrey Lepikhov Postgres Professional
Cost overestimation of foreign JOIN
Hi, While testing of Asynchronous Append feature with TPC-H queries, I found that the push-down JOIN technique is rarely used. For my servers fdw_tuple_cost = 0.2, fdw_startup_cost = 100. Exploring the code, i found in postgres_fdw, estimate_path_cost_size(), lines 2908,2909: run_cost += nrows * join_cost.per_tuple; nrows = clamp_row_est(nrows * fpinfo->joinclause_sel); Above: nrows = fpinfo_i->rows * fpinfo_o->rows; Maybe it is needed to swap lines 2908 and 2909 (see attachment)? In my case of two big partitioned tables and small join result it strongly influenced on choice of the JOIN push-down strategy. -- regards, Andrey Lepikhov Postgres Professional diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index b6c72e1d1e..3047300c4b 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -2905,8 +2905,8 @@ estimate_path_cost_size(PlannerInfo *root, */ run_cost = fpinfo_i->rel_total_cost - fpinfo_i->rel_startup_cost; run_cost += fpinfo_o->rel_total_cost - fpinfo_o->rel_startup_cost; - run_cost += nrows * join_cost.per_tuple; nrows = clamp_row_est(nrows * fpinfo->joinclause_sel); + run_cost += nrows * join_cost.per_tuple; run_cost += nrows * remote_conds_cost.per_tuple; run_cost += fpinfo->local_conds_cost.per_tuple * retrieved_rows;
Re: Cost overestimation of foreign JOIN
On 30.11.2020 22:38, Tom Lane wrote: Andrey Lepikhov writes: Maybe it is needed to swap lines 2908 and 2909 (see attachment)? No; as explained in the comment immediately above here, we're assuming that the join conditions will be applied on the cross product of the input relations. Thank you. Now it is clear to me. Now admittedly, that's a worst-case assumption, since it amounts to expecting that the remote server will do the join in the dumbest possible nested-loop way. If the remote can use a merge or hash join, for example, the cost is likely to be a lot less. My goal is scaling Postgres on a set of the same servers with same postgres instances. If one server uses for the join a hash-join node, i think it is most likely that the other server will also use for this join a hash-join node (Maybe you remember, I also use the statistics copying technique to provide up-to-date statistics on partitions). Tests show good results with such an approach. But maybe this is my special case. But it is not the job of this code path to outguess the remote planner. It's certainly not appropriate to invent an unprincipled cost estimate as a substitute for trying to guess that. Agreed. If you're unhappy with the planning results you get for this, why don't you have use_remote_estimate turned on? I have a mixed load model. Large queries are suitable for additional estimate queries. But for many simple SELECT's that touch a small portion of the data, the latency has increased significantly. And I don't know how to switch the use_remote_estimate setting in such case. -- regards, Andrey Lepikhov Postgres Professional
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 2/2/21 11:57, tsunakawa.ta...@fujitsu.com wrote: Hello, Andrey-san, From: Tang, Haiying Sometimes before i suggested additional optimization [1] which can additionally speed up COPY by 2-4 times. Maybe you can perform the benchmark for this solution too? ... But the patch no longer applies, I tried to apply on e42b3c3bd6(2021/1/26) but failed. Can you send a rebased version? When do you think you can submit the rebased version of them? (IIUC at the off-list HighGo meeting, you're planning to post them late this week after the global snapshot patch.) Just in case you are not going to do them for the moment, can we rebase and/or further modify them so that they can be committed in PG 14? Of course, you can rebase it. -- regards, Andrey Lepikhov Postgres Professional
Re: Clarifying/rationalizing Vars' varno/varattno/varnoold/varoattno
On 5/2/2020 01:24, Tom Lane wrote: I've not written any actual code, but am close to being ready to. This thread gives us hope to get started on solving some of the basic planner problems. But there is no activity for a long time, as I see. Have You tried to implement this idea? Is it actual now? -- regards, Andrey Lepikhov Postgres Professional
Re: Clarifying/rationalizing Vars' varno/varattno/varnoold/varoattno
On 22/12/2021 20:42, Tom Lane wrote: Andrey Lepikhov writes: On 5/2/2020 01:24, Tom Lane wrote: I've not written any actual code, but am close to being ready to. This thread gives us hope to get started on solving some of the basic planner problems. But there is no activity for a long time, as I see. Have You tried to implement this idea? Is it actual now? It's been on the back burner for awhile :-(. I've not forgotten about it though. I would try to develop this feature. Idea is clear for me, but definition of the NullableVars structure is not obvious. Maybe you can prepare a sketch of this struct or you already have some draft code? -- regards, Andrey Lepikhov Postgres Professional
Re: Add index scan progress to pg_stat_progress_vacuum
On 21/12/2021 00:05, Peter Geoghegan wrote: * Some index AMs don't work like nbtree and GiST in that they cannot do their scan sequentially -- they have to do something like a logical/keyspace order scan instead, which is *totally* different to heapam (not just a bit different). There is no telling how many times each page will be accessed in these other index AMs, and in what order, even under optimal conditions. We should arguably not even try to provide any granular progress information here, since it'll probably be too messy. Maybe we could add callbacks into AM interface for send/receive/representation implementation of progress? So AM would define a set of parameters to send into stat collector and show to users. -- regards, Andrey Lepikhov Postgres Professional
Multiple Query IDs for a rewritten parse tree
On 5/1/2022 10:13, Tom Lane wrote: > I feel like we need to get away from the idea that there is just > one query hash, and somehow let different extensions attach > differently-calculated hashes to a query. I don't have any immediate > ideas about how to do that in a reasonably inexpensive way. Now, queryId field represents an query class (depending on an jumbling implementation). It is used by extensions as the way for simple tracking a query from a parse tree creation point to the end of its life along all hook calls, which an extension uses (remember about possible plan caching). I know at least two different cases of using queryId: 1) for monitoring purposes - pg_stat_statements is watching how often queries of a class emerge in the database and collects a stat on each class. 2) adaptive purposes - some extensions influence a planner decision during the optimization stage and want to learn on a performance shift at the end of execution stage. Different purposes may require different jumbling implementations. But users can want to use such extensions at the same time. So we should allow to attach many different query IDs to a query (maybe better to call it as 'query label'?). Thinking for a while I invented three different ways to implement it: 1. queryId will be a trivial 64-bit counter. So, each extension can differ each query from any other, track it along all hooks, use an jumbling code and store an queryId internally. Here only one big problem I see - increasing overhead in the case of many consumers of queryId feature. 2. Instead of simple queryId we can store a list of pairs (QueryId, funcOid). An extension can register a callback for queryId generation and the core will form a list of queryIds right after an query tree rewriting. funcOid is needed to differ jumbling implementations. Here we should invent an additional node type for an element of the list. 3. Instead of queryId we could add a multi-purpose 'private' list in the Query struct. Any extension can add to this list additional object(s) (with registered object type, of course). As an example, i can imagine a kind of convention for queryIds in such case - store a String node with value: ' - '. This way we should implement a registered callback mechanism too. I think, third way is the cheapest, flexible and simple for implementation. Any thoughts, comments, criticism ? -- regards, Andrey Lepikhov Postgres Professional
Re: Multiple Query IDs for a rewritten parse tree
On 10/1/2022 13:56, Julien Rouhaud wrote: On Mon, Jan 10, 2022 at 12:37:34PM +0500, Andrey V. Lepikhov wrote: I think, pg_stat_statements can live with an queryId generator of the sr_plan extension. But It replaces all constants with $XXX parameter at the query string. In our extension user defines which plan is optimal and which constants can be used as parameters in the plan. I don't know the details of that extension, but I think that it should work as long as you have the constants information in the jumble state, whatever the resulting normalized query string is right? Yes. the same input query string doesn't prove that frozen query plan can be used, because rewrite rules could be changed. So we use only a query tree. Here we must have custom jumbling implementation. queryId in this extension defines two aspects: 1. How many input queries will be compared with a query tree template of the frozen statement. 2. As a result, performance overheads on unsuccessful comparings. One drawback I see here - creating or dropping of my extension changes behavior of pg_stat_statements that leads to distortion of the DB load profile. Also, we haven't guarantees, that another extension will work correctly (or in optimal way) with such queryId. But then, if generating 2 queryid is a better for performance, does the extension really need an additional jumble state and/or normalized query string? Additional Jumble state isn't necessary, but it would be better for performance to collect pointers to all constant nodes during a process of hash generation. -- regards, Andrey Lepikhov Postgres Professional
Re: Multiple Query IDs for a rewritten parse tree
On 10/1/2022 15:39, Julien Rouhaud wrote: On Mon, Jan 10, 2022 at 03:24:46PM +0500, Andrey Lepikhov wrote: On 10/1/2022 13:56, Julien Rouhaud wrote: Yes. the same input query string doesn't prove that frozen query plan can be used, because rewrite rules could be changed. So we use only a query tree. Yes, I'm fully aware of that. I wasn't asking about using the input query string but the need for generating a dedicated normalized output query string that would be potentially different from the one generated by pg_stat_statements (or similar). Thanks, now I got it. I don't remember a single situation where we would need to normalize a query string. But then, if generating 2 queryid is a better for performance, does the extension really need an additional jumble state and/or normalized query string? Additional Jumble state isn't necessary, but it would be better for performance to collect pointers to all constant nodes during a process of hash generation. I agree, but it's even better for performance if this is collected only once. I still don't know if this extension (or any extension) would require something different from a common jumble state that would serve for all purpose or if we can work with a single jumble state and only handle multiple queryid. I think, JumbleState in highly monitoring-specific, maybe even pg_stat_statements specific (maybe you can designate another examples). I haven't ideas how it can be used in arbitrary extension. But, it is not so difficult to imagine an implementation (as part of Tom's approach, described earlier) such kind of storage for each generation method. -- regards, Andrey Lepikhov Postgres Professional
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 5/3/21 21:54, tsunakawa.ta...@fujitsu.com wrote: I've managed to rebased it, although it took unexpectedly long. The patch is attached. It passes make check against core and postgres_fdw. I'll turn the CF status back to ready for committer shortly. Macros _() at the postgresExecForeignCopy routine: if (PQputCopyEnd(conn, OK ? NULL : _("canceled by server")) <= 0) uses gettext. Under linux it is compiled ok, because (as i understood) uses standard implementation of gettext: objdump -t contrib/postgres_fdw/postgres_fdw.so | grep 'gettext' gettext@@GLIBC_2.2.5 but in MacOS (and maybe somewhere else) we need to explicitly link libintl library in the Makefile: SHLIB_LINK += $(filter -lintl, $(LIBS) Also, we may not use gettext at all in this part of the code. -- regards, Andrey Lepikhov Postgres Professional
Re: [POC] Fast COPY FROM command for the table with foreign partitions
27.07.2020 21:34, Alexey Kondratov пишет: Hi Andrey, On 2020-07-23 09:23, Andrey V. Lepikhov wrote: On 7/16/20 2:14 PM, Amit Langote wrote: Amit Langote EnterpriseDB: http://www.enterprisedb.com Version 5 of the patch. With changes caused by Amit's comments. Just got a segfault with your v5 patch by deleting from a foreign table. It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a correct pointer to the required function. I haven't had a chance to look closer on the code, but you can easily reproduce this error with the attached script (patched Postgres binaries should be available in the PATH). It works well with master and fails with your patch applied. I used master a3ab7a707d and v5 version of the patch with your script. No errors found. Can you check your test case? -- regards, Andrey Lepikhov Postgres Professional
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 8/7/20 2:14 PM, Amit Langote wrote: I was playing around with v5 and I noticed an assertion failure which I concluded is due to improper setting of ri_usesBulkModify. You can reproduce it with these steps. create extension postgres_fdw; create server lb foreign data wrapper postgres_fdw ; create user mapping for current_user server lb; create table foo (a int, b int) partition by list (a); create table foo1 (like foo); create foreign table ffoo1 partition of foo for values in (1) server lb options (table_name 'foo1'); create table foo2 (like foo); create foreign table ffoo2 partition of foo for values in (2) server lb options (table_name 'foo2'); create function print_new_row() returns trigger language plpgsql as $$ begin raise notice '%', new; return new; end; $$; create trigger ffoo1_br_trig before insert on ffoo1 for each row execute function print_new_row(); copy foo from stdin csv; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. 1,2 2,3 \. NOTICE: (1,2) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. #0 0x7f2d5e266337 in raise () from /lib64/libc.so.6 #1 0x7f2d5e267a28 in abort () from /lib64/libc.so.6 #2 0x00aafd5d in ExceptionalCondition (conditionName=0x7f2d37b468d0 "!resultRelInfo->ri_usesBulkModify || resultRelInfo->ri_FdwRoutine->BeginForeignCopyIn == NULL", errorType=0x7f2d37b46680 "FailedAssertion", fileName=0x7f2d37b4677f "postgres_fdw.c", lineNumber=1863) at assert.c:67 #3 0x7f2d37b3b0fe in postgresExecForeignInsert (estate=0x2456320, resultRelInfo=0x23a8f58, slot=0x23a9480, planSlot=0x0) at postgres_fdw.c:1862 #4 0x0066362a in CopyFrom (cstate=0x23a8d40) at copy.c:3331 The problem is that partition ffoo1's BR trigger prevents it from using multi-insert, but its ResultRelInfo.ri_usesBulkModify is true, which is copied from its parent. We should really check the same things for a partition that CopyFrom() checks for the main target relation (root parent) when deciding whether to use multi-insert. Thnx, I added TAP-test on this problem> However instead of duplicating the same logic to do so in two places (CopyFrom and ExecInitPartitionInfo), I think it might be a good idea to refactor the code to decide if multi-insert mode can be used for a given relation by checking its properties and put it in some place that both the main target relation and partitions need to invoke. InitResultRelInfo() seems to be one such place. +1 Also, it might be a good idea to use ri_usesBulkModify more generally than only for foreign relations as the patch currently does, because I can see that it can replace the variable insertMethod in CopyFrom(). Having both insertMethod and ri_usesBulkModify in each ResultRelInfo seems confusing and bug-prone. Finally, I suggest renaming ri_usesBulkModify to ri_usesMultiInsert to reflect its scope. Please check the attached delta patch that applies on top of v5 to see what that would look like. I merged your delta patch (see v6 in attachment) to the main patch. Currently it seems more commitable than before. -- regards, Andrey Lepikhov Postgres Professional >From da6c4fe8262df58164e2c4ab80e085a019c9c6c1 Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Thu, 9 Jul 2020 11:16:56 +0500 Subject: [PATCH] Fast COPY FROM into the foreign or sharded table. This feature enables bulk COPY into foreign table in the case of multi inserts is possible and foreign table has non-zero number of columns. FDWAPI was extended by next routines: * BeginForeignCopyIn * EndForeignCopyIn * ExecForeignCopyIn BeginForeignCopyIn and EndForeignCopyIn initialize and free the CopyState of bulk COPY. The ExecForeignCopyIn routine send 'COPY ... FROM STDIN' command to the foreign server, in iterative manner send tuples by CopyTo() machinery, send EOF to this connection. Code that constructed list of columns for a given foreign relation in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList(). It is reused in the deparseCopyFromSql(). Added TAP-tests on the specific corner cases of COPY FROM STDIN operation. By the analogy of CopyFrom() the CopyState structure was extended with data_dest_cb callback. It is used for send text representation of a tuple to a custom destination. The PgFdwModifyState structure is extended with the cstate field. It is needed for avoid repeated initialization of CopyState. ALso for this reason CopyTo() routine was split into the set of routines CopyToStart()/ CopyTo()/CopyToFinish(). Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert field of the ResultRelInfo sructure. Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru Authors: Andrey Lepik
Re: Ideas about a better API for postgres_fdw remote estimates
On 7/14/20 6:32 AM, Bruce Momjian wrote: On Mon, Jul 6, 2020 at 11:28:28AM -0400, Stephen Frost wrote: Yeah, thinking about it as a function that inspects partial planner results, it might be useful for other purposes besides postgres_fdw. As I said before, I don't think this necessarily has to be bundled as part of postgres_fdw. That still doesn't make it part of EXPLAIN. Providing it as a function rather than through EXPLAIN does make a bit more sense if we're going to skip things like the lookups you mention above. I'm still inclined to have it be a part of core rather than having it as postgres_fdw though. I'm not completely against it being part of postgres_fdw... but I would think that would really be appropriate if it's actually using something in postgres_fdw, but if everything that it's doing is part of core and nothing related specifically to the postgres FDW, then having it as part of core makes more sense to me. Also, having it as part of core would make it more appropriate for other tools to look at and adding that kind of inspection capability for partial planner results could be very interesting for tools like pgAdmin and such. I agree the statistics extraction should probably be part of core. There is the goal if FDWs returning data, and returning the data quickly. I think we can require all-new FDW servers to get improved performance. I am not even clear if we have a full understanding of the performance characteristics of FDWs yet. I know Tomas did some research on its DML behavior, but other than that, I haven't seen much. On a related note, I have wished to be able to see all the costs associated with plans not chosen, and I think others would like that as well. Getting multiple costs for a query goes in that direction. During the implementation of sharding related improvements i noticed that if we use a lot of foreign partitions, we have bad plans because of vacuum don't update statistics of foreign tables.This is done by the ANALYZE command, but it is very expensive operation for foreign table. Problem with statistics demonstrates with TAP-test from the first patch in attachment. I implemented some FDW + pg core machinery to reduce weight of the problem. The ANALYZE command on foreign table executes query on foreign server that extracts statistics tuple, serializes it into json-formatted string and returns to the caller. The caller deserializes this string, generates statistics for this foreign table and update it. The second patch is a proof-of-concept. This patch speedup analyze command and provides statistics relevance on a foreign table after autovacuum operation. Its effectiveness depends on relevance of statistics on the remote server, but still. -- regards, Andrey Lepikhov Postgres Professional >From 329954981959ee3fc97e52266c89a436d02ddf5e Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 4 Aug 2020 09:29:37 +0500 Subject: [PATCH 1/2] TAP-Test on bad statistics. Add dummy postgres_fdw_stat() routine. It will return stat tuples for each input attribute. --- contrib/postgres_fdw/Makefile | 4 ++-- contrib/postgres_fdw/expected/foreign_stat.out | 18 ++ .../postgres_fdw/postgres_fdw--1.0--1.1.sql| 11 +++ contrib/postgres_fdw/postgres_fdw.c| 17 + contrib/postgres_fdw/postgres_fdw.control | 2 +- contrib/postgres_fdw/sql/foreign_stat.sql | 10 ++ 6 files changed, 59 insertions(+), 3 deletions(-) create mode 100644 contrib/postgres_fdw/expected/foreign_stat.out create mode 100644 contrib/postgres_fdw/postgres_fdw--1.0--1.1.sql create mode 100644 contrib/postgres_fdw/sql/foreign_stat.sql diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile index ee8a80a392..15a6f6c353 100644 --- a/contrib/postgres_fdw/Makefile +++ b/contrib/postgres_fdw/Makefile @@ -14,9 +14,9 @@ PG_CPPFLAGS = -I$(libpq_srcdir) SHLIB_LINK_INTERNAL = $(libpq) EXTENSION = postgres_fdw -DATA = postgres_fdw--1.0.sql +DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql -REGRESS = postgres_fdw +REGRESS = postgres_fdw foreign_stat ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/postgres_fdw/expected/foreign_stat.out b/contrib/postgres_fdw/expected/foreign_stat.out new file mode 100644 index 00..28a470bccc --- /dev/null +++ b/contrib/postgres_fdw/expected/foreign_stat.out @@ -0,0 +1,18 @@ +CREATE TABLE ltable (a int, b real); +CREATE FOREIGN TABLE ftable (a int) server loopback options (table_name 'ltable'); +INSERT INTO ltable (a, b) (SELECT *, 1.01 FROM generate_series(1, 1E4)); +VACUUM; +-- Check statistic interface routine +SELECT * FROM postgres_fdw_stat('public', 'test', 'a'); + postgres_fdw_stat +--- + +(1 row) + +EXPLAIN (TIMING OFF, SUMMARY O
Re: Ideas about a better API for postgres_fdw remote estimates
On 8/29/20 9:22 PM, Stephen Frost wrote: I implemented some FDW + pg core machinery to reduce weight of the problem. The ANALYZE command on foreign table executes query on foreign server that extracts statistics tuple, serializes it into json-formatted string and returns to the caller. The caller deserializes this string, generates statistics for this foreign table and update it. The second patch is a proof-of-concept. Isn't this going to create a version dependency that we'll need to deal with..? What if a newer major version has some kind of improved ANALYZE command, in terms of what it looks at or stores, and it's talking to an older server? When I was considering the issue with ANALYZE and FDWs, I had been thinking it'd make sense to just change the query that's built in deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in more-or-less the same manner as today. If we don't like the available TABLESAMPLE methods then we could add a new one that's explicitly the 'right' sample for an ANALYZE call and use that when it's available on the remote side. Not sure if it'd make any sense for ANALYZE itself to start using that same TABLESAMPLE code, but maybe? Not that I think it'd be much of an issue if it's independent either, with appropriate comments to note that we should probably try to make them match up for the sake of FDWs. This approach does not contradict your idea. This is a lightweight opportunity to reduce the cost of analysis if we have a set of servers with actual versions of system catalog and fdw. This patch speedup analyze command and provides statistics relevance on a foreign table after autovacuum operation. Its effectiveness depends on relevance of statistics on the remote server, but still. If we do decide to go down this route, wouldn't it mean we'd have to solve the problem of what to do when it's a 9.6 foreign server being queried from a v12 server and dealing with any difference in the statistics structures of the two? Seems like we would... in which case I would say that we should pull that bit out and make it general, and use it for pg_upgrade too, which would benefit a great deal from having the ability to upgrade stats between major versions also. That's a much bigger piece to take on, of course, but seems to be what's implied with this approach for the FDW. Thank you for this use case. We can add field "version" to statistics string (btree uses versioning too). As you can see, in this patch we are only trying to get statistics. If for some reason this does not work out, then we go along a difficult route. Moreover, I believe this strategy should only work if we analyze a relation implicitly. If the user executes analysis explicitly by the command "ANALYZE ", we need to perform an fair analysis of the table. -- regards, Andrey Lepikhov Postgres Professional
Re: Ideas about a better API for postgres_fdw remote estimates
On 8/31/20 6:19 PM, Ashutosh Bapat wrote: On Mon, Aug 31, 2020 at 3:36 PM Andrey V. Lepikhov wrote: I agreed that this task needs to implement an API for serialization/deserialization of statistics: pg_load_relation_statistics(json_string text); pg_get_relation_statistics(relname text); We can use a version number for resolving conflicts with different statistics implementations. "Load" function will validate the values[] anyarray while deserializing the input json string to the datatype of the relation column. This is a valuable feature. Analysing a foreign table by fetching rows from the foreign server isn't very efficient. In fact the current FDW API for doing that forges that in-efficiency by requiring the FDW to return a sample of rows that will be analysed by the core. That's why I see that your patch introduces a new API to get foreign rel stat. I don't think there's any point in maintaining these two APIs just for ANALYSING table. Instead we should have only one FDW API which will do whatever it wants and return statistics that can be understood by the core and let core install it in the catalogs. I believe that's doable. I think the same. In case of PostgreSQL it could get the stats available as is from the foreign server, convert it into a form that the core understands and returns. The patch introduces a new function postgres_fdw_stat() which will be available only from version 14 onwards. Can we use row_to_json(), which is available in all the supported versions, instead? I started from here. But we need to convert starelid, staop[] stacoll[] oids into portable format. Also we need to explicitly specify the type of each values[] array. And no one guaranteed that anyarray values[] can't contained an array of complex type values, containing oids, that can't be correctly converted to database objects on another server... These considerations required me to add new postgres_fdw_stat() routine that can be moved into the core. In case of some other foreign server, an FDW will be responsible to return statistics in a form that the core will understand. It may fetch rows from the foreign server or be a bit smart and fetch the statistics and convert. I don't think I fully understood your idea. Please explain in more detail if possible. This also means that FDWs will have to deal with the statistics format that the core understands and thus will need changes in their code with every version in the worst case. But AFAIR, PostgreSQL supports different forms of statistics so the problem may not remain that severe if FDWs and core agree on some bare minimum format that the core supports for long. I don't think FDW needs to know anything about the internals of statistics. It only need to execute query like "SELECT extract_statistics(namespace.relation);" and apply the text representation by the function call like this: store_statistics(const char *stat); All validation and update pg_statistic operations will be performed into the core. I think the patch has some other problems like it works only for regular tables on foreign server but a foreign table can be pointing to any relation like a materialized view, partitioned table or a foreign table on the foreign server all of which have statistics associated with them. Ok. It was implemented for discussion, test and as a base of development. I didn't look closely but it does not consider that the foreign table may not have all the columns from the relation on the foreign server or may have different names. Here we get full statistics from remote server and extract statistics only for columns, included into the tuple descriptor of foreign table. But I think those problems are kind of secondary. We have to agree on the design first. +1. I only want to point out the following. In previous threads statistics was converted row-by-row. I want to suggest to serialize all statistics tuples for the relation into single json string. On apply phase we can filter unneeded attributes. -- regards, Andrey Lepikhov Postgres Professional
Re: Make query ID more portable
On 12/10/21 13:35, Julien Rouhaud wrote: Hi, On Tue, Oct 12, 2021 at 4:12 PM Andrey V. Lepikhov wrote: See the patch in attachment as an POC. Main idea here is to break JumbleState down to a 'clocations' part that can be really interested in a post parse hook and a 'context data', that needed to build query or subquery signature (hash) and, I guess, isn't really needed in any extensions. There have been quite a lot of threads about that in the past, and almost every time people wanted to change how the hash was computed. So it seems to me that extensions would actually be quite interested in that. This is even more the case now that an extension can be used to replace the queryid calculation only and keep the rest of the extension relying on it as is. Yes, I know. I have been using such self-made queryID for four years. And I will use it further. But core jumbling code is good, fast and much easier in support. The purpose of this work is extending of jumbling to use in more flexible way to avoid meaningless copying of this code to an extension. I think, it adds not much complexity and overhead. I think the biggest change in your patch is: case RTE_RELATION: - APP_JUMB(rte->relid); - JumbleExpr(jstate, (Node *) rte->tablesample); + { + char *relname = regclassout_ext(rte->relid, true); + + APP_JUMB_STRING(relname); + JumbleExpr(jstate, (Node *) rte->tablesample, ctx); APP_JUMB(rte->inh); break; Have you done any benchmark on OLTP workload? Adding catalog access there is likely to add significant overhead. Yes, I should do benchmarking. But I guess, main goal of Query ID is monitoring, that can be switched off, if necessary. This part made for a demo. It can be replaced by a hook, for example. Also, why only using the fully qualified relation name for stable hashes? At least operators and functions should also be treated the same way. If you do that you will probably have way too much overhead to be usable in a busy production environment. Why not using the new possibility of 3rd party extension for the queryid calculation that exactly suits your need? I fully agree with these arguments. This code is POC. Main part here is breaking down JumbleState, using a local context for subqueries and sorting of a range table entries hashes. I think, we can call one routine (APP_JUMB_OBJECT(), as an example) for all oids in this code. It would allow an extension to intercept this call and replace oid with an arbitrary value. -- regards, Andrey Lepikhov Postgres Professional
Re: Make query ID more portable
On 12/10/21 18:45, Bruce Momjian wrote: On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote: Andrey Lepikhov writes: But core jumbling code is good, fast and much easier in support. Also, the current code handles renames of schemas and objects, but this would not. Yes, It is good option if an extension works only in the context of one node. But my efforts are directed to the cross-instance usage of a monitoring data. As an example, it may be useful for sharding. Also, I guess for an user is essential to think that if he changed a name of any object he also would change queries and reset monitoring data, related on this object. -- regards, Andrey Lepikhov Postgres Professional
Re: Make query ID more portable
On 12/10/21 18:40, Tom Lane wrote: Andrey Lepikhov writes: But core jumbling code is good, fast and much easier in support. A bigger issue is that query ID stability isn't something we are going to promise on a large scale --- for example, what if a new release adds some new fields to struct Query? So I'm not sure that "query IDs should survive dump/reload" is a useful goal to consider. It's certainly not something that could be reached by anything even remotely like the existing code. Thank you for the explanation. I think the problem of queryId is that is encapsulates two different meanings: 1. It allows an extension to match an query on post parse and execution stages. In this sense, queryId should be as unique as possible for each query. 2. For pg_stat_statements purposes (and my project too) it represents an query class and should be stable against permutations of range table entries, clauses, e.t.c. For example: "SELECT * FROM a,b;" and "SELECT * FROM b,a;" should have the same queryId. This issue may be solved in an extension with next approach: 1. Force as unique value for queryId as extension wants in a post parse hook 2. Generalize the JumbleQuery routine code to generate a kind of query class signature. The attached patch is a first sketch for such change. -- regards, Andrey Lepikhov Postgres ProfessionalFrom ef4033935b25c90fd8c5b6f10a0646a7ede385e3 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 13 Oct 2021 10:22:53 +0500 Subject: [PATCH] Add callback for jumbling of an OID value. Query jumbling machinery depends on OIDs of objects. It is fine for specific instance, because allows to survive an object rename. But it is bad for a cross instance case. Even if you have the same code version, the same schema, you can't guarantee stability of oids. So, an extension can't rely on this technique This patch changes an interface of the JumbleQuery routine to allow an extension to use it for generation of an query signature and use specific algorithm for oids jumbling. --- src/backend/commands/explain.c | 6 +-- src/backend/parser/analyze.c | 12 ++--- src/backend/tcop/postgres.c | 6 +-- src/backend/utils/misc/queryjumble.c | 81 ++-- src/include/utils/queryjumble.h | 14 - 5 files changed, 67 insertions(+), 52 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 10644dfac4..b2f10e828e 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -166,7 +166,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, { ExplainState *es = NewExplainState(); TupOutputState *tstate; - JumbleState *jstate = NULL; + JumbleState jstate; Query *query; List *rewritten; ListCell *lc; @@ -246,10 +246,10 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, query = castNode(Query, stmt->query); if (IsQueryIdEnabled()) - jstate = JumbleQuery(query, pstate->p_sourcetext); + query->queryId = JumbleQuery(query, pstate->p_sourcetext, NULL, &jstate); if (post_parse_analyze_hook) - (*post_parse_analyze_hook) (pstate, query, jstate); + (*post_parse_analyze_hook) (pstate, query, &jstate); /* * Parse analysis was done already, but we still have to run the rule diff --git a/src/backend/parser/analyze.c b/src/backend/parser/analyze.c index 146ee8dd1e..f9c3991a94 100644 --- a/src/backend/parser/analyze.c +++ b/src/backend/parser/analyze.c @@ -113,7 +113,7 @@ parse_analyze(RawStmt *parseTree, const char *sourceText, { ParseState *pstate = make_parsestate(NULL); Query *query; - JumbleState *jstate = NULL; + JumbleState jstate; Assert(sourceText != NULL); /* required as of 8.4 */ @@ -127,10 +127,10 @@ parse_analyze(RawStmt *parseTree, const char *sourceText, query = transformTopLevelStmt(pstate, parseTree); if (IsQueryIdEnabled()) - jstate = JumbleQuery(query, sourceText); + query->queryId = JumbleQuery(query, sourceText, NULL, &jstate); if (post_parse_analyze_hook) - (*post_parse_analyze_hook) (pstate, query, jstate); + (*post_parse_analyze_hook) (pstate, query, &jstate); free_parsestate(pstate); @@ -152,7 +152,7 @@ parse_analyze_varparams(RawStmt *parseTree, const char *sourceText, { ParseState *pstate = make_parsestate(NULL); Query *query; - JumbleState *jstate = NULL; + JumbleState jstate; Assert(sourceText != NULL); /* required as of 8.4 */ @@ -166,10 +166,10 @@ parse_analyze_varparams(RawStmt *parseTree, const char *sourceText, check_variable_parameters(pstate, query); if (IsQueryIdEnabled()) -
Re: Make query ID more portable
On 14/10/21 10:40, Julien Rouhaud wrote: On Thu, Oct 14, 2021 at 12:37 PM Andrey Lepikhov wrote: On 12/10/21 18:45, Bruce Momjian wrote: On Tue, Oct 12, 2021 at 09:40:47AM -0400, Tom Lane wrote: Andrey Lepikhov writes: I think that there are just too many arbitrary decisions that could be made on what exactly should be a query identifier to have a single in-core implementation. Yes, and I use such custom decision too. But core jumbling code implements good idea and can be generalized for reuse. Patch from previous letter and breaking down of JumbleState can allow coders to implement their codes based on queryjumble.c module with smaller changes. If you do sharding, you already have to properly configure each node, so configuring your custom query id extension shouldn't be a big problem. My project is about adaptive query optimization techniques. It is not obvious how to match (without a field in Query struct) a post parse and an execution phases because of nested queries. Also, if we use queryId in an extension, we interfere with pg_stat_statements. -- regards, Andrey Lepikhov Postgres Professional
Re: Suggestions on message transfer among backends
On 11/03/2019 18:36, Andy Fan wrote: Hi: I need some function which requires some message exchange among different back-ends (connections). specially I need a shared hash map and a message queue. Message queue: it should be many writers, 1 reader. Looks POSIX message queue should be OK, but postgre doesn't use it. is there any equivalent in PG? shared hash map: the number of items can be fixed and the value can be fixed as well. any keywords or explanation will be extremely helpful. You may use shm_mq (shared memory queue) and hash tables (dynahash.c) in shared memory (see ShmemInitHash() + shmem_startup_hook) Thanks -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
On 25/03/2019 15:21, Heikki Linnakangas wrote: On 25/03/2019 09:57, David Steele wrote: On 2/6/19 2:08 PM, Andrey Lepikhov wrote: The patchset had a problem with all-zero pages, has appeared at index build stage: the generic_log_relation() routine sends all pages into the WAL. So lsn field at all-zero page was initialized and the PageIsVerified() routine detects it as a bad page. The solution may be: 1. To improve index build algorithms and eliminate the possibility of not used pages appearing. 2. To mark each page as 'dirty' right after initialization. In this case we will got 'empty' page instead of the all-zeroed. 3. Do not write into the WAL all-zero pages. Hmm. When do we create all-zero pages during index build? That seems pretty surprising. GIST uses buffered pages. During GIST build it is possible (very rarely) what no one index tuple was written to the block page before new block was allocated. And the page has become an all-zero page. You can't have problems in the current GIST code, because it writes into the WAL only changed pages. But the idea of the patch is traversing blocks of index relation one-by-one after end of index building process and write this blocks to the WAL. In this case we will see all-zeroed pages and need to check it. On 04.02.2019 10:04, Michael Paquier wrote: On Tue, Dec 18, 2018 at 10:41:48AM +0500, Andrey Lepikhov wrote: Ok. It is used only for demonstration. The latest patch set needs a rebase, so moved to next CF, waiting on author as this got no reviews. The patch no longer applies so marked Waiting on Author. Alexander, Heikki, are either of you planning to review the patch in this CF? I had another quick look. I still think using the "generic xlog AM" for this is a wrong level of abstraction, and we should use the XLOG_FPI records for this directly. We can extend XLOG_FPI so that it can store multiple pages in a single record, if it doesn't already handle it. Another counter-point to using the generic xlog record is that you're currently doing unnecessary two memcpy's of all pages in the index, in GenericXLogRegisterBuffer() and GenericXLogFinish(). That's not free. I guess the generic_log_relation() function can stay where it is, but it should use XLogRegisterBuffer() and XLogInsert() directly. Ok. This patch waited feedback for a long time. I will check the GIST code changes from previous review and will try to use your advice. - Heikki -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
On 26/03/2019 15:59, Heikki Linnakangas wrote: On 26/03/2019 11:29, Andrey Lepikhov wrote: On 25/03/2019 15:21, Heikki Linnakangas wrote: Hmm. When do we create all-zero pages during index build? That seems pretty surprising. GIST uses buffered pages. During GIST build it is possible (very rarely) what no one index tuple was written to the block page before new block was allocated. And the page has become an all-zero page. You can't have problems in the current GIST code, because it writes into the WAL only changed pages. Looking at the code, I don't see how that could happen. The only place where the GiST index file is extended is in gistNewBuffer(), and all callers of that initialize the page immediately after the call. What am I missing? Sorry, This issue was found in SP-GiST AM. You can show it: 1. Apply v2 version of the patch set (see attachment). 2. In the generic_log_relation() routine set logging on PageIsNew(buf) 3. Run script t1.sql (in attachment). This problem can be resolved by calling MarkBufferDirty() after SpGistInitBuffer() in the allocNewBuffer() routine. But in this case we will write to the WAL more pages than necessary. To avoid it in the patch '0001-Relation-into-WAL-function' I do not write new pages to the WAL. Attached patch set is not final version. It is needed for demonstration of 'all-zero pages' issue only. The sentence for the direct use of XLOG_FPI records will be considered in v3. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company t1.sql Description: application/sql >From d3093aa9a7628979b892d31449eda6228ef169ce Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 1 Apr 2019 08:33:46 +0500 Subject: [PATCH 1/4] Relation-into-WAL-function --- src/backend/access/transam/generic_xlog.c | 48 +++ src/include/access/generic_xlog.h | 3 ++ 2 files changed, 51 insertions(+) diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 5b00b7275b..c22e361747 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -542,3 +542,51 @@ generic_mask(char *page, BlockNumber blkno) mask_unused_space(page); } + +/* + * Function to write generic xlog for every existing block of a relation. + * Caller is responsible for locking the relation exclusively. + */ +void +generic_log_relation(Relation rel) +{ + BlockNumber blkno; + BlockNumber nblocks; + int npbuf = 0; + GenericXLogState *state = NULL; + Bufferbufpack[MAX_GENERIC_XLOG_PAGES]; + + CHECK_FOR_INTERRUPTS(); + nblocks = RelationGetNumberOfBlocks(rel); + + /* + * Iterate over all index pages and WAL-logging it. Pages are grouping into + * the packages before adding to a WAL-record. Zero-pages are + * not logged. + */ + for (blkno = 0; blkno < nblocks; blkno++) + { + Buffer buf; + + buf = ReadBuffer(rel, blkno); + if (!PageIsNew(BufferGetPage(buf))) + { + if (npbuf == 0) +state = GenericXLogStart(rel); + + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + GenericXLogRegisterBuffer(state, buf, GENERIC_XLOG_FULL_IMAGE); + bufpack[npbuf++] = buf; + } + else + ReleaseBuffer(buf); + + if ((npbuf == MAX_GENERIC_XLOG_PAGES) || (blkno == nblocks-1)) + { + GenericXLogFinish(state); + + for (; npbuf > 0; npbuf--) +UnlockReleaseBuffer(bufpack[npbuf-1]); + } + } +} diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h index cb5b5b713a..e3bbf014cc 100644 --- a/src/include/access/generic_xlog.h +++ b/src/include/access/generic_xlog.h @@ -42,4 +42,7 @@ extern const char *generic_identify(uint8 info); extern void generic_desc(StringInfo buf, XLogReaderState *record); extern void generic_mask(char *pagedata, BlockNumber blkno); +/* other utils */ +extern void generic_log_relation(Relation rel); + #endif /* GENERIC_XLOG_H */ -- 2.17.1 >From 9a0172346c8a942b6a493aca8c47452256a2932f Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 1 Apr 2019 08:37:32 +0500 Subject: [PATCH 2/4] GIN-Optimal-WAL-Usage --- src/backend/access/gin/ginbtree.c | 6 ++--- src/backend/access/gin/gindatapage.c | 9 src/backend/access/gin/ginentrypage.c | 2 +- src/backend/access/gin/gininsert.c| 30 ++-- src/backend/access/gin/ginutil.c | 4 ++-- src/backend/access/gin/ginvacuum.c| 2 +- src/backend/access/gin/ginxlog.c | 33 --- src/backend/access/rmgrdesc/gindesc.c | 6 - src/include/access/gin.h | 3 ++- src/include/access/ginxlog.h | 2 -- 10 files changed, 26 insertions(+), 71 deletions(-) diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index 533949e46a..9f82eef8c3 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -396,7 +396,7 @@ ginP
Re: Removing unneeded self joins
On 02/08/2019 04:54, Thomas Munro wrote: On Thu, Jun 27, 2019 at 6:42 PM Andrey Lepikhov wrote: Version v.17 of the patch that fix the bug see in attachment. While moving this to the September CF, I noticed that it needs to be updated for the recent pg_list.h API changes. The patch was updated: 1. Changes caused by pg_list.h API changes. 2. Fix the problem of joint clause_relids and required_relids changes [1]. 3. Add eclass mentions of removed relation into the kept relation (field eclass_indexes was introduced by commit 3373c71553). [1] https://www.postgresql.org/message-id/flat/5c21029d-81a2-c999-6744-6a898fcc9a19%40postgrespro.ru -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 72ac38de7fb55fe741677ea75b280eecb443e978 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 2 Aug 2019 11:01:47 +0500 Subject: [PATCH] Remove self joins v18 --- src/backend/nodes/outfuncs.c | 19 +- src/backend/optimizer/path/indxpath.c | 28 +- src/backend/optimizer/path/joinpath.c |6 +- src/backend/optimizer/plan/analyzejoins.c | 1021 +++-- src/backend/optimizer/plan/planmain.c |7 +- src/backend/optimizer/util/pathnode.c |3 +- src/backend/optimizer/util/relnode.c | 26 +- src/include/nodes/nodes.h |1 + src/include/nodes/pathnodes.h | 22 +- src/include/optimizer/pathnode.h |4 + src/include/optimizer/paths.h |3 +- src/include/optimizer/planmain.h |7 +- src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 256 +- src/test/regress/sql/equivclass.sql | 17 + src/test/regress/sql/join.sql | 127 +++ 16 files changed, 1480 insertions(+), 99 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 86c31a48c9..9d511d1d1b 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2270,7 +2270,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_OID_FIELD(userid); WRITE_BOOL_FIELD(useridiscurrent); /* we don't try to print fdwroutine or fdw_private */ - /* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */ + WRITE_NODE_FIELD(unique_for_rels); + /* can't print non_unique_for_rels; BMSes aren't Nodes */ WRITE_NODE_FIELD(baserestrictinfo); WRITE_UINT_FIELD(baserestrict_min_security); WRITE_NODE_FIELD(joininfo); @@ -2342,6 +2343,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node) WRITE_BITMAPSET_FIELD(keys); } +static void +_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node) +{ + WRITE_NODE_TYPE("UNIQUERELINFO"); + + WRITE_BITMAPSET_FIELD(outerrelids); + if (node->index) + { + WRITE_OID_FIELD(index->indexoid); + WRITE_NODE_FIELD(column_values); + } +} + static void _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) { @@ -4108,6 +4122,9 @@ outNode(StringInfo str, const void *obj) case T_StatisticExtInfo: _outStatisticExtInfo(str, obj); break; + case T_UniqueRelInfo: +_outUniqueRelInfo(str, obj); +break; case T_ExtensibleNode: _outExtensibleNode(str, obj); break; diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 5f339fdfde..b57be54df6 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3568,7 +3568,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueRelInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -3589,12 +3590,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueRelInfo **unique_info) { ListCell *ic; Assert(list_length(exprlist) == list_length(oprlist)); + if (unique_info) + *unique_info = NULL; + /* Short-circuit if no indexes... */ if (rel->indexlist == NIL) return false; @@ -3645,6 +3650,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *column_values = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3693,6 +3699,9 @@ relation_ha
Re: Removing unneeded self joins
v.21 in attechment fix small bug: Now we move all non-mergejoinable clauses from joininfo to base restrict info because of the relation will not be joined. On 30/09/2019 13:29, Konstantin Knizhnik wrote: Slightly refactored version of the patch with more comments. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 4540345416f1d5ac71395773621d022d15b529f1 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 5 Nov 2019 17:57:23 +0500 Subject: [PATCH] Remove self joins v21 --- src/backend/nodes/outfuncs.c | 19 +- src/backend/optimizer/path/indxpath.c | 28 +- src/backend/optimizer/path/joinpath.c |6 +- src/backend/optimizer/plan/analyzejoins.c | 1155 - src/backend/optimizer/plan/planmain.c |7 +- src/backend/optimizer/util/pathnode.c |3 +- src/backend/optimizer/util/relnode.c | 26 +- src/backend/utils/mmgr/aset.c |1 - src/include/nodes/nodes.h |1 + src/include/nodes/pathnodes.h | 22 +- src/include/optimizer/pathnode.h |4 + src/include/optimizer/paths.h |3 +- src/include/optimizer/planmain.h |7 +- src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 275 src/test/regress/expected/updatable_views.out | 15 +- src/test/regress/sql/equivclass.sql | 17 + src/test/regress/sql/join.sql | 139 ++ 18 files changed, 1653 insertions(+), 107 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index b0dcd02ff6..dc8cb9cec0 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2274,7 +2274,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_OID_FIELD(userid); WRITE_BOOL_FIELD(useridiscurrent); /* we don't try to print fdwroutine or fdw_private */ - /* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */ + WRITE_NODE_FIELD(unique_for_rels); + /* can't print non_unique_for_rels; BMSes aren't Nodes */ WRITE_NODE_FIELD(baserestrictinfo); WRITE_UINT_FIELD(baserestrict_min_security); WRITE_NODE_FIELD(joininfo); @@ -2346,6 +2347,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node) WRITE_BITMAPSET_FIELD(keys); } +static void +_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node) +{ + WRITE_NODE_TYPE("UNIQUERELINFO"); + + WRITE_BITMAPSET_FIELD(outerrelids); + if (node->index) + { + WRITE_OID_FIELD(index->indexoid); + WRITE_NODE_FIELD(column_values); + } +} + static void _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) { @@ -4122,6 +4136,9 @@ outNode(StringInfo str, const void *obj) case T_StatisticExtInfo: _outStatisticExtInfo(str, obj); break; + case T_UniqueRelInfo: +_outUniqueRelInfo(str, obj); +break; case T_ExtensibleNode: _outExtensibleNode(str, obj); break; diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 37b257cd0e..3d0d03b887 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3564,7 +3564,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueRelInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -3585,12 +3586,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueRelInfo **unique_info) { ListCell *ic; Assert(list_length(exprlist) == list_length(oprlist)); + if (unique_info) + *unique_info = NULL; + /* Short-circuit if no indexes... */ if (rel->indexlist == NIL) return false; @@ -3641,6 +3646,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *column_values = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3689,6 +3695,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, if (match_index_to_operand(rexpr, c, ind)) { matched = true; /* column is unique */ + column_values = lappend(column_values, rinfo->outer_is_left +
Re: pg_waldump and PREPARE
On 08/11/2019 05:53, Fujii Masao wrote: On Fri, Nov 8, 2019 at 9:41 AM Michael Paquier wrote: On Tue, Sep 03, 2019 at 10:00:08AM +0900, Fujii Masao wrote: Sorry for the long delay... Yes, I will update the patch if necessary. Fujii-san, are you planning to update this patch then? I have switched it as waiting on author. No because there has been nothing to update in the latest patch for now unless I'm missing something. So I'm just waiting for some new review comments against the latest patch to come :) Can I switch the status back to "Needs review"? Regards, One issue is that your patch provides small information. WAL errors Investigation often requires information on xid, subxacts, delete-on-abort/commit rels; rarely - invalidation messages etc. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: pg_waldump and PREPARE
On 08/11/2019 09:26, Kyotaro Horiguchi wrote: Hello. At Fri, 8 Nov 2019 08:23:41 +0500, Andrey Lepikhov wrote in Can I switch the status back to "Needs review"? Regards, One issue is that your patch provides small information. WAL errors Investigation often requires information on xid, subxacts, delete-on-abort/commit rels; rarely - invalidation messages etc. Basically agrred, but it can be very large in certain cases, even if it is rare. Maybe this is the reason for introducing a “verbose” option? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: pg_waldump and PREPARE
12.11.2019 12:41, Fujii Masao пишет: Ok, I changed the patch that way. Attached is the latest version of the patch. Regards, I did not see any problems in this version of the patch. The information displayed by pg_waldump for the PREPARE record is sufficient for use. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Optimization of NestLoop join in the case of guaranteed empty inner subtree
During NestLoop execution we have bad corner case: if outer subtree contains tuples the join node will scan inner subtree even if it does not return any tuples. To reproduce the problem see 'problem.sql' in attachment: Out of explain analyze see in 'problem_explain.txt' As you can see, executor scan each of 1e5 outer tuples despite the fact that inner can't return any tuples. Teodor Sigaev and I developed a patch to solve this problem. Result of explain analyze procedure can be found in the 'optimized_execution.txt'. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company problem.sql Description: application/sql QUERY PLAN -- Limit (cost=0.29..1515.83 rows=1 width=80) (actual time=25.194..25.194 rows=0 loops=1) -> Nested Loop (cost=0.29..504674.12 rows=333 width=80) (actual time=25.193..25.193 rows=0 loops=1) Join Filter: (big.id = small.id) -> Index Scan Backward using big_pkey on big (cost=0.29..5152.29 rows=10 width=44) (actual time=0.005..12.933 rows=10 loops=1) -> Materialize (cost=0.00..22.66 rows=333 width=36) (actual time=0.000..0.000 rows=0 loops=10) -> Seq Scan on small (cost=0.00..21.00 rows=333 width=36) (actual time=0.145..0.145 rows=0 loops=1) Filter: ((id)::numeric > '1000'::numeric) Rows Removed by Filter: 1000 Planning Time: 0.215 ms Execution Time: 25.214 ms (10 rows) >From 40ffbd2d9ee5954e5ae09f1e5a929ae2f2b17b8c Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 9 Dec 2019 18:25:04 +0500 Subject: [PATCH] Skip scan of outer subtree if inner of the NestedLoop node is guaranteed empty. --- src/backend/executor/nodeMaterial.c | 11 +++ src/backend/executor/nodeNestloop.c | 5 + src/include/nodes/execnodes.h | 4 src/test/regress/expected/partition_prune.out | 8 4 files changed, 24 insertions(+), 4 deletions(-) diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c index cc93bbe45b..98771647f6 100644 --- a/src/backend/executor/nodeMaterial.c +++ b/src/backend/executor/nodeMaterial.c @@ -135,6 +135,8 @@ ExecMaterial(PlanState *pstate) if (TupIsNull(outerslot)) { node->eof_underlying = true; + if (tuplestorestate && tuplestore_tuple_count(tuplestorestate) == 0) +node->ss.ps.guaranteed_empty = true; return NULL; } @@ -348,6 +350,9 @@ ExecReScanMaterial(MaterialState *node) node->tuplestorestate = NULL; if (outerPlan->chgParam == NULL) ExecReScan(outerPlan); + else +/* At first look no cases can lead to this code. But still. */ +node->ss.ps.guaranteed_empty = false; node->eof_underlying = false; } else @@ -363,6 +368,12 @@ ExecReScanMaterial(MaterialState *node) */ if (outerPlan->chgParam == NULL) ExecReScan(outerPlan); + + /* + * The node suppress tuplestore and guaranteed_empty trick isn't work. + */ + Assert(node->ss.ps.guaranteed_empty == false); + node->eof_underlying = false; } } diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c index fc6667ef82..ea26a7d9d1 100644 --- a/src/backend/executor/nodeNestloop.c +++ b/src/backend/executor/nodeNestloop.c @@ -164,6 +164,11 @@ ExecNestLoop(PlanState *pstate) { ENL1_printf("no inner tuple, need new outer tuple"); + if (innerPlan->guaranteed_empty && +(node->js.jointype == JOIN_INNER || + node->js.jointype == JOIN_SEMI)) +return NULL; + node->nl_NeedNewOuter = true; if (!node->nl_MatchedOuter && diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 692438d6df..c7f1a14526 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1007,6 +1007,9 @@ typedef struct PlanState * ExecConditionalAssignProjectionInfo()). If no projection is necessary * ExecConditionalAssignProjectionInfo() defaults those fields to the scan * operations. + * + * If guaranteed_empty field is true, this means that no tuple will be + * returned by the node. */ const TupleTableSlotOps *scanops; const TupleTableSlotOps *outerops; @@ -1020,6 +1023,7 @@ typedef struct PlanState bool outeropsset; bool inneropsset; bool resultopsset; + bool guaranteed_empty; } PlanState; /* diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index f9eeda60e6..04cfe0944e 100644 --- a/src/test/re
Re: Optimization of NestLoop join in the case of guaranteed empty inner subtree
On 12/11/19 8:49 PM, Tom Lane wrote: Andrey Lepikhov writes: During NestLoop execution we have bad corner case: if outer subtree contains tuples the join node will scan inner subtree even if it does not return any tuples. So the first question about corner-case optimizations like this is always "how much overhead does it add in the normal case where it fails to gain anything?". I see no performance numbers in your proposal. I thought it is trivial. But quick study shows no differences that can be seen. I do not much like anything about the code, either: as written it's only helpful for an especially narrow corner case (so narrow that I wonder if it really ever helps at all: surely calling a nodeMaterial whose tuplestore is empty doesn't cost much). Scanning of large outer can be very costly. If you will try to play with analytical queries you can find cases, where nested loops uses materialization of zero tuples. At least one of the cases for this is finding data gaps. Also, this optimization exists in logic of hash join. But that doesn't stop it from adding a bool to the generic PlanState struct, with global implications. What I'd expected from your text description is that nodeNestLoop would remember whether its inner child had returned zero rows the first time, and assume that subsequent executions could be skipped unless the inner child's parameters change. This note I was waiting for. I agree with you that adding a bool variable to PlanState is excessful. See in attachment another version of the optimization. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From a92617b82d922d5ebac7342bd8c212e0eb5b4553 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 9 Dec 2019 18:25:04 +0500 Subject: [PATCH] Skip scan of outer subtree if inner of the NestedLoop node is guaranteed empty. --- src/backend/executor/nodeNestloop.c | 8 src/include/nodes/execnodes.h | 1 + src/test/regress/expected/partition_prune.out | 8 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c index fc6667ef82..4a7da5406d 100644 --- a/src/backend/executor/nodeNestloop.c +++ b/src/backend/executor/nodeNestloop.c @@ -164,6 +164,11 @@ ExecNestLoop(PlanState *pstate) { ENL1_printf("no inner tuple, need new outer tuple"); + if (node->nl_InnerEmpty && list_length(nl->nestParams) == 0 && +(node->js.jointype == JOIN_INNER || + node->js.jointype == JOIN_SEMI)) +return NULL; + node->nl_NeedNewOuter = true; if (!node->nl_MatchedOuter && @@ -200,6 +205,8 @@ ExecNestLoop(PlanState *pstate) */ continue; } + else + node->nl_InnerEmpty = false; /* * at this point we have a new pair of inner and outer tuples so we @@ -327,6 +334,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags) { case JOIN_INNER: case JOIN_SEMI: + nlstate->nl_InnerEmpty = true; break; case JOIN_LEFT: case JOIN_ANTI: diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 692438d6df..8829433347 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -1847,6 +1847,7 @@ typedef struct NestLoopState JoinState js;/* its first field is NodeTag */ bool nl_NeedNewOuter; bool nl_MatchedOuter; + bool nl_InnerEmpty; TupleTableSlot *nl_NullInnerTupleSlot; } NestLoopState; diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out index f9eeda60e6..04cfe0944e 100644 --- a/src/test/regress/expected/partition_prune.out +++ b/src/test/regress/expected/partition_prune.out @@ -2455,9 +2455,9 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a; Heap Blocks: exact=1 -> Bitmap Index Scan on ab_a1_b2_a_idx (actual rows=1 loops=1) Index Cond: (a = 1) - -> Bitmap Heap Scan on ab_a1_b3 ab_2 (actual rows=0 loops=1) + -> Bitmap Heap Scan on ab_a1_b3 ab_2 (never executed) Recheck Cond: (a = 1) - -> Bitmap Index Scan on ab_a1_b3_a_idx (actual rows=0 loops=1) + -> Bitmap Index Scan on ab_a1_b3_a_idx (never executed) Index Cond: (a = 1) -> Materialize (actual rows=0 loops=1) -> Bitmap Heap Scan on ab_a1_b1 ab_a1_1 (actual rows=0 loops=1) @@ -2496,9 +2496,9 @@ update ab_a1 set b = 3 from ab where ab.a = 1 and ab.a = ab_a1.a; Heap Blocks: exact=1 -> Bitmap Index Scan on ab_a1_b2_a_idx (actual rows=1 loops=1) Index Cond: (a = 1) - -
Re: NOT IN subquery optimization
At the top of the thread your co-author argued the beginning of this work with "findings about the performance of PostgreSQL, MySQL, and Oracle on various subqueries:" https://antognini.ch/2017/12/how-well-a-query-optimizer-handles-subqueries/ I launched this test set with your "not_in ..." patch. Your optimization improves only results of D10-D13 queries. Nothing has changed for bad plans of the E20-E27 and F20-F27 queries. For example, we can replace E20 query: SELECT * FROM large WHERE n IN (SELECT n FROM small WHERE small.u = large.u); - execution time: 1370 ms, by SELECT * FROM large WHERE EXISTS (SELECT n,u FROM small WHERE (small.u = large.u) AND (large.n = small.n )) AND n IS NOT NULL; - execution time: 0.112 ms E21 query: SELECT * FROM large WHERE n IN (SELECT nn FROM small WHERE small.u = large.u); - 1553 ms, by SELECT * FROM large WHERE EXISTS (SELECT nn FROM small WHERE (small.u = large.u) AND (small.nn = large.n)); - 0.194 ms F27 query: SELECT * FROM large WHERE nn NOT IN (SELECT nn FROM small WHERE small.nu = large.u); - 1653.048 ms, by SELECT * FROM large WHERE NOT EXISTS (SELECT nn,nu FROM small WHERE (small.nu = large.u) AND (small.nn = large.nn)); - 274.019 ms Are you planning to make another patch for these cases? Also i tried to increase work_mem up to 2GB: may be hashed subqueries can improve situation? But hashing is not improved execution time of the queries significantly. On your test cases (from the comments of the patch) the subquery hashing has the same execution time with queries No.13-17. At the queries No.1-12 it is not so slow as without hashing, but works more slowly (up to 3 orders) than NOT IN optimization. On 12/2/19 9:25 PM, Li, Zheng wrote: Here is the latest rebased patch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: NOT IN subquery optimization
On 1/7/20 12:34 AM, Li, Zheng wrote: Hi Andrey, Thanks for the comment! The unimproved cases you mentioned all fall into the category “correlated subquery”. This category is explicitly disallowed by existing code to convert to join in convert_ANY_sublink_to_join: /* * The sub-select must not refer to any Vars of the parent query. (Vars of * higher levels should be okay, though.) */ if (contain_vars_of_level((Node *) subselect, 1)) return NULL; I think this is also the reason why hashed subplan is not used for such subqueries. It's probably not always safe to convert a correlated subquery to join. We need to find out/prove when it’s safe/unsafe to convert such ANY subquery if we were to do so. Maybe this part of code contains logical error? You optimize only the special case of the "NOT IN" expression, equal to NOT EXISTS. The convert_EXISTS_sublink_to_join() routine can contain vars of the parent query. May be you give an trivial example for this problem? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Removing unneeded self joins
Rebased version v.22. - Added enable_self_join_removal GUC (true is default) - The joinquals of the relation that is being removed, redistributed in accordance with the remove_rel_from_query () machinery. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 2d194aab7f8e43805a61901de34b28fcc4e136b4 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 27 Jan 2020 14:51:20 +0500 Subject: [PATCH] Remove self joins v.22 --- src/backend/nodes/outfuncs.c | 19 +- src/backend/optimizer/path/indxpath.c | 28 +- src/backend/optimizer/path/joinpath.c |6 +- src/backend/optimizer/plan/analyzejoins.c | 1174 - src/backend/optimizer/plan/planmain.c |7 +- src/backend/optimizer/util/pathnode.c |3 +- src/backend/optimizer/util/relnode.c | 26 +- src/backend/utils/misc/guc.c | 11 + src/backend/utils/mmgr/aset.c |1 - src/include/nodes/nodes.h |1 + src/include/nodes/pathnodes.h | 22 +- src/include/optimizer/pathnode.h |4 + src/include/optimizer/paths.h |3 +- src/include/optimizer/planmain.h |7 +- src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 275 src/test/regress/expected/updatable_views.out | 15 +- src/test/regress/sql/equivclass.sql | 17 + src/test/regress/sql/join.sql | 139 ++ 19 files changed, 1683 insertions(+), 107 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index d76fae44b8..7b2b2641cd 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2278,7 +2278,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_OID_FIELD(userid); WRITE_BOOL_FIELD(useridiscurrent); /* we don't try to print fdwroutine or fdw_private */ - /* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */ + WRITE_NODE_FIELD(unique_for_rels); + /* can't print non_unique_for_rels; BMSes aren't Nodes */ WRITE_NODE_FIELD(baserestrictinfo); WRITE_UINT_FIELD(baserestrict_min_security); WRITE_NODE_FIELD(joininfo); @@ -2350,6 +2351,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node) WRITE_BITMAPSET_FIELD(keys); } +static void +_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node) +{ + WRITE_NODE_TYPE("UNIQUERELINFO"); + + WRITE_BITMAPSET_FIELD(outerrelids); + if (node->index) + { + WRITE_OID_FIELD(index->indexoid); + WRITE_NODE_FIELD(column_values); + } +} + static void _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) { @@ -4131,6 +4145,9 @@ outNode(StringInfo str, const void *obj) case T_StatisticExtInfo: _outStatisticExtInfo(str, obj); break; + case T_UniqueRelInfo: +_outUniqueRelInfo(str, obj); +break; case T_ExtensibleNode: _outExtensibleNode(str, obj); break; diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 2a50272da6..0c6e7eee74 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3564,7 +3564,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueRelInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -3585,12 +3586,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueRelInfo **unique_info) { ListCell *ic; Assert(list_length(exprlist) == list_length(oprlist)); + if (unique_info) + *unique_info = NULL; + /* Short-circuit if no indexes... */ if (rel->indexlist == NIL) return false; @@ -3641,6 +3646,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *column_values = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3689,6 +3695,9 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, if (match_index_to_operand(rexpr, c, ind)) { matched = true; /* column is unique */ + column_values = lappend(column_values, rinfo->outer_is_left +
Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist
On 25/03/2019 15:21, Heikki Linnakangas wrote: I had another quick look. I still think using the "generic xlog AM" for this is a wrong level of abstraction, and we should use the XLOG_FPI records for this directly. We can extend XLOG_FPI so that it can store multiple pages in a single record, if it doesn't already handle it. Another counter-point to using the generic xlog record is that you're currently doing unnecessary two memcpy's of all pages in the index, in GenericXLogRegisterBuffer() and GenericXLogFinish(). That's not free. I guess the generic_log_relation() function can stay where it is, but it should use XLogRegisterBuffer() and XLogInsert() directly. Patch set v.3 uses XLOG_FPI records directly. As a benchmark I use the script (test.sql in attachment) which show WAL size increment during index build. In the table below you can see the influence of the patch on WAL growth. Results === AM | master | patch | GIN | 347 MB | 66 MB | GiST| 157 MB | 43 MB | SP-GiST | 119 MB | 38 MB | -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 2edd0ada13b4749487d0f046191ef2bcf8b11ca3 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 2 Apr 2019 09:42:59 +0500 Subject: [PATCH 1/4] Relation-into-WAL-function --- src/backend/access/transam/generic_xlog.c | 63 +++ src/include/access/generic_xlog.h | 3 ++ 2 files changed, 66 insertions(+) diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 5b00b7275b..0d5025f78a 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -16,6 +16,7 @@ #include "access/bufmask.h" #include "access/generic_xlog.h" #include "access/xlogutils.h" +#include "catalog/pg_control.h" #include "miscadmin.h" #include "utils/memutils.h" @@ -542,3 +543,65 @@ generic_mask(char *page, BlockNumber blkno) mask_unused_space(page); } + +/* + * Function for WAL-logging all pages of a relation. + * Caller is responsible for locking the relation exclusively. + */ +void +log_relation(Relation rel) +{ + BlockNumber blkno = 0; + BlockNumber nblocks; + Bufferbufpack[XLR_MAX_BLOCK_ID]; + + CHECK_FOR_INTERRUPTS(); + + /* + * Iterate over all index pages and WAL-logging it. Pages are grouping into + * the packages before adding to a WAL-record. Zero pages are not logged. + */ + nblocks = RelationGetNumberOfBlocks(rel); + while (blkno < nblocks) + { + XLogRecPtr recptr; + int8 nbufs = 0; + int8 i; + + /* + * Assemble package of relation blocks. Try to combine the maximum + * possible number of blocks in one record. + */ + while (nbufs < XLR_MAX_BLOCK_ID && blkno < nblocks) + { + Buffer buf = ReadBuffer(rel, blkno); + + if (!PageIsNew(BufferGetPage(buf))) +bufpack[nbufs++] = buf; + else +ReleaseBuffer(buf); + blkno++; + } + + XLogBeginInsert(); + XLogEnsureRecordSpace(nbufs, 0); + + START_CRIT_SECTION(); + for (i = 0; i < nbufs; i++) + { + LockBuffer(bufpack[i], BUFFER_LOCK_EXCLUSIVE); + XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD); + } + + recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); + + for (i = 0; i < nbufs; i++) + { + Page page = BufferGetPage(bufpack[i]); + PageSetLSN(page, recptr); + MarkBufferDirty(bufpack[i]); + UnlockReleaseBuffer(bufpack[i]); + } + END_CRIT_SECTION(); + } +} diff --git a/src/include/access/generic_xlog.h b/src/include/access/generic_xlog.h index cb5b5b713a..8abfa486c7 100644 --- a/src/include/access/generic_xlog.h +++ b/src/include/access/generic_xlog.h @@ -42,4 +42,7 @@ extern const char *generic_identify(uint8 info); extern void generic_desc(StringInfo buf, XLogReaderState *record); extern void generic_mask(char *pagedata, BlockNumber blkno); +/* other utils */ +extern void log_relation(Relation rel); + #endif /* GENERIC_XLOG_H */ -- 2.17.1 >From 8857657efa8f3347010d3b251315f800dd03bf8d Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 2 Apr 2019 09:43:20 +0500 Subject: [PATCH 2/4] GIN-Optimal-WAL-Usage --- src/backend/access/gin/ginbtree.c | 6 ++--- src/backend/access/gin/gindatapage.c | 9 src/backend/access/gin/ginentrypage.c | 2 +- src/backend/access/gin/gininsert.c| 30 ++-- src/backend/access/gin/ginutil.c | 4 ++-- src/backend/access/gin/ginvacuum.c| 2 +- src/backend/access/gin/ginxlog.c | 33 --- src/backend/access/rmgrdesc/gindesc.c | 6 - src/include/access/gin.h | 3 ++- src/include/access/ginxlog.h | 2 -- 10 files changed, 26 insertions(+), 71 deletions(-) diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index 533949e46a..9f82eef8
Re: Failure in contrib test _int on loach
On 10/04/2019 20:25, Heikki Linnakangas wrote: On 09/04/2019 19:11, Anastasia Lubennikova wrote: 05.04.2019 19:41, Anastasia Lubennikova writes: In attachment, you can find patch with a test that allows to reproduce the bug not randomly, but on every run. Now I'm trying to find a way to fix the issue. The problem was caused by incorrect detection of the page to insert new tuple after split. If gistinserttuple() of the tuple formed by gistgetadjusted() had to split the page, we must to go back to the parent and descend back to the child that's a better fit for the new tuple. Previously this was handled by the code block with the following comment: * Concurrent split detected. There's no guarantee that the * downlink for this page is consistent with the tuple we're * inserting anymore, so go back to parent and rechoose the best * child. After introducing GistBuildNSN this code path became unreachable. To fix it, I added new flag to detect such splits during indexbuild. Isn't it possible that the grandparent page is also split, so that we'd need to climb further up? Based on Anastasia's idea i prepare alternative solution to fix the bug (see attachment). It utilizes the idea of linear increment of LSN/NSN. WAL write process is used for change NSN value to 1 for each block of index relation. I hope this can be a fairly clear and safe solution. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 59e1519a0a48b879777820ff68116c68ed31e684 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 11 Apr 2019 10:52:39 +0500 Subject: [PATCH] Alt fix for gist_optimized_wal_intarray_test bug --- src/backend/access/gin/gininsert.c | 2 +- src/backend/access/gist/gist.c | 4 ++-- src/backend/access/gist/gistbuild.c | 17 +++-- src/backend/access/spgist/spginsert.c | 2 +- src/backend/access/transam/xloginsert.c | 4 +++- src/include/access/gist.h | 6 -- src/include/access/gist_private.h | 2 ++ src/include/access/xloginsert.h | 6 +- 8 files changed, 29 insertions(+), 14 deletions(-) diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 55eab14617..a63f33b429 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -413,7 +413,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo) { log_newpage_range(index, MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(index), - true); + true, NULL); } /* diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 2db790c840..56f6ce04db 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -492,7 +492,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, * we don't need to be able to detect concurrent splits yet.) */ if (is_build) - recptr = GistBuildLSN; + recptr = gistBuildLSN++; else { if (RelationNeedsWAL(rel)) @@ -559,7 +559,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, MarkBufferDirty(leftchildbuf); if (is_build) - recptr = GistBuildLSN; + recptr = gistBuildLSN++; else { if (RelationNeedsWAL(rel)) diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index 8e81eda517..31118b54cf 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -76,6 +76,13 @@ typedef struct GistBufferingMode bufferingMode; } GISTBuildState; +/* + * A bogus LSN / NSN value used during index build. Must be smaller than any + * real or fake unlogged LSN, so that after an index build finishes, all the + * splits are considered completed. + */ +XLogRecPtr gistBuildLSN = 0; + /* prototypes for private functions */ static void gistInitBuffering(GISTBuildState *buildstate); static int calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep); @@ -107,6 +114,12 @@ static void gistMemorizeParent(GISTBuildState *buildstate, BlockNumber child, static void gistMemorizeAllDownlinks(GISTBuildState *buildstate, Buffer parent); static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child); +static void +gistbuild_log_mask(char *page) +{ + GistPageSetNSN((Page) page, (uint64) 1); +} + /* * Main entry point to GiST index build. Initially calls insert over and over, * but switches to more efficient buffering build algorithm after a certain @@ -180,7 +193,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo) GISTInitBuffer(buffer, F_LEAF); MarkBufferDirty(buffer); - PageSetLSN(page, GistBuildLSN); + PageSetLSN(page, gistBuildLSN++); UnlockReleaseBuffer(buffer); @@ -222,7 +235,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo) { log_newpage_range(index, MAIN_FORKNUM, 0, RelationGetNumberOfBlocks(inde
Re: Failure in contrib test _int on loach
On 11/04/2019 13:14, Heikki Linnakangas wrote: On 11/04/2019 09:10, Andrey Lepikhov wrote: On 10/04/2019 20:25, Heikki Linnakangas wrote: On 09/04/2019 19:11, Anastasia Lubennikova wrote: After introducing GistBuildNSN this code path became unreachable. To fix it, I added new flag to detect such splits during indexbuild. Isn't it possible that the grandparent page is also split, so that we'd need to climb further up? Based on Anastasia's idea i prepare alternative solution to fix the bug (see attachment). It utilizes the idea of linear increment of LSN/NSN. WAL write process is used for change NSN value to 1 for each block of index relation. I hope this can be a fairly clear and safe solution. That's basically the same idea as always using the "fake LSN" during index build, like the original version of this patch did. It's got the problem that I mentioned at https://www.postgresql.org/message-id/090fb3cb-1ca4-e173-ecf7-47d41ebac...@iki.fi: * Using "fake" unlogged LSNs for GiST index build seemed fishy. I could not convince myself that it was safe in all corner cases. In a recently initdb'd cluster, it's theoretically possible that the fake LSN counter overtakes the real LSN value, and that could lead to strange behavior. For example, how would the buffer manager behave, if there was a dirty page in the buffer cache with an LSN value that's greater than the current WAL flush pointer? I think you'd get "ERROR: xlog flush request %X/%X is not satisfied --- flushed only to %X/%X". Perhaps the risk is theoretical; the real WAL begins at XLOG_SEG_SIZE, so with defaults WAL segment size, the index build would have to do about 16 million page splits. The index would have to be at least 150 GB for that. But it seems possible, and with non-default segment and page size settings more so. As i see in bufmgr.c, XLogFlush() can't called during index build. In the log_newpage_range() call we can use mask to set value of NSN (and LSN) to 1. Perhaps we could start at 1, but instead of using a global counter, whenever a page is split, we take the parent's LSN value and increment it by one. So different branches of the tree could use the same values, which would reduce the consumption of the counter values. Yet another idea would be to start the counter at 1, but check that it doesn't overtake the WAL insert pointer. If it's about to overtake it, just generate some dummy WAL. But it seems best to deal with this in gistdoinsert(). I think Anastasia's approach of adding a flag to GISTInsertStack can be made to work, if we set the flag somewhere in gistinserttuples() or gistplacetopage(), whenever a page is split. That way, if it needs to split multiple levels, the flag is set on all of the corresponding GISTInsertStack entries. Yet another trivial fix would be just always start the tree descend from the root in gistdoinsert(), if a page is split. Not as efficient, but probably negligible in practice. Agree -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Do CustomScan as not projection capable node
Can we include the CustomScan node in the list of nodes that do not support projection? Reason is that custom node can contain quite arbitrary logic that does not guarantee projection support. Secondly. If planner does not need a separate Result node, it just assign tlist to subplan (i.e. changes targetlist of custom node) and does not change the custom_scan_tlist. Perhaps I do not fully understand the logic of using the custom_scan_tlist field. But if into the PlanCustomPath() routine our custom node does not build own custom_scan_tlist (may be it will use tlist as base for custom_scan_tlist) we will get errors in the set_customscan_references() call. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 938904d179e0a4e31cbb20fb70243d2b980d8dc2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 19 Apr 2019 09:34:09 +0500 Subject: [PATCH] Add CustomScan node in the list of nodes that do not support projection --- src/backend/optimizer/plan/createplan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index efe073a3ee..682d4d4429 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -6691,6 +6691,7 @@ is_projection_capable_path(Path *path) case T_ModifyTable: case T_MergeAppend: case T_RecursiveUnion: + case T_CustomScan: return false; case T_Append: -- 2.17.1
Re: Do CustomScan as not projection capable node
On 22/04/2019 18:40, Robert Haas wrote: On Fri, Apr 19, 2019 at 12:45 AM Tom Lane wrote: I don't buy this for a minute. Where do you think projection is going to happen? There isn't any existing node type that *couldn't* support projection if we insisted that that be done across-the-board. I think it's mostly just a legacy thing that some don't. I think there may actually be some good reasons for that. If something like an Append or Material node projects, it seems to me that this means that we built the wrong tlist for its input(s). That justification doesn't apply to custom scans, though. The main reason for my question was incomplete information about the parameter custom_scan_tlist / fdw_scan_tlist. In the process of testing my custom node, I encountered an error in setrefs.c caused by optimization of the projection operation. In order to reliably understand how to properly use custom_scan_tlist, I had to study in detail the mechanics of the FDW plan generator and now the problem is solved. We have only three references to this parameter in the hackers mailing list, a brief reference on postgresql.org and limited comments into two patches: 1a8a4e5 and e7cb7ee. It is possible that custom_scan_tlist is designed too nontrivially, and it is possible that it needs some comments describing in more detail how to use it. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 22.10.2018 2:06, Heikki Linnakangas wrote: On 17/08/2018 06:47, Andrey Lepikhov wrote: I propose the patch for fix one small code defect. The XLogReadRecord() function reads the pages of a WAL segment that contain a WAL-record. Then it creates a readRecordBuf buffer in private memory of a backend and copy record from the pages to the readRecordBuf buffer. Pointer 'record' is set to the beginning of the readRecordBuf buffer. But if the WAL-record is fully placed on one page, the XLogReadRecord() function forgets to bind the "record" pointer with the beginning of the readRecordBuf buffer. In this case, XLogReadRecord() returns a pointer to an internal xlog page. This patch fixes the defect. Hmm. I agree it looks weird the way it is. But I wonder, why do we even copy the record to readRecordBuf, if it fits on the WAL page? Returning a pointer to the xlog page buffer seems OK in that case. What if we remove the memcpy(), instead? It depends on the PostgreSQL roadmap. If we want to compress main data and encode something to reduce a WAL-record size, than the representation of the WAL-record in shared buffers (packed) and in local memory (unpacked) will be different and the patch is needed. If something like this should not be in the PostgreSQL core, then it is more efficient to remove memcpy(), of course. I vote for the patch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 23.10.2018 0:53, Heikki Linnakangas wrote: I'd expect the decompression to read from the on-disk buffer, and unpack to readRecordBuf, I still don't see a need to copy the packed record to readRecordBuf. If there is a need for that, though, the patch that implements the packing or compression can add the memcpy() where it needs it. I agree with it. Eventually, placement of the WAL-record can be defined by comparison the record, readBuf and readRecordBuf pointers. In attachment new version of the patch. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 36fd35dc75658f471efbc64fe2a3f204f0aa27e4 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 23 Oct 2018 10:17:55 +0500 Subject: [PATCH] WAL-record-buffer-pointer-fix --- src/backend/access/transam/xlogreader.c | 27 - 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 0768ca7822..c5e019bf77 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -353,19 +353,6 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) gotheader = false; } - /* - * Enlarge readRecordBuf as needed. - */ - if (total_len > state->readRecordBufSize && - !allocate_recordbuf(state, total_len)) - { - /* We treat this as a "bogus data" condition */ - report_invalid_record(state, "record length %u at %X/%X too long", - total_len, - (uint32) (RecPtr >> 32), (uint32) RecPtr); - goto err; - } - len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ; if (total_len > len) { @@ -375,6 +362,19 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) char *buffer; uint32 gotlen; + /* + * Enlarge readRecordBuf as needed. + */ + if (total_len > state->readRecordBufSize && + !allocate_recordbuf(state, total_len)) + { + /* We treat this as a "bogus data" condition */ + report_invalid_record(state, "record length %u at %X/%X too long", + total_len, + (uint32) (RecPtr >> 32), (uint32) RecPtr); + goto err; + } + /* Copy the first fragment of the record from the first page. */ memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len); @@ -479,7 +479,6 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) state->EndRecPtr = RecPtr + MAXALIGN(total_len); state->ReadRecPtr = RecPtr; - memcpy(state->readRecordBuf, record, total_len); } /* -- 2.17.1
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 19.10.2018 0:54, Peter Geoghegan wrote: I would welcome any theories as to what could be the problem here. I'm think that this is fixable, since the picture for the patch is very positive, provided you only focus on bgwriter/checkpoint activity and on-disk sizes. It seems likely that there is a very specific gap in my understanding of how the patch affects buffer cleaning. I have same problem with background heap & index cleaner (based on your patch). In this case the bottleneck is WAL-record which I need to write for each cleaned block and locks which are held during the WAL-record writing process. Maybe you will do a test without writing any data to disk? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 26.10.2018 10:33, Kyotaro HORIGUCHI wrote: Hello. At Tue, 23 Oct 2018 10:25:27 +0500, Andrey Lepikhov wrote in <2553f2b0-0e39-eb0e-d382-6c0ed08ca...@postgrespro.ru> On 23.10.2018 0:53, Heikki Linnakangas wrote: I'd expect the decompression to read from the on-disk buffer, and unpack to readRecordBuf, I still don't see a need to copy the packed record to readRecordBuf. If there is a need for that, though, the patch that implements the packing or compression can add the memcpy() where it needs it. I agree with it. Eventually, placement of the WAL-record can be defined by comparison the record, readBuf and readRecordBuf pointers. In attachment new version of the patch. This looks quite clear and what it does is reasonable to me. Applies cleanly on top of current master and no regression seen. Just one comment. This patch leaves the following code. > /* Record does not cross a page boundary */ > if (!ValidXLogRecord(state, record, RecPtr)) > goto err; > > state->EndRecPtr = RecPtr + MAXALIGN(total_len); !> > state->ReadRecPtr = RecPtr; > } The empty line (marked by '!') looks a bit too much standing out after this patch. Could you remove the line? Then could you transpose the two assignments if you don't mind? Thanks, see attachment. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From ec1df6c2b41fdfe2c79e3f0944653057e394c535 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 23 Oct 2018 10:17:55 +0500 Subject: [PATCH] WAL record buffer pointer fix --- src/backend/access/transam/xlogreader.c | 30 - 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 0768ca7822..82a16e78f3 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -353,19 +353,6 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) gotheader = false; } - /* - * Enlarge readRecordBuf as needed. - */ - if (total_len > state->readRecordBufSize && - !allocate_recordbuf(state, total_len)) - { - /* We treat this as a "bogus data" condition */ - report_invalid_record(state, "record length %u at %X/%X too long", - total_len, - (uint32) (RecPtr >> 32), (uint32) RecPtr); - goto err; - } - len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ; if (total_len > len) { @@ -375,6 +362,19 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) char *buffer; uint32 gotlen; + /* + * Enlarge readRecordBuf as needed. + */ + if (total_len > state->readRecordBufSize && + !allocate_recordbuf(state, total_len)) + { + /* We treat this as a "bogus data" condition */ + report_invalid_record(state, "record length %u at %X/%X too long", + total_len, + (uint32) (RecPtr >> 32), (uint32) RecPtr); + goto err; + } + /* Copy the first fragment of the record from the first page. */ memcpy(state->readRecordBuf, state->readBuf + RecPtr % XLOG_BLCKSZ, len); @@ -476,10 +476,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) if (!ValidXLogRecord(state, record, RecPtr)) goto err; - state->EndRecPtr = RecPtr + MAXALIGN(total_len); - state->ReadRecPtr = RecPtr; - memcpy(state->readRecordBuf, record, total_len); + state->EndRecPtr = RecPtr + MAXALIGN(total_len); } /* -- 2.17.1
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
I do the code review. Now, it is first patch - v6-0001... dedicated to a logical duplicates ordering. Documentation is full and clear. All non-trivial logic is commented accurately. Patch applies cleanly on top of current master. Regression tests passed and my "Retail Indextuple deletion" use cases works without mistakes. But I have two comments on the code. New BTScanInsert structure reduces parameters list of many functions and look fine. But it contains some optimization part ('restorebinsrch' field et al.). It is used very locally in the code - _bt_findinsertloc()->_bt_binsrch() routines calling. May be you localize this logic into separate struct, which will passed to _bt_binsrch() as pointer. Another routines may pass NULL value to this routine. It is may simplify usability of the struct. Due to the optimization the _bt_binsrch() size has grown twice. May be you move this to some service routine? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 03.11.2018 5:00, Peter Geoghegan wrote: The DESC heap TID sort order thing probably needs to go. I'll probably have to go fix the regression test failures that occur when ASC heap TID order is used. (Technically those failures are a pre-existing problem, a problem that I mask by using DESC order...which is weird. The problem is masked in the master branch by accidental behaviors around nbtree duplicates, which is something that deserves to die. DESC order is closer to the accidental current behavior.) I applied your patches at top of master. After tests corrections (related to TID ordering in index relations DROP...CASCADE operation) 'make check-world' passed successfully many times. In the case of 'create view' regression test - 'drop cascades to 62 other objects' problem - I verify an Álvaro Herrera hypothesis [1] and it is true. You can verify it by tracking the object_address_present_add_flags() routine return value. Some doubts, however, may be regarding the 'triggers' test. May you specify the test failures do you mean? [1] https://www.postgresql.org/message-id/20180504022601.fflymidf7eoencb2%40alvherre.pgsql -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
On 04.11.2018 9:31, Peter Geoghegan wrote: On Sat, Nov 3, 2018 at 8:52 PM Andrey Lepikhov wrote: I applied your patches at top of master. After tests corrections (related to TID ordering in index relations DROP...CASCADE operation) 'make check-world' passed successfully many times. In the case of 'create view' regression test - 'drop cascades to 62 other objects' problem - I verify an Álvaro Herrera hypothesis [1] and it is true. You can verify it by tracking the object_address_present_add_flags() routine return value. I'll have to go and fix the problem directly, so that ASC sort order can be used. Some doubts, however, may be regarding the 'triggers' test. May you specify the test failures do you mean? Not sure what you mean. The order of items that are listed in the DETAIL for a cascading DROP can have an "implementation defined" order. I think that this is an example of the more general problem -- what you call the 'drop cascades to 62 other objects' problem is a more specific subproblem, or, if you prefer, a more specific symptom of the same problem. I mean that your code have not any problems that I can detect by regression tests and by the retail index tuple deletion patch. Difference in amount of dropped objects is not a problem. It is caused by pos 2293 - 'else if (thisobj->objectSubId == 0)' - at the file catalog/dependency.c and it is legal behavior: column row object deleted without any report because we already decided to drop its whole table. Also, I checked the triggers test. Difference in the ERROR message 'cannot drop trigger trg1' is caused by different order of tuples in the relation with the dependDependerIndexId relid. It is legal behavior and we can simply replace test results. May be you know any another problems of the patch? -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)
to fixing the processing order of this findDependentObjects() pg_depend scan so that we reliably get the user-visible behavior we already tacitly expect? -- Peter Geoghegan -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 16.11.2018 11:23, Michael Paquier wrote: On Thu, Nov 15, 2018 at 06:12:38PM +0900, Kyotaro HORIGUCHI wrote: This patch eliminates unnecessary copying that was done for non-continued records. Now the return value of XLogReadRecord directly points into page buffer holded in XLogReaderStats. It is safe because no caller site uses the returned pointer beyond the replacement of buffer content at the next call to the same function. I was looking at this patch, and shouldn't we worry about compatibility with plugins or utilities which look directly at what's in readRecordBuf for the record contents? Let's not forget that the contents of XLogReaderState are public. According to my experience, I clarify some comments to avoid this mistakes in the future (see attachment). -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 7de252405ef5d5979fe2711515c0c6402abc0e06 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 19 Nov 2018 07:08:28 +0500 Subject: [PATCH] Some clarifications on readRecordBuf comments --- src/backend/access/transam/xlogreader.c | 4 ++-- src/include/access/xlogreader.h | 6 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index c5e019bf77..88d0bcf48a 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -209,8 +209,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) * If the reading fails for some other reason, NULL is also returned, and * *errormsg is set to a string with details of the failure. * - * The returned pointer (or *errormsg) points to an internal buffer that's - * valid until the next call to XLogReadRecord. + * The returned pointer (or *errormsg) points to an internal read-only buffer + * that's valid until the next call to XLogReadRecord. */ XLogRecord * XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 40116f8ecb..0837625b95 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -185,7 +185,11 @@ struct XLogReaderState */ TimeLineID nextTLI; - /* Buffer for current ReadRecord result (expandable) */ + /* + * Buffer for current ReadRecord result (expandable). + * Used in the case, than current ReadRecord result don't fit on the + * currently read WAL page. + */ char *readRecordBuf; uint32 readRecordBufSize; -- 2.17.1
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 20.11.2018 6:30, Michael Paquier wrote: On Mon, Nov 19, 2018 at 10:48:06AM +0500, Andrey Lepikhov wrote: According to my experience, I clarify some comments to avoid this mistakes in the future (see attachment). No objections from here. - * The returned pointer (or *errormsg) points to an internal buffer that's - * valid until the next call to XLogReadRecord. + * The returned pointer (or *errormsg) points to an internal read-only buffer + * that's valid until the next call to XLogReadRecord. Not sure that this bit adds much. Ok - /* Buffer for current ReadRecord result (expandable) */ + /* +* Buffer for current ReadRecord result (expandable). +* Used in the case, than current ReadRecord result don't fit on the +* currently read WAL page. +*/ Yeah, this one is not entirely true now. What about the following instead: - /* Buffer for current ReadRecord result (expandable) */ + /* +* Buffer for current ReadRecord result (expandable), used when a record +* crosses a page boundary. +*/ I agree -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Removing unneeded self joins
On 12/6/22 23:46, Andres Freund wrote: This doesn't pass the main regression tests due to a plan difference. https://cirrus-ci.com/task/5537518245380096 https://api.cirrus-ci.com/v1/artifact/task/5537518245380096/testrun/build/testrun/regress/regress/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out --- /tmp/cirrus-ci-build/src/test/regress/expected/join.out 2022-12-05 19:11:52.453920838 + +++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out 2022-12-05 19:15:21.864183651 + @@ -5806,7 +5806,7 @@ Nested Loop Join Filter: (sj_t3.id = sj_t1.id) -> Nested Loop - Join Filter: (sj_t3.id = sj_t2.id) + Join Filter: (sj_t2.id = sj_t3.id) -> Nested Loop Semi Join -> Nested Loop -> HashAggregate This change in the test behaviour is induced by the a5fc4641 "Avoid making commutatively-duplicate clauses in EquivalenceClasses." Nothing special, as I see. Attached patch fixes this. -- Regards Andrey Lepikhov Postgres Professional From 3e546637561bf4c6d195bc7c95b1e05263e797e2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 5 Oct 2022 16:58:34 +0500 Subject: [PATCH] Remove self-joins. A Self Join Removal (SJR) feature removes inner join of plain table to itself in a query plan if can be proved that the join can be replaced with a scan. The proof based on innerrel_is_unique machinery. We can remove a self-join when for each outer row: 1. At most one inner row matches the join clauses. 2. If the join target list contains any inner vars, an inner row must be (physically) the same row as the outer one. In this patch we use Rowley's [1] approach to identify a self-join: 1. Collect all mergejoinable join quals which look like a.x = b.x 2. Check innerrel_is_unique() for the qual list from (1). If it returns true, then outer row matches only the same row from the inner relation. 3. If uniqueness of outer relation deduces from baserestrictinfo clause like 'x=const' on unique field(s), check that inner has the same clause with the same constant(s). 4. Compare RowMarks of inner and outer relations. They must have the same strength. Some regression tests changed due to self-join removal logic. [1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com --- doc/src/sgml/config.sgml | 16 + src/backend/optimizer/path/indxpath.c | 38 + src/backend/optimizer/plan/analyzejoins.c | 1046 - src/backend/optimizer/plan/planmain.c |5 + src/backend/utils/misc/guc_tables.c | 22 + src/include/optimizer/paths.h |3 + src/include/optimizer/planmain.h |7 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 774 +++ src/test/regress/expected/sysviews.out|3 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 340 +++ src/tools/pgindent/typedefs.list |2 + 13 files changed, 2278 insertions(+), 26 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8e4145979d..2f9948d5f8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5311,6 +5311,22 @@ ANY num_sync ( + enable_self_join_removal (boolean) + + enable_self_join_removal configuration parameter + + + + +Enables or disables the query planner's optimization which analyses +query tree and replaces self joins with semantically equivalent single +scans. Take into consideration only plain tables. +The default is on. + + + + enable_seqscan (boolean) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 914bfd90bc..8d57c68b1f 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3494,6 +3494,21 @@ bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, List *exprlist, List *oprlist) +{ + return relation_has_unique_index_ext(root, rel, restrictlist, + exprlist, oprlist, NULL); +} + +/* + * relation_has_unique_index_ext + * if extra_clauses isn't NULL, return baserestrictinfo clauses which were + * used to derive uniqueness. + */ +bool +relation_has_unique_index_ext(PlannerInfo *root, RelOptInfo *rel, + List *restrictlist, + List *exprlist, List *oprlist, + List **extra_clauses) { ListCell *ic; @@ -3549,6 +3564,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *exprs = NIL; /* *
Optimization issue of branching UNION ALL
Hi, Client report on a corner case have shown up possible minor non-optimality in procedure of transformation of simple UNION ALL statement tree. Complaint is about auto-generated query with 1E4 simple union all's (see t.sh to generate a demo script). The reason: in REL_11_STABLE it is planned and executed in a second, but REL_12_STABLE and beyond makes matters worse: planning of such a query needs tons of gigabytes of RAM. Superficial study revealed possibly unnecessary operations that could be avoided: 1. Walking across a query by calling substitute_phv_relids() even if lastPHId shows that no one phv is presented. 2. Iterative passes along the append_rel_list for replacing vars in the translated_vars field. I can't grasp real necessity of passing all the append_rel_list during flattening of an union all leaf subquery. No one can reference this leaf, isn't it? In attachment you can see some sketch that reduces a number of planner cycles/copyings. -- Regards Andrey Lepikhov Postgres Professional t.sh Description: application/shellscript diff --git a/src/backend/optimizer/prep/prepjointree.c b/src/backend/optimizer/prep/prepjointree.c index 08a73fb9d86..3739e3fe7ba 100644 --- a/src/backend/optimizer/prep/prepjointree.c +++ b/src/backend/optimizer/prep/prepjointree.c @@ -131,7 +131,7 @@ static bool find_dependent_phvs_in_jointree(PlannerInfo *root, Node *node, int varno); static void substitute_phv_relids(Node *node, int varno, Relids subrelids); -static void fix_append_rel_relids(List *append_rel_list, int varno, +static void fix_append_rel_relids(PlannerInfo *root, int varno, Relids subrelids); static Node *find_jointree_node_for_rel(Node *jtnode, int relid); @@ -1156,8 +1156,9 @@ pull_up_simple_subquery(PlannerInfo *root, Node *jtnode, RangeTblEntry *rte, Relids subrelids; subrelids = get_relids_in_jointree((Node *) subquery->jointree, false); - substitute_phv_relids((Node *) parse, varno, subrelids); - fix_append_rel_relids(root->append_rel_list, varno, subrelids); + if (root->glob->lastPHId != 0) + substitute_phv_relids((Node *) parse, varno, subrelids); + fix_append_rel_relids(root, varno, subrelids); } /* @@ -2064,17 +2065,25 @@ perform_pullup_replace_vars(PlannerInfo *root, * use PHVs for safety. (This analysis could be made tighter but it seems * unlikely to be worth much trouble.) */ - foreach(lc, root->append_rel_list) + if (containing_appendrel) { - AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); - bool save_need_phvs = rvcontext->need_phvs; + bool save_need_phvs = rvcontext->need_phvs; - if (appinfo == containing_appendrel) - rvcontext->need_phvs = false; - appinfo->translated_vars = (List *) - pullup_replace_vars((Node *) appinfo->translated_vars, rvcontext); + rvcontext->need_phvs = false; + containing_appendrel->translated_vars = (List *) + pullup_replace_vars((Node *) containing_appendrel->translated_vars, +rvcontext); rvcontext->need_phvs = save_need_phvs; } + else + { + foreach(lc, root->append_rel_list) + { + AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(lc); + appinfo->translated_vars = (List *) +pullup_replace_vars((Node *) appinfo->translated_vars, rvcontext); + } + } /* * Replace references in the joinaliasvars lists of join RTEs. @@ -3273,7 +3282,7 @@ remove_result_refs(PlannerInfo *root, int varno, Node *newjtloc) subrelids = get_relids_in_jointree(newjtloc, false); Assert(!bms_is_empty(subrelids)); substitute_phv_relids((Node *) root->parse, varno, subrelids); - fix_append_rel_relids(root->append_rel_list, varno, subrelids); + fix_append_rel_relids(root, varno, subrelids); } /* @@ -3492,7 +3501,7 @@ substitute_phv_relids(Node *node, int varno, Relids subrelids) * We assume we may modify the AppendRelInfo nodes in-place. */ static void -fix_append_rel_relids(List *append_rel_list, int varno, Relids subrelids) +fix_append_rel_relids(PlannerInfo *root, int varno, Relids subrelids) { ListCell *l; int subvarno = -1; @@ -3503,7 +3512,7 @@ fix_append_rel_relids(List *append_rel_list, int varno, Relids subrelids) * AppendRelInfo nodes refer to it. So compute it on first use. Note that * bms_singleton_member will complain if set is not singleton. */ - foreach(l, append_rel_list) + foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); @@ -3518,8 +3527,9 @@ fix_append_rel_relids(List *append_rel_list, int varno, Relids subrelids) } /* Also fix up any PHVs in its translated vars */ - substitute_phv_relids((Node *) appinfo->translated_vars, - varno, subrelids); + if (root->glob->lastPHId != 0) + substitute_phv_relids((Node *) appinfo->translated_vars, + varno, subrelids); } }
Re: Optimization issue of branching UNION ALL
On 22/12/2022 06:50, Tom Lane wrote: 2. Iterative passes along the append_rel_list for replacing vars in the translated_vars field. I can't grasp real necessity of passing all the append_rel_list during flattening of an union all leaf subquery. No one can reference this leaf, isn't it? After thinking about that for awhile, I believe we can go further: the containing_appendrel is actually the *only* part of the upper query that needs to be adjusted. So we could do something like the attached. This passes check-world, but I don't have quite enough confidence in it to just commit it. Thanks, I have written the letter because of some doubts too. But only one weak point I could imagine - if someday sql standard will be changed. Your code looks better, than previous attempt. -- regards, Andrey Lepikhov Postgres Professional
Re: POC, WIP: OR-clause support for indexes
On 12/26/15 23:04, Teodor Sigaev wrote: I'd like to present OR-clause support for indexes. Although OR-clauses could be supported by bitmapOR index scan it isn't very effective and such scan lost any order existing in index. We (with Alexander Korotkov) presented results on Vienna's conference this year. In short, it provides performance improvement: EXPLAIN ANALYZE SELECT count(*) FROM tst WHERE id = 5 OR id = 500 OR id = 5000; ... The problems on the way which I see for now: 1 Calculating cost. Right now it's just a simple transformation of costs computed for BitmapOr path. I'd like to hope that's possible and so index's estimation function could be non-touched. So, they could believe that all clauses are implicitly-ANDed 2 I'd like to add such support to btree but it seems that it should be a separated patch. Btree search algorithm doesn't use any kind of stack of pages and algorithm to walk over btree doesn't clear for me for now. 3 I could miss some places which still assumes implicitly-ANDed list of clauses although regression tests passes fine. I support such a cunning approach. But this specific case, you demonstrated above, could be optimized independently at an earlier stage. If to convert: (F(A) = ConstStableExpr_1) OR (F(A) = ConstStableExpr_2) to F(A) IN (ConstStableExpr_1, ConstStableExpr_2) it can be seen significant execution speedup. For example, using the demo.sql to estimate maximum positive effect we see about 40% of execution and 100% of planning speedup. To avoid unnecessary overhead, induced by the optimization, such transformation may be made at the stage of planning (we have cardinality estimations and have pruned partitions) but before creation of a relation scan paths. So, we can avoid planning overhead and non-optimal BitmapOr in the case of many OR's possibly aggravated by many indexes on the relation. For example, such operation can be executed in create_index_paths() before passing rel->indexlist. -- Regards Andrey Lepikhov Postgres Professional demo.sql Description: application/sql
Re: [PATCH] random_normal function
On 1/9/23 23:52, Tom Lane wrote: BTW, if this does bring the probability of failure down to the one-in-a-billion range, I think we could also nuke the whole "ignore:" business, simplifying pg_regress and allowing the random test to be run in parallel with others. With 'ignore' option we get used to cover by tests some of the time dependent features, such as "statement_timeout", "idle_in_transaction_session_timeout", usage of user timeouts in extensions and so on. We have used the pg_sleep() function to interrupt a query at certain execution phase. But on some platforms, especially in containers, the query can vary execution time in so widely that the pg_sleep() timeout, required to get rid of dependency on a query execution time, has become unacceptable. So, the "ignore" option was the best choice. For Now, Do we only have the "isolation tests" option to create stable execution time-dependent tests now? Or I'm not aware about some test machinery? -- Regards Andrey Lepikhov Postgres Professional
Re: [PATCH] random_normal function
On 1/19/23 11:01, Tom Lane wrote: Andrey Lepikhov writes: On 1/9/23 23:52, Tom Lane wrote: BTW, if this does bring the probability of failure down to the one-in-a-billion range, I think we could also nuke the whole "ignore:" business, simplifying pg_regress and allowing the random test to be run in parallel with others. We have used the pg_sleep() function to interrupt a query at certain execution phase. But on some platforms, especially in containers, the query can vary execution time in so widely that the pg_sleep() timeout, required to get rid of dependency on a query execution time, has become unacceptable. So, the "ignore" option was the best choice. But does such a test have any actual value? If your test infrastructure ignores the result, what makes you think you'd notice if the test did indeed detect a problem? Yes, it is good to catch SEGFAULTs and assertions which may be frequent because of a logic complexity in the case of timeouts. I think "ignore:" was a kluge we put in twenty-plus years ago when our testing standards were a lot lower, and it's way past time we got rid of it. Ok, I will try to invent alternative way for deep (and stable) testing of timeouts. Thank you for the answer. -- Regards Andrey Lepikhov Postgres Professional
[POC] Allow an extension to add data into Query and PlannedStmt nodes
Hi, Previously, we read int this mailing list some controversial opinions on queryid generation and Jumbling technique. Here we don't intend to solve these problems but help an extension at least don't conflict with others on the queryId value. Extensions could need to pass some query-related data through all stages of the query planning and execution. As a trivial example, pg_stat_statements uses queryid at the end of execution to save some statistics. One more reason - extensions now conflict on queryid value and the logic of its changing. With this patch, it can be managed. This patch introduces the structure 'ExtensionData' which allows to manage of a list of entries with a couple of interface functions addExtensionDataToNode() and GetExtensionData(). Keep in mind the possible future hiding of this structure from the public interface. An extension should invent a symbolic key to identify its data. It may invent as many additional keys as it wants but the best option here - is no more than one entry for each extension. Usage of this machinery is demonstrated by the pg_stat_statements example - here we introduced Bigint node just for natively storing of queryId value. Ruthless pgbench benchmark shows that we got some overhead: 1.6% - in default mode 4% - in prepared mode ~0.1% in extended mode. An optimization that avoids copying of queryId by storing it into the node pointer field directly allows to keep this overhead in a range of %0.5 for all these modes but increases complexity. So here we demonstrate not optimized variant. Some questions still cause doubts: - QueryRewrite: should we copy extension fields from the parent parsetree to the rewritten ones? - Are we need to invent a registration procedure to do away with the names of entries and use some compact integer IDs? - Do we need to optimize this structure to avoid a copy for simple data types, for example, inventing something like A_Const? All in all, in our opinion, this issue is tend to grow with an increasing number of extensions that utilize planner and executor hooks for some purposes. So, any thoughts will be useful. -- Regards Andrey Lepikhov Postgres ProfessionalFrom 944ce61d7ff934727240d90ee7620bfb69ad3a5a Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Wed, 22 Mar 2023 16:59:30 +0500 Subject: [PATCH] Add on more field into Query and PlannedStmt structures to allow an extension to pass some data across all the execution stages of specific instance of the query. --- .../pg_stat_statements/pg_stat_statements.c | 57 -- src/backend/commands/extension.c | 103 ++ src/backend/executor/execParallel.c | 1 + src/backend/nodes/value.c | 12 ++ src/backend/optimizer/plan/planner.c | 1 + src/backend/rewrite/rewriteHandler.c | 6 + src/backend/tcop/postgres.c | 2 + src/include/commands/extension.h | 4 + src/include/nodes/parsenodes.h| 23 src/include/nodes/plannodes.h | 2 + src/include/nodes/value.h | 10 ++ 11 files changed, 209 insertions(+), 12 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 5285c3f7fa..fc47661c86 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -49,6 +49,7 @@ #include "access/parallel.h" #include "catalog/pg_authid.h" +#include "commands/extension.h" #include "common/hashfn.h" #include "executor/instrument.h" #include "funcapi.h" @@ -107,6 +108,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; !IsA(n, PrepareStmt) && \ !IsA(n, DeallocateStmt)) +#define GET_QUERYID(node) \ + (Bigint *) GetExtensionData(node->ext_field, "pg_stat_statements") + +#define INSERT_QUERYID(node, queryid) \ + castNode(Bigint, AddExtensionDataToNode((Node *) node, \ + "pg_stat_statements", \ + (Node *) makeBigint((int64) queryid), \ + true)) /* * Extension version number, for supporting older extension versions' objects */ @@ -821,6 +830,13 @@ error: static void pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) { + Bigint *queryId; + + if ((queryId = GET_QUERYID(query)) == NULL) + queryId = INSERT_QUERYID(query, query->queryId); + + Assert(queryId); + if (prev_post_parse_analyze_hook) prev_post_parse_analyze_hook(pstate, query, jstate); @@ -837,7 +853,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate) { if (pgss_track_utility && !PGSS_HANDLED_UTILITY(query->utilityStmt)) { - query->queryId = UINT64CONST(0); + queryId->val = UINT64CONST(0); return;
Re: [POC] Allow an extension to add data into Query and PlannedStmt nodes
On 30/3/2023 12:57, Julien Rouhaud wrote: Extensions could need to pass some query-related data through all stages of the query planning and execution. As a trivial example, pg_stat_statements uses queryid at the end of execution to save some statistics. One more reason - extensions now conflict on queryid value and the logic of its changing. With this patch, it can be managed. I just had a quick lookc at the patch, and IIUC it doesn't really help on that side, as there's still a single official "queryid" that's computed, stored everywhere and later used by pg_stat_statements (which does then store in additionally to that official queryid). Thank you for the attention! This patch doesn't try to solve the problem of oneness of queryId. In this patch we change pg_stat_statements and it doesn't set 0 into queryId field according to its internal logic. And another extension should do the same - use queryId on your own but not erase it - erase your private copy in the ext_field. Ruthless pgbench benchmark shows that we got some overhead: 1.6% - in default mode 4% - in prepared mode ~0.1% in extended mode. That's a quite significant overhead. But the only reason to accept such a change is to actually use it to store additional data, so it would be interesting to see what the overhead is like once you store at least 2 different values there. Yeah, but as I said earlier, it can be reduced to much smaller value just with simple optimization. Here I intentionally avoid it to discuss the core concept. - Are we need to invent a registration procedure to do away with the names of entries and use some compact integer IDs? Note that the patch as proposed doesn't have any defense for two extensions trying to register something with the same name, or update a stored value, as AddExtensionDataToNode() simply prepend the new value to the list. You can actually update the value by just storing the new value, but it will add a significant overhead to every other extension that want to read another value. Thanks a lot! Patch in attachment implements such an idea - extension can allocate some entries and use these private IDs to add entries. I hope, an extension would prefer to use only one entry for all the data to manage overhead, but still. The API is also quite limited as each stored value has a single identifier. What if your extension needs to store multiple things? Since it's all based on Node you can't really store some custom struct, so you probably have to end up with things like "my_extension.my_val1", "my_extension.my_val2" which isn't great. Main idea here - if an extension holds custom struct and want to pass it along all planning and execution stages it should use extensible node with custom read/write/copy routines. -- regards, Andrey Lepikhov Postgres Professional From ab101322330684e9839e46c26f70ad5462e40dac Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Thu, 30 Mar 2023 21:49:32 +0500 Subject: [PATCH 2/2] Add ExtendedData entry registration routines to use entryId instead of symbolic name. --- .../pg_stat_statements/pg_stat_statements.c | 7 +- src/backend/commands/extension.c | 69 +++ src/include/commands/extension.h | 6 +- src/include/nodes/parsenodes.h| 2 +- 4 files changed, 64 insertions(+), 20 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index fc47661c86..5b21163365 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -109,11 +109,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; !IsA(n, DeallocateStmt)) #define GET_QUERYID(node) \ - (Bigint *) GetExtensionData(node->ext_field, "pg_stat_statements") + (Bigint *) GetExtensionData(node->ext_field, extendedEntryID) #define INSERT_QUERYID(node, queryid) \ castNode(Bigint, AddExtensionDataToNode((Node *) node, \ - "pg_stat_statements", \ + extendedEntryID, \ (Node *) makeBigint((int64) queryid), \ true)) /* @@ -298,6 +298,7 @@ static bool pgss_track_utility = true; /* whether to track utility commands */ static bool pgss_track_planning = false; /* whether to track planning * duration */ static bool pgss_save = true; /* wh
Re: Removing unneeded self joins
On 5/7/2023 21:28, Andrey Lepikhov wrote: Hi, During the significant code revision in v.41 I lost some replacement operations. Here is the fix and extra tests to check this in the future. Also, Tom added the JoinDomain structure five months ago, and I added code to replace relids for that place too. One more thing, I found out that we didn't replace SJs, defined by baserestrictinfos if no one self-join clause have existed for the join. Now, it is fixed, and the test has been added. To understand changes readily, see the delta file in the attachment. Here is new patch in attachment. Rebased on current master and some minor gaffes are fixed. -- regards, Andrey Lepikhov Postgres Professional From 70bb5cf3d11b2797f1a9c7b04740435135229d29 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 12 Sep 2023 18:25:51 +0700 Subject: [PATCH] Remove self-joins. Self Join Elimination (SJE) feature removes an inner join of a plain table to itself in the query tree if is proved that the join can be replaced with a scan without impact to the query result. Self join and inner relation are replaced with the outer in query, equivalence classes and planner info structures. Also, inner restrictlist moves to the outer one with removing duplicated clauses. Thus, this optimization reduces length of range table list (especially it make sense for partitioned relations), reduces number of restriction clauses === selectivity estimations and potentially can improve total planner prediction for the query. The SJE proof based on innerrel_is_unique machinery. We can remove a self-join when for each outer row: 1. At most one inner row matches the join clause. 2. Each matched inner row must be (physically) the same row as the outer one. In this patch we use the next approach to identify a self-join: 1. Collect all mergejoinable join quals which look like a.x = b.x 2. Add to the list above baseretrictinfo of inner table. 3. Check innerrel_is_unique() for the qual list. If it returns false, skip this pair of joining tables. 4. Check uniqueness, proved by the baserestrictinfo clauses. To prove possibility of the self-join elimination inner and outer clauses must have exact match. Relation replacement procedure is not trivial and it is partly combined with the one, used to remove useless left joins. Tests, covering this feature, added to the join.sql. Some regression tests changed due to self-join removal logic. --- doc/src/sgml/config.sgml | 16 + src/backend/optimizer/path/indxpath.c | 38 + src/backend/optimizer/plan/analyzejoins.c | 1094 - src/backend/optimizer/plan/planmain.c |5 + src/backend/utils/misc/guc_tables.c | 22 + src/include/optimizer/paths.h |3 + src/include/optimizer/planmain.h |7 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 824 - src/test/regress/expected/sysviews.out|3 +- src/test/regress/expected/updatable_views.out | 17 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 360 ++ src/tools/pgindent/typedefs.list |2 + 14 files changed, 2375 insertions(+), 64 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6bc1b215db..43c07b0d3b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5299,6 +5299,22 @@ ANY num_sync ( + enable_self_join_removal (boolean) + + enable_self_join_removal configuration parameter + + + + + Enables or disables the query planner's optimization which analyses +query tree and replaces self joins with semantically equivalent single +scans. Take into consideration only plain tables. +The default is on. + + + + enable_seqscan (boolean) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 6a93d767a5..508285d1ef 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3494,6 +3494,21 @@ bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, List *exprlist, List *oprlist) +{ + return relation_has_unique_index_ext(root, rel, restrictlist, + exprlist, oprlist, NULL); +} + +/* + * relation_has_unique_index_ext + * if extra_clauses isn't NULL, return baserestrictinfo clauses which were + * used to derive uniqueness. + */ +bool +relation_has_unique_index_ext(PlannerInfo *root, RelOptInfo *rel, +
Re: POC: GROUP BY optimization
Hi, Here is the patch rebased on the current master. Also, I fixed some minor slips and one static analyzer warning. This is just for adding to the next commitfest and enforcing work with this patch. One extra difference in newly added postgres_fdw tests is caused by this patch - see changes in the query plan in attachment. -- regards, Andrey Lepikhov Postgres Professional From 33953655c9ac3f9ec64b80c9f2a2ff38bd178745 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 13 Sep 2023 11:20:03 +0700 Subject: [PATCH] Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause using sort, the cost may depend heavily on the order in which the keys are compared when building the groups. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. In principle, we might generate grouping paths for all permutations of the keys and leave the rest to the optimizer. But that might get very expensive, so we try to pick only a couple interesting orderings based on both local and global information. When planning the grouping path, we explore statistics (number of distinct values, cost of the comparison function) for the keys and reorder them to minimize comparison costs. Intuitively, it may be better to perform more expensive comparisons (for complex data types, etc.) last because maybe the cheaper comparisons will be enough. Similarly, the higher the cardinality of a key, the lower the probability we'll need to compare more keys. The patch generates and costs various orderings, picking the cheapest ones. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch generates orderings and picks those, minimizing the comparison cost (for various path keys), and then adds orderings that might be useful for operations higher up in the plan (ORDER BY, etc.). Finally, it always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. --- .../postgres_fdw/expected/postgres_fdw.out| 36 +- src/backend/optimizer/path/costsize.c | 363 ++- src/backend/optimizer/path/equivclass.c | 13 +- src/backend/optimizer/path/pathkeys.c | 602 ++ src/backend/optimizer/plan/planner.c | 465 -- src/backend/optimizer/util/pathnode.c | 4 +- src/backend/utils/adt/selfuncs.c | 38 +- src/backend/utils/misc/guc_tables.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/nodes/pathnodes.h | 10 + src/include/optimizer/cost.h | 4 +- src/include/optimizer/paths.h | 7 + src/include/utils/selfuncs.h | 5 + src/test/regress/expected/aggregates.out | 244 ++- .../regress/expected/incremental_sort.out | 2 +- src/test/regress/expected/join.out| 51 +- src/test/regress/expected/merge.out | 15 +- .../regress/expected/partition_aggregate.out | 44 +- src/test/regress/expected/partition_join.out | 75 +-- src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/union.out | 60 +- src/test/regress/sql/aggregates.sql | 99 +++ src/test/regress/sql/incremental_sort.sql | 2 +- 23 files changed, 1771 insertions(+), 382 deletions(-) diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 144c114d0f..63af7feabe 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2319,18 +2319,21 @@ SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM -- join with pseudoconstant quals EXPLAIN (VERBOSE, COSTS OFF) SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1 AND CURRENT_USER = SESSION_USER) ORDER BY t1.c3, t1.c1 OFFSET 100 LIMIT 10; -
Re: POC: GROUP BY optimization
On 20/7/2023 18:46, Tomas Vondra wrote: On 7/20/23 08:37, Andrey Lepikhov wrote: On 3/10/2022 21:56, Tom Lane wrote: Revert "Optimize order of GROUP BY keys". This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and several follow-on fixes. ... Since we're hard up against the release deadline for v15, let's revert these changes for now. We can always try again later. It may be time to restart the project. As a first step, I rebased the patch on the current master. It wasn't trivial because of some latest optimizations (a29eab, 1349d27 and 8d83a5d). Now, Let's repeat the review and rewrite the current path according to the reasons uttered in the revert commit. 1) procost = 1.0 - I guess we could make this more realistic by doing some microbenchmarks and tuning the costs for the most expensive cases. Ok, some thoughts on this part of the task. As I see, we have not so many different operators: 26 with fixed width and 16 with variable width: SELECT o.oid,oprcode,typname,typlen FROM pg_operator o JOIN pg_type t ON (oprleft = t.oid) WHERE (oprname='<') AND oprleft=oprright AND typlen>0 ORDER BY o.oid; SELECT o.oid,oprcode,typname,typlen FROM pg_operator o JOIN pg_type t ON (oprleft = t.oid) WHERE (oprname='<') AND oprleft=oprright AND typlen<0 ORDER BY o.oid; Benchmarking procedure of types with fixed length can be something like below: CREATE OR REPLACE FUNCTION pass_sort(typ regtype) RETURNS TABLE ( nrows integer, exec_time float ) AS $$ DECLARE data json; step integer; BEGIN SET work_mem='1GB'; FOR step IN 0..3 LOOP SELECT pow(100, step)::integer INTO nrows; DROP TABLE IF EXISTS test CASCADE; EXECUTE format('CREATE TABLE test AS SELECT gs::%s AS x FROM generate_series(1,%s) AS gs;', typ, nrows); EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, FORMAT JSON) SELECT * FROM test ORDER BY (x) DESC INTO data; SELECT data->0->'Execution Time' INTO exec_time; RETURN NEXT; END LOOP; END; $$ LANGUAGE plpgsql VOLATILE; Execution of SELECT * FROM pass_sort('integer'); shows quite linear grow of the execution time. So, using '2.0 * cpu_operator_cost' as a cost for the integer type (as a basis) we can calculate costs for other operators. Variable-width types, i think, could require more complex technique to check dependency on the length. Does this way look worthwhile? -- regards, Andrey Lepikhov Postgres Professional
Re: [PoC] Reducing planning time when tables have many partitions
On 25/8/2023 14:39, Yuya Watari wrote: Hello, On Wed, Aug 9, 2023 at 8:54 PM David Rowley wrote: I think the best way to move this forward is to explore not putting partitioned table partitions in EMs and instead see if we can translate to top-level parent before lookups. This might just be too complex to translate the Exprs all the time and it may add overhead unless we can quickly determine somehow that we don't need to attempt to translate the Expr when the given Expr is already from the top-level parent. If that can't be made to work, then maybe that shows the current patch has merit. Based on your suggestion, I have experimented with not putting child EquivalenceMembers in an EquivalenceClass. I have attached a new patch, v20, to this email. The following is a summary of v20. Working on self-join removal in the thread [1] nearby, I stuck into the problem, which made an additional argument to work in this new direction than a couple of previous ones. With indexing positions in the list of equivalence members, we make some optimizations like join elimination more complicated - it may need to remove some clauses and equivalence class members. For changing lists of derives or ec_members, we should go through all the index lists and fix them, which is a non-trivial operation. [1] https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru -- regards, Andrey Lepikhov Postgres Professional
Re: Oversight in reparameterize_path_by_child leading to executor crash
On 23/8/2023 12:37, Richard Guo wrote: If we go with the "tablesample scans can't be reparameterized" approach in the back branches, I'm a little concerned that what if we find more cases in the futrue where we need modify RTEs for reparameterization. So I spent some time seeking and have managed to find one: there might be lateral references in a scan path's restriction clauses, and currently reparameterize_path_by_child fails to adjust them. It may help you somehow: in [1], we designed a feature where the partitionwise join technique can be applied to a JOIN of partitioned and non-partitioned tables. Unfortunately, it is out of community discussions, but we still support it for sharding usage - it is helpful for the implementation of 'global' tables in a distributed configuration. And there we were stuck into the same problem with lateral relids adjustment. So you can build a more general view of the problem with this patch. [1] Asymmetric partition-wise JOIN https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
Re: disfavoring unparameterized nested loops
On 29/9/2022 21:32, Benjamin Coutu wrote: I'd like to revamp this important discussion. As is well described in this fairly recent paper here https://www.vldb.org/pvldb/vol9/p204-leis.pdf (which also looks at Postgres) "estimation errors quickly grow as the number of joins increases, and that these errors are usually the reason for bad plans" - I think we can all get behind that statement. While nested loop joins work great when cardinality estimates are correct, they are notoriously bad when the optimizer underestimates and they degrade very fast in such cases - the good vs. bad here is very asymmetric. On the other hand hash joins degrade much more gracefully - they are considered very robust against underestimation. The above mentioned paper illustrates that all mayor DBMS (including Postgres) tend to underestimate and usually that underestimation increases drastically with the number of joins (see Figures 3+4 of the paper). Now, a simple approach to guarding against bad plans that arise from underestimation could be to use what I would call a nested-loop-conviction-multiplier based on the current depth of the join tree, e.g. for a base table that multiplier would obviously be 1, but would then grow (e.g.) quadratically. That conviction-multiplier would *NOT* be used to skew the cardinality estimates themselves, but rather be applied to the overall nested loop join cost at each particular stage of the plan when comparing it to other more robust join strategies like hash or sort-merge joins. That way when we can be sure to have a good estimate at the bottom of the join tree we treat all things equal, but favor nested loops less and less as we move up the join tree for the sake of robustness. Also, we can expand the multiplier whenever we fall back to using the default cardinality constant as surely all bets are off at that point - we should definitely treat nested loop joins as out of favor in this instance and that could easily be incorporated by simply increasing the conviction-mutliplier. What are your thoughts on this simple idea - is it perhaps too simple? In my practice, parameterized nested loop reduces, sometimes drastically, execution time. If your query touches a lot of tables but extracts only a tiny part of the data, and you have good coverage by indexes, PNL works great. Moreover, I have pondered extending parameterization through subqueries and groupings. What could you say about a different way: hybrid join? In MS SQL Server, they have such a feature [1], and, according to their description, it requires low overhead. They start from HashJoin and switch to NestLoop if the inner input contains too small tuples. It solves the issue, Isn't it? [1] https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-batch-mode-adaptive-joins/ba-p/385411 -- regards, Andrey Lepikhov Postgres Professional
Re: [PATCH] Add extra statistics to explain for Nested Loop
On 31/7/2022 10:49, Julien Rouhaud wrote: On Sat, Jul 30, 2022 at 08:54:33PM +0800, Julien Rouhaud wrote: Anyway, 1% is in my opinion still too much overhead for extensions that won't get any extra information. I have read all the thread and still can't understand something. What valuable data can I find with these extra statistics if no one parameterized node in the plan exists? Also, thinking about min/max time in the explain, I guess it would be necessary in rare cases. Usually, the execution time will correlate to the number of tuples scanned, won't it? So, maybe skip the time boundaries in the instrument structure? In my experience, it is enough to know the total number of tuples bubbled up from a parameterized node to decide further optimizations. Maybe simplify this feature up to the one total_rows field in the case of nloops > 1 and in the presence of parameters? And at the end. If someone wants a lot of additional statistics, why not give them that by extension? It is only needed to add a hook into the point of the node explanation and some efforts to make instrumentation extensible. But here, honestly, I don't have code/ideas so far. -- regards, Andrey Lepikhov Postgres Professional
Re: Postgres picks suboptimal index after building of an extended statistics
On 12/8/2021 06:26, Tomas Vondra wrote: On 8/11/21 2:48 AM, Peter Geoghegan wrote: On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov wrote: Ivan Frolkov reported a problem with choosing a non-optimal index during a query optimization. This problem appeared after building of an extended statistics. Any thoughts on this, Tomas? Thanks for reminding me, I missed / forgot about this thread. I agree the current behavior is unfortunate, but I'm not convinced the proposed patch is fixing the right place - doesn't this mean the index costing won't match the row estimates displayed by EXPLAIN? I wonder if we should teach clauselist_selectivity about UNIQUE indexes, and improve the cardinality estimates directly, not just costing for index scans. Also, is it correct that the patch calculates num_sa_scans only when (numIndexTuples >= 0.0)? I can't stop thinking about this issue. It is bizarre when Postgres chooses a non-unique index if a unique index gives us proof of minimum scan. I don't see a reason to teach the clauselist_selectivity() routine to estimate UNIQUE indexes. We add some cycles, but it will work with btree indexes only. Maybe to change compare_path_costs_fuzzily() and add some heuristic, for example: "If selectivity of both paths gives us no more than 1 row, prefer to use a unique index or an index with least selectivity." -- regards, Andrey Lepikhov Postgres Professional
Re: POC: GUC option for skipping shared buffers in core dumps
Hi, The current approach could be better because we want to use it on Windows/MacOS and other systems. So, I've tried to develop another strategy - detaching shmem and DSM blocks before executing the abort() routine. As I can see, it helps and reduces the size of the core file. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index 5810f8825e..4d7bf2c0e4 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -325,7 +325,7 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) { SetProcessingMode(NormalProcessing); CheckerModeMain(); - abort(); + pg_abort(); } /* diff --git a/src/backend/main/main.c b/src/backend/main/main.c index ed11e8be7f..34ac874ad0 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -197,7 +197,7 @@ main(int argc, char *argv[]) else PostmasterMain(argc, argv); /* the functions above should not return */ - abort(); + pg_abort(); } diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 54e9bfb8c4..fc32a6bb1b 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -1469,7 +1469,7 @@ PostmasterMain(int argc, char *argv[]) */ ExitPostmaster(status != STATUS_OK); - abort();/* not reached */ + pg_abort(); /* not reached */ } diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index e250b0567e..3123b388ab 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -385,7 +385,7 @@ WalSndShutdown(void) whereToSendOutput = DestNone; proc_exit(0); - abort();/* keep the compiler quiet */ + pg_abort(); /* keep the compiler quiet */ } /* diff --git a/src/backend/utils/error/assert.c b/src/backend/utils/error/assert.c index 719dd7b309..f422d42440 100644 --- a/src/backend/utils/error/assert.c +++ b/src/backend/utils/error/assert.c @@ -19,6 +19,32 @@ #include #endif +#include +#include + +int core_dump_no_shared_buffers = COREDUMP_INCLUDE_ALL; + +/* + * Remember, at the same time someone can work with shared memory, write them to + * disk and so on. + */ +void +pg_abort(void) +{ + if (core_dump_no_shared_buffers != COREDUMP_INCLUDE_ALL) + { + if (core_dump_no_shared_buffers == COREDUMP_EXCLUDE_ALL || + core_dump_no_shared_buffers == COREDUMP_EXCLUDE_DSM) + dsm_detach_all(); + + if (core_dump_no_shared_buffers == COREDUMP_EXCLUDE_ALL || + core_dump_no_shared_buffers == COREDUMP_EXCLUDE_SHMEM) + PGSharedMemoryDetach(); + } + + abort(); +} + /* * ExceptionalCondition - Handles the failure of an Assert() * @@ -63,5 +89,5 @@ ExceptionalCondition(const char *conditionName, sleep(100); #endif - abort(); + pg_abort(); } diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 8e1f3e8521..f6c863ca68 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -601,7 +601,7 @@ errfinish(const char *filename, int lineno, const char *funcname) * children... */ fflush(NULL); - abort(); + pg_abort(); } /* diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index bdb26e2b77..95e205e8d1 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -427,6 +427,14 @@ static const struct config_enum_entry debug_logical_replication_streaming_option {NULL, 0, false} }; +static const struct config_enum_entry core_dump_no_shared_buffers_options[] = { + {"none", COREDUMP_INCLUDE_ALL, false}, + {"shmem", COREDUMP_EXCLUDE_SHMEM, false}, + {"dsm", COREDUMP_EXCLUDE_DSM, false}, + {"all", COREDUMP_EXCLUDE_ALL, false}, + {NULL, 0, false} +}; + StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2), "array length mismatch"); @@ -4971,6 +4979,17 @@ struct config_enum ConfigureNamesEnum[] = NULL, NULL, NULL }, + { + {"core_dump_no_shared_buffers", PGC_POSTMASTER, DEVELOPER_OPTIONS, + NULL, + NULL, + GUC_NOT_IN_SAMPLE + }, + &core_dump_no_shared_buffers, + COREDUMP_I
Re: RFC: Logging plan of the running query
On 25/9/2023 14:21, torikoshia wrote: On 2023-09-20 14:39, Lepikhov Andrei wrote: Hmm, as a test, I made sure to call ProcessLogQueryPlanInterrupt() on all CFI using v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch, and then ran the following query but did not cause any problems. ``` =# CREATE TABLE test(); =# CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$ BEGIN EXECUTE format('ALTER TABLE test ADD COLUMN x integer;'); PERFORM pg_sleep(5); END; $$ LANGUAGE plpgsql VOLATILE; =# SELECT ddl(); ``` Is this the case you're worrying about? I didn't find a problem either. I just feel uncomfortable if, at the moment of interruption, we have a descriptor of another query than the query have been executing and holding resources. -- regards, Andrey Lepikhov Postgres Professional
Re: POC: GROUP BY optimization
On 20/7/2023 18:46, Tomas Vondra wrote: 2) estimating quicksort comparisons - This relies on ndistinct estimates, and I'm not sure how much more reliable we can make those. Probably not much :-( Not sure what to do about this, the only thing I can think of is to track "reliability" of the estimates and only do the reordering if we have high confidence in the estimates. That means we'll miss some optimization opportunities, but it should limit the risk. According to this issue, I see two options: 1. Go through the grouping column list and find the most reliable one. If we have a unique column or column with statistics on the number of distinct values, which is significantly more than ndistincts for other grouping columns, we can place this column as the first in the grouping. It should guarantee the reliability of such a decision, isn't it? 2. If we have extended statistics on distinct values and these statistics cover some set of first columns in the grouping list, we can optimize these positions. It also looks reliable. Any thoughts? -- regards, Andrey Lepikhov Postgres Professional
Re: RFC: Logging plan of the running query
On 28/9/2023 09:04, torikoshia wrote: On 2023-09-25 18:49, Andrey Lepikhov wrote: On 25/9/2023 14:21, torikoshia wrote: On 2023-09-20 14:39, Lepikhov Andrei wrote: Hmm, as a test, I made sure to call ProcessLogQueryPlanInterrupt() on all CFI using v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch, and then ran the following query but did not cause any problems. ``` =# CREATE TABLE test(); =# CREATE OR REPLACE FUNCTION ddl() RETURNS void AS $$ BEGIN EXECUTE format('ALTER TABLE test ADD COLUMN x integer;'); PERFORM pg_sleep(5); END; $$ LANGUAGE plpgsql VOLATILE; =# SELECT ddl(); ``` Is this the case you're worrying about? I didn't find a problem either. I just feel uncomfortable if, at the moment of interruption, we have a descriptor of another query than the query have been executing and holding resources. I think that "descriptor" here refers to ActiveQueryDesc, in which case it is updated at the beginning of ExecutorRun(), so I am wondering if the situation you're worried about would not occur. As you can see, in my example we have the only DDL and no queries with plans. In this case postgres doesn't call ExecutorRun() just because it doesn't have a plan. But locks will be obtained. -- regards, Andrey Lepikhov Postgres Professional
Re: NOT IN subquery optimization
You should do small rebase (conflict with 911e7020770) and pgindent of the patch to repair problems with long lines and backspaces. I am reviewing your patch in small steps. Questions: 1. In the find_innerjoined_rels() routine you stop descending on JOIN_FULL node type. I think it is wrong because if var has NOT NULL constraint, full join can't change it to NULL. 2. The convert_NOT_IN_to_join() routine is ok, but its name is misleading. May be you can use something like make_NOT_IN_to_join_quals()? 3. pull_up_sublinks_qual_recurse(). Comment: "Return pullout predicate (x is NOT NULL)..." may be change to "Return pullout predicate (x is NOT NULL or NOT EXISTS...)"? 4. is_node_nonnullable(): I think one more case of non-nullable var may be foreign key constraint. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Removing unneeded self joins
On 4/1/20 8:34 PM, David Steele wrote: This patch no longer applies cleanly on src/test/regress/sql/equivclass.sql: http://cfbot.cputube.org/patch_27_1712.log The CF entry has been updated to Waiting on Author. v.23 in attachment: 1. The patch applies cleanly. 2. Add checks: two potentially self joined relations may belong to different rules of order restriction in join_info_list. 3. Add test for item 2. The CF entry has been updated to Needs review. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From e74d3c2549737305419b3c29301d29c1e191d6ae Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 3 Apr 2020 09:31:58 +0500 Subject: [PATCH] Remove Self Joins v. 23 --- src/backend/nodes/outfuncs.c | 19 +- src/backend/optimizer/path/indxpath.c | 28 +- src/backend/optimizer/path/joinpath.c |6 +- src/backend/optimizer/plan/analyzejoins.c | 1178 - src/backend/optimizer/plan/planmain.c |7 +- src/backend/optimizer/util/pathnode.c |3 +- src/backend/optimizer/util/relnode.c | 26 +- src/backend/utils/misc/guc.c | 11 + src/backend/utils/mmgr/aset.c |1 - src/include/nodes/nodes.h |1 + src/include/nodes/pathnodes.h | 22 +- src/include/optimizer/pathnode.h |4 + src/include/optimizer/paths.h |3 +- src/include/optimizer/planmain.h | 10 +- src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 330 + src/test/regress/expected/updatable_views.out | 15 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 169 +++ 19 files changed, 1773 insertions(+), 108 deletions(-) diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index eb168ffd6d..a1a9ae1ac1 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2281,7 +2281,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_OID_FIELD(userid); WRITE_BOOL_FIELD(useridiscurrent); /* we don't try to print fdwroutine or fdw_private */ - /* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */ + WRITE_NODE_FIELD(unique_for_rels); + /* can't print non_unique_for_rels; BMSes aren't Nodes */ WRITE_NODE_FIELD(baserestrictinfo); WRITE_UINT_FIELD(baserestrict_min_security); WRITE_NODE_FIELD(joininfo); @@ -2353,6 +2354,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node) WRITE_BITMAPSET_FIELD(keys); } +static void +_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node) +{ + WRITE_NODE_TYPE("UNIQUERELINFO"); + + WRITE_BITMAPSET_FIELD(outerrelids); + if (node->index) + { + WRITE_OID_FIELD(index->indexoid); + WRITE_NODE_FIELD(column_values); + } +} + static void _outEquivalenceClass(StringInfo str, const EquivalenceClass *node) { @@ -4135,6 +4149,9 @@ outNode(StringInfo str, const void *obj) case T_StatisticExtInfo: _outStatisticExtInfo(str, obj); break; + case T_UniqueRelInfo: +_outUniqueRelInfo(str, obj); +break; case T_ExtensibleNode: _outExtensibleNode(str, obj); break; diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 2a50272da6..0c6e7eee74 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3564,7 +3564,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, * relation_has_unique_index_for * Determine whether the relation provably has at most one row satisfying * a set of equality conditions, because the conditions constrain all - * columns of some unique index. + * columns of some unique index. If index_info is not null, it is set to + * point to a new UniqueRelInfo containing the index and conditions. * * The conditions can be represented in either or both of two ways: * 1. A list of RestrictInfo nodes, where the caller has already determined @@ -3585,12 +3586,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel, bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, - List *exprlist, List *oprlist) + List *exprlist, List *oprlist, + UniqueRelInfo **unique_info) { ListCell *ic; Assert(list_length(exprlist) == list_length(oprlist)); + if (unique_info) + *unique_info = NULL; + /* Short-circuit if no indexes... */ if (rel->indexlist == NIL) return false; @@ -3641,6 +3646,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *column_values = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -368
Re: POC: GROUP BY optimization
On 7/22/21 3:58 AM, Tomas Vondra wrote: 4) I'm not sure it's actually a good idea to pass tuplesPerPrevGroup to estimate_num_groups_incremental. In principle yes, if we use "group size" from the previous step, then the returned value is the number of new groups after adding the "new" pathkey. But even if we ignore the issues with amplification mentioned in (3), there's an issue with non-linear behavior in estimate_num_groups, because at some point it's calculating D(N,n,p) = n * (1 - ((N-p)/N)^(N/n)) where N - total rows, p - sample size, n - number of distinct values. And if we have (N1,n1) and (N2,n2) then the ratio of calculated estimated (which is pretty much what calculating group size does) D(N2,n2,p2) / D(N1,n1,p1) which will differ depending on p1 and p2. And if we're tweaking the tuplesPerPrevGroup all the time, that's really annoying, as it may make the groups smaller or larger, which is unpredictable and annoying, and I wonder if it might go against the idea of penalizing tuplesPerPrevGroup to some extent. We could simply use the input "tuples" value here, and then divide the current and previous estimate to calculate the number of new groups. tuplesPerPrevGroup is only a top boundary for the estimation. I think, we should use result of previous estimation as a limit for the next during incremental estimation. Maybe we simply limit the tuplesPerPrevGroup value to ensure of monotonic non-growth of this value? - see in attachment patch to previous fixes. -- regards, Andrey Lepikhov Postgres Professionaldiff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 70af9c91d5..4e26cd275d 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -2025,8 +2025,10 @@ compute_cpu_sort_cost(PlannerInfo *root, List *pathkeys, int nPresortedKeys, /* * Real-world distribution isn't uniform but now we don't have a way to * determine that, so, add multiplier to get closer to worst case. +* Guard this value against irrational decisions. */ - tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups); + tuplesPerPrevGroup = Min(tuplesPerPrevGroup, +ceil(1.5 * tuplesPerPrevGroup / nGroups)); /* * We could skip all following columns for cost estimation, because we
Re: POC: GROUP BY optimization
On 22/1/2022 01:34, Tomas Vondra wrote: The other thing we could do is reduce the coefficient gradually - so it'd be 1.5 for the first pathkey, then 1.25 for the next one, and so on. But it seems somewhat arbitrary (I certainly don't have some sound theoretical justification ...). I think, it hasn't a reason to increase complexity without any theory at the bottom. Simple solution allows to realize problems much faster, if they arise. ... I've skipped the path_save removal in planner.c, because that seems incorrect - if there are multiple pathkeys, we must start with the original path, not the modified one we built in the last iteration. Or am I missing something You are right, I misunderstood the idea of path_save variable. -- regards, Andrey Lepikhov Postgres Professional
Re: Multiple Query IDs for a rewritten parse tree
On 29/1/2022 12:51, Julien Rouhaud wrote: 4. We should reserve position of default in-core generator From the discussion above I was under the impression that the core generator should be distinguished by a predefined kind. I don't really like this approach. IIUC, this patch as-is is meant to break pg_stat_statements extensibility. If kind == 0 is reserved for in-core queryid then you can't use pg_stat_statement with a different queryid generator anymore. Yes, it is one more problem. Maybe if we want to make it extensible, we could think about hooks in the pg_stat_statements too? The patch also reserves kind == -1 for pg_stat_statements internal purpose, and I'm not really sure why that's needed. My idea - tags with positive numbers are reserved for generation results, that is performance-critical. As I see during the implementation, pg_stat_statements makes additional changes on queryId (no matter which ones). Because our purpose is to avoid interference in this place, I invented negative values, where extensions can store their queryIds, based on any generator or not. Maybe it is redundant - main idea here was to highlight the issue. I'm also unsure of how are extensions supposed to cooperate in general, as I feel that queryids should be implemented for some "intent" (like monitoring, planning optimization...). That being said I'm not sure if e.g. AQO heuristics are too specific for its need or if it could be shared with other extension that might be dealing with similar concerns. I think, it depends on a specific purpose of an extension. -- regards, Andrey Lepikhov Postgres Professional
Re: Merging statistics from children instead of re-sampling everything
On 21/1/2022 01:25, Tomas Vondra wrote: But I don't have a very good idea what to do about statistics that we can't really merge. For some types of statistics it's rather tricky to reasonably merge the results - ndistinct is a simple example, although we could work around that by building and merging hyperloglog counters. I think, as a first step on this way we can reduce a number of pulled tuples. We don't really needed to pull all tuples from a remote server. To construct a reservoir, we can pull only a tuple sample. Reservoir method needs only a few arguments to return a sample like you read tuples locally. Also, to get such parts of samples asynchronously, we can get size of each partition on a preliminary step of analysis. In my opinion, even this solution can reduce heaviness of a problem drastically. -- regards, Andrey Lepikhov Postgres Professional
Re: explain analyze rows=%.0f
On 22/7/2022 16:47, Amit Kapila wrote: I feel the discussion has slightly deviated which makes it unclear whether this patch is required or not? After quick review I want to express my thoughts. At first, We have been waiting for this feature for years. Often clients give an explain to us where we see something like: "rows=0, loops=100". Without verbose mode, I can't even understand whether this node produces any rows or not. So, I think this feature is useful for parameterized plans mostly. Also, printing two decimal digits or even three isn't meaningful - sometimes we have a plan where number of loops is about 1E6 or even 1E7, but number of real rows is equal 100 or 1000. To overcome this issue, I see two options: 1. Show the exact number of tuples without division by loops (fair case but invasive and bad for automation tools). 2. Show rows in scientific format like X.XXEXX. I vote for second option. -- regards, Andrey Lepikhov Postgres Professional
Re: [PoC] Reducing planning time when tables have many partitions
On 2/11/2022 15:27, Yuya Watari wrote: Hello, I noticed that the previous patch does not apply to the current HEAD. I attached the rebased version to this email. I'm still in review of your patch now. At most it seems ok, but are you really need both eq_sources and eq_derives lists now? As I see, everywhere access to these lists guides by eclass_source_indexes and eclass_derive_indexes correspondingly. Maybe to merge them? -- regards, Andrey Lepikhov Postgres Professional
Re: [PoC] Reducing planning time when tables have many partitions
On 2/11/2022 15:27, Yuya Watari wrote: I noticed that the previous patch does not apply to the current HEAD. I attached the rebased version to this email. Looking into find_em_for_rel() changes I see that you replaced if (bms_is_subset(em->em_relids, rel->relids) with assertion statement. According of get_ecmember_indexes(), the em_relids field of returned equivalence members can contain relids, not mentioned in the relation. I don't understand, why it works now? For example, we can sort by t1.x, but have an expression t1.x=t1.y*t2.z. Or I've missed something? If it is not a mistake, maybe to add a comment why assertion here isn't failed? -- regards, Andrey Lepikhov Postgres Professional
Re: Removing unneeded self joins
On 12/3/21 14:05, Hywel Carver wrote: I've built and tested this, and it seems to function correctly to me. One question I have is whether the added "IS NOT NULL" filters can be omitted when they're unnecessary. Some of the resulting plans included an "IS NOT NULL" filter on a non-nullable column. To be clear, this is still an improvement (to me) without that. New version of the feature. Deeply refactored with main goal - to reduce the code size) and rebased on current master. Here I didn't work on 'unnecessary IS NOT NULL filter'. -- regards, Andrey Lepikhov Postgres Professional From eee0c5f0de35d8cb83e6c4ca7749020acb18a4d1 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 28 Apr 2021 18:27:53 +0500 Subject: [PATCH] Remove self-joins. Remove inner joins of a relation to itself if can be proven that such join can be replaced with a scan. We can build the required proofs of uniqueness using the existing innerrel_is_unique machinery. We can remove a self-join when for each outer row, if: 1. At most one inner row matches the join clauses. 2. If the join target list contains any inner vars then the inner row is (physically) same row as the outer one. In this patch we use Rowley's [1] approach to identify a self-join: 1. Collect all mergejoinable join quals looks like a.x = b.x 2. Collect all another join quals. 3. Check innerrel_is_unique() for the qual list from (1). If it returns true, then outer row matches only the same row from the inner relation. Proved, that this join is self-join and can be replaced by a scan. Some regression tests changed due to self-join removal logic. [1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com --- src/backend/optimizer/plan/analyzejoins.c | 942 +- src/backend/optimizer/plan/planmain.c | 5 + src/backend/optimizer/util/joininfo.c | 3 + src/backend/optimizer/util/relnode.c | 26 +- src/backend/utils/misc/guc.c | 10 + src/include/optimizer/pathnode.h | 4 + src/include/optimizer/planmain.h | 2 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 331 src/test/regress/expected/sysviews.out| 3 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 166 12 files changed, 1511 insertions(+), 29 deletions(-) diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index 37eb64bcef..dd5c4d2bd3 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -22,6 +22,7 @@ */ #include "postgres.h" +#include "catalog/pg_class.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/joininfo.h" @@ -32,10 +33,12 @@ #include "optimizer/tlist.h" #include "utils/lsyscache.h" +bool enable_self_join_removal; + /* local functions */ static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo); static void remove_rel_from_query(PlannerInfo *root, int relid, - Relids joinrelids); + Relids joinrelids, int subst_relid); static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved); static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel); static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, @@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root, RelOptInfo *innerrel, JoinType jointype, List *restrictlist); +static void change_rinfo(RestrictInfo* rinfo, Index from, Index to); +static Bitmapset* change_relid(Relids relids, Index oldId, Index newId); +static void change_varno(Expr *expr, Index oldRelid, Index newRelid); /* @@ -86,7 +92,7 @@ restart: remove_rel_from_query(root, innerrelid, bms_union(sjinfo->min_lefthand, - sjinfo->min_righthand)); + sjinfo->min_righthand), 0); /* We verify that exactly one reference gets removed from joinlist */ nremoved = 0; @@ -300,7 +306,10 @@ join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo) /* * Remove the target relid from the planner's data structures, having - * determined that there is no need to include it in the
Re: Asynchronous Append on postgres_fdw nodes.
On 6/5/21 22:12, Stephen Frost wrote: * Etsuro Fujita (etsuro.fuj...@gmail.com) wrote: I think the user should be careful about this. How about adding a note about it to the “Asynchronous Execution Options” section in postgres-fdw.sgml, like the attached? +1 ... then again, it'd really be better if we could figure out a way to just do the right thing here. I haven't looked at this in depth but I would think that the overhead of async would be well worth it just about any time there's more than one foreign server involved. Is it not reasonable to have a heuristic where we disable async in the cases where there's only one foreign server, but have it enabled all the other time? While continuing to allow users to manage it explicitly if they want. Bechmarking of SELECT from foreign partitions hosted on the same server, i see results: With async append: 1 partition - 178 ms; 4 - 263; 8 - 450; 16 - 860; 32 - 1740. Without: 1 - 178 ms; 4 - 583; 8 - 1140; 16 - 2302; 32 - 4620. So, these results show that we have a reason to use async append in the case where there's only one foreign server. -- regards, Andrey Lepikhov Postgres Professional
Re: Asynchronous Append on postgres_fdw nodes.
On 6/5/21 11:45, Etsuro Fujita wrote: On Tue, Apr 27, 2021 at 9:31 PM Etsuro Fujita wrote: The patch fixes the issue, but I don’t think it’s the right way to go, because it requires an extra ExecProcNode() call, which wouldn’t be efficient. Also, the patch wouldn’t address another issue I noticed in EXPLAIN ANALYZE for async-capable nodes that the command wouldn’t measure the time spent in such nodes accurately. For the case of async-capable node using postgres_fdw, it only measures the time spent in ExecProcNode() in ExecAsyncRequest()/ExecAsyncNotify(), missing the time spent in other things such as creating a cursor in ExecAsyncRequest(). :-(. To address both issues, I’d like to propose the attached, in which I added instrumentation support to ExecAsyncRequest()/ExecAsyncConfigureWait()/ExecAsyncNotify(). I think this would not only address the reported issue more efficiently, but allow to collect timing for async-capable nodes more accurately. Ok, I agree with the approach, but the next test case failed: EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ( (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) ) q1 LIMIT 100; ERROR: InstrUpdateTupleCount called on node not yet executed Initialization script see in attachment. -- regards, Andrey Lepikhov Postgres Professional t1.sql Description: application/sql
Re: Asynchronous Append on postgres_fdw nodes.
On 6/5/21 14:11, Etsuro Fujita wrote: On Tue, Apr 27, 2021 at 3:57 PM Andrey V. Lepikhov wrote: One more question. Append choose async plans at the stage of the Append plan creation. Later, the planner performs some optimizations, such as eliminating trivial Subquery nodes. So, AsyncAppend is impossible in some situations, for example: (SELECT * FROM f1 WHERE a < 10) UNION ALL (SELECT * FROM f2 WHERE a < 10); But works for the query: SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1 WHERE a < 10; As far as I understand, this is not a hard limit. I think so, but IMO I think this would be an improvement rather than a bug fix. We can choose async subplans at the beginning of the execution stage. For a demo, I prepared the patch (see in attachment). It solves the problem and passes the regression tests. Thanks for the patch! IIUC, another approach to this would be the patch you proposed before [1]. Right? Yes. I think, new solution will be better. -- regards, Andrey Lepikhov Postgres Professional
Re: Asynchronous Append on postgres_fdw nodes.
On 10/5/21 08:03, Etsuro Fujita wrote: On Fri, May 7, 2021 at 7:32 PM Andrey Lepikhov I think a simple fix for this would be just remove the check whether the instr->running flag is set or not in InstrUpdateTupleCount(). Attached is an updated patch, in which I also updated a comment in execnodes.h and docs in fdwhandler.sgml to match the code in nodeAppend.c, and fixed typos in comments in nodeAppend.c. Your patch fixes the problem. But I found two more problems: EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) SELECT * FROM ( (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) UNION ALL (SELECT * FROM l3) ) q1 LIMIT 6709; QUERY PLAN -- Limit (actual rows=6709 loops=1) -> Append (actual rows=6709 loops=1) -> Async Foreign Scan on f1 (actual rows=1 loops=1) -> Async Foreign Scan on f2 (actual rows=1 loops=1) -> Seq Scan on l3 (actual rows=6708 loops=1) Here we scan 6710 tuples at low level but appended only 6709. Where did we lose one tuple? 2. SELECT * FROM ( (SELECT * FROM f1) UNION ALL (SELECT * FROM f2) UNION ALL (SELECT * FROM f3 WHERE a > 0) ) q1 LIMIT 3000; QUERY PLAN -- Limit (actual rows=3000 loops=1) -> Append (actual rows=3000 loops=1) -> Async Foreign Scan on f1 (actual rows=0 loops=1) -> Async Foreign Scan on f2 (actual rows=0 loops=1) -> Foreign Scan on f3 (actual rows=3000 loops=1) Here we give preference to the synchronous scan. Why? -- regards, Andrey Lepikhov Postgres Professional
Defer selection of asynchronous subplans until the executor initialization stage
On 7/5/21 21:05, Etsuro Fujita wrote: I think it would be better to start a new thread for this, and add the patch to the next CF so that it doesn’t get lost. Current implementation of async append choose asynchronous subplans at the phase of an append plan creation. This is safe approach, but we loose some optimizations, such of flattening trivial subqueries and can't execute some simple queries asynchronously. For example: EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF, COSTS OFF) (SELECT * FROM f1 WHERE a < 10) UNION ALL (SELECT * FROM f2 WHERE a < 10); But, as I could understand, we can choose these subplans later, at the init append phase when all optimizations already passed. In attachment - implementation of the proposed approach. Initial script for the example see in the parent thread [1]. [1] https://www.postgresql.org/message-id/a38bb206-8340-9528-5ef6-37de2d5cb1a3%40postgrespro.ru -- regards, Andrey Lepikhov Postgres Professional From 395b1d62389cf40520a4afd87c11301aa2b17df2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Tue, 11 May 2021 08:43:03 +0500 Subject: [PATCH] Defer selection of asynchronous subplans to the executor initial phase. --- contrib/postgres_fdw/postgres_fdw.c | 10 +- src/backend/executor/execAmi.c | 7 +++ src/backend/executor/nodeAppend.c | 19 +++ src/backend/nodes/copyfuncs.c | 1 - src/backend/nodes/outfuncs.c| 1 - src/backend/nodes/readfuncs.c | 1 - src/backend/optimizer/plan/createplan.c | 17 + src/include/nodes/plannodes.h | 1 - src/include/optimizer/planmain.h| 1 + 9 files changed, 33 insertions(+), 25 deletions(-) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 4ff58d9c27..3e151a6790 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -1245,6 +1245,7 @@ postgresGetForeignPlan(PlannerInfo *root, boolhas_final_sort = false; boolhas_limit = false; ListCell *lc; + ForeignScan *fsplan; /* * Get FDW private data created by postgresGetForeignUpperPaths(), if any. @@ -1429,7 +1430,7 @@ postgresGetForeignPlan(PlannerInfo *root, * field of the finished plan node; we can't keep them in private state * because then they wouldn't be subject to later planner processing. */ - return make_foreignscan(tlist, + fsplan = make_foreignscan(tlist, local_exprs, scan_relid, params_list, @@ -1437,6 +1438,13 @@ postgresGetForeignPlan(PlannerInfo *root, fdw_scan_tlist, fdw_recheck_quals, outer_plan); + + /* If appropriate, consider participation in async operations */ + fsplan->scan.plan.async_capable = (enable_async_append && + best_path->path.pathkeys == NIL && + !fsplan->scan.plan.parallel_safe && + is_async_capable_path((Path *)best_path)); + return fsplan; } /* diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c index b3726a54f3..4e70f4eb54 100644 --- a/src/backend/executor/execAmi.c +++ b/src/backend/executor/execAmi.c @@ -524,6 +524,9 @@ ExecSupportsBackwardScan(Plan *node) if (node->parallel_aware) return false; + if (node->async_capable) + return false; + switch (nodeTag(node)) { case T_Result: @@ -536,10 +539,6 @@ ExecSupportsBackwardScan(Plan *node) { ListCell *l; - /* With async, tuples may be interleaved, so can't back up. */ - if (((Append *) node)->nasyncplans > 0) - return false; - foreach(l, ((Append *) node)->appendplans) { if (!ExecSupportsBackwardScan((Plan *) lfirst(l))) diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 3c1f12adaf..363cf9f4a5 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -117,6 +117,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags) int firstvalid; int
Re: Defer selection of asynchronous subplans until the executor initialization stage
On 11/5/21 08:55, Zhihong Yu wrote: + /* Check to see if subplan can be executed asynchronously */ + if (subplan->async_capable) + { + subplan->async_capable = false; It seems the if statement is not needed: you can directly assign false to subplan->async_capable.Thank you, I agree with you. Close look into the postgres_fdw regression tests show at least one open problem with this approach: we need to control situations when only one partition doesn't pruned and append isn't exist at all. -- regards, Andrey Lepikhov Postgres Professional
Re: Asynchronous Append on postgres_fdw nodes.
On 11/5/21 12:24, Etsuro Fujita wrote: On Tue, May 11, 2021 at 11:58 AM Andrey Lepikhov The extra tuple, which is from f1 or f2, would have been kept in the Append node's as_asyncresults, not returned from the Append node to the Limit node. The async Foreign Scan nodes would fetch tuples before the Append node ask the tuples, so the fetched tuples may or may not be used. Ok.>> -> Append (actual rows=3000 loops=1) -> Async Foreign Scan on f1 (actual rows=0 loops=1) -> Async Foreign Scan on f2 (actual rows=0 loops=1) -> Foreign Scan on f3 (actual rows=3000 loops=1) Here we give preference to the synchronous scan. Why? This would be expected behavior, and the reason is avoid performance degradation; you might think it would be better to execute the async Foreign Scan nodes more aggressively, but it would require waiting/polling for file descriptor events many times, which is expensive and might cause performance degradation. I think there is room for improvement, though. Yes, I agree with you. Maybe you can add note in documentation on async_capable, for example: "... Synchronous and asynchronous scanning strategies can be mixed by optimizer in one scan plan of a partitioned table or an 'UNION ALL' command. For performance reasons, synchronous scans executes before the first of async scan. ..." -- regards, Andrey Lepikhov Postgres Professional
Re: Asymmetric partition-wise JOIN
Next version of the patch. For searching any problems I forced this patch during 'make check' tests. Some bugs were found and fixed. -- regards, Andrey Lepikhov Postgres Professional From 101614b504b0b17e201d2375c8af61cfc671e51d Mon Sep 17 00:00:00 2001 From: Andrey Lepikhov Date: Fri, 2 Apr 2021 11:02:20 +0500 Subject: [PATCH] Asymmetric partitionwise join. Teach optimizer to consider partitionwise join of non-partitioned table with each partition of partitioned table. Disallow asymmetric machinery for joining of two partitioned (or appended) relations because it could cause CPU and memory huge consumption during reparameterization of NestLoop path. --- src/backend/optimizer/path/joinpath.c| 9 + src/backend/optimizer/path/joinrels.c| 184 +++ src/backend/optimizer/plan/setrefs.c | 13 +- src/backend/optimizer/util/appendinfo.c | 37 ++- src/backend/optimizer/util/pathnode.c| 9 +- src/backend/optimizer/util/relnode.c | 19 +- src/include/optimizer/paths.h| 7 +- src/test/regress/expected/partition_join.out | 306 +++ src/test/regress/sql/partition_join.sql | 126 9 files changed, 682 insertions(+), 28 deletions(-) diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index b67b517770..7a1a7b2e86 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -335,6 +335,15 @@ add_paths_to_joinrel(PlannerInfo *root, if (set_join_pathlist_hook) set_join_pathlist_hook(root, joinrel, outerrel, innerrel, jointype, &extra); + + /* +* 7. If outer relation is delivered from partition-tables, consider +* distributing inner relation to every partition-leaf prior to +* append these leafs. +*/ + try_asymmetric_partitionwise_join(root, joinrel, + outerrel, innerrel, + jointype, &extra); } /* diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c index 0dbe2ac726..e6bd2ea5fe 100644 --- a/src/backend/optimizer/path/joinrels.c +++ b/src/backend/optimizer/path/joinrels.c @@ -16,6 +16,7 @@ #include "miscadmin.h" #include "optimizer/appendinfo.h" +#include "optimizer/cost.h" #include "optimizer/joininfo.h" #include "optimizer/pathnode.h" #include "optimizer/paths.h" @@ -1551,6 +1552,189 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2, } } +/* + * Build RelOptInfo on JOIN of each partition of the outer relation and the inner + * relation. Return List of such RelOptInfo's. Return NIL, if at least one of + * these JOINs are impossible to build. + */ +static List * +extract_asymmetric_partitionwise_subjoin(PlannerInfo *root, + RelOptInfo *joinrel, + AppendPath *append_path, + RelOptInfo *inner_rel, + JoinType jointype, + JoinPathExtraData *extra) +{ + List*result = NIL; + ListCell*lc; + + foreach (lc, append_path->subpaths) + { + Path*child_path = lfirst(lc); + RelOptInfo *child_rel = child_path->parent; + Relids child_join_relids; + Relids parent_relids; + RelOptInfo *child_join_rel; + SpecialJoinInfo *child_sjinfo; + List*child_restrictlist; + + child_join_relids = bms_union(child_rel->relids, inner_rel->relids); + parent_relids = bms_union(append_path->path.parent->relids, + inner_rel->relids); + + child_sjinfo = build_child_join_sjinfo(root, extra->sjinfo, + child_rel->relids, + inner_rel->relids); + child_restrictlist = (List *) + adjust_appendrel_attrs_multilevel(root, (Node *)extra->restrictlist, + child_join_r
Re: [PoC] Reducing planning time when tables have many partitions
On 2/6/23 06:47, Yuya Watari wrote: Of course, I'm not sure if my approach in v16-0003 is ideal, but it may help solve your concern above. Since simple_rel_array[0] is no longer necessary with my patch, I removed the setup_append_rel_entry() function in v16-0004. However, to work the patch, I needed to change some assertions in v16-0005. For more details, please see the commit message of v16-0005. After these works, the attached patches passed all regression tests in my environment. Instead of my approach, imitating the following change to get_eclass_indexes_for_relids() is also a possible solution. Ignoring NULL RelOptInfos enables us to avoid the segfault, but we have to adjust EquivalenceMemberIterator to match the result, and I'm not sure if this idea is correct. As I see, You moved the indexes from RelOptInfo to PlannerInfo. May be better to move them into RangeTblEntry instead? -- Regards Andrey Lepikhov Postgres Professional
Re: [POC] Allow flattening of subquery with a link to upper query
On 9/13/22 16:40, Andrey Lepikhov wrote: On 5/9/2022 12:22, Richard Guo wrote: On Fri, Sep 2, 2022 at 7:09 PM Andrey Lepikhov mailto:a.lepik...@postgrespro.ru>> wrote: To resolve both issues, lower outer join passes through pull_sublinks_* into flattening routine (see attachment). I've added these cases into subselect.sql In attachment - new version of the patch, rebased onto current master. -- Regards Andrey Lepikhov Postgres Professional From 6462578f8789cb831f45ebfad65308d6afabb833 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 28 Sep 2022 13:42:12 +0300 Subject: [PATCH] Transform correlated subquery of type N-J [1] into ordinary join query. With many restrictions, check each subquery and pull up expressions, references upper query block. Works for operators '=' and 'IN'. Machinery of converting ANY subquery to JOIN stays the same with minor changes in walker function. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pass a lower outer join through the pullup_sublink recursive procedure to find out some situations when qual references outer side of an outer join. [1] Kim, Won. “On optimizing an SQL-like nested query.” ACM Trans. Database Syst. 7 (1982): 443-469. --- src/backend/optimizer/plan/subselect.c| 320 +++- src/backend/optimizer/prep/prepjointree.c | 71 ++-- src/backend/optimizer/util/tlist.c| 2 +- src/backend/optimizer/util/var.c | 8 + src/backend/utils/misc/guc_tables.c | 10 + src/include/optimizer/optimizer.h | 1 + src/include/optimizer/subselect.h | 3 +- src/include/optimizer/tlist.h | 1 + src/test/regress/expected/prepare.out | 18 + src/test/regress/expected/subselect.out | 438 ++ src/test/regress/sql/prepare.sql | 7 + src/test/regress/sql/subselect.sql| 162 12 files changed, 1004 insertions(+), 37 deletions(-) diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c index 92e3338584..29da43bb4d 100644 --- a/src/backend/optimizer/plan/subselect.c +++ b/src/backend/optimizer/plan/subselect.c @@ -32,6 +32,7 @@ #include "optimizer/planner.h" #include "optimizer/prep.h" #include "optimizer/subselect.h" +#include "optimizer/tlist.h" #include "parser/parse_relation.h" #include "rewrite/rewriteManip.h" #include "utils/builtins.h" @@ -65,6 +66,8 @@ typedef struct inline_cte_walker_context } inline_cte_walker_context; +bool optimize_correlated_subqueries = true; + static Node *build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot, List *plan_params, SubLinkType subLinkType, int subLinkId, @@ -1229,6 +1232,288 @@ inline_cte_walker(Node *node, inline_cte_walker_context *context) return expression_tree_walker(node, inline_cte_walker, context); } +static bool +contain_placeholders(Node *node, inline_cte_walker_context *context) +{ + if (node == NULL) + return false; + + if (IsA(node, Query)) + return query_tree_walker((Query *) node, contain_placeholders, context, 0); + else if (IsA(node, PlaceHolderVar)) + { + return true; + } + + return expression_tree_walker(node, contain_placeholders, context); +} + +/* + * To be pullable all clauses of flattening correlated subquery should be + * anded and mergejoinable (XXX: really necessary?) + */ +static bool +quals_is_pullable(Node *quals) +{ + if (!contain_vars_of_level(quals, 1)) + return true; + + if (quals && IsA(quals, OpExpr)) + { + OpExpr *expr = (OpExpr *) quals; + Node *leftarg; + + /* Contains only one expression */ + leftarg = linitial(expr->args); + if (!op_mergejoinable(expr->opno, exprType(leftarg))) /* Is it really necessary ? */ + return false; + + if (contain_placeholders(quals, NULL)) + return false; + + return true; + } + else if (is_andclause(quals)) + { + ListCell *l; + + foreach(l, ((BoolExpr *) quals)->args) + { + Node *andarg = (Node *) lfirst(l); + + if (!IsA(andarg, OpExpr)) +return false; + if (!quals_is_pullable(andarg)) +return false; + } + + return true; + } + + return false; +} + +typedef struct +{ + Query *subquery; + int newvarno; + List *pulling_quals; + bool varlevel_up; +} correlated_t; + +static Node * +pull_subquery_clauses_mutator(Node *node, correlated_t *ctx) +{ + if (node == NULL) + return NULL; + + if (IsA(node, OpExpr) && !ctx->varlevel_up) + { + if (!contain_vars_of_level(node, 1)) + return node; + + /* + * The expression contains links to upper relation. It will be pulled to + * uplevel. All links into vars of upper levels must be changed. + */ + + ctx->varlevel_up = true; + ctx->pulling_quals = + lappend(ctx->pulling_quals, + expression_tree_mutator(node, + pull_subquery_clauses_mutator, + (
Re: [POC] Allow flattening of subquery with a link to upper query
On 5/10/2022 02:45, Zhihong Yu wrote: Hi, For contain_placeholders(): + if (IsA(node, Query)) + return query_tree_walker((Query *) node, contain_placeholders, context, 0); + else if (IsA(node, PlaceHolderVar)) Fixed The `else` is not needed. For correlated_t struct, it would be better if the fields have comments. Ok, I've added some comments. + * (for grouping, as an example). So, revert its status to + * a full valued entry. full valued -> fully valued Fixed -- regards, Andrey Lepikhov Postgres Professional From da570914e53bc34e6bf3291649725a041a8b929c Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 28 Sep 2022 13:42:12 +0300 Subject: [PATCH] Transform correlated subquery of type N-J [1] into ordinary join query. With many restrictions, check each subquery and pull up expressions, references upper query block. Works for operators '=' and 'IN'. Machinery of converting ANY subquery to JOIN stays the same with minor changes in walker function. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pass a lower outer join through the pullup_sublink recursive procedure to find out some situations when qual references outer side of an outer join. [1] Kim, Won. “On optimizing an SQL-like nested query.” ACM Trans. Database Syst. 7 (1982): 443-469. --- src/backend/optimizer/plan/subselect.c| 328 +++- src/backend/optimizer/prep/prepjointree.c | 71 ++-- src/backend/optimizer/util/tlist.c| 2 +- src/backend/optimizer/util/var.c | 8 + src/backend/utils/misc/guc_tables.c | 10 + src/include/optimizer/optimizer.h | 1 + src/include/optimizer/subselect.h | 3 +- src/include/optimizer/tlist.h | 1 + src/test/regress/expected/prepare.out | 18 + src/test/regress/expected/subselect.out | 438 ++ src/test/regress/sql/prepare.sql | 7 + src/test/regress/sql/subselect.sql| 162 src/tools/pgindent/typedefs.list | 1 + 13 files changed, 1013 insertions(+), 37 deletions(-) diff --git a/src/backend/optimizer/plan/subselect.c b/src/backend/optimizer/plan/subselect.c index 92e3338584..be19d85524 100644 --- a/src/backend/optimizer/plan/subselect.c +++ b/src/backend/optimizer/plan/subselect.c @@ -32,6 +32,7 @@ #include "optimizer/planner.h" #include "optimizer/prep.h" #include "optimizer/subselect.h" +#include "optimizer/tlist.h" #include "parser/parse_relation.h" #include "rewrite/rewriteManip.h" #include "utils/builtins.h" @@ -65,6 +66,8 @@ typedef struct inline_cte_walker_context } inline_cte_walker_context; +bool optimize_correlated_subqueries = true; + static Node *build_subplan(PlannerInfo *root, Plan *plan, PlannerInfo *subroot, List *plan_params, SubLinkType subLinkType, int subLinkId, @@ -1229,6 +1232,296 @@ inline_cte_walker(Node *node, inline_cte_walker_context *context) return expression_tree_walker(node, inline_cte_walker, context); } +static bool +contain_placeholders(Node *node, inline_cte_walker_context *context) +{ + if (node == NULL) + return false; + + if (IsA(node, Query)) + return query_tree_walker((Query *) node, contain_placeholders, context, 0); + if (IsA(node, PlaceHolderVar)) + return true; + + return expression_tree_walker(node, contain_placeholders, context); +} + +/* + * To be pullable all clauses of flattening correlated subquery should be + * anded and mergejoinable (XXX: really necessary?) + */ +static bool +quals_is_pullable(Node *quals) +{ + if (!contain_vars_of_level(quals, 1)) + return true; + + if (quals && IsA(quals, OpExpr)) + { + OpExpr *expr = (OpExpr *) quals; + Node *leftarg; + + /* Contains only one expression */ + leftarg = linitial(expr->args); + if (!op_mergejoinable(expr->opno, exprType(leftarg))) /* Is it really necessary ? */ + return false; + + if (contain_placeholders(quals, NULL)) + return false; + + return true; + } + else if (is_andclause(quals)) + { + ListCell *l; + + foreach(l, ((BoolExpr *) quals)->args) + { + Node *andarg = (Node *) lfirst(l); + + if (!IsA(andarg, OpExpr)) + return false; + if (!quals_is_pullable(andarg)) + return false; + } + + return true; + } + + return false; +} + +/*
Re: Removing unneeded self joins
New version, rebased onto current master. Nothing special, just rebase. -- regards, Andrey Lepikhov Postgres Professional From 03aab7a2431032166c9ea5f52fbcccaf7168abec Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 5 Oct 2022 16:58:34 +0500 Subject: [PATCH] Remove self-joins. A Self Join Removal (SJR) feature removes inner join of plain table to itself in a query plan if can be proved that the join can be replaced with a scan. The proof based on innerrel_is_unique machinery. We can remove a self-join when for each outer row: 1. At most one inner row matches the join clauses. 2. If the join target list contains any inner vars, an inner row must be (physically) the same row as the outer one. In this patch we use Rowley's [1] approach to identify a self-join: 1. Collect all mergejoinable join quals which look like a.x = b.x 2. Check innerrel_is_unique() for the qual list from (1). If it returns true, then outer row matches only the same row from the inner relation. 3. If uniqueness of outer relation deduces from baserestrictinfo clause like 'x=const' on unique field(s), check that inner has the same clause with the same constant(s). 4. Compare RowMarks of inner and outer relations. They must have the same strength. Some regression tests changed due to self-join removal logic. [1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com --- doc/src/sgml/config.sgml | 16 + src/backend/optimizer/path/indxpath.c | 38 + src/backend/optimizer/plan/analyzejoins.c | 1046 - src/backend/optimizer/plan/planmain.c |5 + src/backend/utils/misc/guc_tables.c | 22 + src/include/optimizer/paths.h |3 + src/include/optimizer/planmain.h |7 + src/test/regress/expected/equivclass.out | 32 + src/test/regress/expected/join.out| 774 +++ src/test/regress/expected/sysviews.out|3 +- src/test/regress/sql/equivclass.sql | 16 + src/test/regress/sql/join.sql | 340 +++ src/tools/pgindent/typedefs.list |2 + 13 files changed, 2278 insertions(+), 26 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d750290f13..5ce2d4d2fa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -5290,6 +5290,22 @@ ANY num_sync ( + enable_self_join_removal (boolean) + + enable_self_join_removal configuration parameter + + + + +Enables or disables the query planner's optimization which analyses +query tree and replaces self joins with semantically equivalent single +scans. Take into consideration only plain tables. +The default is on. + + + + enable_seqscan (boolean) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index c31fcc917d..51f672a65c 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -3513,6 +3513,21 @@ bool relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, List *restrictlist, List *exprlist, List *oprlist) +{ + return relation_has_unique_index_ext(root, rel, restrictlist, + exprlist, oprlist, NULL); +} + +/* + * relation_has_unique_index_ext + * if extra_clauses isn't NULL, return baserestrictinfo clauses which were + * used to derive uniqueness. + */ +bool +relation_has_unique_index_ext(PlannerInfo *root, RelOptInfo *rel, + List *restrictlist, + List *exprlist, List *oprlist, + List **extra_clauses) { ListCell *ic; @@ -3568,6 +3583,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, { IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic); int c; + List *exprs = NIL; /* * If the index is not unique, or not immediately enforced, or if it's @@ -3616,6 +3632,24 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel, if (match_index_to_operand(rexpr, c, ind)) { matched = true; /* column is unique */ + + if (bms_membership(rinfo->clause_relids) == BMS_SINGLETON) + { + MemoryContext oldMemCtx = +
Re: document the need to analyze partitioned tables
On 10/5/22 13:37, Laurenz Albe wrote: On Mon, 2022-03-28 at 15:05 +0200, Tomas Vondra wrote: I've pushed the last version, and backpatched it to 10 (not sure I'd call it a bugfix, but I certainly agree with Justin it's worth mentioning in the docs, even on older branches). I'd like to suggest an improvement to this. The current wording could be read to mean that dead tuples won't get cleaned up in partitioned tables. By the way, where are the statistics of a partitioned tables used? The actual tables scanned are always the partitions, and in the execution plans that I have seen, the optimizer always used the statistics of the partitions. For example, it is used to estimate selectivity of join clause: CREATE TABLE test (id integer, val integer) PARTITION BY hash (id); CREATE TABLE test_0 PARTITION OF test FOR VALUES WITH (modulus 2, remainder 0); CREATE TABLE test_1 PARTITION OF test FOR VALUES WITH (modulus 2, remainder 1); INSERT INTO test (SELECT q, q FROM generate_series(1,10) AS q); VACUUM ANALYZE test; INSERT INTO test (SELECT q, q%2 FROM generate_series(11,200) AS q); VACUUM ANALYZE test_0,test_1; EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF) SELECT * FROM test t1, test t2 WHERE t1.id = t2.val; VACUUM ANALYZE test; EXPLAIN (ANALYZE, TIMING OFF, SUMMARY OFF) SELECT * FROM test t1, test t2 WHERE t1.id = t2.val; Here without actual statistics on parent table we make wrong prediction. -- Regards Andrey Lepikhov Postgres Professional
Re: Fast COPY FROM based on batch insert
On 10/7/22 11:18, Etsuro Fujita wrote: On Tue, Sep 27, 2022 at 6:03 PM Etsuro Fujita wrote: I will review the patch a bit more, but I feel that it is in good shape. One thing I noticed is this bit added to CopyMultiInsertBufferFlush() to run triggers on the foreign table. + /* Run AFTER ROW INSERT triggers */ + if (resultRelInfo->ri_TrigDesc != NULL && + (resultRelInfo->ri_TrigDesc->trig_insert_after_row || +resultRelInfo->ri_TrigDesc->trig_insert_new_table)) + { + Oid relid = RelationGetRelid(resultRelInfo->ri_RelationDesc); + + for (i = 0; i < inserted; i++) + { + TupleTableSlot *slot = rslots[i]; + + /* +* AFTER ROW Triggers might reference the tableoid column, +* so (re-)initialize tts_tableOid before evaluating them. +*/ + slot->tts_tableOid = relid; + + ExecARInsertTriggers(estate, resultRelInfo, +slot, NIL, +cstate->transition_capture); + } + } Since foreign tables cannot have transition tables, we have trig_insert_new_table=false. So I simplified the if test and added an assertion ensuring trig_insert_new_table=false. Attached is a new version of the patch. I tweaked some comments a bit as well. I think the patch is committable. So I plan on committing it next week if there are no objections. I reviewed the patch one more time. Only one question: bistate and ri_FdwRoutine are strongly bounded. Maybe to add some assertion on (ri_FdwRoutine XOR bistate) ? Just to prevent possible errors in future. -- Regards Andrey Lepikhov Postgres Professional
Re: Fast COPY FROM based on batch insert
On 10/12/22 07:56, Etsuro Fujita wrote: On Tue, Oct 11, 2022 at 3:06 PM Andrey Lepikhov wrote: I reviewed the patch one more time. Only one question: bistate and ri_FdwRoutine are strongly bounded. Maybe to add some assertion on (ri_FdwRoutine XOR bistate) ? Just to prevent possible errors in future. You mean the bistate member of CopyMultiInsertBuffer? Yes We do not use that member at all for foreign tables, so the patch avoids initializing that member in CopyMultiInsertBufferInit() when called for a foreign table. So we have bistate = NULL for foreign tables (and bistate != NULL for plain tables), as you mentioned above. I think it is a good idea to add such assertions. How about adding them to CopyMultiInsertBufferFlush() and CopyMultiInsertBufferCleanup() like the attached? In the attached I updated comments a bit further as well. Yes, quite enough. -- Regards Andrey Lepikhov Postgres Professional
Re: Fast COPY FROM based on batch insert
On 28/10/2022 16:12, Etsuro Fujita wrote: On Thu, Oct 13, 2022 at 6:58 PM Etsuro Fujita wrote: I have committed the patch after tweaking comments a little bit further. I think there is another patch that improves performance of COPY FROM for foreign tables using COPY FROM STDIN, but if Andrey (or anyone else) want to work on it again, I think it would be better to create a new CF entry for it (and start a new thread for it). So I plan to close this in the November CF unless they think otherwise. Anyway, thanks for the patch, Andrey! Thanks for reviewing, Ian and Zhihong! Thanks, I studied performance of this code in comparison to bulk INSERTions. This patch seems to improve speed of insertion by about 20%. Also, this patch is very invasive. So, I don't have any plans to work on it now. -- regards, Andrey Lepikhov Postgres Professional