Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Thanks for the detailed explanation. On Sat, Feb 19, 2022 at 2:27 AM Robert Haas wrote: > On Fri, Feb 18, 2022 at 12:56 AM Andy Fan > wrote: > > What do you think about moving on this feature? The items known by me > > are: 1). Make sure the estimation error can be fixed or discuss if my > current > > solution is workable. b). Just distribute some selectivity > restrictinfo to > > RelOptInfo. c). See how hard it is to treat the original / derived > qual equally. > > d). Reduce the extra planner cost at much as possible. Any other > important > > items I missed? > > I think it's not realistic to do anything here for PostgreSQL 15. > Considering that it's almost the end of February and feature freeze > will probably be in perhaps 5-6 weeks, in order to get something > committed at this point, I didn't expect that we could commit it very soon;) Actually my expectation was that more people would care about the direction of this feature. I care about it, but that's not enough obviously. So I summarized the direction I want to go, and let more people see if that's right. > Tom doesn't think we need this at all, and you and I and > Tomas all have somewhat different ideas on what approach we ought to > be taking, Agreed. IMO, the estimation error looks like a serious issue that we all agree to find a solution. But currently we have different ways to handle that. I'd pretty much hope that we can have a discussion about this stuff. and the patches appear to be at a POC level at this point rather than something that's close to being ready to ship, > This is very true since no consensus on an approach so far. PoC would be enough for now. > It seems to me that the thing to do here is see if you can build > consensus on an approach. Just saying that we ought to think the > patches you've already got are good enough is not going to get you > anywhere. I truly understand this and no matter which approach I insist on, the only reason is just because I think it is the best one IMO and not because it comes from me or not. > I do understand that the political element of this problem > is frustrating to you, as it is to many people. But consider the > alternative: suppose the way things worked around here is that any > committer could commit anything they liked without needing the > approval of any other committer, or even over their objections. Well, > it would be chaos. This is the fact I think. > People would be constantly reverting or rewriting > things that other people had done, and everybody would probably be > pissed off at each other all the time, and the quality would go down > the tubes and nobody would use PostgreSQL any more. > But the reason it's frustrating is because the PostgreSQL > community is a community of human beings, and there's nothing more > frustrating in the world than the stuff other human beings do. > > New knowledge gained from how committers think about other's patch:) It is reasonable. Committing the patch is not my only goal. Thinking stuff more completely is also an awesome thing to get during discussion. Just that sometimes ignorance is frustrating (I also truly understood that everyone's energy is limited). However, it's equally true that we get further working together than > we would individually. I think Tom is wrong about the merits of doing > something in this area, but I also think he's incredibly smart and > thoughtful and one of the best technologists I've ever met, and > probably just one of the absolute best technologists on Planet Earth. > And I also have to consider, and this is really important, the > possibility that Tom is right about this issue and I am wrong. So far > Tom hasn't replied to what I wrote, but I hope he does. Maybe he'll > admit that I have some valid points. Maybe he'll tell me why he thinks > I'm wrong. Maybe I'll learn about some problem that I haven't > considered from his response, and maybe that will lead to a refinement > of the idea that will make it better. +1. Just to be more precise, are you also confused about why this should not be done at all. IIUC, I get 3 reasons from Tom's reply. a). Planning cost. b). estimation error. c) extra qual execution is bad. > I don't know, but it's certainly > happened in plenty of other cases. And that's how PostgreSQL gets to > be this pretty amazing database that it is. So, yeah, building > consensus is frustrating and it takes a long time and sometimes it > feels like other people are obstructing you needlessly and sometimes > that's probably true. But there's not a realistic alternative. Nobody > here is smart enough to create software that is as good as what all of > us create together. > +1. -- Best Regards Andy Fan
Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
> > > >> It actually deals with a more general form of this case, because the > >> clauses don't need to reference the same attribute - so for example this > >> would work too, assuming there is extended stats object on the columns > >> on each side: > >> > >> P(A.c = B.d | (A.e < 42) & (B.f < 42)) > > > > That'd be cool. > > > > Yeah, but the patch implementing this still needs more work. > > Thanks for that patch. That patch has been on my study list for a long time and it can fix the other real case I met some day ago. I spent one day studying it again yesterday just that the result does not deserve sharing at the current stage. As for the purpose here, if we have extended statistics, I believe it can work well. But requiring extended statistics for this feature does not look very reasonable to me. Do you think we can go further in direction for the issue here? and it would be super great that you can take a look at the commit 3 [1]. IIUC, It can solve the issue and is pretty straightforward. [1] https://www.postgresql.org/message-id/CAKU4AWrdeQZ8xvf%3DDVhndUs%3DRGn8oVoSJvYK3Yj7uWq2%3Ddt%3DMg%40mail.gmail.com -- Best Regards Andy Fan
Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Thanks Tom for joining. > I'm not in favor of complicating the EquivalenceClass > mechanism for this, because (b) what it definitely will do > is make ECs harder to understand and reason about. I'm not willing to show opposition on purpose, and I'm not insist on current strategy, but I can't understand the comment here, not sure how others. So I just point it out. IMO, the semantics of ec_filter is that every EMs in this EC can have this filter. I do like this method very much. If we need something to improve that, it may be the content in ec_filter is not generic enough. For example: select * from t1, t2 where t1.a = t2.a and t2.a > 3; Then the EC filter is "t2.a > 3". Why is it a "t2.a" rather than a more generic type to show "any EM" in this EC, I can double check the patch to see if this can be any helpful. Maybe I'm focusing on the current situation too much, could you describe more about the badness of this semantics level? > If we develop a > separate mechanism that can infer things from inequalities, and it _only_ kicks in when there are some inequalities, that might work out okay. > I will try to make this part clearer. The current mechanism includes 3 steps. 1). Gather the inequalities_qual_list during the deconstruct_jointree. 2). After the root->eq_classes is built, scan each of the above quals to find out if there is an EC match, if yes, add it to the EC. There are some fast paths here. 3). compose the qual in ec_filter and members in ec_members, then distribute it to the relations. Step 1 would make sure only inequalities is checked. Are you unhappy with the cost of step 2 here? for the case like SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.b > 3; we have to go step 2 and get nothing finally. As for the case like "FROM t1, t2, t3 WHERE t1.a = t2.a and t3.c > 3". t3.c > 3 can be discard quickly with EC->relids checking. But because of that, I don't even like the 0001 patch in this series. > I've not looked at the subsequent ones. > > I agree with 0001 patch should be the first one to reach an agreement . -- Best Regards Andy Fan
Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
> > > I don't think 0001 is right either, although maybe for somewhat > different reasons. First, I think it only considers VAR OP CONST style > clauses, but that is leaving money on the table, because given a.x = > b.x AND mumble(a.x), we can decide to instead test mumble(b.x) if the > equality operator in question has is-binary-identical semantics. It > does not seem necessary for a first patch to deal with both that and > the somewhat more pleasing case where we're making deductions based on > operator families ... but we shouldn't commit to a design for the VAR > OP CONST case without understanding how it could be generalized. > I can follow up with this and +1 with the statement. > Second, it looks to me like the patch takes the rather naive strategy > of enforcing the derived clauses everywhere that they can legally be > put, which seems certain not to be optimal. If we can have some agreement (after more discussion) the EC filter is acceptable on semantics level, I think we may have some chances to improve something at execution level. -- Best Regards Andy Fan
Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Hi: I just tested more cases for the estimation issue for this feature, and we can find **we get a more accurate/stable estimation than before**. Here is the test cases and result (by comparing the master version and patched version). create table ec_t110 as select i::int as a from generate_series(1, 110) i; create table ec_t200 as select i::int as a from generate_series(1, 200) i; create table ec_t500 as select i::int as a from generate_series(1, 500) i; create table ec_t800 as select i::int as a from generate_series(1, 800) i; create table ec_t1000 as select i::int as a from generate_series(1, 1000) i; analyze; -- 2 table joins. explain analyze select * from ec_t1000, ec_t110 where ec_t1000.a = ec_t110.a and ec_t1000.a > 100; -- (0.9) explain analyze select * from ec_t1000, ec_t110 where ec_t1000.a = ec_t110.a and ec_t110.a > 100; -- (0.1) -- 3 table joins. explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t1000.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t110.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t200.a > 100; -- 4 table joins. explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t1000.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t110.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t200.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a > 100; -- 5 table joins. explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a = ec_t800.a and ec_t1000.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a = ec_t800.a and ec_t110.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a = ec_t800.a and ec_t200.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a = ec_t800.a and ec_t500.a > 100; explain analyze select * from ec_t1000, ec_t110 , ec_t200, ec_t500, ec_t800 where ec_t1000.a = ec_t110.a and ec_t110.a = ec_t200.a and ec_t500.a = ec_t200.a and ec_t500.a = ec_t800.a and ec_t800.a > 100; | Query Id | Real rows | Est. Rows at master | Est. rows with patched | table # | |--+---+-++-| |1 |10 | 99 | 10 | 2 | |2 |10 | 10 | 10 | 2 | |3 |10 | 20 | 11 | 3 | |4 |10 | 2 | 11 | 3 | |5 |10 | 11 | 11 | 3 | |6 |10 | 10 | 9 | 4 | |7 |10 | 1 | 9 | 4 | |8 |10 | 6 | 9 | 4 | |9 |10 | 9 | 9 | 4 | | 10 |10 | 8 | 8 | 5 | | 11 |10 | 1 | 8 | 5 | | 12 |10 | 5 | 8 | 5 | | 13 |10 | 7 | 8 | 5 | | 14 |10 | 8 | 8 | 5 | In the past, we can just use the qual user provided to do estimation. As for now, since we introduce the CorrectiveQuals design, we still keep just only 1 qual counted, but we can choose the best one in CorrectiveQuals no matter which one is provided by the user. we gain a better and stable estimation because of this. I'm happy about the overall design but not pretty confident about the method to "choose the best one to keep". So I did some test case as many as I can to find something is wrong, so far so good. I'm also happy with how to keep only one qual in CorrectiveQuals (not choose the best one). Assume we just have 1 EC filter in this query for simplicity. At the beginning, all the baserel have been impacte
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Thanks for working on the new version. On Fri, Dec 4, 2020 at 10:41 PM David Rowley wrote: > > I also > noticed that the code I'd written to build the cache lookup expression > included a step to deform the outer tuple. This was unnecessary and > slowed down the expression evaluation. > > I thought it would be something like my 3rd suggestion on [1], however after I read the code, it looked like no. Could you explain what changes it is? I probably missed something. [1] https://www.postgresql.org/message-id/CAApHDvqvGZUPKHO%2B4Xp7Lm_q1OXBo2Yp1%3D5pVnEUcr4dgOXxEg%40mail.gmail.com -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Thank you Heikki for your attention. On Mon, Nov 30, 2020 at 11:20 PM Heikki Linnakangas wrote: > On 30/11/2020 16:30, Jesper Pedersen wrote: > > On 11/30/20 5:04 AM, Heikki Linnakangas wrote: > >> On 26/11/2020 16:58, Andy Fan wrote: > >>> This patch has stopped moving for a while, any suggestion about > >>> how to move on is appreciated. > >> > >> The question on whether UniqueKey.exprs should be a list of > >> EquivalenceClasses or PathKeys is unresolved. I don't have an opinion > >> on that, but I'd suggest that you pick one or the other and just go > >> with it. If it turns out to be a bad choice, then we'll change it. > > > > In this case I think it is matter of deciding if we are going to use > > EquivalenceClasses or Exprs before going further; there has been work > > ongoing in this area for a while, so having a clear direction from a > > committer would be greatly appreciated. > > Plain Exprs are not good enough, because you need to know which operator > the expression is unique on. Usually, it's the default = operator in the > default btree opclass for the datatype, but it could be something else, > too. Actually I can't understand this, could you explain more? Based on my current knowledge, when we run "SELECT DISTINCT a FROM t", we never care about which operator to use for the unique. There's some precedence for PathKeys, as we generate PathKeys to > represent the DISTINCT column in PlannerInfo->distinct_pathkeys. On the > other hand, I've always found it confusing that we use PathKeys to > represent DISTINCT and GROUP BY, which are not actually sort orderings. > OK, I have the same confusion now:) Perhaps it would make sense to store EquivalenceClass+opfamily in > UniqueKey, and also replace distinct_pathkeys and group_pathkeys with > UniqueKeys. > > I can understand why we need EquivalenceClass for UniqueKey, but I can't understand why we need opfamily here. For anyone who is interested with these patchsets, here is my plan about this now. 1). I will try EquivalenceClass rather than Expr in UniqueKey and add opfamily if needed. 2). I will start a new thread to continue this topic. The current thread is too long which may scare some people who may have interest in it. 3). I will give up patch 5 & 6 for now. one reason I am not happy with the current implementation, and the other reason is I want to make the patchset smaller to make the reviewer easier. I will not give up them forever, after the main part of this patchset is committed, I will continue with them in a new thread. Thanks everyone for your input. -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Thank you Tom and Heikki for your input. On Sun, Dec 6, 2020 at 4:40 AM Tom Lane wrote: > Heikki Linnakangas writes: > >> I can understand why we need EquivalenceClass for UniqueKey, but I can't > >> understand why we need opfamily here. > > > Thinking a bit harder, I guess we don't. Because EquivalenceClass > > includes the operator family already, in the ec_opfamilies field. > > No. EquivalenceClasses only care about equality, which is why they > might potentially mention several opfamilies that share an equality > operator. If you care about sort order, you *cannot* rely on an > EquivalenceClass to depict that. Now, abstract uniqueness also only > cares about equality, but if you are going to implement it via sort- > and-unique then you need to settle on a sort order. > I think UniqueKey only cares about equality. Even DISTINCT / groupBy can be implemented with sort, but UniqueKey only care about the result of DISTINCT/GROUPBY, so it doesn't matter IIUC. > > I agree we are overspecifying DISTINCT by settling on a sort operator at > parse time, rather than considering all the possibilities at plan time. > But given that opfamilies sharing equality are mostly a hypothetical > use-case, I'm not in a big hurry to fix it. Before we had ASC/DESC > indexes, there was a real use-case for making a "reverse sort" opclass, > with the same equality as the type's regular opclass but the opposite sort > order. But that's ancient history now, and I've seen few other plausible > use-cases. > > I have not been following this thread closely enough to understand > why we need a new "UniqueKeys" data structure at all. Currently the UniqueKey is defined as a List of Expr, rather than EquivalenceClasses. A complete discussion until now can be found at [1] (The messages I replied to also care a lot and the information is completed). This patch has stopped at this place for a while, I'm planning to try EquivalenceClasses, but any suggestion would be welcome. > But if the > motivation is only to remove this overspecification, I humbly suggest > that it ain't worth the trouble. > > regards, tom lane > [1] https://www.postgresql.org/message-id/CAKU4AWqy3Uv67%3DPR8RXG6LVoO-cMEwfW_LMwTxHdGrnu%2Bcf%2BdA%40mail.gmail.com -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen wrote: > Hi, > > On 12/5/20 10:38 PM, Andy Fan wrote: > > Currently the UniqueKey is defined as a List of Expr, rather than > > EquivalenceClasses. > > A complete discussion until now can be found at [1] (The messages I > replied > > to also > > care a lot and the information is completed). This patch has stopped at > > this place for > > a while, I'm planning to try EquivalenceClasses, but any suggestion > would > > be welcome. > > > > > > Unfortunately I think we need a RfC style patch of both versions in > their minimum implementation. > > Hopefully this will make it easier for one or more committers to decide > on the right direction since they can do a side-by-side comparison of > the two solutions. > > I do get the exact same idea. Actually I have made EquivalenceClasses works with baserel last weekend and then I realized it is hard to compare the 2 situations without looking into the real/Poc code, even for very experienced people. I will submit a new patch after I get the partitioned relation, subquery works. Hope I can make it in one week. > Just my $0.02. > > Thanks for working on this Andy ! > > Best regards, > Jesper > > -- Best Regards Andy Fan
initscan for MVCC snapshot
Hi: I see initscan calls RelationGetwNumberOfBlocks every time and rescan calls initscan as well. In my system, RelationGetNumberOfBlocks is expensive (the reason doesn't deserve a talk.. ), so in a nest loop + Bitmap heap scan case, the impact will be huge. The comments of initscan are below. /* * Determine the number of blocks we have to scan. * * It is sufficient to do this once at scan start, since any tuples added * while the scan is in progress will be invisible to my snapshot anyway. * (That is not true when using a non-MVCC snapshot. However, we couldn't * guarantee to return tuples added after scan start anyway, since they * might go into pages we already scanned. To guarantee consistent * results for a non-MVCC snapshot, the caller must hold some higher-level * lock that ensures the interesting tuple(s) won't change.) */ I still do not fully understand the comments. Looks we only need to call multi times for non-MVCC snapshot, IIUC, does the following change reasonable? === diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 1b2f70499e..8238eabd8b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) ParallelBlockTableScanDesc bpscan = NULL; boolallow_strat; boolallow_sync; + boolis_mvcc = scan->rs_base.rs_snapshot && IsMVCCSnapshot(scan->rs_base.rs_snapshot); /* * Determine the number of blocks we have to scan. @@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_nblocks = bpscan->phs_nblocks; } else - scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_base.rs_rd); + if (scan->rs_nblocks == -1 || !is_mvcc) + scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_base.rs_rd); /* * If the table is large relative to NBuffers, use a bulk-read access @@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot, else scan->rs_base.rs_key = NULL; + scan->rs_nblocks = -1; initscan(scan, key, false); -- Best Regards Andy Fan
Re: initscan for MVCC snapshot
On Mon, Dec 7, 2020 at 8:26 PM Andy Fan wrote: > Hi: > I see initscan calls RelationGetwNumberOfBlocks every time and rescan > calls > initscan as well. In my system, RelationGetNumberOfBlocks is expensive > (the reason > doesn't deserve a talk.. ), so in a nest loop + Bitmap heap scan case, > the > impact will be huge. The comments of initscan are below. > > /* > * Determine the number of blocks we have to scan. > * > * It is sufficient to do this once at scan start, since any tuples added > * while the scan is in progress will be invisible to my snapshot anyway. > * (That is not true when using a non-MVCC snapshot. However, we couldn't > * guarantee to return tuples added after scan start anyway, since they > * might go into pages we already scanned. To guarantee consistent > * results for a non-MVCC snapshot, the caller must hold some higher-level > * lock that ensures the interesting tuple(s) won't change.) > */ > > I still do not fully understand the comments. Looks we only need to call > multi times for non-MVCC snapshot, IIUC, does the following change > reasonable? > > === > > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/heapam.c > index 1b2f70499e..8238eabd8b 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool > keep_startblock) > ParallelBlockTableScanDesc bpscan = NULL; > boolallow_strat; > boolallow_sync; > + boolis_mvcc = scan->rs_base.rs_snapshot && > IsMVCCSnapshot(scan->rs_base.rs_snapshot); > > /* > * Determine the number of blocks we have to scan. > @@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool > keep_startblock) > scan->rs_nblocks = bpscan->phs_nblocks; > } > else > - scan->rs_nblocks = > RelationGetNumberOfBlocks(scan->rs_base.rs_rd); > + if (scan->rs_nblocks == -1 || !is_mvcc) > + scan->rs_nblocks = > RelationGetNumberOfBlocks(scan->rs_base.rs_rd); > > /* > * If the table is large relative to NBuffers, use a bulk-read > access > @@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot, > else > scan->rs_base.rs_key = NULL; > > + scan->rs_nblocks = -1; > initscan(scan, key, false); > > -- > Best Regards > Andy Fan > I have tested this with an ext4 file system, and I can get a 7%+ performance improvement for the given test case. Here are the steps: create table t(a int, b char(8000)); insert into t select i, i from generate_series(1, 100)i; create index on t(a); delete from t where a <= 1; vacuum t; alter system set enable_indexscan to off; select pg_reload_conf(); cat 1.sql select * from generate_series(1, 1)i, t where i = t.a; bin/pgbench -f 1.sql postgres -T 300 -c 10 Without this patch: latency average = 61.806 ms with this patch: latency average = 57.484 ms I think the result is good and I think we can probably make this change. However, I'm not sure about it. -- Best Regards Andy Fan
Re: initscan for MVCC snapshot
On Thu, Dec 10, 2020 at 7:31 PM Andy Fan wrote: > > > On Mon, Dec 7, 2020 at 8:26 PM Andy Fan wrote: > >> Hi: >> I see initscan calls RelationGetwNumberOfBlocks every time and rescan >> calls >> initscan as well. In my system, RelationGetNumberOfBlocks is expensive >> (the reason >> doesn't deserve a talk.. ), so in a nest loop + Bitmap heap scan case, >> the >> impact will be huge. The comments of initscan are below. >> >> /* >> * Determine the number of blocks we have to scan. >> * >> * It is sufficient to do this once at scan start, since any tuples added >> * while the scan is in progress will be invisible to my snapshot anyway. >> * (That is not true when using a non-MVCC snapshot. However, we couldn't >> * guarantee to return tuples added after scan start anyway, since they >> * might go into pages we already scanned. To guarantee consistent >> * results for a non-MVCC snapshot, the caller must hold some higher-level >> * lock that ensures the interesting tuple(s) won't change.) >> */ >> >> I still do not fully understand the comments. Looks we only need to call >> multi times for non-MVCC snapshot, IIUC, does the following change >> reasonable? >> >> === >> >> diff --git a/src/backend/access/heap/heapam.c >> b/src/backend/access/heap/heapam.c >> index 1b2f70499e..8238eabd8b 100644 >> --- a/src/backend/access/heap/heapam.c >> +++ b/src/backend/access/heap/heapam.c >> @@ -211,6 +211,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool >> keep_startblock) >> ParallelBlockTableScanDesc bpscan = NULL; >> boolallow_strat; >> boolallow_sync; >> + boolis_mvcc = scan->rs_base.rs_snapshot && >> IsMVCCSnapshot(scan->rs_base.rs_snapshot); >> >> /* >> * Determine the number of blocks we have to scan. >> @@ -229,7 +230,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool >> keep_startblock) >> scan->rs_nblocks = bpscan->phs_nblocks; >> } >> else >> - scan->rs_nblocks = >> RelationGetNumberOfBlocks(scan->rs_base.rs_rd); >> + if (scan->rs_nblocks == -1 || !is_mvcc) >> + scan->rs_nblocks = >> RelationGetNumberOfBlocks(scan->rs_base.rs_rd); >> >> /* >> * If the table is large relative to NBuffers, use a bulk-read >> access >> @@ -1210,6 +1212,7 @@ heap_beginscan(Relation relation, Snapshot snapshot, >> else >> scan->rs_base.rs_key = NULL; >> >> + scan->rs_nblocks = -1; >> initscan(scan, key, false); >> >> -- >> Best Regards >> Andy Fan >> > > I have tested this with an ext4 file system, and I can get a 7%+ > performance improvement > for the given test case. Here are the steps: > > create table t(a int, b char(8000)); > insert into t select i, i from generate_series(1, 100)i; > create index on t(a); > delete from t where a <= 1; > vacuum t; > alter system set enable_indexscan to off; > select pg_reload_conf(); > > cat 1.sql > select * from generate_series(1, 1)i, t where i = t.a; > > bin/pgbench -f 1.sql postgres -T 300 -c 10 > > Without this patch: latency average = 61.806 ms > with this patch: latency average = 57.484 ms > > I think the result is good and I think we can probably make this change. > However, I'm not > sure about it. > > The plan which was used is below, in the rescan of Bitmap Heap Scan, mdnblocks will be called 1 times in current implementation, Within my patch, it will be only called once. postgres=# explain (costs off) select * from generate_series(1, 1)i, t where i = t.a; QUERY PLAN -- Nested Loop -> Function Scan on generate_series i -> Bitmap Heap Scan on t Recheck Cond: (a = i.i) -> Bitmap Index Scan on t_a_idx Index Cond: (a = i.i) (6 rows) -- Best Regards Andy Fan
Re: Cache relation sizes?
On Thu, Nov 19, 2020 at 1:02 PM Thomas Munro wrote: > On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila > wrote: > > Yeah, it is good to verify VACUUM stuff but I have another question > > here. What about queries having functions that access the same > > relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function > > access the relation t1)? Now, here I think because the relation 't1' > > is already opened, it might use the same value of blocks from the > > shared cache even though the snapshot for relation 't1' when accessed > > from func() might be different. Am, I missing something, or is it > > dealt in some way? > > I think it should be covered by the theory about implicit memory > barriers snapshots, but to simplify things I have now removed the > lock-free stuff from the main patch (0001), because it was a case of > premature optimisation and it distracted from the main concept. The > main patch has 128-way partitioned LWLocks for the mapping table, and > then per-relfilenode spinlocks for modifying the size. There are > still concurrency considerations, which I think are probably handled > with the dirty-update-wins algorithm you see in the patch. In short: > due to extension and exclusive locks, size changes AKA dirty updates > are serialised, but clean updates can run concurrently, so we just > have to make sure that clean updates never clobber dirty updates -- do > you think that is on the right path? > Hi Thomas: Thank you for working on it. I spent one day studying the patch and I want to talk about one question for now. What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)? -- Best Regards Andy Fan
Re: Cache relation sizes?
Hi Thomas, Thank you for your quick response. On Thu, Dec 17, 2020 at 3:05 PM Thomas Munro wrote: > Hi Andy, > > On Thu, Dec 17, 2020 at 7:29 PM Andy Fan wrote: > > I spent one day studying the patch and I want to talk about one question > for now. > > What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what > will happen > > if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)? > > The underlying theory of the patch is that after fsync() succeeds, the > new size is permanent, but before that it could change at any time > because of asynchronous activities in the kernel*. Therefore, if you > want to evict the size from shared memory, you have to fsync() the > file first. I used smgrimmedsync() to do that. > > *I suspect that it can really only shrink with network file systems > such as NFS and GlusterFS, where the kernel has to take liberties to > avoid network round trips for every file extension, but I don't know > the details on all 9 of our supported OSes and standards aren't very > helpful for understanding buffered I/O. > Let me try to understand your point. Suppose process 1 extends a file to 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still return 1 based on the comments in the ReadBuffer_common ("because of buggy Linux kernels that sometimes return an seek SEEK_END result that doesn't account for a recent write"). b). When this patch is introduced, we would get 2 blocks for sure if we can get the size from cache. c). After user reads the 2 blocks from cache and then the cache is evicted, and user reads the blocks again, it is possible to get 1 again. Do we need to consider case a) in this patch? and Is case c). the situation you want to avoid and do we have to introduce fsync to archive this? Basically I think case a) should not happen by practice so we don't need to handle case c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag. I'm not opposed to adding them, I am just interested in the background in case you are also willing to discuss it. I would suggest the below change for smgrnblock_fast, FYI @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum) /* * The generation doesn't match, the shared relation must have been * evicted since we got a pointer to it. We'll need to do more work. +* and invalid the fast path for next time. */ + reln->smgr_shared = NULL; } -- Best Regards Andy Fan
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On Tue, Aug 4, 2020 at 7:49 AM Alvaro Herrera wrote: > I've been working on the ability to detach a partition from a > partitioned table, without causing blockages to concurrent activity. > I think this operation is critical for some use cases. > > This would be a very great feature. When we can't handle thousands of partitions very well, and user agree to detach some old partitions automatically, the blocking issue here would be a big blocker for this solution. Thanks for working on this! > There was a lot of great discussion which ended up in Robert completing > a much sought implementation of non-blocking ATTACH. DETACH was > discussed too because it was a goal initially, but eventually dropped > from that patch altogether. Nonetheless, that thread provided a lot of > useful input to this implementation. Important ones: > > [1] > https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w...@mail.gmail.com > [2] > https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=lj...@mail.gmail.com > [3] > https://postgr.es/m/CA+TgmoY13KQZF-=hntrt9uywyx3_oyoqpu9iont49jggidp...@mail.gmail.com > > Attached is a patch that implements > ALTER TABLE ... DETACH PARTITION .. CONCURRENTLY. > > In the previous thread we were able to implement the concurrent model > without the extra keyword. For this one I think that won't work; my > implementation works in two transactions so there's a restriction that > you can't run it in a transaction block. Also, there's a wait phase > that makes it slower than the non-concurrent one. Those two drawbacks > make me think that it's better to keep both modes available, just like > we offer both CREATE INDEX and CREATE INDEX CONCURRENTLY. > > Why two transactions? The reason is that in order for this to work, we > make a catalog change (mark it detached), and commit so that all > concurrent transactions can see the change. A second transaction waits > for anybody who holds any lock on the partitioned table and grabs Access > Exclusive on the partition (which now no one cares about, if they're > looking at the partitioned table), where the DDL action on the partition > can be completed. > > ALTER TABLE is normally unable to run in two transactions. I hacked it > (0001) so that the relation can be closed and reopened in the Exec phase > (by having the rel as part of AlteredTableInfo: when ATRewriteCatalogs > returns, it uses that pointer to close the rel). It turns out that this > is sufficient to make that work. This means that ALTER TABLE DETACH > CONCURRENTLY cannot work as part of a multi-command ALTER TABLE, but > that's alreay enforced by the grammar anyway. > > DETACH CONCURRENTLY doesn't work if a default partition exists. It's > just too problematic a case; you would still need to have AEL on the > default partition. > > > I haven't yet experimented with queries running in a standby in tandem > with a detach. > > -- > Álvaro Herrera > -- Best Regards Andy Fan
Re: Cache relation sizes?
On Thu, Dec 24, 2020 at 6:59 AM Thomas Munro wrote: > On Thu, Dec 17, 2020 at 10:22 PM Andy Fan > wrote: > > Let me try to understand your point. Suppose process 1 extends a file to > > 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* > still > > return 1 based on the comments in the ReadBuffer_common ("because > > of buggy Linux kernels that sometimes return an seek SEEK_END result > > that doesn't account for a recent write"). b). When this patch is > introduced, > > we would get 2 blocks for sure if we can get the size from cache. c). > After > > user reads the 2 blocks from cache and then the cache is evicted, and > user > > reads the blocks again, it is possible to get 1 again. > > > > Do we need to consider case a) in this patch? and Is case c). the > situation you > > want to avoid and do we have to introduce fsync to archive this? > Basically I > > think case a) should not happen by practice so we don't need to handle > case > > c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag. > > I'm not opposed to adding them, I am just interested in the background > in case > > you are also willing to discuss it. > > Sorry for the lack of background -- there was some discussion on the > thread "Optimize dropping of relation buffers using dlist", but it's > long and mixed up with other topics. I'll try to summarise here. > > It's easy to reproduce files shrinking asynchronously on network > filesystems that reserve space lazily[1], Thanks for the explaintain. After I read that thread, I am more confused now. In [1], we have the below test result. $ cat magic_shrinking_file.c #include #include #include #include int main() { int fd; char buffer[8192] = {0}; fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND); if (fd < 0) { perror("open"); return EXIT_FAILURE; } printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END)); printf("write(...) = %zd\n", write(fd, buffer, sizeof(buffer))); printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END)); printf("fsync(...) = %d\n", fsync(fd)); printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END)); return EXIT_SUCCESS; } $ cc magic_shrinking_file.c $ ./a.out lseek(..., SEEK_END) = 9670656 write(...) = 8192 lseek(..., SEEK_END) = 9678848 fsync(...) = -1 lseek(..., SEEK_END) = 9670656 I got 2 information from above. a) before the fsync, the lseek(fd, 0, SEEK_END) returns a correct value, however after the fsync, it returns a wrong value. b). the fsync failed(return values is -1) in the above case. I'm more confused because of point a). Since the fsync can't correct some wrong value, what is the point to call fsync in this patch (I agree that it is good to fix this reliability problem within this patch, but I'm doubtful if fsync can help in this case). Am I missing something obviously? I read your patch with details some days ago and feel it is in pretty good shape. and I think you are clear about the existing issues very well. like a). smgr_alloc_sr use a FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). and looks like you have some concern about the concurrency issue, which I didn't find anything wrong. d). how to handle the DIRTY sr during evict. I planned to recheck item c) before this reply, but looks like the time is not permitted. May I know what Is your plan about this patch? [1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com -- Best Regards Andy Fan
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
Thank you all, friends! On Fri, Feb 12, 2021 at 9:02 AM David Rowley wrote: > On Wed, 10 Feb 2021 at 16:18, Andy Fan wrote: > > v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch > > > > Introduce notnullattrs field in RelOptInfo to indicate which attr are > not null > > in current query. The not null is judged by checking pg_attribute and > query's > > restrictinfo. The info is only maintained at Base RelOptInfo and > Partition's > > RelOptInfo level so far. > > > > > > Any thoughts? > > I'm not that happy with what exactly the definition is of > RelOptInfo.notnullattrs. > > The comment for the field says: > + /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */ > > So you could expect someone to assume that these are a Bitmapset of > attnums for all columns in the relation marked as NOT NULL. However, > that's not true since you use find_nonnullable_vars() to chase down > quals that filter out NULL values and you mark those too. > > The comment might be unclear, but the behavior is on purpose. I want to find more cases which can make the attr NOT NULL, all of them are useful for UniqueKey stuff. > The reason I don't really like this is that it really depends where > you want to use RelOptInfo.notnullattrs. If someone wants to use it > to optimise something before the base quals are evaluated then they > might be unhappy that they found some NULLs. > > Do you mean the notnullattrs is not set correctly before the base quals are evaluated? I think we have lots of data structures which are set just after some stage. but notnullattrs is special because it is set at more than 1 stage. However I'm doubtful it is unacceptable, Some fields ever change their meaning at different stages like Var->varno. If a user has a misunderstanding on it, it probably will find it at the testing stage. > I think you either need to explain in detail what the field means or > separate out the two meanings somehow. > > Agreed. Besides the not null comes from 2 places (metadata and quals), it also means it is based on the relation, rather than the RelTarget. for sample: A is not null, but SELECT return_null_udf(A) FROM t, return_null_udf is NULL. I think this is not well documented as well. How about just change the documents as: 1. /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber, the NOT NULL * comes from pg_attribtes and quals at different planning stages. */ or 2. /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber, the NOT NULL * comes from pg_attribtes and quals at different planning stages. And it just means * the base attr rather than RelOptInfo->reltarget. */ I don't like to separate them into 2 fields because it may make the usage harder a bit as well. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Thu, Feb 11, 2021 at 9:09 PM Ashutosh Bapat wrote: > Can this information be part of PathTarget structure and hence part of > RelOptInfo::reltarget, so that it can be extended to join, group and > other kinds of RelOptInfo in future? I think you want to expand this field in a more generic way. For example: SELECT udf(a) FROM t WHERE a is not null; In current implementation, I only knows a is not null, nothing about if udf(a) is null or not. And we can't present anything for joinrel as well since it is just attno. At the same time, looks we can't tell if UDF(A) is null even if the UDF is strict and A is not null? > In fact, it might be easy to do that in this patch itself. > Actually I can't think out the method:) > On Wed, Feb 10, 2021 at 8:57 AM Andy Fan wrote: > > > > > > On Wed, Feb 10, 2021 at 11:18 AM Andy Fan > wrote: > >> > >> Hi: > >> > >> This patch is the first patch in UniqueKey patch series[1], since I > need to revise > >> that series many times but the first one would be not that often, so > I'd like to > >> submit this one for review first so that I don't need to maintain it > again and again. > >> > >> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch > >> > >> Introduce notnullattrs field in RelOptInfo to indicate which attr are > not null > >> in current query. The not null is judged by checking pg_attribute and > query's > >> restrictinfo. The info is only maintained at Base RelOptInfo and > Partition's > >> RelOptInfo level so far. > >> > >> > >> Any thoughts? > >> > >> [1] > https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com > >> -- > >> Best Regards > >> Andy Fan (https://www.aliyun.com/) > > > > > > Add the missed patch.. > > > > -- > > Best Regards > > Andy Fan (https://www.aliyun.com/) > > > > -- > Best Wishes, > Ashutosh Bapat > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: How to get Relation tuples in C function
On Sun, Feb 14, 2021 at 5:23 AM Tom Lane wrote: > Patrick Handja writes: > > I would like to know if there is a better way to pass a relation or if > the > > relation name (CString) as a parameter in a C function and thus be able > to > > manipulate its tuples. The documentation is available here: > > https://www.postgresql.org/docs/13/xfunc-c.html#id-1.8.3.13.11. But it > is > > not quite clear enough on how to retrieve tuples. > > The thing I'd recommend you do is use SPI [1], which lets you execute > SQL queries from inside a C function. If you don't want to do that > for whatever reason, you need to open the relation, set up a scan, > and fetch tuples from the scan, relying on low-level APIs that tend > to change from version to version. contrib/pageinspect or > contrib/pgstattuple might offer usable sample code, although with any > prototype you might look at, it's going to be hard to see the forest > for the trees. > > regards, tom lane > > [1] https://www.postgresql.org/docs/current/spi.html > > > Thank you tom for the reply. What would be the difference between the SPI and "write a pure SQL UDF" and call it with DirectFunctionCall1? I just ran into a similar situation some days before. Currently I think DirectFunctionCall1 doesn't need to maintain a connection but SPI has to do that. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Tue, Feb 16, 2021 at 12:01 PM David Rowley wrote: > On Fri, 12 Feb 2021 at 15:18, Andy Fan wrote: > > > > On Fri, Feb 12, 2021 at 9:02 AM David Rowley > wrote: > >> The reason I don't really like this is that it really depends where > >> you want to use RelOptInfo.notnullattrs. If someone wants to use it > >> to optimise something before the base quals are evaluated then they > >> might be unhappy that they found some NULLs. > >> > > > > Do you mean the notnullattrs is not set correctly before the base quals > are > > evaluated? I think we have lots of data structures which are set just > after some > > stage. but notnullattrs is special because it is set at more than 1 > stage. However > > I'm doubtful it is unacceptable, Some fields ever change their meaning > at different > > stages like Var->varno. If a user has a misunderstanding on it, it > probably will find it > > at the testing stage. > > You're maybe focusing too much on your use case for notnullattrs. It > only cares about NULLs in the result for each query level. > > thinks of an example... > > OK, let's say I decided that COUNT(*) is faster than COUNT(id) so > decided that I might like to write a patch which rewrite the query to > use COUNT(*) when it was certain that "id" could not contain NULLs. > > The query is: > > SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER > JOIN sales s ON p.partid = s.partid GROUP BY p.partid; > > sale.saleid is marked as NOT NULL in pg_attribute. As the writer of > the patch, I checked the comment for notnullattrs and it says "Not > null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I > should be ok to assume since sales.saleid is marked in notnullattrs > that I can rewrite the query?! > > The documentation about the RelOptInfo.notnullattrs needs to be clear > what exactly it means. I'm not saying your representation of how to > record NOT NULL in incorrect. I'm saying that you need to be clear > what exactly is being recorded in that field. > > If you want it to mean "attribute marked here cannot output NULL > values at this query level", then you should say something along those > lines. > I think I get what you mean. You are saying notnullattrs is only correct at the *current* stage, namely set_rel_size. It would not be true after that, but the data is still there. That would cause some confusion. I admit that is something I didn't realize before. I checked other fields of RelOptInfo, looks no one filed works like this, so I am not really happy with this design now. I'm OK with saying more things along these lines. That can be done we all understand each other well. Any better design is welcome as well. I think the "Var represents null stuff" is good, until I see your comments below. > However, having said that, because this is a Bitmapset of > pg_attribute.attnums, it's only possible to record Vars from base > relations. It does not seem like you have any means to record > attributes that are normally NULLable, but cannot produce NULL values > due to a strict join qual. > > e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something; > > I'd expect the RelOptInfo for t not to contain a bit for the > "nullable" column, but there's no way to record the fact that the join > RelOptInfo for {t,j} cannot produce a NULL for that column. It might > be quite useful to know that for the UniqueKeys patch. > > The current patch can detect t.nullable is not null correctly. That is done by find_nonnullable_vars(qual) and deconstruct_recure stage. > I know there's another discussion here between Ashutosh and Tom about > PathTarget's and Vars. I had the Var idea too once myself [1] but it > was quickly shot down. Tom's reasoning there in [1] seems legit. I > guess we'd need some sort of planner version of Var and never confuse > it with the Parse version of Var. That sounds like quite a big > project which would have quite a lot of code churn. I'm not sure how > acceptable it would be to have Var represent both these things. It > gets complex when you do equal(var1, var2) and expect that to return > true when everything matches apart from the notnull field. We > currently have this issue with the "location" field and we even have a > special macro which just ignores those in equalfuncs.c. I imagine not > many people would like to expand that to other fields. > > Thanks for sharing this. > It would be good to agree on the correct representation for Vars that > cannot produce NULLs so that we don't
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Tue, Feb 16, 2021 at 10:03 PM Andy Fan wrote: > > > On Tue, Feb 16, 2021 at 12:01 PM David Rowley > wrote: > >> On Fri, 12 Feb 2021 at 15:18, Andy Fan wrote: >> > >> > On Fri, Feb 12, 2021 at 9:02 AM David Rowley >> wrote: >> >> The reason I don't really like this is that it really depends where >> >> you want to use RelOptInfo.notnullattrs. If someone wants to use it >> >> to optimise something before the base quals are evaluated then they >> >> might be unhappy that they found some NULLs. >> >> >> > >> > Do you mean the notnullattrs is not set correctly before the base quals >> are >> > evaluated? I think we have lots of data structures which are set just >> after some >> > stage. but notnullattrs is special because it is set at more than 1 >> stage. However >> > I'm doubtful it is unacceptable, Some fields ever change their meaning >> at different >> > stages like Var->varno. If a user has a misunderstanding on it, it >> probably will find it >> > at the testing stage. >> >> You're maybe focusing too much on your use case for notnullattrs. It >> only cares about NULLs in the result for each query level. >> >> thinks of an example... >> >> OK, let's say I decided that COUNT(*) is faster than COUNT(id) so >> decided that I might like to write a patch which rewrite the query to >> use COUNT(*) when it was certain that "id" could not contain NULLs. >> >> The query is: >> >> SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER >> JOIN sales s ON p.partid = s.partid GROUP BY p.partid; >> >> sale.saleid is marked as NOT NULL in pg_attribute. As the writer of >> the patch, I checked the comment for notnullattrs and it says "Not >> null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I >> should be ok to assume since sales.saleid is marked in notnullattrs >> that I can rewrite the query?! >> >> The documentation about the RelOptInfo.notnullattrs needs to be clear >> what exactly it means. I'm not saying your representation of how to >> record NOT NULL in incorrect. I'm saying that you need to be clear >> what exactly is being recorded in that field. >> >> If you want it to mean "attribute marked here cannot output NULL >> values at this query level", then you should say something along those >> lines. >> > > I think I get what you mean. You are saying notnullattrs is only correct > at the *current* stage, namely set_rel_size. It would not be true after > that, but the data is still there. That would cause some confusion. I > admit > that is something I didn't realize before. I checked other fields of > RelOptInfo, > looks no one filed works like this, so I am not really happy with this > design > now. I'm OK with saying more things along these lines. That can be done > we all understand each other well. Any better design is welcome as well. > I think the "Var represents null stuff" is good, until I see your > comments below. > > > >> However, having said that, because this is a Bitmapset of >> pg_attribute.attnums, it's only possible to record Vars from base >> relations. It does not seem like you have any means to record >> attributes that are normally NULLable, but cannot produce NULL values >> due to a strict join qual. >> >> e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something; >> >> I'd expect the RelOptInfo for t not to contain a bit for the >> "nullable" column, but there's no way to record the fact that the join >> RelOptInfo for {t,j} cannot produce a NULL for that column. It might >> be quite useful to know that for the UniqueKeys patch. >> >> > The current patch can detect t.nullable is not null correctly. That > is done by find_nonnullable_vars(qual) and deconstruct_recure stage. > > >> I know there's another discussion here between Ashutosh and Tom about >> PathTarget's and Vars. I had the Var idea too once myself [1] but it >> was quickly shot down. Tom's reasoning there in [1] seems legit. I >> guess we'd need some sort of planner version of Var and never confuse >> it with the Parse version of Var. That sounds like quite a big >> project which would have quite a lot of code churn. I'm not sure how >> acceptable it would be to have Var represent both these things. It >> gets complex when you do equal(var1, var2) and expect that
Re: How to get Relation tuples in C function
On Sun, Feb 14, 2021 at 7:56 PM Michael Paquier wrote: > On Sun, Feb 14, 2021 at 09:29:08AM +0800, Andy Fan wrote: > > Thank you tom for the reply. What would be the difference between the > > SPI and "write a pure SQL UDF" and call it with DirectFunctionCall1? I > > just ran into a similar situation some days before. Currently I think > > DirectFunctionCall1 doesn't need to maintain a connection but SPI has to > > do that. > > Hard to say without knowing your use case. A PL function is more > simple to maintain than a C function, though usually less performant > from the pure point of view of its operations. A SQL function could > finish by being inlined, allowing the planner to apply optimizations > as it would know the function body. Going with SPI has the advantage > to have code able to work without any changes across major versions, > which is a no-brainer when it comes to long-term maintenance. > -- > Michael > Thank you Michael for the response. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Implementing Incremental View Maintenance
On Tue, Feb 16, 2021 at 9:33 AM Yugo NAGATA wrote: > Hi, > > Attached is a rebased patch (v22a). > Thanks for the patch. Will you think posting a patch with the latest commit at that time is helpful? If so, when others want to review it, they know which commit to apply the patch without asking for a new rebase usually. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Tue, Feb 16, 2021 at 12:01 PM David Rowley wrote: > On Fri, 12 Feb 2021 at 15:18, Andy Fan wrote: > > > > On Fri, Feb 12, 2021 at 9:02 AM David Rowley > wrote: > >> The reason I don't really like this is that it really depends where > >> you want to use RelOptInfo.notnullattrs. If someone wants to use it > >> to optimise something before the base quals are evaluated then they > >> might be unhappy that they found some NULLs. > >> > > > > Do you mean the notnullattrs is not set correctly before the base quals > are > > evaluated? I think we have lots of data structures which are set just > after some > > stage. but notnullattrs is special because it is set at more than 1 > stage. However > > I'm doubtful it is unacceptable, Some fields ever change their meaning > at different > > stages like Var->varno. If a user has a misunderstanding on it, it > probably will find it > > at the testing stage. > > You're maybe focusing too much on your use case for notnullattrs. It > only cares about NULLs in the result for each query level. > > thinks of an example... > > OK, let's say I decided that COUNT(*) is faster than COUNT(id) so > decided that I might like to write a patch which rewrite the query to > use COUNT(*) when it was certain that "id" could not contain NULLs. > > The query is: > > SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER > JOIN sales s ON p.partid = s.partid GROUP BY p.partid; > > sale.saleid is marked as NOT NULL in pg_attribute. As the writer of > the patch, I checked the comment for notnullattrs and it says "Not > null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I > should be ok to assume since sales.saleid is marked in notnullattrs > that I can rewrite the query?! > > The documentation about the RelOptInfo.notnullattrs needs to be clear > what exactly it means. I'm not saying your representation of how to > record NOT NULL in incorrect. I'm saying that you need to be clear > what exactly is being recorded in that field. > > If you want it to mean "attribute marked here cannot output NULL > values at this query level", then you should say something along those > lines. > > However, having said that, because this is a Bitmapset of > pg_attribute.attnums, it's only possible to record Vars from base > relations. It does not seem like you have any means to record > attributes that are normally NULLable, but cannot produce NULL values > due to a strict join qual. > > e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something; > > I'd expect the RelOptInfo for t not to contain a bit for the > "nullable" column, but there's no way to record the fact that the join > RelOptInfo for {t,j} cannot produce a NULL for that column. It might > be quite useful to know that for the UniqueKeys patch. > I checked again and found I do miss the check on JoinExpr->quals. I have fixed it in v3 patch. Thanks for the review! In the attached v3, commit 1 is the real patch, and commit 2 is just add some logs to help local testing. notnull.sql/notnull.out is the test case for this patch, both commit 2 and notnull.* are not intended to be committed at last. Besides the above fix in v3, I changed the comments alongs the notnullattrs as below and added a true positive helper function is_var_nullable. + /* +* Not null attrs, the values are calculated by looking into pg_attribute and quals +* However both cases are not reliable in some outer join cases. So when +* we want to check if a Var is nullable, function is_var_nullable is a good +* place to start with, which is true positive. +*/ + Bitmapset *notnullattrs; I also found separating the two meanings is unnecessary since both of them are not reliable in the outer join case and we are just wanting which attr is nullable, no match how we know it. The below example shows why not-null-by-qual is not readable as well (at least with current implementation) create table n1(a int, b int not null); create table n2(a int, b int not null); create table n3(a int, b int not null); select * from n1 left join n2 on n1.a = n2.a full join n3 on n2.a = n3.a; In this case, when we check (n1 left join n2 on n1.a = n2.a) , we know n1.a is not nullable. However after full join with n3, it changed. > I know there's another discussion here between Ashutosh and Tom about > PathTarget's and Vars. I had the Var idea too once myself [1] but it > was quickly shot down. Tom's reasoning there in [1] seems legit. I > guess we'd need some sort of planner version of Var and never confuse > i
Re: Extend more usecase for planning time partition pruning and init partition pruning.
On Mon, Feb 8, 2021 at 3:43 PM Andy Fan wrote: > > > On Mon, Jan 25, 2021 at 10:21 AM Andy Fan > wrote: > >> >> >> On Sun, Jan 24, 2021 at 6:34 PM Andy Fan >> wrote: >> >>> Hi: >>> >>> I recently found a use case like this. SELECT * FROM p, q WHERE >>> p.partkey = >>> q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either >>> planning time >>> partition prune or init partition prune. Even though we have run-time >>> partition pruning work at last, it is too late in some cases since we >>> have >>> to init all the plan nodes in advance. In my case, there are 10+ >>> partitioned relation in one query and the execution time is short, so >>> the >>> init plan a lot of plan nodes cares a lot. >>> >>> The attached patches fix this issue. It just get the "p.partkey = q.colx" >>> case in root->eq_classes or rel->joinlist (outer join), and then check >>> if there >>> is some baserestrictinfo in another relation which can be used for >>> partition >>> pruning. To make the things easier, both partkey and colx must be Var >>> expression in implementation. >>> >>> - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch >>> >>> Just some existing refactoring and extending ChangeVarNodes to be able >>> to change var->attno. >>> >>> - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch >>> >>> Do the real job. >>> >>> Thought? >>> >>> >>> >>> -- >>> Best Regards >>> Andy Fan (https://www.aliyun.com/) >>> >> >> >> Some results from this patch. >> >> create table p (a int, b int, c character varying(8)) partition by >> list(c); >> create table p1 partition of p for values in ('01'); >> create table p2 partition of p for values in ('02'); >> create table p3 partition of p for values in ('03'); >> create table q (a int, c character varying(8), b int) partition by >> list(c); >> create table q1 partition of q for values in ('01'); >> create table q2 partition of q for values in ('02'); >> create table q3 partition of q for values in ('03'); >> >> Before the patch: >> postgres=# explain (costs off) select * from p inner join q on p.c = q.c >> and q.c > '02'; >> QUERY PLAN >> >> Hash Join >>Hash Cond: ((p.c)::text = (q.c)::text) >>-> Append >> -> Seq Scan on p1 p_1 >> -> Seq Scan on p2 p_2 >> -> Seq Scan on p3 p_3 >>-> Hash >> -> Seq Scan on q3 q >>Filter: ((c)::text > '02'::text) >> (9 rows) >> >> After the patch: >> >> QUERY PLAN >> >> Hash Join >>Hash Cond: ((p.c)::text = (q.c)::text) >>-> Seq Scan on p3 p >>-> Hash >> -> Seq Scan on q3 q >>Filter: ((c)::text > '02'::text) >> (6 rows) >> >> >> Before the patch: >> postgres=# explain (costs off) select * from p inner join q on p.c = q.c >> and (q.c = '02' or q.c = '01'); >> QUERY PLAN >> >> >> Hash Join >>Hash Cond: ((p.c)::text = (q.c)::text) >>-> Append >> -> Seq Scan on p1 p_1 >> -> Seq Scan on p2 p_2 >> -> Seq Scan on p3 p_3 >>-> Hash >> -> Append >>-> Seq Scan on q1 q_1 >> Filter: (((c)::text = '02'::text) OR ((c)::text >> = '01'::text)) >>-> Seq Scan on q2 q_2 >> Filter: (((c)::text = '02'::text) OR ((c)::text >> = '01'::text)) >> (12 rows) >> >> After the patch: >> QUERY PLAN >> >> >> Hash Join >>Hash Cond: ((p.c)::text = (q.c)::text) >>
UniqueKey on Partitioned table.
Suppose we have partitioned table defined as below: P(a, b, c, d) partition by (a) p1 (a=1) p2 (a=2) p3 (a=3) Since in PG, we can define different indexes among partitions, so each child may have different UniqueKeys, and some of them might be invalidated in parent's level. For example, 1). we only define unique index p1(c), so (c) would be an uniquekey of p1 only. so it is invalidate on appendrel level. 2). We define unique index p_n(c) on each childrel, so every childrel has UniqueKey (c). However it is invalid on appendrel as well. 3). We define unique index p_n(a, c), since a is the partition key, so (a, c) would be valid for both child rel and parent rel. In my v1 implementation[1] before, I maintained the child rel exactly the same as non-partitioned table. But when calculating the UniqueKey for partitioned table, I first introduce a global_unique_indexlist which handles the above 3 cases. The indexes for case 1 and case 2 will not be in global_unique_indexlist but the index in case 3 will be even if they are only built in child level. After we have build the global_unique_indexlist on appendrel, we will build the UnqiueKey exactly same as non partitioned table. With this way, I'm not happy with the above method now is because 1). the global_unique_indexlist is build in a hard way. 2). I have to totally ignored the UniqueKey on child level and re-compute it on appendrel level. 3). The 3 cases should rarely happen in real life, I guess. When I re-implement the UniqueKey with EquivalenceClass, I re-think about how to handle the above stuff. Now my preferred idea is just not handle it. When building the uniquekey on parent rel, we just handle 2 cases. If the appendrel only have 1 child, we just copy (and modified if needed due to col-order-mismatch case) the uniquekey. 2). Only handle the Unique index defined in top level, for this case it would yield the below situation. create unique index on p(a, b); --> (A, B) will be the UniqueKey of p. create unique index on p_nn(a, b); --> (A, B) will not be the UniqueKey of p even we create it on ALL the child rel. The result is not perfect but I think it is practical. Any suggestions? The attached is a UnqiueKey with EquivalenceClass patch, I just complete the single relation part and may have bugs. I just attached it here for design review only. and the not-null-attrs is just v1 which we can continue discussing on the original thread[2]. [1] https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch Description: Binary data v1-0002-UniqueKey-with-EquivalenceClass-for-single-rel-on.patch Description: Binary data
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
l_cost < 20) + return NULL; > I used imdbpy2sql.py to parse the imdb database files and load the > data into PostgreSQL. This seemed to work okay apart from the > movie_info_idx table appeared to be missing. Many of the 113 join > order benchmark queries need this table. I followed the steps in [1] and changed something with the attached patch. At last I got 2367725 rows. But probably you are running into a different problem since no change is for movie_info_idx table. [1] https://github.com/gregrahn/join-order-benchmark -- Best Regards Andy Fan (https://www.aliyun.com/) 0001-fix.patch Description: Binary data
Re: Extend more usecase for planning time partition pruning and init partition pruning.
On Fri, Feb 19, 2021 at 6:03 PM Andy Fan wrote: > > > On Mon, Feb 8, 2021 at 3:43 PM Andy Fan wrote: > >> >> >> On Mon, Jan 25, 2021 at 10:21 AM Andy Fan >> wrote: >> >>> >>> >>> On Sun, Jan 24, 2021 at 6:34 PM Andy Fan >>> wrote: >>> >>>> Hi: >>>> >>>> I recently found a use case like this. SELECT * FROM p, q WHERE >>>> p.partkey = >>>> q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either >>>> planning time >>>> partition prune or init partition prune. Even though we have run-time >>>> partition pruning work at last, it is too late in some cases since we >>>> have >>>> to init all the plan nodes in advance. In my case, there are 10+ >>>> partitioned relation in one query and the execution time is short, so >>>> the >>>> init plan a lot of plan nodes cares a lot. >>>> >>>> The attached patches fix this issue. It just get the "p.partkey = >>>> q.colx" >>>> case in root->eq_classes or rel->joinlist (outer join), and then check >>>> if there >>>> is some baserestrictinfo in another relation which can be used for >>>> partition >>>> pruning. To make the things easier, both partkey and colx must be Var >>>> expression in implementation. >>>> >>>> - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch >>>> >>>> Just some existing refactoring and extending ChangeVarNodes to be able >>>> to change var->attno. >>>> >>>> - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch >>>> >>>> Do the real job. >>>> >>>> Thought? >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> Andy Fan (https://www.aliyun.com/) >>>> >>> >>> >>> Some results from this patch. >>> >>> create table p (a int, b int, c character varying(8)) partition by >>> list(c); >>> create table p1 partition of p for values in ('01'); >>> create table p2 partition of p for values in ('02'); >>> create table p3 partition of p for values in ('03'); >>> create table q (a int, c character varying(8), b int) partition by >>> list(c); >>> create table q1 partition of q for values in ('01'); >>> create table q2 partition of q for values in ('02'); >>> create table q3 partition of q for values in ('03'); >>> >>> Before the patch: >>> postgres=# explain (costs off) select * from p inner join q on p.c = q.c >>> and q.c > '02'; >>> QUERY PLAN >>> >>> Hash Join >>>Hash Cond: ((p.c)::text = (q.c)::text) >>>-> Append >>> -> Seq Scan on p1 p_1 >>> -> Seq Scan on p2 p_2 >>> -> Seq Scan on p3 p_3 >>>-> Hash >>> -> Seq Scan on q3 q >>>Filter: ((c)::text > '02'::text) >>> (9 rows) >>> >>> After the patch: >>> >>> QUERY PLAN >>> >>> Hash Join >>>Hash Cond: ((p.c)::text = (q.c)::text) >>>-> Seq Scan on p3 p >>>-> Hash >>> -> Seq Scan on q3 q >>>Filter: ((c)::text > '02'::text) >>> (6 rows) >>> >>> >>> Before the patch: >>> postgres=# explain (costs off) select * from p inner join q on p.c = q.c >>> and (q.c = '02' or q.c = '01'); >>> QUERY PLAN >>> >>> >>> Hash Join >>>Hash Cond: ((p.c)::text = (q.c)::text) >>>-> Append >>> -> Seq Scan on p1 p_1 >>> -> Seq Scan on p2 p_2 >>> -> Seq Scan on p3 p_3 >>>-> Hash >>> -> Append >>>-> Seq Scan on q1 q_1 >>> Filter: (((c)::text = '02'::text) OR ((c)::text >>> = '00
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Mon, Feb 22, 2021 at 9:21 AM Justin Pryzby wrote: > On Tue, Feb 16, 2021 at 11:15:51PM +1300, David Rowley wrote: > > To summarise here, the planner performance gets a fair bit worse with > > the patched code. With master, summing the average planning time over > > each of the queries resulted in a total planning time of 765.7 ms. > > After patching, that went up to 1097.5 ms. I was pretty disappointed > > about that. > > I have a couple ideas; > > - default enable_resultcache=off seems okay. In plenty of cases, planning >time is unimportant. This is the "low bar" - if we can do better and > enable >it by default, that's great. > > - Maybe this should be integrated into nestloop rather than being a > separate >plan node. That means that it could be dynamically enabled during >execution, maybe after a few loops or after checking that there's at > least >some minimal number of repeated keys and cache hits. cost_nestloop > would >consider whether to use a result cache or not, and explain would show > the >cache stats as a part of nested loop. +1 for this idea now.. I am always confused why there is no such node in Oracle even if it is so aggressive to do performance improvement and this function looks very promising. After realizing the costs in planner, I think planning time might be an answer (BTW, I am still not sure Oracle did this). In this case, I propose there'd still >be a GUC to disable it. > > - Maybe cost_resultcache() can be split into initial_cost and final_cost >parts, same as for nestloop ? I'm not sure how it'd work, since >initial_cost is supposed to return a lower bound, and resultcache tries > to >make things cheaper. initial_cost would just add some operator/tuple > costs >to make sure that resultcache of a unique scan is more expensive than >nestloop alone. estimate_num_groups is at least O(n) WRT >rcpath->param_exprs, so maybe you charge 100*list_length(param_exprs) * >cpu_operator_cost in initial_cost and then call estimate_num_groups in >final_cost. We'd be estimating the cost of estimating the cost... > > - Maybe an initial implementation of this would only add a result cache > if the >best plan was already going to use a nested loop, even though a cached >nested loop might be cheaper than other plans. This would avoid most >planner costs, and give improved performance at execution time, but > leaves >something "on the table" for the future. > > > +cost_resultcache_rescan(PlannerInfo *root, ResultCachePath *rcpath, > > + Cost *rescan_startup_cost, Cost *rescan_total_cost) > > +{ > > + double tuples = rcpath->subpath->rows; > > + double calls = rcpath->calls; > ... > > + /* estimate on the distinct number of parameter values */ > > + ndistinct = estimate_num_groups(root, rcpath->param_exprs, calls, > NULL, > > + &estinfo); > > Shouldn't this pass "tuples" and not "calls" ? > > -- > Justin > -- Best Regards Andy Fan (https://www.aliyun.com/)
pg_temp_%d namespace creation can invalidate all the cached plan in other backends
Planning is expensive and we use plancache to bypass its effect. I find the $subject recently which is caused by we register NAMESPACEOID invalidation message for pg_temp_%s as well as other normal namespaces. Is it a must? We can demo the issue with the below case: Sess1: create table t (a int); prepare s as select * from t; postgres=# execute s; INFO: There is no cached plan now a --- (0 rows) postgres=# execute s; -- The plan is cached. a --- (0 rows) Sess2: create temp table m (a int); Sess1: postgres=# execute s; -- The cached plan is reset. INFO: There is no cached plan now a --- (0 rows) What I want to do now is bypass the invalidation message totally if it is a pg_temp_%d namespace. (RELATION_IS_OTHER_TEMP). With this change, the impact is not only the plan cache is not reset but also all the other stuff in SysCacheInvalidate/CallSyscacheCallbacks will not be called (for pg_temp_%d change only). I think pg_temp_%d is not meaningful for others, so I think the bypassing is OK. I still have not kicked off any coding so far, I want to know if it is a correct thing to do? -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: pg_temp_%d namespace creation can invalidate all the cached plan in other backends
On Tue, Feb 23, 2021 at 12:07 PM Andy Fan wrote: > Planning is expensive and we use plancache to bypass its effect. I find the > $subject recently which is caused by we register NAMESPACEOID invalidation > message for pg_temp_%s as well as other normal namespaces. Is it a > must? > > We can demo the issue with the below case: > > Sess1: > create table t (a int); > prepare s as select * from t; > postgres=# execute s; > INFO: There is no cached plan now > a > --- > (0 rows) > > postgres=# execute s; -- The plan is cached. > a > --- > (0 rows) > > > Sess2: > create temp table m (a int); > > Sess1: > > postgres=# execute s; -- The cached plan is reset. > INFO: There is no cached plan now > a > --- > (0 rows) > > > What I want to do now is bypass the invalidation message totally if it is > a pg_temp_%d > namespace. (RELATION_IS_OTHER_TEMP). > Please ignore the word "RELATION_IS_OTHER_TEMP", it is pasted here by accident.. > With this change, the impact is not only > the plan cache is not reset but also all the other stuff in > SysCacheInvalidate/CallSyscacheCallbacks will not be called (for > pg_temp_%d change > only). I think pg_temp_%d is not meaningful for others, so I think the > bypassing is OK. > I still have not kicked off any coding so far, I want to know if it is a > correct thing to do? > > -- > Best Regards > Andy Fan (https://www.aliyun.com/) > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: pg_temp_%d namespace creation can invalidate all the cached plan in other backends
On Tue, Feb 23, 2021 at 1:50 PM Tom Lane wrote: > Andy Fan writes: > > Planning is expensive and we use plancache to bypass its effect. I find > the > > $subject recently which is caused by we register NAMESPACEOID > invalidation > > message for pg_temp_%s as well as other normal namespaces. Is it a > > must? > > Since we don't normally delete those namespaces once they exist, > the number of such events is negligible over the life of a database > (at least in production scenarios). I do miss this part during my test. Thanks for sharing this. > I'm having a very hard time > getting excited about spending effort here. > While I admit this should happen rarely in production, I still think we may need to fix it. This is kind of tech debt. For example, why my application has a spike on time xx:yy:zz (Assume it happens even it is rare). I think there is a very limited DBA who can find out this reason easily. Even he can find out it, he is still hard to make others to understand and be convinced. So why shouldn't we just avoid it if the effort is not huge? (I do find this in my production case, where the case starts from this invalidation message, and crashes at ResetPlanCache. I'm using a modified version, so the crash probably not the community version's fault and we will fix it separately. ) Also, you can't just drop the inval event, because even if you > believe it's irrelevant to other backends (a questionable > assumption), it certainly is relevant locally. > Thanks for this hint! Can just finding a place to run SysCacheInvalidate/CallSyscacheCallbacks locally fix this issue? -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Make Append Cost aware of some run time partition prune case
Hi Ryan: On Thu, Mar 4, 2021 at 8:14 AM Ryan Lambert wrote: > On Mon, Nov 9, 2020 at 5:44 PM Andy Fan wrote: > >> Currently the cost model of append path sums the cost/rows for all the >> subpaths, it usually works well until we run into the run-time partition >> prune >> case. The first result is that generic plans will rarely be used for >> some cases. >> For instance, SELECT * FROM p WHERE pkey = $1; The custom plan will only >> count the cost of one partition, however generic plan will count the cost >> for all the >> partitions even we are sure that only 1 partition will survive. Another >> impact >> is that planners may choose a wrong plan. for example, SELECT * FROM >> t1, p >> WHERE t1.a = p.pkey; The cost/rows of t1 nest loop p is estimated highly >> improperly. This patch wants to help this case to some extent. >> > > Greetings, > > I was referred to this patch by Amit as a possible improvement for an > issue I noticed recently. I had a test setup where I expected run-time > pruning to kick in but it did not. I am trying to test this patch to see > if it helps for that scenario, but ran into an error running make install > against the current master (commit 0a687c8f1). > > costsize.c: In function ‘cost_append’: > costsize.c:2171:32: error: ‘AppendPath’ {aka ‘struct AppendPath’} has no > member named ‘partitioned_rels’ > 2171 | List *partitioned_rels = apath->partitioned_rels; > |^~ > make[4]: *** [: costsize.o] Error 1 > make[4]: Leaving directory > '/var/lib/postgresql/git/postgresql/src/backend/optimizer/path' > make[3]: *** [../../../src/backend/common.mk:39: path-recursive] Error 2 > make[3]: Leaving directory > '/var/lib/postgresql/git/postgresql/src/backend/optimizer' > make[2]: *** [common.mk:39: optimizer-recursive] Error 2 > make[2]: Leaving directory '/var/lib/postgresql/git/postgresql/src/backend' > make[1]: *** [Makefile:42: install-backend-recurse] Error 2 > make[1]: Leaving directory '/var/lib/postgresql/git/postgresql/src' > make: *** [GNUmakefile:11: install-src-recurse] Error 2 > > Thanks, > > Ryan Lambert > > Thanks for checking. This patch is on a very old master and the code is too complex since I wanted to handle a full scenario of a run time partition prune, which has lots of troubles and not very practical I think. so I am not happy with that now. I have implemented a new one, which only handles 1 level of partitioned table, and only 1 partition key. and only handle the eq operators like partkey = $1 / partkey in ($1, $2) / parkey = $1 or partkey = $2; The patch works well in my user case. I can send one on the latest master shortly, and hope it is helpful for you as well. (At the same time, I also ran into a case that we can expand more init partition prune case [1], you can check that one if you like. I am happy with that patch now). [1] https://www.postgresql.org/message-id/flat/CAKU4AWq4NLxu5JF9%2Bd%3Do%3DA636-%3DeFNFmPx%2BkJ44ezTm%3DikZ73w%40mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Make Append Cost aware of some run time partition prune case
> > I have implemented a new one, which only handles 1 level of partitioned > table, and > only 1 partition key. and only handle the eq operators like partkey = $1 > / partkey in ($1, $2) > / parkey = $1 or partkey = $2; The patch works well in my user case. I > can send > one on the latest master shortly, and hope it is helpful for you as well. > > Hi: Here is the new patch for this topic, which has proved works in my limited user case (apply the similar logic on pg11), and it is incomplete since more cases should be covered but not. Uploading this patch now is just for discussing and testing. Design principle: 1). the cost of AppendPath should be reduced for either init partition prune or run time partition prune. All of the startup_cost, total_cost, rows should be adjusted. As for the startup_cost, if we should adjust it carefully, or else we can get the run_time_cost less than 0. 2). When we merge the subpath from sub-partitioned rel via accumulate_append_subpath, currently we just merge the subpaths and discard the cost/rows in AppendPath. After this feature is involved, we may consider to use the AppendPath's cost as well during this stage. 3). When join is involved, AppendPath is not enough since the estimated rows for a join relation cares about rel1->rows, rel2->rows only, the Path.rows is not cared. so we need to adjust rel->rows as well. and only for the init partition prune case. (appendrel->rows for planning time partition prune is handled already). The biggest problem of the above is I don't know how to adjust the cost for Parallel Append Path. Currently I just ignore it, which would cause some query should use Parallel Append Path but not. Something I don't want to handle really unless we can address its value. 1. Operators like >, <. Between and. 2. the uncompleted part key appeared in prunequals. Like we have partition key (a, b). But use just use A = 1 as restrictinfo. The main reason I don't want to handle them are 1). it would be uncommon. b). It introduces extra complexity. c). at last, we can't estimate it well like partkey > $1, what would be a prune ratio for ). Something I don't handle so far are: 1). accumulate_append_subpath stuff. 2). MergeAppend. 3). Multi Partition key. -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-adjust-cost-model-for-partition-prune-case.patch Description: Binary data
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Fri, Mar 5, 2021 at 12:00 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Thu, Feb 18, 2021 at 08:58:13PM +0800, Andy Fan wrote: > > Thanks for continuing work on this patch! > > > On Tue, Feb 16, 2021 at 12:01 PM David Rowley > wrote: > > > > > On Fri, 12 Feb 2021 at 15:18, Andy Fan > wrote: > > > > > > > > On Fri, Feb 12, 2021 at 9:02 AM David Rowley > > > wrote: > > > >> The reason I don't really like this is that it really depends where > > > >> you want to use RelOptInfo.notnullattrs. If someone wants to use it > > > >> to optimise something before the base quals are evaluated then they > > > >> might be unhappy that they found some NULLs. > > > >> > > > > > > > > Do you mean the notnullattrs is not set correctly before the base > quals > > > are > > > > evaluated? I think we have lots of data structures which are set > just > > > after some > > > > stage. but notnullattrs is special because it is set at more than 1 > > > stage. However > > > > I'm doubtful it is unacceptable, Some fields ever change their > meaning > > > at different > > > > stages like Var->varno. If a user has a misunderstanding on it, it > > > probably will find it > > > > at the testing stage. > > > > > > You're maybe focusing too much on your use case for notnullattrs. It > > > only cares about NULLs in the result for each query level. > > > > > > thinks of an example... > > > > > > OK, let's say I decided that COUNT(*) is faster than COUNT(id) so > > > decided that I might like to write a patch which rewrite the query to > > > use COUNT(*) when it was certain that "id" could not contain NULLs. > > > > > > The query is: > > > > > > SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER > > > JOIN sales s ON p.partid = s.partid GROUP BY p.partid; > > > > > > sale.saleid is marked as NOT NULL in pg_attribute. As the writer of > > > the patch, I checked the comment for notnullattrs and it says "Not > > > null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I > > > should be ok to assume since sales.saleid is marked in notnullattrs > > > that I can rewrite the query?! > > > > > > The documentation about the RelOptInfo.notnullattrs needs to be clear > > > what exactly it means. I'm not saying your representation of how to > > > record NOT NULL in incorrect. I'm saying that you need to be clear > > > what exactly is being recorded in that field. > > > > > > If you want it to mean "attribute marked here cannot output NULL > > > values at this query level", then you should say something along those > > > lines. > > > > > > However, having said that, because this is a Bitmapset of > > > pg_attribute.attnums, it's only possible to record Vars from base > > > relations. It does not seem like you have any means to record > > > attributes that are normally NULLable, but cannot produce NULL values > > > due to a strict join qual. > > > > > > e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something; > > > > > > I'd expect the RelOptInfo for t not to contain a bit for the > > > "nullable" column, but there's no way to record the fact that the join > > > RelOptInfo for {t,j} cannot produce a NULL for that column. It might > > > be quite useful to know that for the UniqueKeys patch. > > > > > > > I checked again and found I do miss the check on JoinExpr->quals. I have > > fixed it in v3 patch. Thanks for the review! > > > > In the attached v3, commit 1 is the real patch, and commit 2 is just add > > some logs to help local testing. notnull.sql/notnull.out is the test > case > > for > > this patch, both commit 2 and notnull.* are not intended to be committed > > at last. > > Just to clarify, this version of notnullattrs here is the latest one, > and another one from "UniqueKey on Partitioned table" thread should be > disregarded? > Actually they are different sections for UniqueKey. Since I don't want to mess two topics in one place, I open another thread. The topic here is how to represent a not null attribute, which is a precondition for all UniqueKey stuff. The thread " UniqueKey on Partitioned table[1] " is talking a
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Fri, Mar 5, 2021 at 4:16 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Fri, Mar 05, 2021 at 10:22:45AM +0800, Andy Fan wrote: > > > > I checked again and found I do miss the check on JoinExpr->quals. I > have > > > > fixed it in v3 patch. Thanks for the review! > > > > > > > > In the attached v3, commit 1 is the real patch, and commit 2 is > just add > > > > some logs to help local testing. notnull.sql/notnull.out is the test > > > case > > > > for > > > > this patch, both commit 2 and notnull.* are not intended to be > committed > > > > at last. > > > > > > Just to clarify, this version of notnullattrs here is the latest one, > > > and another one from "UniqueKey on Partitioned table" thread should be > > > disregarded? > > > > > > > Actually they are different sections for UniqueKey. Since I don't want > to > > mess > > two topics in one place, I open another thread. The topic here is how to > > represent > > a not null attribute, which is a precondition for all UniqueKey stuff. > The > > thread > > " UniqueKey on Partitioned table[1] " is talking about how to maintain > the > > UniqueKey on a partitioned table only. > > Sure, those two threads are addressing different topics. But [1] also > includes the patch for notnullattrs (I guess it's the same as one of the > older versions from this thread), so it would be good to specify which > one should be used to avoid any confusion. > > > > I'm not sure about this, but couldn't be there still > > > some cases when a Var belongs to nullable_baserels, but still has some > > > constraints preventing it from being nullable (e.g. a silly example > when > > > the not nullable column belong to the table, and the query does full > > > join of this table on itself using this column)? > > > > > > Do you say something like "SELECT * FROM t1 left join t2 on t1.a = t2.a > > WHERE > > t2.b = 3; "? In this case, the outer join will be reduced to inner join > > at > > reduce_outer_join stage, which means t2 will not be shown in > > nullable_baserels. > > Nope, as I said it's a bit useless example of full self join t1 on > itself. In this case not null column "a" will be considered as nullable, > but following your description for is_var_nullable it's fine (although > couple of commentaries to this function are clearly necessary). > > > > Is this function necessary for the following patches? I've got an > > > impression that the discussion in this thread was mostly evolving about > > > correct description when notnullattrs could be used, not making it > > > bullet proof. > > > > > > > Exactly, that is the blocker issue right now. I hope more authorities can > > give > > some suggestions to move on. > > Hm...why essentially a documentation question is the blocker? Or if you > mean it's a question of the patch scope, are there any arguments for > extending it? > I treat the below comment as the blocker issue: > It would be good to agree on the correct representation for Vars that > cannot produce NULLs so that we don't shut the door on classes of > optimisation that require something other than what you need for your > case. David/Tom/Ashutosh, do you mind to share more insights to this? I mean the target is the patch is in a committable state. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Extend more usecase for planning time partition pruning and init partition pruning.
Hi Amit: Thanks for your review! On Thu, Mar 4, 2021 at 5:07 PM Amit Langote wrote: > Hi Andy, > > On Sun, Jan 24, 2021 at 7:34 PM Andy Fan wrote: > > I recently found a use case like this. SELECT * FROM p, q WHERE > p.partkey = > > q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either > planning time > > partition prune or init partition prune. Even though we have run-time > > partition pruning work at last, it is too late in some cases since we > have > > to init all the plan nodes in advance. In my case, there are 10+ > > partitioned relation in one query and the execution time is short, so > the > > init plan a lot of plan nodes cares a lot. > > > > The attached patches fix this issue. It just get the "p.partkey = q.colx" > > case in root->eq_classes or rel->joinlist (outer join), and then check > if there > > is some baserestrictinfo in another relation which can be used for > partition > > pruning. To make the things easier, both partkey and colx must be Var > > expression in implementation. > > > > - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch > > > > Just some existing refactoring and extending ChangeVarNodes to be able > > to change var->attno. > > > > - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch > > IIUC, your proposal is to transpose the "q.b in (1, 2)" in the > following query as "p.a in (1, 2)" and pass it down as a pruning qual > for p: > > select * from p, q where p.a = q.b and q.b in (1, 2); > > or "(q.b = 1 or q.b = 2)" in the following query as "(p.a = 1 or p.a = 2)": > > select * from p, q where p.a = q.b and (q.b = 1 or q.b = 2); > > Yes, you understand me correctly. > While that transposition sounds *roughly* valid, I have some questions > about the approach: > > * If the transposed quals are assumed valid to use for partition > pruning, could they also not be used by, say, the surviving > partitions' index scan paths? So, perhaps, it doesn't seem right that > partprune.c builds the clauses on-the-fly for pruning and dump them > once done. > > * On that last part, I wonder if partprune.c isn't the wrong place to > determine that "q.b in (1, 2)" and "p.a in (1, 2)" are in fact > equivalent. That sort of thing is normally done in the phase of > planning when distribute_qual_to_rels() runs and any equivalences > found stored in PlannerInfo.eq_classes. Have you investigated why the > process_ machinery doesn't support working with ScalarArrayOpExpr and > BoolExpr to begin with? > > * Or maybe have you considered generalizing what > build_implied_pruning_quals() does so that other places like > indxpath.c can use the facility? > > Actually at the beginning of this work, I do think I should put the implied quals to baserestictinfo in the distribute_qual_for_rels stage. That probably can fix all the issues you reported. However that probably more complex than what I did with more risks and I have a very limited timeline to handle the real custom issue, so I choose this strategy. But it is the time to re-think the baserestrictinfo way now. I will spend some time in this direction, Thank you for this kind of push-up:) I just checked this stuff on Oracle, Oracle does use this strategy. SQL> explain plan for select * from t1, t2 where t1.a = t2.a and t1.a > 2; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT Plan hash value: 1838229974 --- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --- | 0 | SELECT STATEMENT | | 1 |52 | 4 (0)| 00:00:01 | |* 1 | HASH JOIN | | 1 |52 | 4 (0)| 00:00:01 | |* 2 | TABLE ACCESS FULL| T1 | 1 |26 | 2 (0)| 00:00:01 | |* 3 | TABLE ACCESS FULL| T2 | 1 |26 | 2 (0)| 00:00:01 | --- PLAN_TABLE_OUTPUT Predicate Information (identified by operation id): --- 1 - access("T1"."A"="T2"."A") * 2 - filter("T1"."A">2) 3 - filter("T2"."A">2)* 17 rows selected. postgres=# explain (costs off) select * from t1, t2 where t1.a = t2.a and t1.a > 2; QUERY PLAN --- Merge Join Merge Cond: (t1.a = t2.a) -> Sort Sort Key: t1.a -> Seq Scan on t1 Filter: (a > 2) -> Sort Sort Key: t2.a -> Seq Scan on t2 (9 rows) -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Huge memory consumption on partitioned table with FKs
On Fri, Mar 5, 2021 at 5:00 AM Tom Lane wrote: > Amit Langote writes: > > Updated patch attached. > > This claim seems false on its face: > > > All child constraints of a given foreign key constraint can use the > > same RI query and the resulting plan, that is, no need to create as > > many copies of the query and the plan as there are partitions, as > > happens now due to the child constraint OID being used in the key > > for ri_query_cache. > > What if the child tables don't have the same physical column numbers > as the parent? My point below is a bit off-topic, but I want to share it here. Since we implement a partitioned table in PG with the inherited class, it has much more flexibility than other databases. Like in PG, we allow different partitions have different physical order, different indexes, maybe different index states. that would cause our development work hard in many places and cause some runtime issues as well (like catalog memory usage), have we discussed limiting some flexibility so that we can have better coding/running experience? I want to do some research in this direction, but it would be better that I can listen to any advice from others. More specifically, I want to reduce the memory usage of Partitioned table/index as the first step. In my testing, each IndexOptInfo will use 2kB memory in each backend. > The comment claiming that it's okay if riinfo->fk_attnums > doesn't match seems quite off point, because the query plan is still > going to need to use the correct column numbers. Even if column numbers > are the same, the plan would also contain table and index OIDs that could > only be right for one partition. > > I could imagine translating a parent plan to apply to a child instead of > building it from scratch, but that would take a lot of code we don't have > (there's no rewriteHandler infrastructure for plan nodes). > > Maybe I'm missing something, because I see that the cfbot claims > this is passing, and I'd sure like to think that our test coverage > is not so thin that it'd fail to detect probing the wrong partition > for foreign key matches. But that's what it looks like this patch > will do. > > regards, tom lane > > > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Huge memory consumption on partitioned table with FKs
On Mon, Mar 8, 2021 at 3:43 PM Andy Fan wrote: > > > On Fri, Mar 5, 2021 at 5:00 AM Tom Lane wrote: > >> Amit Langote writes: >> > Updated patch attached. >> >> This claim seems false on its face: >> >> > All child constraints of a given foreign key constraint can use the >> > same RI query and the resulting plan, that is, no need to create as >> > many copies of the query and the plan as there are partitions, as >> > happens now due to the child constraint OID being used in the key >> > for ri_query_cache. >> >> What if the child tables don't have the same physical column numbers >> as the parent? > > > My point below is a bit off-topic, but I want to share it here. Since > we implement a partitioned table in PG with the inherited class, it has > much > more flexibility than other databases. Like in PG, we allow different > partitions > have different physical order, different indexes, maybe different index > states. > that would cause our development work hard in many places and cause some > runtime issues as well (like catalog memory usage), have we discussed > limiting some flexibility so that we can have better coding/running > experience? > I want to do some research in this direction, but it would be better that > I can > listen to any advice from others. More specifically, I want to reduce the > memory > usage of Partitioned table/index as the first step. In my testing, each > IndexOptInfo > will use 2kB memory in each backend. > As for the compatible issue, will it be ok to introduce a new concept like " CREATE TABLE p (a int) partitioned by list(a) RESTRICTED". We can add these limitation to restricted partitioned relation only. > The comment claiming that it's okay if riinfo->fk_attnums >> doesn't match seems quite off point, because the query plan is still >> going to need to use the correct column numbers. Even if column numbers >> are the same, the plan would also contain table and index OIDs that could >> only be right for one partition. >> >> I could imagine translating a parent plan to apply to a child instead of >> building it from scratch, but that would take a lot of code we don't have >> (there's no rewriteHandler infrastructure for plan nodes). >> >> Maybe I'm missing something, because I see that the cfbot claims >> this is passing, and I'd sure like to think that our test coverage >> is not so thin that it'd fail to detect probing the wrong partition >> for foreign key matches. But that's what it looks like this patch >> will do. >> >> regards, tom lane >> >> >> > > -- > Best Regards > Andy Fan (https://www.aliyun.com/) > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Huge memory consumption on partitioned table with FKs
On Mon, Mar 8, 2021 at 8:42 PM Amit Langote wrote: > Hi Andy, > > On Mon, Mar 8, 2021 at 8:39 PM Andy Fan wrote: > > On Mon, Mar 8, 2021 at 3:43 PM Andy Fan > wrote: > >> My point below is a bit off-topic, but I want to share it here. Since > >> we implement a partitioned table in PG with the inherited class, it has > much > >> more flexibility than other databases. Like in PG, we allow different > partitions > >> have different physical order, different indexes, maybe different > index states. > >> that would cause our development work hard in many places and cause some > >> runtime issues as well (like catalog memory usage), have we discussed > >> limiting some flexibility so that we can have better coding/running > experience? > >> I want to do some research in this direction, but it would be better > that I can > >> listen to any advice from others. More specifically, I want to reduce > the memory > >> usage of Partitioned table/index as the first step. In my testing, > each IndexOptInfo > >> will use 2kB memory in each backend. > > > > > > As for the compatible issue, will it be ok to introduce a new concept > like " > > CREATE TABLE p (a int) partitioned by list(a) RESTRICTED". We can add > these > > limitation to restricted partitioned relation only. > > I think you'd agree that the topics you want to discuss deserve a > separate discussion thread. You may refer to this discussion in that > new thread if you think that your proposals can solve the problem > being discussed here more generally, which would of course be great. > > Sure, I can prepare more data and start a new thread for this. -- Best Regards Andy Fan (https://www.aliyun.com/)
Advance xmin aggressively on Read Commit isolation level
Since the impact of the long transaction is huge in postgresql, for example a long transaction by incident, tables may become very huge and it can't become small again even the transaction is completed unless a vacuum full is used. I have 2 ideas about this. One is in the Read Committed level, we can advance xmin aggressively. suppose it started at t1, and complete a query at t2. the xmin should be t1 currently. Can we advance the xmin to t2 since it is read committed level, The data older than t2 will never be used? Another one is can we force to clean up the old tuples which are older than xxx? If users want to access that, we can just raise errors. Oracle uses this strategy and the error code is ORA-01555. -- Best Regards Andy Fan
Re: Advance xmin aggressively on Read Commit isolation level
On Fri, Nov 6, 2020 at 4:54 PM Thomas Munro wrote: > On Fri, Nov 6, 2020 at 9:48 PM Andy Fan wrote: > > I have 2 ideas about this. One is in the Read Committed level, we can > advance xmin > > aggressively. suppose it started at t1, and complete a query at t2. the > xmin should > > be t1 currently. Can we advance the xmin to t2 since it is read > committed level, > > The data older than t2 will never be used? Another one is can we force > to clean > > up the old tuples which are older than xxx? If users want to access > that, > > we can just raise errors. Oracle uses this strategy and the error code > is > > ORA-01555. > > Hi Andy, > > For the second idea, we have old_snapshot_threshold which does exactly > that since 9.6. There have been some questions about whether it works > correctly, though: see https://commitfest.postgresql.org/30/2682/ if > you would like to help look into that :-) > Hi Tomas: This is exactly what I want and I have big interest with that. Thanks for the information! -- Best Regards Andy Fan
Re: Allow "snapshot too old" error, to prevent bloat
On Sun, Nov 8, 2020 at 5:14 PM Tom Lane wrote: > Jim Nasby writes: > > On 2/15/15 10:36 AM, Tom Lane wrote: > >> I wonder if we couldn't achieve largely the same positive effects > through > >> adding a simple transaction-level timeout option. > > > A common use-case is long-running reports hitting relatively stable data > > in a database that also has tables with a high churn rate (ie: queue > > tables). In those scenarios your only options right now are to suffer > > huge amounts of bloat in the high-churn or not do your reporting. A > > simple transaction timeout only "solves" this by denying you reporting > > queries. > > Agreed, but Kevin's proposal has exactly the same problem only worse, > because (a) the reporting transactions might or might not fail (per > Murphy, they'll fail consistently only when you're on deadline), and > (b) if they do fail, there's nothing you can do short of increasing the > slack db-wide. > > > An idea that I've had on this would be some way to "lock down" the > > tables that a long-running transaction could access. That would allow > > vacuum to ignore any snapshots that transaction had for tables it wasn't > > accessing. That's something that would be deterministic. > > There might be something in that, but again it's not much like this patch. > The key point I think we're both making is that nondeterministic failures > are bad, Based on my experience, a long transaction which is caused by a long-running query is not so many. A common case I met is user start a transaction but forget it to close it, I am thinking if we can do something for this situation. Since most users will use Read Committed isolation level, so after a user completes a query, the next query will use a fresh new snapshot, so there is no need to block the oldest xmin because of this. will it be correct to advance the oldest xmin in this case? If yes, what would be the blocker in implementing this feature? -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Fri, Nov 6, 2020 at 6:13 AM David Rowley wrote: > On Mon, 2 Nov 2020 at 20:43, David Rowley wrote: > > > > On Tue, 20 Oct 2020 at 22:30, David Rowley wrote: > > I did some further tests this time with some tuple deforming. Again, > > it does seem that v9 is slower than v8. > > > > Graphs attached > > > > Looking at profiles, I don't really see any obvious reason as to why > > this is. I'm very much inclined to just pursue the v8 patch (separate > > Result Cache node) and just drop the v9 idea altogether. > > Nobody raised any objections, so I'll start taking a more serious look > at the v8 version (the patch with the separate Result Cache node). > > One thing that I had planned to come back to about now is the name > "Result Cache". I admit to not thinking for too long on the best name > and always thought it was something to come back to later when there's > some actual code to debate a better name for. "Result Cache" was > always a bit of a placeholder name. > > Some other names that I'd thought of were: > > "MRU Hash" > "MRU Cache" > "Parameterized Tuple Cache" (bit long) > "Parameterized Cache" > "Parameterized MRU Cache" > > I think "Tuple Cache" would be OK which means it is a cache for tuples. Telling MRU/LRU would be too internal for an end user and "Parameterized" looks redundant given that we have said "Cache Key" just below the node name. Just my $0.01. -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Mon, Nov 2, 2020 at 3:44 PM David Rowley wrote: > On Tue, 20 Oct 2020 at 22:30, David Rowley wrote: > > > > So far benchmarking shows there's still a regression from the v8 > > version of the patch. This is using count(*). An earlier test [1] did > > show speedups when we needed to deform tuples returned by the nested > > loop node. I've not yet repeated that test again. I was disappointed > > to see v9 slower than v8 after having spent about 3 days rewriting the > > patch > > I did some further tests this time with some tuple deforming. Again, > it does seem that v9 is slower than v8. > I run your test case on v8 and v9, I can produce a stable difference between them. v8: statement latencies in milliseconds: 1603.611 select count(*) from hundredk hk inner join lookup l on hk.thousand = l.a; v9: statement latencies in milliseconds: 1772.287 select count(*) from hundredk hk inner join lookup l on hk.thousand = l.a; then I did a perf on the 2 version, Is it possible that you called tts_minimal_clear twice in the v9 version? Both ExecClearTuple and ExecStoreMinimalTuple called tts_minimal_clear on the same slot. With the following changes: diff --git a/src/backend/executor/execMRUTupleCache.c b/src/backend/executor/execMRUTupleCache.c index 3553dc26cb..b82d8e98b8 100644 --- a/src/backend/executor/execMRUTupleCache.c +++ b/src/backend/executor/execMRUTupleCache.c @@ -203,10 +203,9 @@ prepare_probe_slot(MRUTupleCache *mrucache, MRUCacheKey *key) TupleTableSlot *tslot = mrucache->tableslot; int numKeys = mrucache->nkeys; - ExecClearTuple(pslot); - if (key == NULL) { + ExecClearTuple(pslot); /* Set the probeslot's values based on the current parameter values */ for (int i = 0; i < numKeys; i++) pslot->tts_values[i] = ExecEvalExpr(mrucache->param_exprs[i], @@ -641,7 +640,7 @@ ExecMRUTupleCacheFetch(MRUTupleCache *mrucache) { mrucache->state = MRUCACHE_FETCH_NEXT_TUPLE; - ExecClearTuple(mrucache->cachefoundslot); + // ExecClearTuple(mrucache->cachefoundslot); slot = mrucache->cachefoundslot; ExecStoreMinimalTuple(mrucache->last_tuple->mintuple, slot, false); return slot; @@ -740,7 +739,7 @@ ExecMRUTupleCacheFetch(MRUTupleCache *mrucache) return NULL; } - ExecClearTuple(mrucache->cachefoundslot); + // ExecClearTuple(mrucache->cachefoundslot); slot = mrucache->cachefoundslot; ExecStoreMinimalTuple(mrucache->last_tuple->mintuple, slot, false); return slot; v9 has the following result: 1608.048 select count(*) from hundredk hk inner join lookup l on hk.thousand = l.a; > Graphs attached > > Looking at profiles, I don't really see any obvious reason as to why > this is. I'm very much inclined to just pursue the v8 patch (separate > Result Cache node) and just drop the v9 idea altogether. > > David > -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Mon, Nov 9, 2020 at 10:07 AM David Rowley wrote: > On Mon, 9 Nov 2020 at 03:52, Andy Fan wrote: > > then I did a perf on the 2 version, Is it possible that you called > tts_minimal_clear twice in > > the v9 version? Both ExecClearTuple and ExecStoreMinimalTuple called > tts_minimal_clear > > on the same slot. > > > > With the following changes: > > Thanks for finding that. After applying that fix I did a fresh set of > benchmarks on the latest master, latest master + v8 and latest master > + v9 using the attached script. (resultcachebench2.sh.txt) > > I ran this on my zen2 AMD64 machine and formatted the results into the > attached resultcache_master_vs_v8_vs_v9.csv file > > If I load this into PostgreSQL: > > # create table resultcache_bench (tbl text, target text, col text, > latency_master numeric(10,3), latency_v8 numeric(10,3), latency_v9 > numeric(10,3)); > # copy resultcache_bench from > '/path/to/resultcache_master_vs_v8_vs_v9.csv' with(format csv); > > and run: > > # select col,tbl,target, sum(latency_v8) v8, sum(latency_v9) v9, > round(avg(latency_v8/latency_v9)*100,1) as v8_vs_v9 from > resultcache_bench group by 1,2,3 order by 2,1,3; > > I've attached the results of the above query. (resultcache_v8_vs_v9.txt) > > Out of the 24 tests done on each branch, only 6 of 24 are better on v9 > compared to v8. So v8 wins on 75% of the tests. I think either version is OK for me and I like this patch overall. However I believe v9 should be no worse than v8 all the time, Is there any theory to explain your result? v9 never wins using > the lookup1 table (1 row per lookup). It only runs on 50% of the > lookup100 queries (100 inner rows per outer row). However, despite the > draw in won tests for the lookup100 test, v8 takes less time overall, > as indicated by the following query: > > postgres=# select round(avg(latency_v8/latency_v9)*100,1) as v8_vs_v9 > from resultcache_bench where tbl='lookup100'; > v8_vs_v9 > -- > 99.3 > (1 row) > > Ditching the WHERE clause and simply doing: > > postgres=# select round(avg(latency_v8/latency_v9)*100,1) as v8_vs_v9 > from resultcache_bench; > v8_vs_v9 > -- > 96.2 > (1 row) > > indicates that v8 is 3.8% faster than v9. Altering that query > accordingly indicates v8 is 11.5% faster than master and v9 is only 7% > faster than master. > > Of course, scaling up the test will yield both versions being even > more favourable then master, but the point here is comparing v8 to v9. > > David > -- Best Regards Andy Fan
Make Append Cost aware of some run time partition prune case
Filter: (a = $1) -> Seq Scan on a1_2 a_2 (cost=0.00..90.50 rows=5000 width=12) Filter: (a = $1) -> Append (cost=0.00..231.00 rows=1 width=12) << Here Subplans Removed: 2 -> Seq Scan on a1_1 a_4 (cost=0.00..90.50 rows=5000 width=12) Filter: (a = $2) -> Seq Scan on a1_2 a_5 (cost=0.00..90.50 rows=5000 width=12) Filter: (a = $2) (15 rows) -- add a limit to make sure the runtime partitions prune. explain select * from generate_series(1, 10) i(i) join lateral (select * from a where a.a = (i.i % 2 + 1) and a.b = (i.i % 2 + 1) and a.c = (i.i % 10) limit 1000) m on true; QUERY PLAN Nested Loop (cost=0.00..2030.10 rows=1 width=16) -> Function Scan on generate_series i (cost=0.00..0.10 rows=10 width=4) -> Limit (cost=0.00..183.00 rows=1000 width=12) -> Append (cost=0.00..183.00 rows=1000 width=12) << Here -> Seq Scan on a1_1 a_1 (cost=0.00..178.00 rows=1000 width=12) Filter: ((c = (i.i % 10)) AND (a = ((i.i % 2) + 1)) AND (b = ((i.i % 2) + 1))) -> Seq Scan on a1_2 a_2 (cost=0.00..178.00 rows=1000 width=12) Filter: ((c = (i.i % 10)) AND (a = ((i.i % 2) + 1)) AND (b = ((i.i % 2) + 1))) -> Seq Scan on a2_1 a_3 (cost=0.00..178.00 rows=1000 width=12) Filter: ((c = (i.i % 10)) AND (a = ((i.i % 2) + 1)) AND (b = ((i.i % 2) + 1))) -> Seq Scan on a2_2 a_4 (cost=0.00..178.00 rows=1000 width=12) Filter: ((c = (i.i % 10)) AND (a = ((i.i % 2) + 1)) AND (b = ((i.i % 2) + 1))) (12 rows) Any thoughts about this? Thanks [1] https://www.postgresql.org/message-id/CA%2BTgmoZHYoAL4HYwnGO25B8CxCB%2BvNMdf%2B7rbUzYykR4sU9yUA%40mail.gmail.com -- Best Regards Andy Fan v1-0001-Adjust-Append-Path-cost-model-for-runtime-partiti.patch Description: Binary data part_runtime_prune.sql Description: Binary data
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Tue, Nov 10, 2020 at 7:55 AM David Rowley wrote: > On Tue, 10 Nov 2020 at 12:49, Tom Lane wrote: > > > > Alvaro Herrera writes: > > > Are you taking into account the possibility that generated machine code > > > is a small percent slower out of mere bad luck? I remember someone > > > suggesting that they can make code 2% faster or so by inserting random > > > no-op instructions in the binary, or something like that. So if the > > > difference between v8 and v9 is that small, then it might be due to > this > > > kind of effect. > > > > Yeah. I believe what this arises from is good or bad luck about relevant > > tight loops falling within or across cache lines, and that sort of thing. > > We've definitely seen performance changes up to a couple percent with > > no apparent change to the relevant code. > > I do happen to prefer having the separate Result Cache node (v8), so > from my point of view, even if the performance was equal, I'd rather > have v8. I understand that some others feel different though. > > While I have interest about what caused the tiny difference, I admit that what direction this patch should go is more important. Not sure if anyone is convinced that v8 and v9 have a similar performance. The current data show it is similar. I want to profile/read code more, but I don't know what part I should pay attention to. So I think any hints on why v9 should be better at a noticeable level in theory should be very helpful. After that, I'd like to read the code or profile more carefully. -- Best Regards Andy Fan
Have we tried to treat CTE as SubQuery in planner?
Hi: Take the following example: insert into cte1 select i, i from generate_series(1, 100)i; create index on cte1(a); explain with cte1 as (select * from cte1) select * from c where a = 1; It needs to do seq scan on the above format, however it is pretty quick if we change the query to select * from (select * from cte1) c where a = 1; I know how we treat cte and subqueries differently currently, I just don't know why we can't treat cte as a subquery, so lots of subquery related technology can apply to it. Do we have any discussion about this? Thanks -- Best Regards Andy Fan
Re: Have we tried to treat CTE as SubQuery in planner?
On Sat, Nov 14, 2020 at 2:14 PM Tom Lane wrote: > Andy Fan writes: > > Take the following example: > > > insert into cte1 select i, i from generate_series(1, 100)i; > > create index on cte1(a); > > > explain > > with cte1 as (select * from cte1) > > select * from c where a = 1; > > > It needs to do seq scan on the above format, however it is pretty > > quick if we change the query to > > select * from (select * from cte1) c where a = 1; > > This example seems both confused and out of date. Since we changed > the rules on materializing CTEs (in 608b167f9), I get > > Sorry, I should have tested it again on the HEAD, and 608b167f9 is exactly the thing I mean. regression=# create table c as select i as a, i from generate_series(1, > 100)i; > SELECT 100 > regression=# create index on c(a); > CREATE INDEX > regression=# explain > regression-# with cte1 as (select * from c) > regression-# select * from cte1 where a = 1; > QUERY PLAN > -- > Bitmap Heap Scan on c (cost=95.17..4793.05 rows=5000 width=8) >Recheck Cond: (a = 1) >-> Bitmap Index Scan on c_a_idx (cost=0.00..93.92 rows=5000 width=0) > Index Cond: (a = 1) > (4 rows) > > regards, tom lane > -- Best Regards Andy Fan
Re: Have we tried to treat CTE as SubQuery in planner?
On Sat, Nov 14, 2020 at 2:44 PM Jesse Zhang wrote: > Hi, > > On Fri, Nov 13, 2020 at 10:04 PM Andy Fan wrote: > > > > Hi: > > > > Take the following example: > > > > insert into cte1 select i, i from generate_series(1, 100)i; > > create index on cte1(a); > > > > explain > > with cte1 as (select * from cte1) > > select * from c where a = 1; > > > > ITYM: > > EXPLAIN > WITH c AS (SELECT * FROM cte1) > SELECT * FROM c WHERE a = 1; > > I'm also guessing your table DDL is: > > CREATE TABLE cte1 (a int, b int); > > > It needs to do seq scan on the above format, however it is pretty > > quick if we change the query to > > select * from (select * from cte1) c where a = 1; > > Does it? On HEAD, I got the following plan: > > You understand me correctly, just too busy recently and make me make mistakes like this. Sorry about that:( > (without stats): > Bitmap Heap Scan on foo >Recheck Cond: (a = 1) >-> Bitmap Index Scan on foo_a_idx > Index Cond: (a = 1) > > (with stats): > Index Scan using foo_a_idx on foo >Index Cond: (a = 1) > > > > > > > I know how we treat cte and subqueries differently currently, > > I just don't know why we can't treat cte as a subquery, so lots of > > subquery related technology can apply to it. Do we have any > > discussion about this? > > This was brought up a few times, the most recent one I can recall was a > little bit over two years ago [1] > > [1] https://postgr.es/m/87sh48ffhb@news-spur.riddles.org.uk And I should have searched "CTE" at least for a while.. -- Best Regards Andy Fan
Different results between PostgreSQL and Oracle for "for update" statement
We can reproduce this difference with the following steps. create table su (a int, b int); insert into su values(1, 1); - session 1: begin; update su set b = 2 where b = 1; - sess 2: select * from su where a in (select a from su where b = 1) for update; - sess 1: commit; Then session 2 can get the result. PostgreSQL: a | b ---+--- 1 | 2 (1 row) Oracle: It gets 0 rows. Oracle's plan is pretty similar to Postgres. PLAN_TABLE_OUTPUT Plan hash value: 2828511618 - | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | - | 0 | SELECT STATEMENT || 1 | 52 | 4 (0)| 00:00:01 | | 1 | FOR UPDATE |||| || | 2 | BUFFER SORT |||| || |* 3 |HASH JOIN SEMI|| 1 | 52 | 4 (0)| 00:00:01 | | 4 | TABLE ACCESS FULL| SU | 1 | 26 | 2 (0)| 00:00:01 | |* 5 | TABLE ACCESS FULL| SU | 1 | 26 | 2 (0)| 00:00:01 | PLAN_TABLE_OUTPUT - Any thoughts on who is wrong? -- Best Regards Andy Fan
Re: Different results between PostgreSQL and Oracle for "for update" statement
On Thu, Nov 19, 2020 at 11:49 PM Tom Lane wrote: > Andy Fan writes: > > create table su (a int, b int); > > insert into su values(1, 1); > > > - session 1: > > begin; > > update su set b = 2 where b = 1; > > > - sess 2: > > select * from su where a in (select a from su where b = 1) for update; > > This'd probably work the way you expect if there were "for update" > in the sub-select as well. As is, the sub-select will happily > return "1". > > regards, tom lane > Hi Tom: Thank you for your attention. Your suggestion would fix the issue. However The difference will cause some risks when users move their application from Oracle to PostgreSQL. So I'd like to think which behavior is more reasonable. In our current pg strategy, we would get select * from su where a in (select a from su where b = 1) for update; a | b ---+--- 1 | 2 (1 row) The data is obtained from 4 steps: 1. In the subquery, it gets su(a=1, b=1). 2. in the outer query, it is blocked. 3. session 1 is committed. sess 2 can continue. 4. outer query gets su(a=1, b=2). By comparing the result in step 1 & step 4 in the same query, it looks like we are using 2 different snapshots for the same query. I think it is a bit strange. If we think it is an issue, I think we can fix it with something like this: +/* + * We should use the same RowMarkType for the RangeTblEntry + * if the underline relation is the same. Doesn't handle SubQuery for now. + */ +static RowMarkType +select_rowmark_type_for_rangetable(RangeTblEntry *rte, + List *rowmarks, + LockClauseStrength strength) +{ + ListCell *lc; + foreach(lc, rowmarks) + { + PlanRowMark *rc = lfirst_node(PlanRowMark, lc); + if (rc->relid == rte->relid) + return rc->markType; + } + return select_rowmark_type(rte, strength); +} + /* * preprocess_rowmarks - set up PlanRowMarks if needed */ @@ -2722,6 +2743,7 @@ preprocess_rowmarks(PlannerInfo *root) newrc->strength = rc->strength; newrc->waitPolicy = rc->waitPolicy; newrc->isParent = false; + newrc->relid = rte->relid; prowmarks = lappend(prowmarks, newrc); } @@ -2742,18 +2764,18 @@ preprocess_rowmarks(PlannerInfo *root) newrc = makeNode(PlanRowMark); newrc->rti = newrc->prti = i; newrc->rowmarkId = ++(root->glob->lastRowMarkId); - newrc->markType = select_rowmark_type(rte, LCS_NONE); + newrc->markType = select_rowmark_type_for_rangetable(rte, prowmarks, LCS_NONE); newrc->allMarkTypes = (1 << newrc->markType); newrc->strength = LCS_NONE; newrc->waitPolicy = LockWaitBlock; /* doesn't matter */ newrc->isParent = false; - + newrc->relid = rte->relid; prowmarks = lappend(prowmarks, newrc); } - root->rowMarks = prowmarks; } + /* * Select RowMarkType to use for a given table */ diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index ac5685da64..926086a69a 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -1103,6 +1103,7 @@ typedef struct PlanRowMark LockClauseStrength strength;/* LockingClause's strength, or LCS_NONE */ LockWaitPolicy waitPolicy; /* NOWAIT and SKIP LOCKED options */ boolisParent; /* true if this is a "dummy" parent entry */ + Oid relid; /* relation oid */ } PlanRowMark; Do you think it is a bug and the above method is the right way to go? -- Best Regards Andy Fan
Re: Different results between PostgreSQL and Oracle for "for update" statement
Hi Andreas: Thanks for your input. On Fri, Nov 20, 2020 at 9:37 PM Andreas Karlsson wrote: > On 11/20/20 9:57 AM, Andy Fan wrote: > > Thank you for your attention. Your suggestion would fix the issue. > However > > The difference will cause some risks when users move their application > > from Oracle > > to PostgreSQL. So I'd like to think which behavior is more reasonable. > > I think PostgreSQL's behavior is more reasonable since it only locks the > rows it claims to lock and no extra rows. This makes the code easy to > reason about. And PostgreSQL does not re-evaluate sub queries after > grabbing the lock which while it might be surprising to some people is > also a quite nice consistent behavior in practice as long as you are > aware of it. > I admit my way is bad after reading your below question, but I would not think *it might be surprising to some people* is a good signal for a design. Would you think "re-evaluate the quals" after grabbing the lock should be a good idea? And do you know if any other database uses the postgres's way or Oracle's way? I just heard Oracle might do the re-check just some minutes before reading your reply and I also found Oracle doesn't lock the extra rows per my test. > I do not see why these two scenarios should behave differently (which I > think they would with your proposed patch): > > Good question! I think my approach doesn't make sense now! -- Best Regards Andy Fan
Re: Different results between PostgreSQL and Oracle for "for update" statement
Thank all of you for your great insight! On Sat, Nov 21, 2020 at 9:04 AM Peter Geoghegan wrote: > On Fri, Nov 20, 2020 at 3:04 PM Andreas Karlsson > wrote: > > I am sadly not familiar enough with Oracle or have access to any Oracle > > license so I cannot comment on how Oracle have implemented their behvior > > or what tradeoffs they have made. > > I bet that Oracle does a statement-level rollback for READ COMMITTED > mode's conflict handling. I'd agree with you about this point, this difference can cause more different behavior between Postgres & Oracle (not just select .. for update). create table dml(a int, b int); insert into dml values(1, 1), (2,2); -- session 1: begin; delete from dml where a in (select min(a) from dml); --session 2: delete from dml where a in (select min(a) from dml); -- session 1: commit; In Oracle: 1 row deleted in sess 2. In PG: 0 rows are deleted. > I'm not sure if this means that it locks multiple rows or not. This is something not really exists and you can ignore this part:) About the statement level rollback, Another difference is related. create table t (a int primary key, b int); begin; insert into t values(1,1); insert into t values(1, 1); commit; Oracle : t has 1 row, PG: t has 0 row (since the whole transaction is aborted). I don't mean we need to be the same as Oracle, but to support a customer who comes from Oracle, it would be good to know the difference. -- Best Regards Andy Fan
Re: Different results between PostgreSQL and Oracle for "for update" statement
On Sat, Nov 21, 2020 at 11:27 PM Pavel Stehule wrote: > > > so 21. 11. 2020 v 9:59 odesílatel Andy Fan > napsal: > >> Thank all of you for your great insight! >> >> On Sat, Nov 21, 2020 at 9:04 AM Peter Geoghegan wrote: >> >>> On Fri, Nov 20, 2020 at 3:04 PM Andreas Karlsson >>> wrote: >>> > I am sadly not familiar enough with Oracle or have access to any Oracle >>> > license so I cannot comment on how Oracle have implemented their >>> behvior >>> > or what tradeoffs they have made. >>> >>> I bet that Oracle does a statement-level rollback for READ COMMITTED >>> mode's conflict handling. >> >> >> I'd agree with you about this point, this difference can cause more >> different >> behavior between Postgres & Oracle (not just select .. for update). >> >> create table dml(a int, b int); >> insert into dml values(1, 1), (2,2); >> >> -- session 1: >> begin; >> delete from dml where a in (select min(a) from dml); >> >> --session 2: >> delete from dml where a in (select min(a) from dml); >> >> -- session 1: >> commit; >> >> In Oracle: 1 row deleted in sess 2. >> In PG: 0 rows are deleted. >> >> >>> I'm not sure if this means that it locks multiple rows or not. >> >> >> This is something not really exists and you can ignore this part:) >> >> About the statement level rollback, Another difference is related. >> >> create table t (a int primary key, b int); >> begin; >> insert into t values(1,1); >> insert into t values(1, 1); >> commit; >> >> Oracle : t has 1 row, PG: t has 0 row (since the whole transaction is >> aborted). >> >> I don't mean we need to be the same as Oracle, but to support a >> customer who comes from Oracle, it would be good to know the >> difference. >> > > yes, it would be nice to be better documented, somewhere - it should not > be part of Postgres documentation. Unfortunately, people who know Postgres > perfectly do not have the same knowledge about Oracle. > > Some differences are documented in Orafce documentation > https://github.com/orafce/orafce/tree/master/doc > > orafce project is awesome! > but I am afraid so there is nothing about the different behaviour of > snapshots. > > https://github.com/orafce/orafce/pull/120 is opened for this. -- Best Regards Andy Fan
Re: Different results between PostgreSQL and Oracle for "for update" statement
On Sun, Nov 22, 2020 at 5:56 AM Peter Geoghegan wrote: > On Sat, Nov 21, 2020 at 12:58 AM Andy Fan > wrote: > > I don't mean we need to be the same as Oracle, but to support a > > customer who comes from Oracle, it would be good to know the > > difference. > > Actually, it is documented here: > https://www.postgresql.org/docs/devel/transaction-iso.html > > The description starts with: "UPDATE, DELETE, SELECT FOR UPDATE, and > SELECT FOR SHARE commands behave the same as SELECT in terms of > searching for target rows...". > > I imagine that the number of application developers that are aware of > this specific aspect of transaction isolation in PostgreSQL (READ > COMMITTED conflict handling/EvalPlanQual()) is extremely small. In > practice it doesn't come up that often. Totally agree with that. > Though Postgres hackers tend to think about it a lot because it is hard to > maintain. > Hackers may care about this if they run into a real user case :) > I'm not saying that that's good or bad. Just that that has been my > experience. > > I am sure that some application developers really do understand the > single most important thing about READ COMMITTED mode's behavior: each > new command gets its own MVCC snapshot. But I believe that Oracle is > no different. So in practice application developers probably don't > notice any difference between READ COMMITTED mode in practically all > cases. (Again, just my opinion.) > > -- > Peter Geoghegan > -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Hi David: I did a review on the v8, it looks great to me. Here are some tiny things noted, just FYI. 1. modified src/include/utils/selfuncs.h @@ -70,9 +70,9 @@ * callers to provide further details about some assumptions which were made * during the estimation. */ -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of - * the DEFAULTs as defined above. - */ +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one + * of the DEFAULTs as defined + * above. */ Looks nothing has changed. 2. leading spaces is not necessary in comments. /* * ResultCacheTuple Stores an individually cached tuple */ typedef struct ResultCacheTuple { MinimalTuple mintuple; /* Cached tuple */ struct ResultCacheTuple *next; /* The next tuple with the same parameter * values or NULL if it's the last one */ } ResultCacheTuple; 3. We define ResultCacheKey as below. /* * ResultCacheKey * The hash table key for cached entries plus the LRU list link */ typedef struct ResultCacheKey { MinimalTuple params; dlist_node lru_node; /* Pointer to next/prev key in LRU list */ } ResultCacheKey; Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for each element during the ResultCacheHash_equal call. I am thinking if we can store a "Datum *" directly to save these steps. exec_aggvalues/exec_aggnulls looks a good candidate for me, except that the name looks not good. IMO, we can rename exec_aggvalues/exec_aggnulls and try to merge EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be reused in this case. 4. I think the ExecClearTuple in prepare_probe_slot is not a must, since the data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is not real used in our case. Since both prepare_probe_slot and ResultCacheHash_equal are in pretty hot path, we may need to consider it. static inline void prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key) { ... ExecClearTuple(pslot); ... } static void tts_virtual_clear(TupleTableSlot *slot) { if (unlikely(TTS_SHOULDFREE(slot))) { VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot; pfree(vslot->data); vslot->data = NULL; slot->tts_flags &= ~TTS_FLAG_SHOULDFREE; } slot->tts_nvalid = 0; slot->tts_flags |= TTS_FLAG_EMPTY; ItemPointerSetInvalid(&slot->tts_tid); } -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Sun, Nov 22, 2020 at 9:21 PM Andy Fan wrote: > Hi David: > > I did a review on the v8, it looks great to me. Here are some tiny > things noted, > just FYI. > > 1. modified src/include/utils/selfuncs.h > @@ -70,9 +70,9 @@ > * callers to provide further details about some assumptions which were > made > * during the estimation. > */ > -#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one of > - * the DEFAULTs as defined above. > - */ > +#define SELFLAG_USED_DEFAULT (1 << 0) /* Estimation fell back on one > + * of the DEFAULTs as defined > + * above. */ > > Looks nothing has changed. > > > 2. leading spaces is not necessary in comments. > > /* > * ResultCacheTuple Stores an individually cached tuple > */ > typedef struct ResultCacheTuple > { > MinimalTuple mintuple; /* Cached tuple */ > struct ResultCacheTuple *next; /* The next tuple with the same parameter > * values or NULL if it's the last one */ > } ResultCacheTuple; > > > 3. We define ResultCacheKey as below. > > /* > * ResultCacheKey > * The hash table key for cached entries plus the LRU list link > */ > typedef struct ResultCacheKey > { > MinimalTuple params; > dlist_node lru_node; /* Pointer to next/prev key in LRU list */ > } ResultCacheKey; > > Since we store it as a MinimalTuple, we need some FETCH_INNER_VAR step for > each element during the ResultCacheHash_equal call. I am thinking if we > can > store a "Datum *" directly to save these steps. > exec_aggvalues/exec_aggnulls looks > a good candidate for me, except that the name looks not good. IMO, we can > rename exec_aggvalues/exec_aggnulls and try to merge > EEOP_AGGREF/EEOP_WINDOW_FUNC into a more generic step which can be > reused in this case. > > 4. I think the ExecClearTuple in prepare_probe_slot is not a must, since > the > data tts_values/tts_flags/tts_nvalid are all reset later, and tts_tid is > not > real used in our case. Since both prepare_probe_slot > and ResultCacheHash_equal are in pretty hot path, we may need to consider > it. > > static inline void > prepare_probe_slot(ResultCacheState *rcstate, ResultCacheKey *key) > { > ... > ExecClearTuple(pslot); > ... > } > > > static void > tts_virtual_clear(TupleTableSlot *slot) > { > if (unlikely(TTS_SHOULDFREE(slot))) > { > VirtualTupleTableSlot *vslot = (VirtualTupleTableSlot *) slot; > > pfree(vslot->data); > vslot->data = NULL; > > slot->tts_flags &= ~TTS_FLAG_SHOULDFREE; > } > > slot->tts_nvalid = 0; > slot->tts_flags |= TTS_FLAG_EMPTY; > ItemPointerSetInvalid(&slot->tts_tid); > } > > -- > Best Regards > Andy Fan > add 2 more comments. 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN case to make the error clearer. case RC_END_OF_SCAN: /* * We've already returned NULL for this scan, but just in case * something call us again by mistake. */ return NULL; 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case by set it to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple failure is we can't cache_reduce_memory. I guess if cache_reduce_memory failed once, it would not succeed later(no more tuples can be stored, nothing is changed). So I think we can record this state and avoid any new cache_reduce_memory call. /* * If we failed to create the entry or failed to store the * tuple in the entry, then go into bypass mode. */ if (unlikely(entry == NULL || !cache_store_tuple(node, outerslot))) to if (unlikely(entry == NULL || node->memory_cant_be_reduced || !cache_store_tuple(node, outerslot))) -- Best Regards Andy Fan
About adding a new filed to a struct in primnodes.h
Hi: For example, we added a new field in a node in primnodes.h struct FuncExpr { + int newf; }; then we modified the copy/read/out functions for this node. In _readFuncExpr, we probably add something like static FuncExpr _readFuncExpr(..) { .. + READ_INT_FILED(newf); }; Then we will get a compatible issue if we create a view with the node in the older version and access the view with the new binary. I think we can have bypass this issue easily with something like READ_INT_FIELD_UNMUST(newf, defaultvalue); However I didn't see any code like this in our code base. does it doesn't work or is it something not worth doing? -- Best Regards Andy Fan
Re: About adding a new filed to a struct in primnodes.h
On Tue, Nov 24, 2020 at 11:11 PM Alvaro Herrera wrote: > On 2020-Nov-24, Andy Fan wrote: > > > then we modified the copy/read/out functions for this node. In > > _readFuncExpr, > > we probably add something like > > > [ ... ] > > > Then we will get a compatible issue if we create a view with the node in > > the older version and access the view with the new binary. > > When nodes are modified, you have to increment CATALOG_VERSION_NO which > makes the new code incompatible with a datadir previously created Thank you Alvaro. I just think this issue can be avoided without causing the incompatible issue. IIUC, after the new binary isn't compatible with the datadir, the user has to dump/restore the whole database or use pg_upgrade. It is kind of extra work. -- for precisely this reason. > I probably didn't get the real point of this, sorry about that. -- Best Regards Andy Fan
Re: About adding a new filed to a struct in primnodes.h
On Wed, Nov 25, 2020 at 8:10 AM Andy Fan wrote: > > > On Tue, Nov 24, 2020 at 11:11 PM Alvaro Herrera > wrote: > >> On 2020-Nov-24, Andy Fan wrote: >> >> > then we modified the copy/read/out functions for this node. In >> > _readFuncExpr, >> > we probably add something like >> >> > [ ... ] >> >> > Then we will get a compatible issue if we create a view with the node in >> > the older version and access the view with the new binary. >> >> When nodes are modified, you have to increment CATALOG_VERSION_NO which >> makes the new code incompatible with a datadir previously created > > > Thank you Alvaro. I just think this issue can be avoided without causing > the incompatible issue. IIUC, after the new binary isn't compatible with > the datadir, the user has to dump/restore the whole database or use > pg_upgrade. It is kind of extra work. > > > -- for precisely this reason. >> > > I probably didn't get the real point of this, sorry about that. > > -- > Best Regards > Andy Fan > What I mean here is something like below. diff --git a/src/backend/nodes/read.c b/src/backend/nodes/read.c index 8c1e39044c..c3eba00639 100644 --- a/src/backend/nodes/read.c +++ b/src/backend/nodes/read.c @@ -29,6 +29,7 @@ /* Static state for pg_strtok */ static const char *pg_strtok_ptr = NULL; +static const char *pg_strtok_extend(int *length, bool testonly); /* State flag that determines how readfuncs.c should treat location fields */ #ifdef WRITE_READ_PARSE_PLAN_TREES @@ -102,6 +103,20 @@ stringToNodeWithLocations(const char *str) #endif +const char* +pg_strtok(int *length) +{ + return pg_strtok_extend(length, false); +} + +/* + * Just peak the next filed name without changing the global state. + */ +const char* +pg_peak_next_field(int *length) +{ + return pg_strtok_extend(length, true); +} /* * * the lisp token parser @@ -149,7 +164,7 @@ stringToNodeWithLocations(const char *str) * as a single token. */ const char * -pg_strtok(int *length) +pg_strtok_extend(int *length, bool testonly) { const char *local_str; /* working pointer to string */ const char *ret_str; /* start of token to return */ @@ -162,7 +177,8 @@ pg_strtok(int *length) if (*local_str == '\0') { *length = 0; - pg_strtok_ptr = local_str; + if (!testonly) + pg_strtok_ptr = local_str; return NULL; /* no more tokens */ } @@ -199,7 +215,8 @@ pg_strtok(int *length) if (*length == 2 && ret_str[0] == '<' && ret_str[1] == '>') *length = 0; - pg_strtok_ptr = local_str; + if (!testonly) + pg_strtok_ptr = local_str; return ret_str; } -- the below is a demo code to use it. diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index 76ab5ae8b7..c19cd45793 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -689,13 +689,27 @@ _readFuncExpr(void) READ_OID_FIELD(funcid); READ_OID_FIELD(funcresulttype); - READ_BOOL_FIELD(funcretset); + token = pg_peak_next_field(&length); + if (memcmp(token, ":funcretset", strlen(":funcretset")) == 0) + { + READ_BOOL_FIELD(funcretset); + } + else + local_node->funcretset = false; + READ_BOOL_FIELD(funcvariadic); READ_ENUM_FIELD(funcformat, CoercionForm); - READ_OID_FIELD(funccollid); + READ_OID_FIELD(funccollid); READ_OID_FIELD(inputcollid); READ_NODE_FIELD(args); - READ_LOCATION_FIELD(location); + +token = pg_peak_next_field(&length); + if (memcmp(token, ":location", strlen(":location")) == 0) + { + READ_LOCATION_FIELD(location); + } + else + local_node->location = -1; READ_DONE(); } After writing it, I am feeling that this will waste a bit of performance since we need to token a part of the string twice. But looks the overhead looks good to me and can be avoided if we refactor the code again with "read_fieldname_or_nomove(char *fieldname, int *length) " function. -- Best Regards Andy Fan
Re: About adding a new filed to a struct in primnodes.h
On Wed, Nov 25, 2020 at 9:40 AM Tom Lane wrote: > Andy Fan writes: > > What I mean here is something like below. > > What exactly would be the value of that? > > There is work afoot, or at least on people's to-do lists, to mechanize > creation of the outfuncs/readfuncs/etc code directly from the Node > struct declarations. So special cases for particular fields are going > to be looked on with great disfavor, I agree with this, but I don't think there is no value in my suggestion unless I missed something. Per my current understanding, the code is too easy to make the datadir incompatible with binary, which requires user to do some extra work and my suggestion is to avoid that and it is the value. I think it is possible to make this strategy work with the "mechanize creation of the outfuncs/readfuncs/ect node", but there is no point to try it unless we make agreement on if we should do that. > even if you can make a case that > there's some reason to do it. (Which I'm not seeing. We've certainly > never had to do it in the past.) > > regards, tom lane > -- Best Regards Andy Fan
Re: About adding a new filed to a struct in primnodes.h
On Wed, Nov 25, 2020 at 11:54 AM Tom Lane wrote: > Andy Fan writes: > > On Wed, Nov 25, 2020 at 9:40 AM Tom Lane wrote: > >> What exactly would be the value of that? > >> ... > > > I agree with this, but I don't think there is no value in my suggestion > > unless I missed something. Per my current understanding, the code > > is too easy to make the datadir incompatible with binary, which requires > > user to do some extra work and my suggestion is to avoid that and > > it is the value. > > In practice, a major-version upgrade is also going to involve things like > these: > > Actually I thought if this would be the main reason for this case, now you confirmed that. Thank you. > We already sweat a great deal to make user table contents be upwards > compatible across major versions. I think that trying to take on > some similar guarantee with respect to system catalog contents would > serve mostly to waste a lot of developer manpower that can be put to > much better uses. > > Less experienced people know what to do, but real experienced people know what not to do. Thanks for sharing the background in detail! Actually I may struggle too much in a case where I port a feature with node modification into a lower version which I have to consider the compatible issue. The feature is inline-cte, We added a new field in CommonTableExpr which is used to hint optimizer if to think about the "inline" or not. From the porting aspect, the new field will cause the incompatible issue, however a hint would not. I know porting will not be a concern of the community, and I would not argue about hints in this thread, I just share my little experience for reference here. Thank you in all. -- Best Regards Andy Fan
Re: [doc] plan invalidation when statistics are update
On Wed, Nov 25, 2020 at 1:13 PM Fujii Masao wrote: > > > On 2020/11/24 23:14, Fujii Masao wrote: > > > > > > On 2020/11/19 14:33, torikoshia wrote: > >> On 2020-11-18 11:35, Fujii Masao wrote: > >> > >> Thanks for your comment! > >> > >>> On 2020/11/18 11:04, torikoshia wrote: > >>>> Hi, > >>>> > >>>> AFAIU, when the planner statistics are updated, generic plans are > invalidated and PostgreSQL recreates. However, the manual doesn't seem to > explain it explicitly. > >>>> > >>>>https://www.postgresql.org/docs/devel/sql-prepare.html > >>>> > >>>> I guess this case is included in 'whenever database objects used in > the statement have definitional (DDL) changes undergone', but I feel it's > hard to infer. > >>>> > >>>> Since updates of the statistics can often happen, how about > describing this case explicitly like an attached patch? > >>> > >>> +1 to add that note. > >>> > >>> - statement. Also, if the value of linkend="guc-search-path"/> changes > >>> + statement. For example, when the planner statistics of the > statement > >>> + are updated, PostgreSQL re-analyzes and > >>> + re-plans the statement. > >>> > >>> I don't think "For example," is necessary. > >>> > >>> "planner statistics of the statement" sounds vague? Does the statement > >>> is re-analyzed and re-planned only when the planner statistics of > database > >>> objects used in the statement are updated? If yes, we should describe > >>> that to make the note a bit more explicitly? > >> > >> Yes. As far as I confirmed, updating statistics which are not used in > >> prepared statements doesn't trigger re-analyze and re-plan. > >> > >> Since plan invalidations for DDL changes and statistcal changes are > caused > >> by PlanCacheRelCallback(Oid 'relid'), only the prepared statements using > >> 'relid' relation seem invalidated.> Attached updated patch. > > > > Thanks for confirming that and updating the patch! > > force re-analysis and re-planning of the statement before using it > whenever database objects used in the statement have undergone > definitional (DDL) changes since the previous use of the prepared > - statement. Also, if the value of > changes > + statement. Similarly, whenever the planner statistics of database > + objects used in the statement have updated, re-analysis and re-planning > + happen. > > "been" should be added between "have" and "updated" in the above "objects > used in the statement have updated"? > > I'm inclined to add "since the previous use of the prepared statement" into > also the second description, to make it clear. But if we do that, it's > better > to merge the above two description into one, as follows? > > whenever database objects used in the statement have undergone > - definitional (DDL) changes since the previous use of the prepared > + definitional (DDL) changes or the planner statistics of them have > + been updated since the previous use of the prepared > statement. Also, if the value of > changes > > > +1 for documenting this case since I just spent time reading code last week for it. and +1 for the above sentence to describe this case. -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
This patch has stopped moving for a while, any suggestion about how to move on is appreciated. -- Best Regards Andy Fan
Re: Hybrid Hash/Nested Loop joins and caching results from subplans
On Fri, Nov 27, 2020 at 8:10 AM David Rowley wrote: > Thanks for having another look at this. > > > On Sun, Nov 22, 2020 at 9:21 PM Andy Fan > wrote: > > add 2 more comments. > > > > 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN case to make the > error clearer. > > > > case RC_END_OF_SCAN: > > /* > > * We've already returned NULL for this scan, but just in case > > * something call us again by mistake. > > */ > > return NULL; > > This just took some inspiration from nodeMaterial.c where it says: > > /* > * If necessary, try to fetch another row from the subplan. > * > * Note: the eof_underlying state variable exists to short-circuit further > * subplan calls. It's not optional, unfortunately, because some plan > * node types are not robust about being called again when they've already > * returned NULL. > */ > > I'm not feeling a pressing need to put an Assert(false); in there as > it's not what nodeMaterial.c does. nodeMaterial is nodeResultCache's > sister node which can also be seen below Nested Loops. > > OK, even though I am not quite understanding the above now, I will try to figure it by myself. I'm OK with this decision. > > 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case > by set it > >to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple > failure is > >we can't cache_reduce_memory. I guess if cache_reduce_memory > >failed once, it would not succeed later(no more tuples can be stored, > >nothing is changed). So I think we can record this state and avoid > any new > >cache_reduce_memory call. > > > > /* > > * If we failed to create the entry or failed to store the > > * tuple in the entry, then go into bypass mode. > > */ > > if (unlikely(entry == NULL || > > !cache_store_tuple(node, outerslot))) > > > > to > > > > if (unlikely(entry == NULL || node->memory_cant_be_reduced || > > !cache_store_tuple(node, outerslot))) > > The reason for RC_CACHE_BYPASS_MODE is if there's a single set of > parameters that have so many results that they, alone, don't fit in > the cache. We call cache_reduce_memory() whenever we go over our > memory budget. That function returns false if it was unable to free > enough memory without removing the "specialkey", which in this case is > the current cache entry that's being populated. Later, when we're > caching some entry that isn't quite so large, we still want to be able > to cache that. In that case, we'll have removed the remnants of the > overly large entry that didn't fit to way for newer and, hopefully, > smaller entries. No problems. I'm not sure why there's a need for > another flag here. > > Thanks for the explanation, I'm sure I made some mistakes before at this part. -- Best Regards Andy Fan
cost_sort vs cost_agg
Currently the cost_sort doesn't consider the number of columns to sort, which means the cost of SELECT * FROM t ORDER BY a; equals with the SELECT * FROM t ORDER BY a, b; which is obviously wrong. The impact of this is when we choose the plan for SELECT DISTINCT * FROM t ORDER BY c between: Sort Sort Key: c -> HashAggregate Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n and Unique -> Sort Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n Since "Sort (c)" has the same cost as "Sort (c, a, b, d, e, f, g, h, i, j, k, l, m, n)", and Unique node on a sorted input is usually cheaper than HashAggregate, so the later one will win usually which might bad at many places. My patch v1 did a simple improvement for cost_sort, which will consider the number of cols to sort. The main part is below: cost_sort: Assert(numSortCols); /* Include the default cost-per-comparison */ + comparison_cost += (2.0 * cpu_operator_cost * numSortCols); However it still chooses a wrong plan in the simple case below. create table wcols (a int , b int, c int, d int, e int, f int, g int, h int, i int, j int, k int, l int, m int, n int); insert into wcols select i, i , i, i , i, i , i, i, i, i, i, i, i, i from generate_series(1, 100)i; select distinct * from wcols order by c; Optimizer chose HashAggregate with my patch, but it takes 6s. after set enable_hashagg = off, it takes 2s. The main reason is both cost_sort and cost_agg doesn't consider the real hash function or real sort function, they use cpu_operator_cost instead. If we really want to fix this issue, shall we a). figure the real pg_proc.oid for sort and hash during planning stage and costing with that? b). in cost_sort, we may consider the nature order of input data for the ordered column as well? c). store the Oids in SortPath and AggPath to avoid the double calculation during createPlan stage? or any better suggestion? Thanks -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Improve-the-cost_sort-v1.patch Description: Binary data
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi Masahiko: On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada wrote: > Hi Andy, > > On Mon, Dec 7, 2020 at 9:15 PM Andy Fan wrote: > > > > > > > > On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen < > jesper.peder...@redhat.com> wrote: > >> > >> Hi, > >> > >> On 12/5/20 10:38 PM, Andy Fan wrote: > >> > Currently the UniqueKey is defined as a List of Expr, rather than > >> > EquivalenceClasses. > >> > A complete discussion until now can be found at [1] (The messages I > replied > >> > to also > >> > care a lot and the information is completed). This patch has stopped > at > >> > this place for > >> > a while, I'm planning to try EquivalenceClasses, but any suggestion > would > >> > be welcome. > >> > > >> > > >> > >> Unfortunately I think we need a RfC style patch of both versions in > >> their minimum implementation. > >> > >> Hopefully this will make it easier for one or more committers to decide > >> on the right direction since they can do a side-by-side comparison of > >> the two solutions. > >> > > > > I do get the exact same idea. Actually I have made EquivalenceClasses > > works with baserel last weekend and then I realized it is hard to compare > > the 2 situations without looking into the real/Poc code, even for very > > experienced people. I will submit a new patch after I get the > partitioned > > relation, subquery works. Hope I can make it in one week. > > Status update for a commitfest entry. > > Are you planning to submit a new patch? Or is there any blocker for > this work? This patch entry on CF app has been in state Waiting on > Author for a while. If there is any update on that, please reflect on > CF app. > > > I agree that the current status is "Waiting on author", and no block issue for others. I plan to work on this in 1 month. I have to get my current urgent case completed first. Sorry for the delay action and thanks for asking. -- Best Regards Andy Fan (https://www.aliyun.com/)
Extend more usecase for planning time partition pruning and init partition pruning.
Hi: I recently found a use case like this. SELECT * FROM p, q WHERE p.partkey = q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning time partition prune or init partition prune. Even though we have run-time partition pruning work at last, it is too late in some cases since we have to init all the plan nodes in advance. In my case, there are 10+ partitioned relation in one query and the execution time is short, so the init plan a lot of plan nodes cares a lot. The attached patches fix this issue. It just get the "p.partkey = q.colx" case in root->eq_classes or rel->joinlist (outer join), and then check if there is some baserestrictinfo in another relation which can be used for partition pruning. To make the things easier, both partkey and colx must be Var expression in implementation. - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch Just some existing refactoring and extending ChangeVarNodes to be able to change var->attno. - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch Do the real job. Thought? -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch Description: Binary data v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch Description: Binary data
Re: Extend more usecase for planning time partition pruning and init partition pruning.
On Sun, Jan 24, 2021 at 6:34 PM Andy Fan wrote: > Hi: > > I recently found a use case like this. SELECT * FROM p, q WHERE > p.partkey = > q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either planning > time > partition prune or init partition prune. Even though we have run-time > partition pruning work at last, it is too late in some cases since we have > to init all the plan nodes in advance. In my case, there are 10+ > partitioned relation in one query and the execution time is short, so the > init plan a lot of plan nodes cares a lot. > > The attached patches fix this issue. It just get the "p.partkey = q.colx" > case in root->eq_classes or rel->joinlist (outer join), and then check if > there > is some baserestrictinfo in another relation which can be used for > partition > pruning. To make the things easier, both partkey and colx must be Var > expression in implementation. > > - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch > > Just some existing refactoring and extending ChangeVarNodes to be able > to change var->attno. > > - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch > > Do the real job. > > Thought? > > > > -- > Best Regards > Andy Fan (https://www.aliyun.com/) > Some results from this patch. create table p (a int, b int, c character varying(8)) partition by list(c); create table p1 partition of p for values in ('01'); create table p2 partition of p for values in ('02'); create table p3 partition of p for values in ('03'); create table q (a int, c character varying(8), b int) partition by list(c); create table q1 partition of q for values in ('01'); create table q2 partition of q for values in ('02'); create table q3 partition of q for values in ('03'); Before the patch: postgres=# explain (costs off) select * from p inner join q on p.c = q.c and q.c > '02'; QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Seq Scan on q3 q Filter: ((c)::text > '02'::text) (9 rows) After the patch: QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Seq Scan on p3 p -> Hash -> Seq Scan on q3 q Filter: ((c)::text > '02'::text) (6 rows) Before the patch: postgres=# explain (costs off) select * from p inner join q on p.c = q.c and (q.c = '02' or q.c = '01'); QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2 q_2 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) (12 rows) After the patch: QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2 q_2 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) (11 rows) Before the patch: postgres=# explain (costs off) select * from p left join q on p.c = q.c where (q.c = '02' or q.c = '01'); QUERY PLAN Hash Join Hash Cond: ((p.c)::text = (q.c)::text) -> Append -> Seq Scan on p1 p_1 -> Seq Scan on p2 p_2 -> Seq Scan on p3 p_3 -> Hash -> Append -> Seq Scan on q1 q_1 Filter: (((c)::text = '02'::text) OR ((c)::text = '01'::text)) -> Seq Scan on q2
Re: Extend more usecase for planning time partition pruning and init partition pruning.
On Mon, Jan 25, 2021 at 10:21 AM Andy Fan wrote: > > > On Sun, Jan 24, 2021 at 6:34 PM Andy Fan wrote: > >> Hi: >> >> I recently found a use case like this. SELECT * FROM p, q WHERE >> p.partkey = >> q.colx AND (q.colx = $1 OR q.colx = $2); Then we can't do either >> planning time >> partition prune or init partition prune. Even though we have run-time >> partition pruning work at last, it is too late in some cases since we >> have >> to init all the plan nodes in advance. In my case, there are 10+ >> partitioned relation in one query and the execution time is short, so the >> init plan a lot of plan nodes cares a lot. >> >> The attached patches fix this issue. It just get the "p.partkey = q.colx" >> case in root->eq_classes or rel->joinlist (outer join), and then check if >> there >> is some baserestrictinfo in another relation which can be used for >> partition >> pruning. To make the things easier, both partkey and colx must be Var >> expression in implementation. >> >> - v1-0001-Make-some-static-functions-as-extern-and-extend-C.patch >> >> Just some existing refactoring and extending ChangeVarNodes to be able >> to change var->attno. >> >> - v1-0002-Build-some-implied-pruning-quals-to-extend-the-us.patch >> >> Do the real job. >> >> Thought? >> >> >> >> -- >> Best Regards >> Andy Fan (https://www.aliyun.com/) >> > > > Some results from this patch. > > create table p (a int, b int, c character varying(8)) partition by list(c); > create table p1 partition of p for values in ('01'); > create table p2 partition of p for values in ('02'); > create table p3 partition of p for values in ('03'); > create table q (a int, c character varying(8), b int) partition by list(c); > create table q1 partition of q for values in ('01'); > create table q2 partition of q for values in ('02'); > create table q3 partition of q for values in ('03'); > > Before the patch: > postgres=# explain (costs off) select * from p inner join q on p.c = q.c > and q.c > '02'; > QUERY PLAN > > Hash Join >Hash Cond: ((p.c)::text = (q.c)::text) >-> Append > -> Seq Scan on p1 p_1 > -> Seq Scan on p2 p_2 > -> Seq Scan on p3 p_3 >-> Hash > -> Seq Scan on q3 q >Filter: ((c)::text > '02'::text) > (9 rows) > > After the patch: > > QUERY PLAN > > Hash Join >Hash Cond: ((p.c)::text = (q.c)::text) >-> Seq Scan on p3 p >-> Hash > -> Seq Scan on q3 q >Filter: ((c)::text > '02'::text) > (6 rows) > > > Before the patch: > postgres=# explain (costs off) select * from p inner join q on p.c = q.c > and (q.c = '02' or q.c = '01'); > QUERY PLAN > > > Hash Join >Hash Cond: ((p.c)::text = (q.c)::text) >-> Append > -> Seq Scan on p1 p_1 > -> Seq Scan on p2 p_2 > -> Seq Scan on p3 p_3 >-> Hash > -> Append >-> Seq Scan on q1 q_1 > Filter: (((c)::text = '02'::text) OR ((c)::text = > '01'::text)) >-> Seq Scan on q2 q_2 > Filter: (((c)::text = '02'::text) OR ((c)::text = > '01'::text)) > (12 rows) > > After the patch: > QUERY PLAN > > > Hash Join >Hash Cond: ((p.c)::text = (q.c)::text) >-> Append > -> Seq Scan on p1 p_1 > -> Seq Scan on p2 p_2 >-> Hash > -> Append >-> Seq Scan on q1 q_1 > Filter: (((c)::text = '02'::text) OR ((c)::text = > '01'::text)) >-> Seq Scan on q2 q_2 > Filter: (((c)::text = '02'::text) OR ((c)::text = > '01'::text)) > (11 rows) > > Before the patch: > postgres=# explain (costs off) select * from p
Re: cost_sort vs cost_agg
Thank you Ashutosh. On Fri, Jan 15, 2021 at 7:18 PM Ashutosh Bapat wrote: > On Thu, Jan 14, 2021 at 7:12 PM Andy Fan wrote: > > > > Currently the cost_sort doesn't consider the number of columns to sort, > which > > means the cost of SELECT * FROM t ORDER BY a; equals with the SELECT * > > FROM t ORDER BY a, b; which is obviously wrong. The impact of this is > when we > > choose the plan for SELECT DISTINCT * FROM t ORDER BY c between: > > > > Sort > >Sort Key: c > >-> HashAggregate > > Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > > > and > > > > Unique > >-> Sort > > Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n > > > > > > Since "Sort (c)" has the same cost as "Sort (c, a, b, d, e, f, g, h, i, > j, k, > > l, m, n)", and Unique node on a sorted input is usually cheaper than > > HashAggregate, so the later one will win usually which might bad at many > > places. > > I can imagine that HashAggregate + Sort will perform better if there > are very few distinct rows but otherwise, Unique on top of Sort would > be a better strategy since it doesn't need two operations. > > Thanks for the hint, I will consider the distinct rows as a factor in the next patch. > > > > Optimizer chose HashAggregate with my patch, but it takes 6s. after set > > enable_hashagg = off, it takes 2s. > > This example actually shows that using Unique is better than > HashAggregate + Sort. May be you want to try with some data which has > very few distinct rows. > > -- Best Regards Andy Fan (https://www.aliyun.com/)
Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
Hi: This patch is the first patch in UniqueKey patch series[1], since I need to revise that series many times but the first one would be not that often, so I'd like to submit this one for review first so that I don't need to maintain it again and again. v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch Introduce notnullattrs field in RelOptInfo to indicate which attr are not null in current query. The not null is judged by checking pg_attribute and query's restrictinfo. The info is only maintained at Base RelOptInfo and Partition's RelOptInfo level so far. Any thoughts? [1] https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Wed, Feb 10, 2021 at 11:18 AM Andy Fan wrote: > Hi: > > This patch is the first patch in UniqueKey patch series[1], since I need > to revise > that series many times but the first one would be not that often, so I'd > like to > submit this one for review first so that I don't need to maintain it again > and again. > > v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch > > Introduce notnullattrs field in RelOptInfo to indicate which attr are not > null > in current query. The not null is judged by checking pg_attribute and > query's > restrictinfo. The info is only maintained at Base RelOptInfo and > Partition's > RelOptInfo level so far. > > > Any thoughts? > > [1] > https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com > > -- > Best Regards > Andy Fan (https://www.aliyun.com/) > Add the missed patch.. -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch Description: Binary data
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
On Sun, Jan 24, 2021 at 6:26 PM Andy Fan wrote: > Hi Masahiko: > > On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada > wrote: > >> Hi Andy, >> >> On Mon, Dec 7, 2020 at 9:15 PM Andy Fan wrote: >> > >> > >> > >> > On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen < >> jesper.peder...@redhat.com> wrote: >> >> >> >> Hi, >> >> >> >> On 12/5/20 10:38 PM, Andy Fan wrote: >> >> > Currently the UniqueKey is defined as a List of Expr, rather than >> >> > EquivalenceClasses. >> >> > A complete discussion until now can be found at [1] (The messages I >> replied >> >> > to also >> >> > care a lot and the information is completed). This patch has stopped >> at >> >> > this place for >> >> > a while, I'm planning to try EquivalenceClasses, but any >> suggestion would >> >> > be welcome. >> >> > >> >> > >> >> >> >> Unfortunately I think we need a RfC style patch of both versions in >> >> their minimum implementation. >> >> >> >> Hopefully this will make it easier for one or more committers to decide >> >> on the right direction since they can do a side-by-side comparison of >> >> the two solutions. >> >> >> > >> > I do get the exact same idea. Actually I have made EquivalenceClasses >> > works with baserel last weekend and then I realized it is hard to >> compare >> > the 2 situations without looking into the real/Poc code, even for very >> > experienced people. I will submit a new patch after I get the >> partitioned >> > relation, subquery works. Hope I can make it in one week. >> >> Status update for a commitfest entry. >> >> Are you planning to submit a new patch? Or is there any blocker for >> this work? This patch entry on CF app has been in state Waiting on >> Author for a while. If there is any update on that, please reflect on >> CF app. >> >> >> I agree that the current status is "Waiting on author", and no block > issue for others. > I plan to work on this in 1 month. I have to get my current urgent case > completed first. > Sorry for the delay action and thanks for asking. > > > I'd start to continue this work today. At the same time, I will split the multi-patch series into some dedicated small chunks for easier review. The first one is just for adding a notnullattrs in RelOptInfo struct, in thread [1]. https://www.postgresql.org/message-id/flat/CAKU4AWpQjAqJwQ2X-aR9g3%2BZHRzU1k8hNP7A%2B_mLuOv-n5aVKA%40mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Can I assume relation would not be invalid during from ExecutorRun to ExecutorEnd
> If the relation is open, > the relcache entry can't be destroyed altogether, but it can be > rebuilt: see RelationClearRelation(). Thanks! This is a new amazing knowledge for me! > They are in fact stable in a certain sense - if we have the relation open we hold a reference count on it, and so the Relation pointer itself will remain valid. This sounds amazing as well. > But > the data it points to can change in various ways, and different > members of the RelationData struct are handled differently. Separately > from the reference count, the heavyweight lock that we also hold on > the relation as a condition of opening it prevents certain kinds of > changes, so that even if the relation cache entry is rebuilt, certain > particular fields will be unaffected. Which fields are protected in > this way will depend on what kind of lock is held. It's hard to speak > in general terms. Amazing++; > The best advice I can give you is (1) look exactly > what RelationClearRelation() is going to do to the fields you care > about if a rebuild happens, (2) err on the side of assuming that > things can change under you, and (3) try running your code under > debug_discard_caches = 1. It will be slow that way, but it's pretty > effective in finding places where you've made unsafe assumptions. > > Thanks! I clearly understand what's wrong in my previous knowledge. That is, after a relation is open with some lock, then the content of the relation will never change until the RelationClose. It would take time to fill the gap, but I'd like to say "thank you!" first. -- Best Regards Andy Fan
Proposal for col LIKE $1 with generic Plan
Thanks to the get_index_clause_from_support, we can use index for WHERE a like 'abc%' case. However this currently only works on custom plan. Think about the case where the planning takes lots of time, custom plan will not give a good result. so I want to see if we should support this for a generic plan as well. The first step of this is we need to find an operator to present prefix is just literal. which means '%' is just '%', not match any characters. After trying 'like' and '~' operator, I find none of them can be used. for example: PREPARE s AS SELECT * FROM t WHERE a LIKE ('^' || $1); EXECUTE s('%abc'); '%' is still a special character to match any characters. So '~' is. So I think we need to define an new operator like text(a) ~^ text(b), which means a is prefixed with b literally. For example: 'abc' ~^ 'ab` -> true 'abc' ~^ 'ab%' -> false so the above case can be written as: PREPARE s AS SELECT * FROM t WHERE a ~^ $1; The second step is we need to define greater string for $1 just like make_greater_string. Looks we have to know the exact value to make the greater string, so we can define a new FuncExpr like this: Index Cond: ((md5 >= $1::text) AND (md5 < make_greater_string_fn($1)). This may not be able to fix handle the current make_greater_string return NULL case. so we may define another FuncExpr like text_less_than_or_null Index Cond: ((md5 >= $1::text) AND (text_less_than_or_null(md5, make_greater_string_fn($1))) Is this a right thing to do and a right method? Thanks -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Proposal for col LIKE $1 with generic Plan
On Thu, Mar 25, 2021 at 10:15 AM Andy Fan wrote: > Thanks to the get_index_clause_from_support, we can use index for WHERE a > like > 'abc%' case. However this currently only works on custom plan. Think about > the > case where the planning takes lots of time, custom plan will not give a > good > result. so I want to see if we should support this for a generic plan as > well. > > The first step of this is we need to find an operator to present prefix is > just > literal. which means '%' is just '%', not match any characters. After > trying > 'like' and '~' operator, I find none of them can be used. for example: > > PREPARE s AS SELECT * FROM t WHERE a LIKE ('^' || $1); > EXECUTE s('%abc'); > > '%' is still a special character to match any characters. So '~' is. So > I think > we need to define an new operator like text(a) ~^ text(b), which means a > is prefixed with b literally. For example: > > 'abc' ~^ 'ab` -> true > 'abc' ~^ 'ab%' -> false > > so the above case can be written as: > > PREPARE s AS SELECT * FROM t WHERE a ~^ $1; > > During the PoC coding, I found we already have ^@ operator for this [1], but we don't implement that for BTree index so far. So I will try gist index for my current user case and come back to this thread later. Thanks! [1] https://www.postgresql.org/message-id/20180416155036.36070396%40wp.localdomain -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: UniqueKey on Partitioned table.
On Sat, Mar 27, 2021 at 3:07 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sat, Feb 20, 2021 at 10:25:59AM +0800, Andy Fan wrote: > > > > The attached is a UnqiueKey with EquivalenceClass patch, I just complete > the > > single relation part and may have bugs. I just attached it here for > design > > review only. and the not-null-attrs is just v1 which we can continue > > discussing on the original thread[2]. > > Thanks for the patch. After a short look through it I'm a bit confused > and wanted to clarify, now uniquekeys list could contain both Expr and > EquivalenceClass? > Yes, That's because I don't want to create a new EquivalenceClass (which would make the PlannerInfo->eq_classes longer) if we don't have one , then I just used one Expr instead for this case. However during the test, I found some EquivalenceClass with only 1 EquivalenceMember unexpectedly. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Extend more usecase for planning time partition pruning and init partition pruning.
Hi David: On Mon, Mar 8, 2021 at 9:34 AM David Rowley wrote: > On Thu, 4 Mar 2021 at 22:07, Amit Langote wrote: > > * Or maybe have you considered generalizing what > > build_implied_pruning_quals() does so that other places like > > indxpath.c can use the facility? > > I agree with doing it another way. There's plenty of other queries > which we could produce a better plan for if EquivalenceClass knew > about things like IN conditions and >=, >, < and <= btree ops. > > It seems wrong to code anything in this regard that's specific to > partition pruning. > > Please see [1] for an idea. IIRC, the implementation was not well > received and there were concerns about having to evaluate additional > needless quals. That part I think can be coded around. The trick will > be to know when and when not to use additional quals. > > The show stopper for me was having a more efficient way to find if a > given Expr exists in an EquivalenceClass. This is why I didn't take > the idea further, at the time. My implementation in that patch > required lots of looping to find if a given Expr had an existing > EquivalenceMember, to which there was a danger of that becoming slow > for complex queries. > > I'm unsure right now if it would be possible to build standard > EquivalenceMembers and EquivalenceFilters in the same pass. I think > it might require 2 passes since you only can use IN and range type > quals for Exprs that actually have a EquivalenceMember. So you need to > wait until you're certain there's some equality OpExpr before adding > EquivalenceFilters. (Pass 1 can perhaps remember if anything looks > interesting and then skip pass 2 if there's not...??) > > EquivalenceClass might be slightly faster now since we have > RelOptInfo.eclass_indexes. However, I've not checked to see if the > indexes will be ready in time for when you'd be building the > additional filters. I'm guessing that they wouldn't be since you'd > still be building the EquivalenceClasses at that time. Certainly, > process_equivalence() could do much faster lookups of Exprs if there > was some global index for all EquivalenceMembers. However, > equalfuncs.c only gives us true or false if two nodes are equal(). > We'd need to either have a -1, 0, +1 value or be able to hash nodes > and put things into a hash table. Else we're stuck trawling through > lists comparing each item 1-by-1. That's pretty slow. Especially with > complex queries. > > Both Andres and I have previously suggested ways to improve Node > searching. My idea is likely easier to implement, as it just changed > equalfuncs.c to add a function that returns -1, 0, +1 so we could use > a binary search tree to index Nodes. Andres' idea [2] is likely the > better of the two. Please have a look at that. It'll allow us to > easily build a function to hash nodes and put them in a hash table. > > To get [1], the implementation will need to be pretty smart. There's > concern about the idea. See [3]. You'll need to ensure you're not > adding too much planner overhead and also not slowing down execution > for cases by adding additional qual evals that are redundant. > > It's going to take some effort to make everyone happy here. > I truly understand what you are saying here, and believe that needs some more hard work to do. I am not sure I am prepared to do that at current stage. So I will give up this idea now and continue to work with this when time is permitted. I have marked the commitfest entry as "Returned with Feedback". Thanks for the detailed information! > David > > [1] > https://www.postgresql.org/message-id/flat/CAKJS1f9fPdLKM6%3DSUZAGwucH3otbsPk6k0YT8-A1HgjFapL-zQ%40mail.gmail.com#024ad18e19bb9b6c022fb572edc8c992 > [2] > https://www.postgresql.org/message-id/flat/20190828234136.fk2ndqtld3onfrrp%40alap3.anarazel.de > [3] > https://www.postgresql.org/message-id/flat/30810.1449335...@sss.pgh.pa.us#906319f5e212fc3a6a682f16da079f04 > -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
On Tue, Feb 16, 2021 at 12:01 PM David Rowley wrote: > On Fri, 12 Feb 2021 at 15:18, Andy Fan wrote: > > > > On Fri, Feb 12, 2021 at 9:02 AM David Rowley > wrote: > >> The reason I don't really like this is that it really depends where > >> you want to use RelOptInfo.notnullattrs. If someone wants to use it > >> to optimise something before the base quals are evaluated then they > >> might be unhappy that they found some NULLs. > >> > > > > Do you mean the notnullattrs is not set correctly before the base quals > are > > evaluated? I think we have lots of data structures which are set just > after some > > stage. but notnullattrs is special because it is set at more than 1 > stage. However > > I'm doubtful it is unacceptable, Some fields ever change their meaning > at different > > stages like Var->varno. If a user has a misunderstanding on it, it > probably will find it > > at the testing stage. > > You're maybe focusing too much on your use case for notnullattrs. It > only cares about NULLs in the result for each query level. > > thinks of an example... > > OK, let's say I decided that COUNT(*) is faster than COUNT(id) so > decided that I might like to write a patch which rewrite the query to > use COUNT(*) when it was certain that "id" could not contain NULLs. > > The query is: > > SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER > JOIN sales s ON p.partid = s.partid GROUP BY p.partid; > > sale.saleid is marked as NOT NULL in pg_attribute. As the writer of > the patch, I checked the comment for notnullattrs and it says "Not > null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I > should be ok to assume since sales.saleid is marked in notnullattrs > that I can rewrite the query?! > > The documentation about the RelOptInfo.notnullattrs needs to be clear > what exactly it means. I'm not saying your representation of how to > record NOT NULL in incorrect. I'm saying that you need to be clear > what exactly is being recorded in that field. > > If you want it to mean "attribute marked here cannot output NULL > values at this query level", then you should say something along those > lines. > > However, having said that, because this is a Bitmapset of > pg_attribute.attnums, it's only possible to record Vars from base > relations. It does not seem like you have any means to record > attributes that are normally NULLable, but cannot produce NULL values > due to a strict join qual. > > e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something; > > I'd expect the RelOptInfo for t not to contain a bit for the > "nullable" column, but there's no way to record the fact that the join > RelOptInfo for {t,j} cannot produce a NULL for that column. It might > be quite useful to know that for the UniqueKeys patch. > > I read your comments again and find I miss your point before. So I'd summarize the my current understanding to make sure we are in the same place for further working. I want to define a notnullattrs on RelOptInfo struct. The not nullable may comes from catalog definition or quals on the given query. For example: CREATE TABLE t(a INT NOT NULL, nullable INT); SELECT * FROM t; ==> a is not null for sure by definition. SELECT * FROM t WHERE nullable > 3; ==> nullable is not null as well by qual. However the thing becomes complex with the below 2 cases. 1. SELECT * FROM t INNER JOIN j on t.nullable = q.b; We know t.b will not be null **finally**. But the current plan may something like this: QUERY PLAN -- Merge Join Merge Cond: (t.nullable = j.something) -> Sort Sort Key: t.nullable -> Seq Scan on t -> Sort Sort Key: j.something -> Seq Scan on j (8 rows) which means the Path "Seq Scan on t" still contains some null values. At least, we should not assume t.nullable is "not nullable" the base relation stage. 2. SELECT t.a FROM j LEFT JOIN t ON t.b = t.a; Even the t.a is not null by definition, but it may have null **finally** due to the outer join. My current patch doesn't handle the 2 cases well since t.nullable is marked as NOT NULL for both cases. > I know there's another discussion here between Ashutosh and Tom about > PathTarget's and Vars. I had the Var idea too once myself [1] but it > was quickly shot down. Tom's reasoning there in [1] seems legit. I > guess we'd need some sort of planner version of Var and never confuse > it with the Parse version of Var. That sounds like quite a big > project which woul
Re: UniqueKey on Partitioned table.
On Tue, Mar 30, 2021 at 4:16 AM David Rowley wrote: > On Tue, 30 Mar 2021 at 02:27, Ashutosh Bapat > wrote: > > > > On Sat, Mar 27, 2021 at 11:44 AM Andy Fan > wrote: > > > > > > On Sat, Mar 27, 2021 at 3:07 AM Dmitry Dolgov <9erthali...@gmail.com> > wrote: > > >> Thanks for the patch. After a short look through it I'm a bit confused > > >> and wanted to clarify, now uniquekeys list could contain both Expr and > > >> EquivalenceClass? > > > > > > > > > Yes, That's because I don't want to create a new EquivalenceClass > (which > > > would make the PlannerInfo->eq_classes longer) if we don't have > > > one , then I just used one Expr instead for this case. > > > However during the > > > test, I found some EquivalenceClass with only 1 EquivalenceMember > > > unexpectedly. > > > > > > > Pathkeys may induce single member ECs. Why UniqueKeys are an exception? > > When working with UniqueKey, I do want to make PlannerInfo.eq_classes short, so I don't want to create a new EC for UniqueKey only. After I realized we have so single-member ECs, I doubt if the "Expr in UniqueKey" will be executed in real. I still didn't get enough time to do more research about this. I doubt that it should be. get_eclass_for_sort_expr() makes > single-member ECs for sorts. Thanks for this hint. I can check more cases like this. > I imagine the UniqueKey stuff should > copy that... However, get_eclass_for_sort_expr() can often dominate the planning effort in queries to partitioned tables with a large > number of partitions when the query has an ORDER BY. Perhaps Andy is > trying to sidestep that issue? > Yes. a long PlannerInfo.eq_classes may make some finding slow, and in my UniqueKey patch, I am trying to not make it longer. > I mentioned a few things in [1] on what I think about this. > > David > > [1] > https://www.postgresql.org/message-id/CAApHDvoDMyw=htuw-258yqnk4bhw6cpguju_gzbh4x+rnoe...@mail.gmail.com > I appreciate all of the people who helped on this patch and others. I would like to share more of my planning. As for the UniqueKey patch, there are some design decisions that need to be made. In my mind, the order would be: a). How to present the notnullattrs probably in [1] b). How to present the element in UniqueKey. Prue EquivalenceClasses or Mix of Expr and EquivalenceClass as we just talked about. c). How to maintain the UniqueKey Partitioned table in the beginning of this thread. As for a) & c). I have my current proposal for discussion. as for b) I think I need more thinking about this. Based on the idea above, I am not willing to move too fast on the following issue unless the previous issue can be addressed. Any feedback/suggestion about my current planning is welcome. [1] https://www.postgresql.org/message-id/CAKU4AWp3WKyrMKNdg46BvQRD7xkNL9UsLLcLhd5ao%3DFSbnaN_Q%40mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
> > I assume we want to know if a Var is nullable with a function like. > is_var_notnullable(Var *var, Relids relids). If so, we can define the > data as below: > > struct RelOptInfo { > > Bitmapset** notnullattrs; > .. > }; > > After this we can implement the function as: > > bool > is_var_notnullable(Var* var, Relids relids) > { > RelOptInfo *rel = find_rel_by_relids(reldis); > return bms_is_member(var->varattno, rel->notnullattrs[var->varno]); > } > > Probably we can make some hackers to reduce the notnullattrs's memory usage > overhead. > > To be more precise, to make the rel->notnullattrs shorter, we can do the following methods: 1). Relids only has single element, we can always use a 1-len array rather than rel->varno elements. 2). For multi-elements relids, we use the max(varno) as the length of rel->notnullattrs. 3). For some cases, the notnullattrs of a baserel is not changed in later stages, we can just reuse the same Bitmapset * in later stages. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: What to call an executor node which lazily caches tuples in a hash table?
On Wed, Mar 31, 2021 at 7:45 AM Zhihong Yu wrote: > Hi, > I was reading this part of the description: > > the Result Cache's > hash table is much smaller than the hash join's due to result cache only > caching useful values rather than all tuples from the inner side of the > join. > > I think the word 'Result' should be part of the cache name considering the > above. > > Cheers > > On Tue, Mar 30, 2021 at 4:30 PM David Rowley wrote: > >> Hackers, >> >> Over on [1] I've been working on adding a new type of executor node >> which caches tuples in a hash table belonging to a given cache key. >> >> The current sole use of this node type is to go between a >> parameterized nested loop and the inner node in order to cache >> previously seen sets of parameters so that we can skip scanning the >> inner scan for parameter values that we've already cached. The node >> could also be used to cache results from correlated subqueries, >> although that's not done yet. >> >> The cache limits itself to not use more than hash_mem by evicting the >> least recently used entries whenever more space is needed for new >> entries. >> >> Currently, in the patch, the node is named "Result Cache". That name >> was not carefully thought out. I just needed to pick something when >> writing the code. >> >> Here's an EXPLAIN output with the current name: >> >> postgres=# explain (costs off) select relkind,c from pg_class c1, >> lateral (select count(*) c from pg_class c2 where c1.relkind = >> c2.relkind) c2; >> QUERY PLAN >> >> Nested Loop >>-> Seq Scan on pg_class c1 >>-> Result Cache >> Cache Key: c1.relkind >> -> Aggregate >>-> Seq Scan on pg_class c2 >> Filter: (c1.relkind = relkind) >> (7 rows) >> >> I just got off a team call with Andres, Thomas and Melanie. During the >> call I mentioned that I didn't like the name "Result Cache". Many name >> suggestions followed: >> >> Here's a list of a few that were mentioned: >> >> Probe Cache >> Tuple Cache >> Keyed Materialize >> Hash Materialize >> Result Cache >> Cache >> Hash Cache >> Lazy Hash >> Reactive Hash >> Parameterized Hash >> Parameterized Cache >> Keyed Inner Cache >> MRU Cache >> MRU Hash >> >> I was hoping to commit the final patch pretty soon, but thought I'd >> have another go at seeing if we can get some consensus on a name >> before doing that. Otherwise, I'd sort of assumed that we'd just reach >> some consensus after everyone complained about the current name after >> the feature is committed. >> >> My personal preference is "Lazy Hash", but I feel it might be better >> to use the word "Reactive" instead of "Lazy". >> >> There was some previous discussion on the name in [2]. I suggested >> some other names in [3]. Andy voted for "Tuple Cache" in [4] >> >> Votes? Other suggestions? >> >> (I've included all the people who have shown some previous interest in >> naming this node.) >> >> David >> >> [1] >> https://www.postgresql.org/message-id/flat/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com >> [2] >> https://www.postgresql.org/message-id/CA%2BTgmoZMxLeanqrS00_p3xNsU3g1v3EKjNZ4dM02ShRxxLiDBw%40mail.gmail.com >> [3] >> https://www.postgresql.org/message-id/CAApHDvoj_sH1H3JVXgHuwnxf1FQbjRVOqqgxzOgJX13NiA9-cg%40mail.gmail.com >> [4] >> https://www.postgresql.org/message-id/CAKU4AWoshM0JoymwBK6PKOFDMKg-OO6qtSVU_Piqb0dynxeL5w%40mail.gmail.com >> >> >> I want to share some feelings about other keywords. Materialize are used in Materialize node in executor node, which would write data to disk when memory is not enough, and it is used in "Materialized View", where it stores all the data to disk This gives me some feeling that "Materialize" usually has something with disk, but our result cache node doesn't. And I think DBA checks plans more than the PostgreSQL developer. So some MRU might be too internal for them. As for developers, if they want to know such details, they can just read the source code. When naming it, we may also think about some non native English speakers, so some too advanced words may make them uncomfortable. Actually when I read "Reactive", I googled to find what its meaning is. I knew reactive programming, but I do not truly understand "reactive hash". And Compared with HashJoin, Hash may mislead people the result may be spilled into disk as well. so I prefer "Cache" over "Hash". At last, I still want to vote for "Tuple(s) Cache", which sounds simple and enough. I was thinking if we need to put "Lazy" in the node name since we do build cache lazily, then I found we didn't call "Materialize" as "Lazy Materialize", so I think we can keep consistent. > I was hoping to commit the final patch pretty soon Very glad to see it, thanks for the great feature. -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: UniqueKey on Partitioned table.
On Wed, Mar 31, 2021 at 9:12 PM Ashutosh Bapat wrote: > > b). How to present the element > > in UniqueKey. Prue EquivalenceClasses or Mix of Expr and > EquivalenceClass as > > we just talked about. > I think the reason we add ECs for sort expressions is to use > transitive relationship. The EC may start with a single member but > later in the planning that member might find partners which are all > equivalent. Result ordered by one is also ordered by the other. The > same logic applies to UniqueKey as well, isn't it. In a result if a > set of columns makes a row unique, the set of columns represented by > the other EC member should be unique. Though a key will start as a > singleton it might EC partners later and thus thus unique key will > transition to all the members. With that logic UniqueKey should use > just ECs instead of bare expressions. > TBH, I haven't thought about this too hard, but I think when we build the UniqueKey, all the ECs have been built already. So can you think out an case we start with an EC with a single member at the beginning and have more members later for UniqueKey cases? -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
> > However the thing becomes complex with the below 2 cases. > > 1. SELECT * FROM t INNER JOIN j on t.nullable = q.b; > We know t.b will be not null **finally**. But the current plan may > something > like this: > > QUERY PLAN > -- > Merge Join >Merge Cond: (t.nullable = j.something) >-> Sort > Sort Key: t.nullable > -> Seq Scan on t >-> Sort > Sort Key: j.something > -> Seq Scan on j > (8 rows) > > which means the Path "Seq Scan on t" still contains some null values. At > least, > we should not assume t.nullable is "not nullable" the base relation stage. > > 2. SELECT t.a FROM j LEFT JOIN t ON t.b = t.a; > Even the t.a is not null by definition, but it may have null **finally** > due to > the outer join. > The above 2 cases have been addressed by defining the notnullattrs on every RelOptInfo, and maintaining them on every join. However, per offline discussion with David, IIUC, there is a more case to think about. CREATE TABLE t (a INT, b INT); SELECT * FROM t WHERE a = 1 and b = 2; We know b is not null after we evaluate the qual b = 2, but it may still nullable when we just evaluate a = 1; I prefer to not handle it by saying the semantics of notnullattrs is correct after we evaluate all the quals on its RelOptInfo. > It would be good to agree on the correct representation for Vars that >> cannot produce NULLs so that we don't shut the door on classes of >> optimisation that require something other than what you need for your >> case. >> >> > Looks we have to maintain not null on the general RelOptInfo level rather > than Base > RelOptInfo. But I don't want to teach Var about the notnull so far. The > reasons are: 1). > We need to maintain the Planner version and Parser version due to the VIEW > case. > 2). We have to ignore the extra part for equal(Var, Var) . 3). Var is > usually shared among > different RelOptInfo. which means we have to maintain different copies for > this purpose IIUC. > > I assume we want to know if a Var is nullable with a function like. > is_var_notnullable(Var *var, Relids relids). If so, we can define the > data as below: > > struct RelOptInfo { > > Bitmapset** notnullattrs; > .. > }; > > After this we can implement the function as: > /* * is_var_notnullable * Check if the var is nullable for a given RelOptIno after * all the quals on it have been evaluated. * * var is the var to check, relids is the ids of a RelOptInfo * we will check on. */ bool is_var_notnullable(Var* var, Relids relids) { RelOptInfo *rel = find_rel_by_relids(reldis); return bms_is_member(var->varattno, rel->notnullattrs[var->varno]); } Do you think this is a reasonable solution? > bool > is_var_notnullable(Var* var, Relids relids) > { > RelOptInfo *rel = find_rel_by_relids(reldis); > return bms_is_member(var->varattno, rel->notnullattrs[var->varno]); > } > > Probably we can make some hackers to reduce the notnullattrs's memory usage > overhead. > > -- Best Regards Andy Fan (https://www.aliyun.com/)
Cost model improvement for run-time partition prune
Currently the cost model ignores the initial partition prune and run time partition prune totally. This impacts includes: 1). The cost of Nest Loop path is highly overrated. 2). And the rows estimator can be very wrong as well some time. We can use the following cases to demonstrate. CREATE TABLE p (c_type INT, v INT) partition by list(c_type); SELECT 'create table p_'|| i || ' partition of p for values in ('|| i|| ');' from generate_series(1, 100) i; \gexec SELECT 'insert into p select ' || i ||', v from generate_series(1, 100) v;' from generate_series(1, 100) i; \gexec ANALYZE P; CREATE INDEX on p(v); Case 1: PREPARE s AS SELECT * FROM generate_series(1, 10) i JOIN p ON i = p.v; EXPLAIN execute s; postgres=# EXPLAIN execute s; QUERY PLAN - Nested Loop (cost=0.43..8457.60 rows=1029 width=12) -> Function Scan on generate_series i (cost=0.00..0.10 rows=10 width=4) -> Append (cost=0.42..844.75 rows=100 width=8) -> Index Scan using p_1_v_idx on p_1 (cost=0.42..8.44 rows=1 width=8) Index Cond: (v = i.i) -> Index Scan using p_2_v_idx on p_2 (cost=0.42..8.44 rows=1 width=8) Index Cond: (v = i.i) -> Index Scan using p_3_v_idx on p_3 (cost=0.42..8.44 rows=1 width=8) Index Cond: (v = i.i) ... We can see the cost/rows of Append Path is highly overrated. (the rows should be 1 rather than 100, cost should be 8.44 rather than 844). Case 2: PREPARE s2 AS SELECT * FROM p a JOIN p b ON a.v = b.v and a.c_type = $1 and a.v < 10; EXPLAIN execute s2(3); postgres=# EXPLAIN execute s2(3); QUERY PLAN - Gather (cost=1000.85..5329.91 rows=926 width=16) Workers Planned: 1 -> Nested Loop (cost=0.85..4237.31 rows=545 width=16) -> Parallel Index Scan using p_3_v_idx on p_3 a (cost=0.42..8.56 rows=5 width=8) Index Cond: (v < 10) Filter: (c_type = 3) -> Append (cost=0.42..844.75 rows=100 width=8) -> Index Scan using p_1_v_idx on p_1 b_1 (cost=0.42..8.44 rows=1 width=8) Index Cond: (v = a.v) -> Index Scan using p_2_v_idx on p_2 b_2 (cost=0.42..8.44 rows=1 width=8) Index Cond: (v = a.v) ... set plan_cache_mode = force_generic_plan; EXPLAIN ANALYZE execute s2(3); QUERY PLAN Gather (cost=1000.85..312162.57 rows=93085 width=16) Workers Planned: 2 -> Nested Loop (cost=0.85..301854.07 rows=38785 width=16) -> Parallel Append (cost=0.42..857.94 rows=400 width=8) Subplans Removed: 99 -> Parallel Index Scan using p_3_v_idx on p_3 a_1 (cost=0.42..8.56 rows=5 width=8) Index Cond: (v < 10) Filter: (c_type = $1) -> Append (cost=0.42..751.49 rows=100 width=8) ... We can see the rows for Gather node changed from 926 to 93085, while the actual rows is 900 rows. The reason for case 2 is because we adjust the rel->tuples for plan time partition prune, but we did nothing for initial partition prune. I would like to aim to fix both of the issues. The scope I want to cover at the first stage are 1. Only support limited operators like '=', 'in', 'partkey = $1 OR partkey = $2'; which means the operators like '>', '<', 'BETWEEN .. AND ' are not supported. 2. Only supporting all the partition keys are used in prune quals. for example, if we have partkey (p1, p2). but user just have p1 = $1 in the quals. 3. All the other cases should be supported. The reason I put some limits above is because 1). they are not common. 2). there are no way to guess the reasonable ratio. The design principle are: 1). Adjust the AppendPath's cost and rows for both initial partition prune and run time partition prune in cost_append. The ratio is just 1/nparts for all the supported case, even for partkey in ($1, $2, $3). 2). Adjust rel->tuples for initial partition prune only. 3). Use the adjusted AppendPath's cost/rows for sub-partitioned case, around the cases accumulate_append_subpath. I have implemented this for 1-level partition and 1 partition key only at [1], and I have tested it on my real user case, looks the algorithm works great. I am planning to implement the full version recently. Any suggestion for the design/scope part? [1] https://www.postgresql.org/message-id/CAKU4AWpO4KegS6tw8UUnWA4GWr-Di%3DWBmuQnnyjxFGA0MhEHyA%40mail.gmail.com -- Best Regards Andy Fan (https://www.aliyun.com/)
Re: Wired if-statement in gen_partprune_steps_internal
On Thu, Apr 8, 2021 at 7:59 PM Amit Langote wrote: > On Thu, Apr 8, 2021 at 7:41 PM David Rowley wrote: > > On Thu, 8 Apr 2021 at 21:04, Amit Langote > wrote: > > > Maybe, we should also updated the description of node struct as > > > follows to consider that last point: > >> > > > * PartitionPruneStepOp - Information to prune using a set of mutually > ANDed > > > * OpExpr and any IS [ NOT ] NULL clauses > > > > I didn't add that. I wasn't really sure if I understood why we'd talk > > about PartitionPruneStepCombine in the PartitionPruneStepOp. I thought > > the overview in gen_partprune_steps_internal was ok to link the two > > together and explain why they're both needed. > > Sorry, maybe the way I wrote it was a bit confusing, but I meant to > suggest that we do what I have quoted above from my last email. That > is, we should clarify in the description of PartitionPruneStepOp that > it contains information derived from OpExprs and in some cases also IS > [ NOT ] NULL clauses. > > Thanks for the commit. > > -- > Amit Langote > EDB: http://www.enterprisedb.com > Thanks for the patch. Recently I am reading the partition prune code again, and want to propose some tiny changes. That is helpful for me and hope it is helpful for others as well, especially for the people who are not familiar with these codes. -- v1-0001-Document-enhancement-for-RelOptInfo.partexprs-nul.patch Just add comments for RelOptInfo.partexprs & nullable_partexprs to remind the reader nullable_partexprs is just for partition wise join. and use bms_add_member(relinfo->all_partrels, childRTindex); instead of bms_add_members(relinfo->all_partrels, childrelinfo->relids); which would be more explicit to say add the child rt index to all_partrels. -- v1-0002-Split-gen_prune_steps_from_exprs-into-some-smalle.patch Just split the gen_prune_steps_from_opexps into some smaller chunks. The benefits are the same as smaller functions. -- Best Regards Andy Fan (https://www.aliyun.com/) v1-0001-Document-enhancement-for-RelOptInfo.partexprs-nul.patch Description: Binary data v1-0002-Split-gen_prune_steps_from_exprs-into-some-smalle.patch Description: Binary data
Re: Difference for Binary format vs Text format for client-server communication
On Sun, Jul 26, 2020 at 1:49 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 2020-07-16 18:52, Andy Fan wrote: > > The reason I ask this is because I have a task to make numeric output > > similar to oracle. > > > > Oracle: > > > > SQL> select 2 / 1.0 from dual; > > > > 2/1.0 > > -- > > 2 > > > > PG: > > > > postgres=# select 2 / 1.0; > >?column? > > > > 2. > > (1 row) > > > > If the user uses text format, I can just hack some numeric_out function, > > but if they > > use binary format, looks I have to change the driver they used for it. > > Am I > > understand it correctly? > > I think what you should be looking at is why the numeric division > function produces that scale and possibly make changes there. Thanks, I think you are talking about the select_div_scale function, which is called before the real division task in div_var. so it will be hard to hack at that part. Beside that, oracle returns the zero-trim version no matter if division is involved(I forgot to mention at the first). At last, I just hacked the numeric_out function, then it works like Oracle now. However it just works in text format. I tried JDBC, and it uses text format by default. The solution is not good enough but it is ok for my purpose currently. IIUC, if a driver uses text protocol for a data type, then it works like this: 1). server gets a value in binary format. 2). server convert it to string and send it via network, 3). client gets the string. 4). client converts the string to a given data type. looks it is much more complex than binary protocol. then why text protocol is chosen by default. > By the > time the type's output or send function is invoked, that's already > decided. The output/send functions are not the place to make scale or > other semantic adjustments. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Best Regards Andy Fan
Allows Extend Protocol support CURSOR_OPT_HOLD with prepared stmt.
I have a user case like this: rs = prepared_stmt.execute(1); while(rs.next()) { // do something with the result and commit the transaction. conn.commit(); } The driver used the extended protocol in this case. It works like this: 1). Parse -> PreparedStmt. 2). Bind -> Bind the prepared stmt with a Portal, no chance to set the CURSOR_OPT_HOLD option. 3). Execute. 4). Commit - the portal was dropped at this stage. 5). when fetching the next batch of results, we get the error "Portal doesn't exist" There are several methods we can work around this, but no one is perfect. 1.run the prepared stmt in a dedicated connection. (The number of connection will doubled) 2. use the with hold cursor. It doesn't support any bind parameter, so we have to create a cursor for each dedicated id. 3. don't commit the transaction. -- long transaction with many rows locked. I have several questions about this case: 1. How about filling a cursorOptions information in bind protocol? then we can set the portal->cursorOptions accordingly? if so, how to be compatible with the old driver usually? 2. Currently I want to add a new GUC parameter, if set it to true, server will create a holdable portal, or else nothing changed. Then let the user set it to true in the above case and reset it to false afterward. Is there any issue with this method? -- Best Regards Andy Fan
Re: Allows Extend Protocol support CURSOR_OPT_HOLD with prepared stmt.
> > > 2. Currently I want to add a new GUC parameter, if set it to true, server > will > create a holdable portal, or else nothing changed. Then let the user set > it to true in the above case and reset it to false afterward. Is there > any issue > with this method? > > I forget to say in this case, the user has to drop the holdable portal explicitly. -- Best Regards Andy Fan
Re: Index Skip Scan (new UniqueKeys)
Hi David: Thanks for looking into this. On Fri, Jul 31, 2020 at 11:07 AM David Rowley wrote: > On Mon, 13 Jul 2020 at 10:18, Floris Van Nee > wrote: > > One question about the unique keys - probably for Andy or David: I've > looked in the archives to find arguments for/against using Expr nodes or > EquivalenceClasses in the Unique Keys patch. However, I couldn't really > find a clear answer about why the current patch uses Expr rather than > EquivalenceClasses. At some point David mentioned "that probably Expr nodes > were needed rather than EquivalenceClasses", but it's not really clear to > me why. What were the thoughts behind this? > > I'm still not quite sure on this either way. I did think > EquivalenceClasses were more suitable before I wrote the POC patch for > unique keys. But after that, I had in mind that Exprs might be > better. The reason I thought this was due to the fact that the > DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were > EquivalenceClasses then checking to see if the DISTINCT can be skipped > turned into something more complex that required looking through lists > of ec_members rather than just checking if the uniquekey exprs were a > subset of the DISTINCT clause. > > Thinking about it a bit harder, if we did use Exprs then it would mean > it a case like the following wouldn't work for Andy's DISTINCT no-op > stuff. > > CREATE TABLE xy (x int primary key, y int not null); > > SELECT DISTINCT y FROM xy WHERE x=y; > > whereas if we use EquivalenceClasses then we'll find that we have an > EC with x,y in it and can skip the DISTINCT since we have a UniqueKey > containing that EquivalenceClass. > Also, looking at what Andy wrote to make a case like the following > work in his populate_baserel_uniquekeys() function in the 0002 patch: > > CREATE TABLE ab (a int, b int, primary key(a,b)); > SELECT DISTINCT a FROM ab WHERE b = 1; > > it's a bit uninspiring. Really what we want here when checking if we > can skip doing the DISTINCT is a UniqueKey set using > EquivalenceClasses as we can just insist that any unmatched UniqueKey > items have an ec_is_const == true. However, that means we have to loop > through the ec_members of the EquivalenceClasses in the uniquekeys > during the DISTINCT check. That's particularly bad when you consider > that in a partitioned table case there might be an ec_member for each > child partition and there could be 1000s of child partitions and > following those ec_members chains is going to be too slow. > > My current thoughts are that we should be using EquivalenceClasses but > we should first add some infrastructure to make them perform better. > My current thoughts are that we do something like what I mentioned in > [1] or something more like what Andres mentions in [2]. After that, > we could either make EquivalenceClass.ec_members a hash table or > binary search tree. Or even perhaps just have a single hash table/BST > for all EquivalenceClasses that allows very fast lookups from {Expr} > -> {EquivalenceClass}. I think an Expr can only belong in a single > non-merged EquivalenceClass. So when we do merging of > EquivalenceClasses we could just repoint that data structure to point > to the new EquivalenceClass. We'd never point to ones that have > ec_merged != NULL. This would also allow us to fix the poor > performance in regards to get_eclass_for_sort_expr() for partitioned > tables. > > So, it seems the patch dependency chain for skip scans just got a bit > longer :-( > > I admit that EquivalenceClasses has a better expressive power. There are 2 more cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b, c FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better be (a, c) and (b, c). The other case happens similarly in group by case. After realizing this, I am still hesitant to do that, due to the complexity. If we do that, we may have to maintain a EquivalenceClasses in one more place or make the existing EquivalenceClasses List longer, for example: SELECT pk FROM t; The current infrastructure doesn't create any EquivalenceClasses for pk. So we have to create a new one in this case and reuse some existing ones in other cases. Finally since the EquivalenceClasses is not so straight to upper user, we have to depend on the infrastructure change to look up an EquivalenceClasses quickly from an Expr. I rethink more about the case you provide above, IIUC, there is such issue for joinrel. then we can just add a EC checking for populate_baserel_uniquekeys. As for the DISTINCT/GROUP BY case, we should build the UniqueKeys from root->distinct_pathkeys and root->group_pathkeys where the EquivalenceClasses are already there. I am still not insisting on either Expr or EquivalenceClasses right now, if we need to change it to EquivalenceClasses, I'd see if we need to have more places to take care before doing that. -- Best Regards Andy Fan
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Hi Floris: On Thu, Jul 23, 2020 at 3:22 AM Floris Van Nee wrote: > Hi Andy, > > > > A small thing I found: > > > > +static List * > > +get_exprs_from_uniqueindex(IndexOptInfo *unique_index, > > + >List *const_exprs, > > + >List *const_expr_opfamilies, > > + >Bitmapset *used_varattrs, > > + >bool *useful, > > + >bool *multi_nullvals) > > … > > + indexpr_item = list_head(unique_index->indexprs); > > + for(c = 0; c < unique_index->ncolumns; c++) > > + { > > > > I believe the for loop must be over unique_index->nkeycolumns, rather than > columns. It shouldn’t include the extra non-key columns. This can currently > lead to invalid memory accesses as well a few lines later when it does an > array access of unique_index->opfamily[c] – this array only has nkeycolumns > entries. > You are correct, I would include this in the next version patch, Thank you for this checking! -- Andy Fan Best Regards > > > > > *From:* Andy Fan > *Sent:* Sunday 19 July 2020 5:03 AM > *To:* Dmitry Dolgov <9erthali...@gmail.com> > *Cc:* David Rowley ; PostgreSQL Hackers < > pgsql-hackers@lists.postgresql.org>; Tom Lane ; > Ashutosh Bapat ; rushabh.lat...@gmail.com > *Subject:* Re: [PATCH] Keeps tracking the uniqueness with UniqueKey > [External] > > > > Fixed a test case in v10. > > > > -- > > Best Regards > > Andy Fan > -- Best Regards Andy Fan
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 2:22 AM Tom Lane wrote: > Robert Haas writes: > > On Wed, Aug 5, 2020 at 1:30 PM Tom Lane wrote: > >> I'm strongly tempted to convert the trailing Assert to an actual > >> test-and-elog, too, but didn't do so here. > > > I was thinking about that, too. +1 for taking that step. > > Will do. > > In the longer term, it's annoying that we have no test methodology > for this other than "manually set a breakpoint here". One of the methods I see is we can just add some GUC variable for some action injection. basically it adds some code based on the GUC like this; if (shall_delay_planning) { sleep(10) }; AFAIK, MongoDB uses much such technology in their test framework. First it defines the fail point [1], and then does code injection if the fail point is set [2]. At last, during the test it can set a fail point like a GUC, but with more attributes [3]. If that is useful in PG as well and it is not an urgent task, I would like to help in this direction. [1] https://github.com/mongodb/mongo/search?q=MONGO_FAIL_POINT_DEFINE&unscoped_q=MONGO_FAIL_POINT_DEFINE [2] https://github.com/mongodb/mongo/blob/d4e7ea57599b44353b5393afedee8ae5670837b3/src/mongo/db/repl/repl_set_config.cpp#L475 [3] https://github.com/mongodb/mongo/blob/e07c2d29aded5a30ff08b5ce6a436b6ef6f44014/src/mongo/shell/replsettest.js#L1427 If we're going > to allow plan-relevant DDL changes to happen with less than full table > lock, I think we need to improve that. I spent a little bit of time > just now trying to build an isolationtester case for this, and failed > completely. So I wonder if we can create some sort of test module that > allows capture of a plan tree and then execution of that plan tree later > (even after relcache inval would normally have forced replanning). > Obviously that could not be a normal SQL-accessible feature, because > some types of invals would make the plan completely wrong, but for > testing purposes it'd be mighty helpful to check that a stale plan > still works. > > regards, tom lane > > > -- Best Regards Andy Fan
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 12:02 PM Tom Lane wrote: > Andy Fan writes: > > On Thu, Aug 6, 2020 at 2:22 AM Tom Lane wrote: > >> In the longer term, it's annoying that we have no test methodology > >> for this other than "manually set a breakpoint here". > > > One of the methods I see is we can just add some GUC variable for some > > action injection. basically it adds some code based on the GUC like > this; > > See my straw-man proposal downthread. I'm not very excited about putting > things like this into the standard build, because it's really hard to be > sure that there are no security-hazard-ish downsides of putting in ways to > get at testing behaviors from standard SQL. And then there's the question > of whether you're adding noticeable overhead to production builds. So a > loadable module that can use some existing hook to provide the needed > behavior seems like a better plan to me, whenever we can do it that way. > > In general, though, it seems like we've seldom regretted investments in > test tooling. > > regards, tom lane > Thanks for your explanation, I checked it again and it looks a very clean method. The attached is a draft patch based on my understanding. Hope I didn't misunderstand you.. -- Best Regards Andy Fan v1-0001-test_module-used-for-concurrency-case-simulation.patch Description: Binary data
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Thu, Aug 6, 2020 at 10:42 PM Tom Lane wrote: > Andy Fan writes: > > On Thu, Aug 6, 2020 at 12:02 PM Tom Lane wrote: > >> See my straw-man proposal downthread. > > > Thanks for your explanation, I checked it again and it looks a very clean > > method. The attached is a draft patch based on my understanding. Hope > > I didn't misunderstand you.. > > Ah, I was going to play arond with that today, but you beat me to it ;-) > > Very glad to be helpful. > A few thoughts after a quick look at the patch: > > * I had envisioned that there's a custom GUC controlling the lock ID > used; this would allow blocking different sessions at different points, > if we ever need that. Also, I'd make the GUC start out as zero which > means "do nothing", so that merely loading the module has no immediate > effect. > > I forgot to say I didn't get the point of the guc variable in the last thread, now I think it is a smart idea, so added it. In this way, one session can only be blocked at one place, it may not be an issue in practise. * Don't really see the point of the before-planning lock. > > yes.. it was removed now. * Rather than exposing internal declarations from lockfuncs.c, you > could just write calls to pg_advisory_lock_int8 etc. using > DirectFunctionCall1. > > Thanks for sharing it, this method looks pretty good. > * We need some better name than "test_module". I had vaguely thought > about "delay_execution", but am surely open to better ideas. > > Both names look good to me, delay_execution looks better, it is used in v2. Attached is the v2 patch. -- Best Regards Andy Fan v2-0001-delay_execution-used-for-concurrency-case-simulat.patch Description: Binary data
Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)
On Fri, Aug 7, 2020 at 8:32 AM Tom Lane wrote: > Andy Fan writes: > > Attached is the v2 patch. > > Forgot to mention that I'd envisioned adding this as a src/test/modules/ > module; contrib/ is for things that we intend to expose to users, which > I think this isn't. > > I played around with this and got the isolation test I'd experimented > with yesterday to work with it. If you revert 7a980dfc6 then the > attached patch will expose that bug. Interestingly, I had to add an > explicit AcceptInvalidationMessages() call to reproduce the bug; so > apparently we do none of those between planning and execution in the > ordinary query code path. Arguably, that means we're testing a scenario > somewhat different from what can happen in live databases, but I think > it's OK. Amit's recipe for reproducing the bug works because there are > other relation lock acquisitions (and hence AcceptInvalidationMessages > calls) later in planning than where he asked us to wait. So this > effectively tests a scenario where a very late A.I.M. call within the > planner detects an inval event for some already-planned relation, and > that seems like a valid-enough scenario. > > Anyway, attached find a reviewed version of your patch plus a test > scenario contributed by me (I was too lazy to split it into two > patches). Barring objections, I'll push this tomorrow or so. > > (BTW, I checked and found that this test does *not* expose the problems > with Amit's original patch. Not sure if it's worth trying to finagle > it so it does.) > > regards, tom lane > > + * delay_execution.c + * Test module to allow delay between parsing and execution of a query. I am not sure if we need to limit the scope to "between parsing and execution", IMO, it can be used at any place where we have a hook so that delay_execution can inject the lock_unlock logic with a predefined lock id. Probably the test writer only wants one place blocked, then delay_execution.xxx_lock_id can be set so that only the given lock ID is considered. Just my 0.01 cents. -- Best Regards Andy Fan
Can I test Extended Query in core test framework
I want to write some test cases with extended query in core test system. basically it looks like PreparedStatement preparedStatement = conn.prepareStatement("select * from bigtable"); preparedStatement.setFetchSize(4); ResultSet rs = preparedStatement.executeQuery(); while(rs.next()) { System.out.println(rs.getInt(1)); // conn.commit(); conn.rollback(); } However I don't find a way to do that after checking the example in src/test/xxx/t/xxx.pl where most often used object is PostgresNode, which don't have such abilities. Can I do that in core system, I tried grep '\->prepare' and '\->execute' and get nothing. am I miss something? -- Best Regards Andy Fan
Re: Can I test Extended Query in core test framework
Thank you Ashutosh for your reply. On Tue, Aug 11, 2020 at 9:06 PM Ashutosh Bapat wrote: > You could run PREPARE and EXECUTE as SQL commands from psql. Please > take a look at the documentation of those two commands. I haven't > looked at TAP infrastructure, but you could open a psql session to a > running server and send an arbitrary number of SQL queries through it. > > PREPARE & EXECUTE doesn't go with the extended query way. it is still exec_simple_query.What I did is I hacked some exec_bind_message [1] logic, that's why I want to test extended queries. [1] https://www.postgresql.org/message-id/CAKU4AWqvwmo=nlpga_ohxb4f+u4ts1_3yry9m6xtjlt9dkh...@mail.gmail.com -- Best Regards Andy Fan