Re: NOT IN subquery optimization

2020-04-08 Thread David Steele
On 3/26/20 4:58 PM, Li, Zheng wrote: >BTW, so far as I can see, the only reason you're bothering with the whole thing is to compare the size of the subquery output with work_mem, because that's all that subplan_is_hashable does. I wonder whether that consideration is even sti

Re: NOT IN subquery optimization

2020-04-01 Thread Andrey Lepikhov
You should do small rebase (conflict with 911e7020770) and pgindent of the patch to repair problems with long lines and backspaces. I am reviewing your patch in small steps. Questions: 1. In the find_innerjoined_rels() routine you stop descending on JOIN_FULL node type. I think it is wrong beca

Re: NOT IN subquery optimization

2020-03-26 Thread Li, Zheng
>BTW, so far as I can see, the only reason you're bothering with the whole thing is to compare the size of the subquery output with work_mem, because that's all that subplan_is_hashable does. I wonder whether that consideration is even still necessary in the wake of 1f39bce02. If

Re: NOT IN subquery optimization

2020-03-24 Thread Tom Lane
"Li, Zheng" writes: > * I find it entirely unacceptable to stick some planner temporary > fields into struct SubLink. If you need that storage you'll have > to find some other place to put it. But in point of fact I don't > think you need it; it doesn't look to me to be critical

Re: NOT IN subquery optimization

2020-03-24 Thread Li, Zheng
Hi Tom, Thanks for the feedback. * I find it entirely unacceptable to stick some planner temporary fields into struct SubLink. If you need that storage you'll have to find some other place to put it. But in point of fact I don't think you need it; it doesn't look to me to be

Re: NOT IN subquery optimization

2020-01-08 Thread Andrey Lepikhov
On 1/7/20 12:34 AM, Li, Zheng wrote: Hi Andrey, Thanks for the comment! The unimproved cases you mentioned all fall into the category “correlated subquery”. This category is explicitly disallowed by existing code to convert to join in convert_ANY_sublink_to_join: /* * The sub-se

Re: NOT IN subquery optimization

2020-01-06 Thread Li, Zheng
Hi Andrey, Thanks for the comment! The unimproved cases you mentioned all fall into the category “correlated subquery”. This category is explicitly disallowed by existing code to convert to join in convert_ANY_sublink_to_join: /* * The sub-select must not refer to any Vars of the paren

Re: NOT IN subquery optimization

2020-01-04 Thread Andrey Lepikhov
At the top of the thread your co-author argued the beginning of this work with "findings about the performance of PostgreSQL, MySQL, and Oracle on various subqueries:" https://antognini.ch/2017/12/how-well-a-query-optimizer-handles-subqueries/ I launched this test set with your "not_in ..." pa

Re: NOT IN subquery optimization

2019-11-30 Thread Michael Paquier
On Wed, Jun 26, 2019 at 09:26:16PM +, Li, Zheng wrote: > Let me know if you have any comments. I have one: the latest patch visibly applies, but fails to build because of the recent API changes around lists in the backend code. So a rebase is in order. The discussion has not moved a iota in t

Re: NOT IN subquery optimization

2019-09-25 Thread Alvaro Herrera
On 2019-Aug-02, Thomas Munro wrote: > Hi Zheng, Jim, > > With my Commitfest doozer hat on, I have moved this entry to the > September 'fest. I noticed in passing that it needs to be adjusted > for the new pg_list.h API. It'd be good to get some feedback from > reviewers on these two competing p

Re: NOT IN subquery optimization

2019-08-01 Thread Thomas Munro
On Sat, Jun 29, 2019 at 4:19 AM Li, Zheng wrote:> > Resending patch v2.2, looks like the previous submission did not get attached > to the original thread. > > This version fixed an issue that involves CTE. Because we call > subquery_planner before deciding whether to proceed with the transforma

Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 12:25, David Rowley wrote: > That sounds fine. I'll take mine elsewhere since I didn't start this thread. Moved to https://www.postgresql.org/message-id/CAKJS1f82pqjqe3WT9_xREmXyG20aOkHc-XqkKZG_yMA7JVJ3Tw%40mail.gmail.com -- David Rowley http://www.2ndQ

Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Wed, 6 Mar 2019 at 03:37, Tom Lane wrote: > > David Steele writes: > > I'm not sure if I have an issue with competing patches on the same > > thread. I've seen that before and it can lead to a good outcome. It > > case, as you say, also lead to confusion. > > It's a bit of a shame that the c

Re: NOT IN subquery optimization

2019-03-05 Thread Tom Lane
David Steele writes: > I'm not sure if I have an issue with competing patches on the same > thread. I've seen that before and it can lead to a good outcome. It > case, as you say, also lead to confusion. It's a bit of a shame that the cfbot will only be testing one of them at a time if we lea

Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Sun, 3 Mar 2019 at 01:34, Jim Finnerty wrote: > I agree with Tom's comment above - when the cost of the NOT IN is dominated > by the cost of the outer scan (i.e. when the cardinality of the outer > relation is large relative to the cardinality of the subquery), and if the > inner cardinality is

Re: NOT IN subquery optimization

2019-03-05 Thread David Steele
On 3/5/19 10:53 AM, David Rowley wrote: On Tue, 5 Mar 2019 at 21:21, David Steele wrote: On 2/27/19 2:26 AM, David Rowley wrote: FWIW, I did add this to the March CF, but I set the target version to 13. I wasn't considering this for PG12. I see Zheng was, but I agree with you on PG13 being t

Re: Re: NOT IN subquery optimization

2019-03-05 Thread David Rowley
On Tue, 5 Mar 2019 at 21:21, David Steele wrote: > > On 2/27/19 2:26 AM, David Rowley wrote: > > FWIW, I did add this to the March CF, but I set the target version to > > 13. I wasn't considering this for PG12. I see Zheng was, but I agree > > with you on PG13 being the target for this. > > Looks

Re: Re: NOT IN subquery optimization

2019-03-05 Thread David Steele
On 2/27/19 2:26 AM, David Rowley wrote: On Wed, 27 Feb 2019 at 13:13, Tom Lane wrote: "Li, Zheng" writes: However, given that the March CommitFest is imminent and the runtime smarts patent concerns David had pointed out (which I was not aware of before), we would not move that direction at

Re: NOT IN subquery optimization

2019-03-03 Thread David Rowley
On Mon, 4 Mar 2019 at 11:06, Tom Lane wrote: > > David Rowley writes: > > On Mon, 4 Mar 2019 at 04:42, Tom Lane wrote: > >> You absolutely will get errors during btree insertions and searches > >> if a datatype's btree comparison functions ever return NULL (for > >> non-NULL inputs). > > > I und

Re: NOT IN subquery optimization

2019-03-03 Thread Tom Lane
David Rowley writes: > On Mon, 4 Mar 2019 at 04:42, Tom Lane wrote: >> You absolutely will get errors during btree insertions and searches >> if a datatype's btree comparison functions ever return NULL (for >> non-NULL inputs). > I understand this is the case if an index happens to be used, but

Re: NOT IN subquery optimization

2019-03-03 Thread David Rowley
On Mon, 4 Mar 2019 at 04:42, Tom Lane wrote: > > David Rowley writes: > > On Sun, 3 Mar 2019 at 17:11, Tom Lane wrote: > >> (At the code level, this is implicit in the fact that the comparison > >> function will be called via FunctionCall2Coll or a sibling, and those > >> all throw an error if t

Re: NOT IN subquery optimization

2019-03-03 Thread Tom Lane
David Rowley writes: > On Sun, 3 Mar 2019 at 17:11, Tom Lane wrote: >> (At the code level, this is implicit in the fact that the comparison >> function will be called via FunctionCall2Coll or a sibling, and those >> all throw an error if the called function returns NULL.) > Ah okay. I can get it

Re: NOT IN subquery optimization

2019-03-03 Thread David Rowley
On Sun, 3 Mar 2019 at 17:11, Tom Lane wrote: > (At the code level, this is implicit in the fact that the comparison > function will be called via FunctionCall2Coll or a sibling, and those > all throw an error if the called function returns NULL.) > > Now, it doesn't say in so many words that the c

Re: NOT IN subquery optimization

2019-03-02 Thread Tom Lane
David Rowley writes: > On Sun, 3 Mar 2019 at 05:25, Tom Lane wrote: >> My initial thought about plugging that admittedly-academic point is >> to insist that the join operator be both strict and a member of a >> btree opclass (hash might be OK too; less sure about other index types). > Why strict

Re: NOT IN subquery optimization

2019-03-02 Thread David Rowley
On Sun, 3 Mar 2019 at 05:25, Tom Lane wrote: > Hmm ... thinking about the strictness angle some more: what we really > need to optimize NOT IN, IIUC, is an assumption that the join operator > will never return NULL. While not having NULL inputs is certainly a > *necessary* condition for that (ass

Re: NOT IN subquery optimization

2019-03-02 Thread Tom Lane
David Rowley writes: > On Sat, 2 Mar 2019 at 13:45, Tom Lane wrote: >> Yeah --- that has a nontrivial risk of making things significantly worse, >> which makes it a hard sell. I think the most reasonable bet here is >> simply to not perform the transformation if we can't prove the inner side >>

Re: NOT IN subquery optimization

2019-03-02 Thread David Rowley
On Sat, 2 Mar 2019 at 13:45, Tom Lane wrote: > > David Rowley writes: > > I think you're fighting a losing battle here with adding OR quals to > > the join condition. > > Yeah --- that has a nontrivial risk of making things significantly worse, > which makes it a hard sell. I think the most reas

Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
David Rowley writes: > I think you're fighting a losing battle here with adding OR quals to > the join condition. Yeah --- that has a nontrivial risk of making things significantly worse, which makes it a hard sell. I think the most reasonable bet here is simply to not perform the transformation

Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 12:39, Li, Zheng wrote: > However, if s.a is nullable, we would do this transformation: > select count(*) from big b where not exists(select 1 from small s > where s.a = b.a or s.a is null); I understand you're keen to make this work, but you're assuming again that f

Re: NOT IN subquery optimization

2019-03-01 Thread Li, Zheng
The current transformation would not add "or s.a is NULL" in the example provided since it is non-nullable. You will be comparing these two cases in terms of the transformation: select count(*) from big b where not exists(select 1 from small s where s.a = b.a); Time: 51.416 ms select count(*) f

Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 12:13, Tom Lane wrote: > > "Li, Zheng" writes: > > Although adding "or var is NULL" to the anti join condition forces the > > planner to choose nested loop anti join, it is always faster compared to > > the original plan. > > TBH, I am *really* skeptical of sweeping claims

Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
"Li, Zheng" writes: > Although adding "or var is NULL" to the anti join condition forces the > planner to choose nested loop anti join, it is always faster compared to the > original plan. TBH, I am *really* skeptical of sweeping claims like that. The existing code will typically produce a has

Re: NOT IN subquery optimization

2019-03-01 Thread Li, Zheng
Thanks all for the feedbacks! I'm working on a refined patch. Although adding "or var is NULL" to the anti join condition forces the planner to choose nested loop anti join, it is always faster compared to the original plan. In order to enable the transformation from NOT IN to anti join when the

Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
David Rowley writes: > On Sat, 2 Mar 2019 at 05:44, Tom Lane wrote: >> I'm not sure if the second one is actually a semantics bug or just a >> misoptimization? But yeah, +1 for putting in some simple tests for >> corner cases right now. Anyone want to propose a specific patch? > The second is

Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 05:44, Tom Lane wrote: > > Andres Freund writes: > > I've not checked, but could we please make sure these cases are covered > > in the regression tests today with a single liner? > > I'm not sure if the second one is actually a semantics bug or just a > misoptimization? Bu

Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
Andres Freund writes: > On March 1, 2019 4:53:03 AM PST, David Rowley > wrote: >> On Fri, 1 Mar 2019 at 15:27, Richard Guo wrote: >>> 1. The patch would give wrong results when the inner side is empty. >>> 2. Because of the new added predicate 'OR (var is NULL)', we cannot >>> use hash join or

Re: NOT IN subquery optimization

2019-03-01 Thread Andres Freund
Hi, On March 1, 2019 4:53:03 AM PST, David Rowley wrote: >On Fri, 1 Mar 2019 at 15:27, Richard Guo wrote: >> I have reviewed your patch. Good job except two issues I can find: >> >> 1. The patch would give wrong results when the inner side is empty. >In this >> case, the whole data from outer s

Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Fri, 1 Mar 2019 at 15:27, Richard Guo wrote: > I have reviewed your patch. Good job except two issues I can find: > > 1. The patch would give wrong results when the inner side is empty. In this > case, the whole data from outer side should be in the outputs. But with the > patch, we will lose t

Re: NOT IN subquery optimization

2019-02-28 Thread Richard Guo
On Tue, Feb 26, 2019 at 6:51 AM Li, Zheng wrote: > Resend the patch with a whitespace removed so that "git apply patch" works > directly. > > Hi Zheng, I have reviewed your patch. Good job except two issues I can find: 1. The patch would give wrong results when the inner side is empty. In this

Re: NOT IN subquery optimization

2019-02-26 Thread Richard Guo
On Wed, Feb 27, 2019 at 4:52 AM David Rowley wrote: > On Wed, 27 Feb 2019 at 03:07, Jim Finnerty wrote: > > If you're proposing to do that for this thread then I can take my > planner only patch somewhere else. I only posted my patch as I pretty > much already had what I thought you were origin

Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:41, Li, Zheng wrote: > We still check for inner side's nullability, when it is nullable we > append a "var is NULL" to the anti join condition. So every outer > tuple is going to evaluate to true on the join condition when there > is indeed a null entry in the inner. Tha

Re: NOT IN subquery optimization

2019-02-26 Thread Li, Zheng
I'm totally fine with setting the target to PG13. -- I'm interested to know how this works without testing for inner nullability. If any of the inner side's join exprs are NULL then no records can match. What do you propose to work around that? -- We still check for inner side's nullability, whe

Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:13, Tom Lane wrote: > > "Li, Zheng" writes: > > However, given that the March CommitFest is imminent and the runtime smarts > > patent concerns David had pointed out (which I was not aware of before), we > > would not move that direction at the moment. > > > I propose

Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 13:05, Li, Zheng wrote: > -With the latest fix (for the empty table case), our patch does the > transformation as long as the outer is non-nullable regardless of the inner > nullability, experiments show that the results are always faster. Hi Zheng, I'm interested to kno

Re: NOT IN subquery optimization

2019-02-26 Thread Tom Lane
"Li, Zheng" writes: > However, given that the March CommitFest is imminent and the runtime smarts > patent concerns David had pointed out (which I was not aware of before), we > would not move that direction at the moment. > I propose that we collaborate to build one patch from the two patches

Re: NOT IN subquery optimization

2019-02-26 Thread Li, Zheng
I agree we will need some runtime smarts (such as a new anti join type as pointed out by Richard) to "ultimately" account for all the cases of NOT IN queries. However, given that the March CommitFest is imminent and the runtime smarts patent concerns David had pointed out (which I was not aware

Re: NOT IN subquery optimization

2019-02-26 Thread David Rowley
On Wed, 27 Feb 2019 at 03:07, Jim Finnerty wrote: > > The problem is that the special optimization that was proposed for the case > where the subquery has no WHERE clause isn't valid when the subquery returns > no rows. That additional optimization needs to be removed, and preferably > replaced w

Re: NOT IN subquery optimization

2019-02-26 Thread Richard Guo
Greenplum Database does this optimization. The idea is to use a new join type, let's call it JOIN_LASJ_NOTIN, and its semantic regarding NULL is defined as below: 1. If there is a NULL in the outer side, and the inner side is empty, the NULL should be part of the outputs. 2. If there is a NULL

Re: NOT IN subquery optimization

2019-02-25 Thread David Rowley
On Tue, 26 Feb 2019 at 11:51, Li, Zheng wrote: > Resend the patch with a whitespace removed so that "git apply patch" works > directly. I had a quick look at this and it seems to be broken for the empty table case I mentioned up thread. Quick example: Setup: create table t1 (a int); create ta

Re: NOT IN subquery optimization

2019-02-25 Thread David Rowley
On Fri, 22 Feb 2019 at 09:44, David Rowley wrote: > I've attached the rebased and still broken version. I set about trying to make a less broken version of this. A quick reminder of the semantics of NOT IN: 1. WHERE NOT IN(SELECT FROM table); If table is non-empty: will filter out rows where

Re: NOT IN subquery optimization

2019-02-21 Thread David Rowley
On Thu, 21 Feb 2019 at 16:27, Jim Finnerty wrote: > We can always correctly transform a NOT IN to a correlated NOT EXISTS. In > almost all cases it is more efficient to do so. In the one case that we've > found that is slower it does come down to a more general costing issue, so > that's probabl

Re: NOT IN subquery optimization

2019-02-20 Thread Tom Lane
Jim Finnerty writes: > re: The idea that's been kicked around in the past is to detect whether the > subselect's output column(s) can be proved NOT NULL, and if so, convert > to an antijoin just like NOT EXISTS > basically, yes. this will handle nullability of both the outer and inner > correlat

Re: NOT IN subquery optimization

2019-02-20 Thread Tom Lane
Jim Finnerty writes: > We have been working on improved optimization of NOT IN, and we would like > to share this optimizaton with the community. The idea that's been kicked around in the past is to detect whether the subselect's output column(s) can be proved NOT NULL, and if so, convert to an a