Re: [HACKERS] [PATCH] Equivalence Class Filters

Robert Haas Tue, 08 Dec 2015 20:35:14 -0800

On Tue, Dec 8, 2015 at 11:12 PM, David Rowley
<[email protected]> wrote:
> My point was more of a response to the general condition that we cannot add
> any planner overhead unless the extra CPU cycles spent are almost certainly
> going improve the plan.


I hope that's an overstatement of the requirement, because otherwise
it boils down to "don't improve the planner".  Most things we do to
try to improve plans are going to generate paths that will only
sometimes be better than the paths we're already generating.  "almost
certainly" is a high bar to clear.

> hmm, well I think it's more than likely that my view on this has been skewed
> by the fact that we push equality quals down with the eclass contains a
> Const without any regard that it could generate a worse plan or add needless
> CPU overhead for checking the quals. If you rewrite your above paragraph
> with equality instead of inequality then it still applies. A JOIN B ON A.X =
> B.X WHERE A.X = 1000000, will clause B.X = 1000000 to be pushed into B, but
> what if B.X values are all set to 1000000? I've not seen anyone complain
> about us doing that.  This is the reason I limited the patch to only deal
> with members of the BTREE op-class, I assumed that if there's some BTREE
> index helping with the equality qual then that same index may well just help
> with a qual using the <, <=, > or >= operator, and also I thought that in
> most cases all these btree inequality operations will have about the same
> CPU cost as equality.
>
> Would you be able to explain the difference between the two? Or is it just a
> question of the likelihood?

Likelihood, I guess.  I mean, typically if you are doing an equijoin
on two or more relations, the join column is unique, so there's only
going to be one row in each of A and B that is equal to a given
constant.  If A.X and/or B.X contain duplicate values, then the output
is going to have a number of rows equal to the product of the number
of duplicates in each, sort of like a Cartesian join.  That scenario
could happen, but it's a pretty strange kind of query which I would be
disinclined to spend a lot of time optimizing.  OTOH, something like
SELECT * FROM orders o, lines l WHERE l.order_id = o.order_id AND
o.order_id > 1000000 is a pretty normal query, at least IMHO.
Worrying about how that's going to perform with various data
distributions seems like a pretty reasonable thing to care about.

>> Furthermore, there are some cases involving parameterized paths where
>> enforcing the inequality multiple times is definitely bad: for
>> example, if we've got a nested loop where the outer side is a seq scan
>> that enforces the condition and the inner side is an index probe, it
>> is just a waste to retest it on the inner side.  We already know that
>> the outer row passes the inequality, so the inner row will necessarily
>> pass also.  This doesn't apply to merge or hash joins, and it also
>> doesn't apply to all nested loops: scans that aren't paramaterized by
>> the equivalence-class column can still benefit from separate
>> enforcement of the inequality.
>>
> I guess that could be fixed by somehow marking these pushed quals as
> optional and having parameterised scans ignore optional quals.

Yeah.

> I have to admit that I didn't give it much thought as all of the cases that
> I tested turned what used to be nested loop with a parameterised inner index
> scan into a merge join, which was faster in my test case. Of course, that
> does not mean that I claim that this merge join will be faster in all cases
> or that the plan will switch to using a merge join in all cases.

Interesting that it turned into a merge join and that that was faster.

> Again this is one of the reason that I limited it to btree operators only.
> I don't quite know how to estimate this either as the extra rows being
> filtered may mean that a sort no longer spills to disk, or a hash join no
> longer multi-batches. I'd imagine filtering would be extra worthwhile in
> such cases, so I doubt setting the threshold to some constant value is the
> correct way, and most likely considering the path with and without the
> qual(s) would be far too costly.

It might be OK, particularly for OLTP queries, but it's certainly not
going to be the cheapest thing anybody ever did.

> Well I agree with that. I've got no interest in slowing anything down. I've
> been busy working with big databases for quite a while now, so perhaps my
> point of view is coming from that direction still.

Yeah.

> So anyway, it's quite clear that we don't want the patch in its current
> form, and I'm not volunteering any more time at this stage to improve it, so
> for this reason I won't add it the January commit fest.

Too bad, but I understand.  I hope you come back around to working on
this further at some point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Equivalence Class Filters

Reply via email to