On Mon, Jun 21, 2021 at 12:52:39PM -0400, Tom Lane wrote: > You're throwing around a lot of pejorative adjectives there without > having bothered to quantify any of them. This sounds less like a sound > argument to me than like a witch trial. > > Reflecting for a bit on the ancient principle that "the only good numbers > in computer science are 0, 1, and N", it seems to me that we could make > an effort to classify RelOptInfos as provably empty, provably at most one > row, and others. (This would be a property of relations, not individual > paths, so it needn't bloat struct Path.) We already understand about > provably-empty rels, so this is just generalizing that idea a little. > Once we know about that, at least for the really-common cases like unique > keys, I'd be okay with a hard rule that we don't consider unparameterized > nestloop joins with an outer side that falls in the "N" category. > Unless there's no alternative, of course. > > Another thought that occurs to me here is that maybe we could get rid of > the enable_material knob in favor of forcing (or at least encouraging) > materialization when the outer side isn't provably one row.
There were a lot of interesting ideas in this thread and I want to analyze some of them. First, there is the common assumption (not stated) that over-estimating by 5% and underestimating by 5% cause the same harm, which is clearly false. If I go to a restaurant and estimate the bill to be 5% higher or %5 lower, assuming I have sufficient funds, under or over estimating is probably fine. If I am driving and under-estimate the traction of my tires, I am probably fine, but if I over-estimate their traction by 5%, I might crash. Closer to home, Postgres is more tolerant of memory usage under-estimation than over-estimation: https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018 What I hear Robert saying is that unparameterized nested loops are much more sensitive to misestimation than hash joins, and only slightly faster than hash joins when they use accurate row counts, so we might want to avoid them by default. Tom is saying that if we know the outer side will have zero or one row, we should still do unparameterized nested loops because those are not more sensitive to misestimation than hash joins, and slightly faster. If that is accurate, I think the big question is how common are cases where the outer side can't be proven to have zero or one row and nested loops are enough of a win to risk its greater sensitivity to misestimation. If it is uncommon, seems we could just code the optimizer to use hash joins in those cases without a user-visible knob, beyond the knob that already turns off nested loop joins. Peter's comment about having nodes in the executor that adjust to the row counts it finds is interesting, and eventually might be necessary once we are sure we have gone as far as we can without it. -- Bruce Momjian <br...@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.