On Fri, Apr 24, 2009 at 10:49 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> As far as I can tell, the focus on trying to estimate the number of >> tuples per bucket is entirely misguided. Supposing the relation is >> mostly unique so that the values don't cluster too much, the right >> answer is (of course) NTUP_PER_BUCKET. > > But the entire point of that code is to arrive at a sane estimate > when the inner relation *isn't* mostly unique and *does* cluster. > So I think you're being much too hasty to conclude that it's wrong.
A few further thoughts on this... the patch you attached computes jselec (percentage of tuples with a match) and avgmatch (number of matches per tuple with at least one match). This seems like a really good idea, but perhaps we should do it even when performing an inner/left join, if that's possible. Regardless of the join type, when there is no match, we need to scan the entire bucket. There's no reason to assume we'll hit more populated buckets more often. So, if there are 2^b virtual buckets and n inner tuples, then the bucket we hit will on average contain n/(2^b) tuples and the chances of a hash collision for any given tuple in that bucket are about 2^-(32-b) multiplied by whatever fudge factor you want to allow for imperfect hashing (call it f). The total number of hash collisions per unmatched tuple will be equal to the product of the number of tuples and the chance of a collision per tuple, or in other words n/(2^b) * 2^-(32-b) * f = n * 2^-32 * f. For an inner/left join, when there is a match, we still need to scan the entire bucket, but now we know we'll have to evaluate the hash quals at least avgmatch times. In addition, we risk having to evaluate the hash quals for each tuple in the bucket that doesn't match, if the hash values happen to collide. In this case we can only have hash collisions with the tuples we're not matching, so the number of hash qual evaluations should be about (n - avgmatch) * 2^-32 * f + avgmatch. For a semi/anti join, when there is a match, we only need to scan until we hit the first matching tuple. If we let the number of collisions, c, be (n - avgmatch) * 2^-32 * f, then the percentage of the bucket we need to scan is about c / (c + avgmatch). Since they might not be randomly distributed within the bucket, multiply by 2 as in your existing patch, but with a maximum of 1.0. So the number of hash qual evaluations will be: MIN(1.0, 2.0 * c / (c + avgmatch) ) * (c + avgmatch) = MIN(c + avgmatch, 2.0 * c) ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers