On Fri, Apr 24, 2009 at 8:09 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > So it appears to me that instead of taking an average-case correction > as is done in this patch and the old coding, we have to explicitly model > the matched-tuple and unmatched-tuple cases separately. For hashjoins, > what I think we should do is estimate the bucket scanning cost for > unmatched tuples assuming a perfectly uniform bucket distribution, > rather than the worst-case bucket size that's used in the current code > (and is a reasonable estimate for matched tuples). This is to reflect > the fact that an unmatched outer tuple might or might not hit a > populated bucket, and which bucket it hits is probably uncorrelated with > the distribution of the inner tuples.
I've been investigating this some more as well and have come to the conclusion that the following code (and estimate_hash_bucketsize) are pretty much wrong. > /* > * The number of tuple comparisons needed is the number of outer tuples > ! * times the typical number of tuples in a hash bucket, which is the > inner > ! * relation size times its bucketsize fraction. At each one, we need > to > ! * evaluate the hashjoin quals. But actually, charging the full qual > eval > ! * cost at each tuple is pessimistic, since we don't evaluate the > quals > ! * unless the hash values match exactly. For lack of a better idea, > halve > ! * the cost estimate to allow for that. > */ > startup_cost += hash_qual_cost.startup; > run_cost += hash_qual_cost.per_tuple * > ! outer_path_rows * clamp_row_est(inner_path_rows * > innerbucketsize) * 0.5; As far as I can tell, the focus on trying to estimate the number of tuples per bucket is entirely misguided. Supposing the relation is mostly unique so that the values don't cluster too much, the right answer is (of course) NTUP_PER_BUCKET. estimate_hash_bucketsize() backs into this number, but it places far too much importance on the value. If you increase NTUP_PER_BUCKET to, say, 50, the planner concludes that hash joins are really, really expensive, and avoids them like the plague, but in reality they're only a little slower. Why? Because the extra tuples that get thrown into the bucket generally don't have the same hash value (or if they did, they would have been in the bucket either way...) and get rejected with a simple integer comparison, which is much cheaper than hash_qual_cost.per_tuple. It seems to me that what we should really be attempting to estimate here is the likely number of times we're going to have to evaluate the hash quals per tuple. Your idea of estimated the match and unmatched cases separately is a lot better than anything that I had come up with. In the unmatched case, the cost of a hash probe is nearly zero, regardless of whether the hash bucket is empty or not (because the probability of a real hash collision is quite low, and nothing else requires evaluating the hash quals). In the matched case, we expect to probe about inner_rows/inner_ndistinct tuples - unless it's a SEMI or ANTI join, in which case we expect to probe exactly one row. As a side note, the code in estimate_hash_bucketsize that scales up the estimate of tuples per bucket by the ratio of the frequency of the most common value looks pretty sketchy as well. Consider a relation which has 10000 values all distinct. The frequency of each value is 0.0001. Now we add a duplicate copy of one value, which makes it an MCV with almost twice the frequency of any other value in the table, so estimate_hash_bucketsize() essentially *doubles* its estimate of how many items are in a bucket. That doesn't make a lot of sense. At the very least, we ought not to do this scaling when the outer rel is more or less distinct on the hash key; even if the inner rel has a fair number of duplicates, things will still be pretty fast. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers