Hi Joel, Joel Jacobson <[email protected]> 于2026年2月26日周四 04:32写道: > > On Wed, Feb 25, 2026, at 13:19, Joel Jacobson wrote: > > On Tue, Feb 24, 2026, at 19:21, Joel Jacobson wrote: > >> This bug seems to sometimes cause the wrong table, the larger table, to > >> be hashed in a Hash Join, and the smaller table to be used for probing. > > To help reviewers, instead of relying on benchmark results, > I realized it's much better if we can actually prove the calculation > of skew_ratio is incorrect, and that it becomes correct with the fix. > > I therefore added debugging here: > > if (avgfreq > 0.0 && *mcv_freq > avgfreq) > + { > + elog(DEBUG1, "mcv_freq=%g, avgfreq=%g, skew_ratio=%g", > + *mcv_freq, avgfreq, *mcv_freq / avgfreq); > estfract *= *mcv_freq / avgfreq; > + } > > And created this minimal schema to prove the incorrect calculation: > > CREATE TABLE orders ( > order_id bigint NOT NULL, > tracking_code bigint, > region int NOT NULL > ); > > CREATE TABLE shipments ( > shipment_id bigint NOT NULL, > tracking_code bigint NOT NULL > ); > > -- Skewed tracking_code distribution: > -- 10% of rows share a single "hot" tracking_code (value 1), > -- the rest get unique codes. > -- This ensures mcv_freq > avgfreq, triggering the skew adjustment > -- even in the fixed version of estimate_hash_bucket_stats. > INSERT INTO orders > SELECT > g, > CASE > WHEN g > 150000 THEN NULL > WHEN g <= 15000 THEN 1 > ELSE g > END, > (g % 1000) + 1 > FROM generate_series(1, 200000) g; > > INSERT INTO shipments > SELECT g, CASE WHEN g <= 15000 THEN 1 ELSE g END > FROM generate_series(1, 150000) g; > > CREATE INDEX ON orders (region); > > ANALYZE orders; > ANALYZE shipments; > > -- Ground truth of the skew_ratio for orders WHERE region = 42: > WITH base AS ( > SELECT > count(*)::numeric AS total, > count(tracking_code)::numeric AS nonnull, > count(DISTINCT tracking_code) AS ndistinct > FROM orders > WHERE region = 42 > ), > mcv AS ( > SELECT count(*)::numeric AS mcv_count > FROM orders > WHERE region = 42 AND tracking_code IS NOT NULL > GROUP BY tracking_code > ORDER BY count(*) DESC > LIMIT 1 > ) > SELECT > round(mcv_count / total, 6) AS mcv_freq, > round(nonnull / total / ndistinct, 6) AS avgfreq, > round((mcv_count / total) / (nonnull / total / ndistinct), 6) AS > skew_ratio > FROM base, mcv; > > mcv_freq | avgfreq | skew_ratio > ----------+----------+------------ > 0.075000 | 0.005515 | 13.600000 > (1 row) > > SET client_min_messages = DEBUG1; > > SELECT * > FROM orders o > JOIN shipments s ON s.tracking_code = o.tracking_code > WHERE o.region = 42; > > -- HEAD (65707ed): > DEBUG: mcv_freq=0.0748333, avgfreq=8.69725e-06, skew_ratio=8604.25 > > -- 0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch: > DEBUG: mcv_freq=0.0738667, avgfreq=0.00871124, skew_ratio=8.47947 > > In HEAD, the skew_ratio for orders.tracking_code is wrongly estimated > to 8604.25, when the ground truth is 13.6, which the fixed estimate > 8.47947, is a good approximation of. > > The fix only moves the computation of avgfreq from before the > ndistinct adjustment, to after it: > > diff --git a/src/backend/utils/adt/selfuncs.c > b/src/backend/utils/adt/selfuncs.c > index 29fec65559..d19c4b5d96 100644 > --- a/src/backend/utils/adt/selfuncs.c > +++ b/src/backend/utils/adt/selfuncs.c > @@ -4456,9 +4456,6 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node > *hashkey, double nbuckets, > else > stanullfrac = 0.0; > > - /* Compute avg freq of all distinct data values in raw relation */ > - avgfreq = (1.0 - stanullfrac) / ndistinct; > - > /* > * Adjust ndistinct to account for restriction clauses. Observe we > are > * assuming that the data distribution is affected uniformly by the > @@ -4473,6 +4470,9 @@ estimate_hash_bucket_stats(PlannerInfo *root, Node > *hashkey, double nbuckets, > ndistinct = clamp_row_est(ndistinct); > } > > + /* Compute avg freq of all distinct data values in the filtered > relation */ > + avgfreq = (1.0 - stanullfrac) / ndistinct; > + > /* > * Initial estimate of bucketsize fraction is 1/nbuckets as long as > the > * number of buckets is less than the expected number of distinct > values; > > /Joel > >
I think your analysis is correct. After bd3e3e9, the mcv_freq is calculated by RelOptInfo.rows, which accounts for restriction clauses. But avgfreq is for the raw relation. I tried another way. I used vardata.rel->tuples replacing with original vardata.rel->rows. I got the correct plan, and partition_join.sql had a lot plan diff as your first thread said. But vardata.rel->tuples may be zero due to an empty relation. So I agree with your fix. I added Tom to the cc list. He may know more about this. -- Thanks, Tender Wang
