I wrote:
> Thinking a bit harder, we are comparing these costs:
> [ theoretical arguments trimmed ]
I spent some effort on actually measuring timings of the v6 patch,
and concluded that this is all splitting hairs that we don't need
to split. The actual crossover between hash-loses and hash-wins
is more than what my theoretical argument suggested, but still
probably less than 100 MCVs on each side. I think we should go with
(sslot1.nvalues + sslot2.nvalues) >= 200
and call it good.
To arrive at this result, I built the v6 patchset with
EQJOINSEL_MCV_HASH_THRESHOLD changed to either 0 (to force hashing)
or 1000000 (to prevent it). I then ran the attached scripts with
different values of "nstats" and collected timings from the postmaster
log output produced by the 0001 patch.
The scripts are designed to test both the cheap-comparisons scenario
(integer columns) and the expensive-comparisons scenario (text columns
with a case-insensitive ICU collation). My motivation for splitting
them into a setup and a test step was to allow the tests to be run
repeatedly against the same underlying data. (Although I soon realized
that because VACUUM ANALYZE takes a random sample each time, the stats
we're working from aren't totally the same each time anyway.) Also
you'll notice that the test data is based on log(random()), which
I did to roughly approximate a zipfian distribution. If you remove
the log() call you'll get a flat distribution instead, but it didn't
seem to change the conclusions much.
regards, tom lane
drop table if exists tint1, tint2, ttext1, ttext2;
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false,
locale = 'und-u-ks-level1');
create table tint1 as
select (-log(random())*10000)::int as x from generate_series(1,1000000);
create table tint2 as
select (-log(random())*10000)::int as x from generate_series(1,1000000);
create table ttext1 (x text collate ignore_accent_case);
insert into ttext1
select md5((-log(random())*10000)::int::text) as x from
generate_series(1,1000000);
create table ttext2 (x text collate ignore_accent_case);
insert into ttext2
select md5((-log(random())*10000)::int::text) as x from
generate_series(1,1000000);
\set nstats 10000
alter table tint1 alter column x set statistics :nstats;
alter table tint2 alter column x set statistics :nstats;
alter table ttext1 alter column x set statistics :nstats;
alter table ttext2 alter column x set statistics :nstats;
vacuum analyze tint1, tint2, ttext1, ttext2;
\echo nstats = :nstats
select tablename, array_length(most_common_vals,1)
from pg_stats where tablename in ('tint1', 'tint2', 'ttext1', 'ttext2')
order by 1;
explain select * from tint1 t1 join tint2 t2 on t1.x = t2.x;
explain select * from ttext1 t1 join ttext2 t2 on t1.x = t2.x;