On 9/12/24 12:12, David Rowley wrote: > On Thu, 12 Sept 2024 at 21:51, Andrei Lepikhov <lepi...@gmail.com> wrote: >> Initial problem causes wrong cost_sort estimation. Right now I think >> about providing cost_sort() the sort clauses instead of (or in addition >> to) the pathkeys. > > I'm not quite sure why the sort clauses matter any more than the > EquivalenceClass. If the EquivalanceClass defines that all members > will have the same value for any given row, then, if we had to choose > any single member to drive the n_distinct estimate from, isn't the > most accurate distinct estimate from the member with the smallest > n_distinct estimate? (That assumes the less distinct member has every > value the more distinct member has, which might not be true) > > David >
How large can the cost difference get? My assumption was it's correlated with how different the ndistincts are for the two sides, so I tried CREATE TABLE t1(x integer, y integer,z text); CREATE TABLE t2(x integer, y integer,z text); INSERT INTO t1 (x,y) SELECT x, 1 FROM generate_series(1,1000000) AS x; INSERT INTO t2 (x,y) SELECT mod(x,1000), 1 FROM generate_series(1,1000000) AS x; CREATE INDEX ON t1(x); CREATE INDEX ON t2(x); CREATE INDEX ON t1(y); CREATE INDEX ON t2(y); VACUUM ANALYZE; Which changes the ndistinct for t2.x from 1M to 1k. I've expected the cost difference to get much larger, but in it does not really change: GroupAggregate (cost=38.99..37886.88 rows=992 width=16) (actual rows=1 loops=1) GroupAggregate (cost=37874.26..37904.04 rows=992 width=16) (actual rows=1 loops=1) That is pretty significant - the total cost difference is tiny, I'd even say it does not matter in practice (seems well within possible impact of collecting a different random sample). But the startup cost changes in rather absurd way, while the rest of the plan is exactly the same. We even know this: -> Incremental Sort (cost=38.99..37869.52 rows=992 width=8) Sort Key: t1.x, t1.y Presorted Key: t1.x in both cases. There's literally no other difference between these plans visible in the explain ... I'm not sure how to fix this, but it seems estimate_num_groups() needs to do things differently. And I agree looking for the minimum ndistinct seems like the right approach. but doesn't estimate_num_groups() supposed to already do that? The comment says: * 3. If the list contains Vars of different relations that are known equal * due to equivalence classes, then drop all but one of the Vars from each * known-equal set, keeping the one with smallest estimated # of values * (since the extra values of the others can't appear in joined rows). * Note the reason we only consider Vars of different relations is that * if we considered ones of the same rel, we'd be double-counting the * restriction selectivity of the equality in the next step. I haven't debugged this, but how come this doesn't do the trick? regards -- Tomas Vondra