On Tue, Aug 25, 2020 at 12:27 AM David Rowley <dgrowle...@gmail.com> wrote:
> On Sat, 22 Aug 2020 at 00:35, iulian dragos > <iulian.dra...@databricks.com> wrote: > > I am trying to understand why the query planner insists on using a hash > join, and how to make it choose the better option, which in this case would > be a nested loop. > > > | -> Index Scan using > test_result_module_result_id_idx on test_result (cost=0.57..6911.17 > rows=4331 width=12) (actual time=0.002..0.002 rows=1 loops=14824) | > > | Index Cond: (module_result_id = > module_result.id) > | > > You might want to check if the pg_stats view reports a realistic > n_distinct value for test_result.module_result_id. If the > pg_class.retuples is correct for that relation then that would > indicate the n_distinct estimate is about 115000. Going by the number > of rows you've mentioned it would appear a more realistic value for > that would be -0.4. which is 0 - 1 / (500000000 / 200000000.0). > However, that's assuming each module_result has a test_result. You > could run a SELECT COUNT(DISTINCT module_result_id) FROM test_result; > to get a better idea. > > If ANALYZE is not getting you a good value for n_distinct, then you > can overwrite it. See [1], search for n_distinct. > Thanks for the tip! Indeed, `n_distinct` isn't right. I found it in pg_stats set at 131736.0, but the actual number is much higher: 210104361. I tried to set it manually, but the plan is still the same (both the actual number and a percentage, -0.4, as you suggested): > ALTER TABLE test_result ALTER COLUMN module_result_id SET (n_distinct=210104361) You're about to run a destructive command. Do you want to proceed? (y/n): y Your call! ALTER TABLE Time: 0.205s > > David > > [1] https://www.postgresql.org/docs/current/sql-altertable.html >