________________________________
From: Bruce Momjian <br...@momjian.us> Sent: Wednesday, 12 October 2022 1:30 AM

>On Tue, Oct 11, 2022 at 09:59:43AM -0400, Tom Lane wrote:
>> David Rowley <dgrowle...@gmail.com> writes:
>> > It feels like something is a bit lacking in our cost model here. I'm
>> > just not sure what that is.
>>
>> The example you show is the same old problem that we've understood for
>> decades: for cost-estimation purposes, we assume that matching rows
>> are more or less evenly distributed in the table.  Their actual
>> location doesn't matter that much if you're scanning the whole table;
>> but if you're hoping that a LIMIT will be able to stop after scanning
>> just a few rows, it does matter.
>
> We do have a correlation statistics value for each column but I am
> unclear if that would help here.

This might give someone an idea -  the best query I come up with was

   explain analyze select distinct 2 from tbl where (fld = 230) limit 1;


Limit  (cost=0.56..28349.28 rows=1 width=4) (actual time=0.039..0.040 rows=1 
loops=1)
  ->  Unique  (cost=0.56..28349.28 rows=1 width=4) (actual time=0.039..0.039 
rows=1 loops=1)
        ->  Index Only Scan using idx on tbl  (cost=0.56..28349.28 rows=995241 
width=4) (actual time=0.038..0.038 rows=1 loops=1)
              Index Cond: (fld = 230)
              Heap Fetches: 0
Planning Time: 0.066 ms
Execution Time: 0.047 ms


With the distinct and the limit, the planner somehow knows to push the either 
the distinct or the limit into the index only scan so the unique for distinct 
only had 1 row and the outer limit only had 1 row.  Without the limit, the 
distinct still does the index only scan but has to do the unique on the million 
rows and execution time goes to about 100ms.


fld is mostly ordered - it's a serial primary key in another table.  The 
cardinality of the 131 distinct values is an exponential distribution. Of the 
20m rows,   the fld values ordered by count is 8m, 5m 2m, 1m, 1m, .... down to 
about 10k.  index is btree with stats target of 1000.  table is analyzed and 
vacuum frozen.  there is a "create statistics" on this table for n:1 
relationship between another field and this one.


Without the distinct, choosing a different value with lower number of rows 
changed the plan to index only scan with limit somewhere between 3.7% and 4.7% 
of the table.  With a brin index on a similar size/distributed table that is in 
fld order, that changed to somewhere between 0.6% and 0.7%.



Reply via email to