Re: Optimizing count(), but Explain estimates wildly off

Vitalii Tymchyshyn Mon, 26 Feb 2024 20:36:40 -0800

Is your transaction id more or less monotonic according to the date? If so,
something like the next can help:


with tenders_filtered as (select * from tenders where country = 'Mexico'
and "date" >= '2023-01-01' and "date" < '2024-01-01')
Select * from tenders_filtered inner join items on transaction_id =
tender_transaction_id
where tender_transaction_id between (select min(transaction_id) from
tenders_filtered) and (select max(transaction_id) from tenders_filtered)

This assumes you have an index on items(tender_transaction_id) and it would
be able to select a small subset (less than say 5%) of the table.
If your transaction_id is not monotonic, you can consider having something
monotonic or even additional denormalized field(s) with country and/or date
to your items.

Another option is to use a windowing function to get the count, e.g.
Select *,count(*) OVER () as cnt from tenders inner join items on
transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'

This would at least save you from doing a second call.

пн, 26 лют. 2024 р. о 16:26 Chema <ch...@interneta.org> пише:

> Dear pgsqlers,
>
> I'm trying to optimize simple queries on two tables (tenders & items) with
> a couple million records.  Besides the resulting records, the app also
> displays the count of total results.  Doing count() takes as much time as
> the other query (which can be 30+ secs), so it's an obvious target for
> optimization.  I'm already caching count() results for the most common
> conditions (country & year) in a material table, which practically halves
> response time.  The tables are updated sparingly, and only with bulk
> COPYs.  Now I'm looking for ways to optimize queries with other conditions.
>
> Reading around, seems many people are still using this 2005 snippet
> <https://www.postgresql.org/message-id/20050810133157.ga46...@winnie.fuhr.org>
> to obtain the row count estimate from Explain:
>
> CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$DECLARE
>   rec   record;
>   rows  integer;BEGIN
>   FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
>     rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
>     EXIT WHEN rows IS NOT NULL;
>   END LOOP;
>   RETURN rows;END;$$ LANGUAGE plpgsql VOLATILE STRICT;
>
> Is this still the current best practice?  Any tips to increase precision?
> Currently it can estimate the actual number of rows for over *or* under a
> million, as seen on the sample queries (1,955,297 instead of 1,001,200;
> 162,080 instead of 1,292,010).
>
> Any other tips to improve the query are welcome, of course.  There's a big
> disparity between the two sample queries plans even though only the
> filtered country changes.
>
> I already raised default_statistics_target up to 2k (the planner wasn't
> using indexes at all with low values).  Gotta get it even higher? These are
> my custom settings:
>
> shared_buffers = 256MB                  # min 128kB
> work_mem = 128MB                                # min 64kB
> maintenance_work_mem = 254MB            # min 1MB
> effective_cache_size = 2GB
> default_statistics_target = 2000
> random_page_cost = 1.0                  # same scale as above
>
> Sample query:
>
> Explain Analyze
> Select * from tenders inner join items on transaction_id =
> tender_transaction_id
> where country = 'Colombia'
> and "date" >= '2023-01-01' and "date" < '2024-01-01'
> QUERY PLAN
> Gather  (cost=253837.99..1506524.32 rows=1955297 width=823) (actual
> time=51433.592..63239.809 rows=1001200 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   ->  Parallel Hash Join  (cost=252837.99..1309994.62 rows=814707
> width=823) (actual time=51361.920..61729.142 rows=333733 loops=3)
>         Hash Cond: (items.tender_transaction_id = tenders.transaction_id)
>         ->  Parallel Seq Scan on items  (cost=0.00..1048540.46
> rows=3282346 width=522) (actual time=1.689..56887.108 rows=2621681 loops=3)
>         ->  Parallel Hash  (cost=247919.56..247919.56 rows=393475
> width=301) (actual time=2137.473..2137.476 rows=333733 loops=3)
>               Buckets: 1048576  Batches: 1  Memory Usage: 219936kB
>               ->  Parallel Bitmap Heap Scan on tenders
>  (cost=16925.75..247919.56 rows=393475 width=301) (actual
> time=385.315..908.865 rows=333733 loops=3)
>                     Recheck Cond: ((country = 'Colombia'::text) AND (date
> >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
>                     Heap Blocks: exact=24350
>                     ->  Bitmap Index Scan on
> tenders_country_and_date_index  (cost=0.00..16689.67 rows=944339 width=0)
> (actual time=423.213..423.214 rows=1001200 loops=1)
>                           Index Cond: ((country = 'Colombia'::text) AND
> (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
> Planning Time: 12.784 ms
> JIT:
> Functions: 33
> Options: Inlining true, Optimization true, Expressions true, Deforming true
> Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521
> ms, Emission 651.442 ms, Total 2072.987 ms
> Execution Time: 63378.033 ms
>
> Explain Analyze
> Select * from tenders inner join items on transaction_id =
> tender_transaction_id
> where country = 'Mexico'
> and "date" >= '2023-01-01' and "date" < '2024-01-01'
> QUERY PLAN
> Gather  (cost=1000.99..414258.70 rows=162080 width=823) (actual
> time=52.538..7006.128 rows=1292010 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   ->  Nested Loop  (cost=0.99..397050.70 rows=67533 width=823) (actual
> time=40.211..4087.081 rows=430670 loops=3)
>         ->  Parallel Index Scan using tenders_country_and_date_index on
> tenders  (cost=0.43..45299.83 rows=32616 width=301) (actual
> time=4.376..59.760 rows=1218 loops=3)
>               Index Cond: ((country = 'Mexico'::text) AND (date >=
> '2023-01-01'::date) AND (date < '2024-01-01'::date))
>         ->  Index Scan using items_tender_transaction_id_index on items
>  (cost=0.56..10.67 rows=11 width=522) (actual time=0.321..3.035 rows=353
> loops=3655)
>               Index Cond: (tender_transaction_id = tenders.transaction_id)
> Planning Time: 7.808 ms
> JIT:
> Functions: 27
> Options: Inlining false, Optimization false, Expressions true, Deforming
> true
> Timing: Generation 17.785 ms, Inlining 0.000 ms, Optimization 5.080 ms,
> Emission 93.274 ms, Total 116.138 ms
> Execution Time: 7239.427 ms
>
> Thanks in advance!
>

Re: Optimizing count(), but Explain estimates wildly off

Reply via email to