Re: Faster distinct query?

Israel Brewster Wed, 22 Sep 2021 18:22:25 -0700

> On Sep 22, 2021, at 5:10 PM, David Rowley <dgrowle...@gmail.com> wrote:
> 
> On Thu, 23 Sept 2021 at 08:27, David G. Johnston
> <david.g.johns...@gmail.com> wrote:
>> 
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mle...@entrata.com> wrote:
>>> If there is correlation between station & channel, then you might look at 
>>> creating a multivariate statistics object and analyzing the table so the 
>>> planner can make better choices
>> 
>> 
>> There is no where clause so I'm doubtful there is much to be gained going 
>> down this path.  The Index-Only scan seems like an optimal way to obtain 
>> this data and the existing query already does that.  The aggregation path 
>> might vary though it seems like that shouldn't be the case here.
> 
> ndistinct extended statistics would be used to estimate the number of
> groups in the GROUP BY clause for the version of the query that
> performs GROUP BY station, channel.   We've not seen the EXPLAIN
> ANALYZE for that query, so don't know if there's any use for extended
> statistics there.  However, if the planner was to think there were
> more groups than there actually are, then it would be less inclined to
> do parallelise the GROUP BY.  I think writing the query in such a way
> that allows it to be parallelised is likely going to result in some
> quite good performance improvements. i.e:
> 
> select station, array_agg(channel) as channels
> FROM(
>   SELECT station,channel FROM data GROUP BY station,channel
> ) AS sub
> group by station;
>


Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! 
See the explain analyze output here: https://explain.depesz.com/s/L5Bf 
<https://explain.depesz.com/s/L5Bf> It looks more complicated, but being able 
to run parallel definitely makes a difference, and there may be some other 
improvements in there that I’m not aware of as well!

Still not quite fast enough for real-time queries, but certainly fast enough to 
keep a materialized view updated.

And this is why I love postgresql and this community - when something isn’t 
working as well as I would like, there is usually a way to improve it 
drastically :-)

Thanks!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


> which is just the same as Michael's version but without DISTINCT.
> Also, Tom's point about lots of heap fetches is going to count for
> quite a bit too, especially so if I/O plays a large part in the total
> query time.
> 
> David

Re: Faster distinct query?

Reply via email to