Re: Faster distinct query?

Israel Brewster Wed, 22 Sep 2021 14:54:12 -0700

> On Sep 22, 2021, at 1:50 PM, Ryan Booz <[email protected]> wrote:
> 
> Cool. I'd be interested to see the explain on it if you ever try it again. On 
> that cardinality, I'd expect it to be really fast, so I'm interested to see 
> if the (SkipScan) nodes were actually used.


With timescaledb extension installed, the explain is what I posted in the 
original message (https://explain.depesz.com/s/mtxB#html 
<https://explain.depesz.com/s/mtxB#html>). Without timescaledb installed, the 
explain looks the same, except it takes twice as long to run.

Unless I missed something in your message, i.e. some sort of tweak to the query 
to get it to use the timescaledb features?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> On Sep 22, 2021, at 12:49 PM, Ryan Booz <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> [Timescale Dev Advocate here]
>> I realize this might not be the most accepted answer (could be interpreted 
>> as trying to "sell" something), but feels like an opportunity to talk about 
>> DISTINCT queries and opportunities. Because you have that index, Timescale 
>> 2.3 added a "Skip Scan" query planner node that works on regular BTree 
>> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at 
>> all). In this case, your distinct query would likely run in a few 
>> milliseconds based on the counts you mention (170 stations, 3 channels per 
>> station), and then the outer aggregation would do the GROUP BY. So, you 
>> **could** add the TimescaleDB extension to your database (or a copy of) and 
>> give it a try. You don't actually need to use any TimescaleDB features 
>> otherwise.
> 
> I had actually already done that, as I was considering, in spite of past 
> negative experiences with timescaledb, experimenting with it on this DB to 
> see if it worked any better with this data. Out of curiosity, I tried 
> removing the timescaledb extension, whereupon the query in question took 
> roughly twice as long. So you are right that installing timescaledb speeds 
> things up, even when not using any timescaledb specific functions. So that 
> was a good call. Thanks!
> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 
>> 
>> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for now) 
>> and what we did to overcome it: 
>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>  
>> <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/>
>> Plans for a similar feature in PostgreSQL proper that we'd totally support 
>> but hasn't made forward progress yet: 
>> https://commitfest.postgresql.org/19/1741/ 
>> <https://commitfest.postgresql.org/19/1741/>
>> Anyway, it might be worth a shot. HTH
>> 
>> Ryan B
>> 
>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston 
>> <[email protected] <mailto:[email protected]>> wrote:
>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <[email protected] 
>> <mailto:[email protected]>> wrote:
>> In the future, please share the plan returned by explain analyze, and some 
>> data about how many rows in the involved tables,
>> 
>> I believe we consider it acceptable to link to an explain viewer, which is 
>> what the OP did.  Reading explain output in email has its own challenges, 
>> and I'd rather have the website than a text attachment.
>> 
>> 
>> How does the below work? It should do a very simple index scan only, then 
>> aggregate the relative few rows after the fact.
>> 
>> select station, array_agg(distinct(channel)) as channels
>> FROM(
>> SELECT station,channel FROM data GROUP BY station,channel
>> ) AS sub
>> group by station;
>> 
>> Yeah, am pondering this too, though seems like the queries should be 
>> identical so the plan/execution should be the same either way.
>> 
>> 
>> If there is correlation between station & channel, then you might look at 
>> creating a multivariate statistics object and analyzing the table so the 
>> planner can make better choices
>> 
>> There is no where clause so I'm doubtful there is much to be gained going 
>> down this path.  The Index-Only scan seems like an optimal way to obtain 
>> this data and the existing query already does that.  The aggregation path 
>> might vary though it seems like that shouldn't be the case here.
>> 
>> David J.
>

Re: Faster distinct query?

Reply via email to