Re: Faster distinct query?

Israel Brewster Wed, 22 Sep 2021 15:23:11 -0700

> On Sep 22, 2021, at 2:05 PM, Ryan Booz <[email protected]> wrote:
> 
> Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as 
> one of the execution nodes. I also realize I was making a few assumptions 
> about your data, are channels shared among stations, or are all channels 
> unique (like an ID) per station? That would impact the index and approach.


Ok, that may be a good point: “channel” is currently a varchar column, 
containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of 
possible channels that I am currently aware of, which are shared among stations 
- most stations have a ‘BHZ’ channel, for example. That would be fairly simple 
to normalize out if that would help.

> 
> Something like:
> 
> station | channel
> ----------|-----------
> 1            1
> 1            2
> 2            3
> 2            4
> 
> or:
> station | channel
> ----------|-----------
> 1            1
> 1            2
> 2            1
> 2            2
> 
> 
> 
> 
> On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <[email protected] 
> <mailto:[email protected]>> wrote:
>> On Sep 22, 2021, at 1:50 PM, Ryan Booz <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Cool. I'd be interested to see the explain on it if you ever try it again. 
>> On that cardinality, I'd expect it to be really fast, so I'm interested to 
>> see if the (SkipScan) nodes were actually used.
> 
> With timescaledb extension installed, the explain is what I posted in the 
> original message (https://explain.depesz.com/s/mtxB#html 
> <https://explain.depesz.com/s/mtxB#html>). Without timescaledb installed, the 
> explain looks the same, except it takes twice as long to run.
> 
> Unless I missed something in your message, i.e. some sort of tweak to the 
> query to get it to use the timescaledb features?
> 
> ---
> Israel Brewster
> Software Engineer
> Alaska Volcano Observatory 
> Geophysical Institute - UAF 
> 2156 Koyukuk Drive 
> Fairbanks AK 99775-7320
> Work: 907-474-5172
> cell:  907-328-9145
> 
>> 
>> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> On Sep 22, 2021, at 12:49 PM, Ryan Booz <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> [Timescale Dev Advocate here]
>>> I realize this might not be the most accepted answer (could be interpreted 
>>> as trying to "sell" something), but feels like an opportunity to talk about 
>>> DISTINCT queries and opportunities. Because you have that index, Timescale 
>>> 2.3 added a "Skip Scan" query planner node that works on regular BTree 
>>> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at 
>>> all). In this case, your distinct query would likely run in a few 
>>> milliseconds based on the counts you mention (170 stations, 3 channels per 
>>> station), and then the outer aggregation would do the GROUP BY. So, you 
>>> **could** add the TimescaleDB extension to your database (or a copy of) and 
>>> give it a try. You don't actually need to use any TimescaleDB features 
>>> otherwise.
>> 
>> I had actually already done that, as I was considering, in spite of past 
>> negative experiences with timescaledb, experimenting with it on this DB to 
>> see if it worked any better with this data. Out of curiosity, I tried 
>> removing the timescaledb extension, whereupon the query in question took 
>> roughly twice as long. So you are right that installing timescaledb speeds 
>> things up, even when not using any timescaledb specific functions. So that 
>> was a good call. Thanks!
>> 
>> ---
>> Israel Brewster
>> Software Engineer
>> Alaska Volcano Observatory 
>> Geophysical Institute - UAF 
>> 2156 Koyukuk Drive 
>> Fairbanks AK 99775-7320
>> Work: 907-474-5172
>> cell:  907-328-9145
>> 
>>> 
>>> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for 
>>> now) and what we did to overcome it: 
>>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
>>>  
>>> <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/>
>>> Plans for a similar feature in PostgreSQL proper that we'd totally support 
>>> but hasn't made forward progress yet: 
>>> https://commitfest.postgresql.org/19/1741/ 
>>> <https://commitfest.postgresql.org/19/1741/>
>>> Anyway, it might be worth a shot. HTH
>>> 
>>> Ryan B
>>> 
>>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> In the future, please share the plan returned by explain analyze, and some 
>>> data about how many rows in the involved tables,
>>> 
>>> I believe we consider it acceptable to link to an explain viewer, which is 
>>> what the OP did.  Reading explain output in email has its own challenges, 
>>> and I'd rather have the website than a text attachment.
>>> 
>>> 
>>> How does the below work? It should do a very simple index scan only, then 
>>> aggregate the relative few rows after the fact.
>>> 
>>> select station, array_agg(distinct(channel)) as channels
>>> FROM(
>>> SELECT station,channel FROM data GROUP BY station,channel
>>> ) AS sub
>>> group by station;
>>> 
>>> Yeah, am pondering this too, though seems like the queries should be 
>>> identical so the plan/execution should be the same either way.
>>> 
>>> 
>>> If there is correlation between station & channel, then you might look at 
>>> creating a multivariate statistics object and analyzing the table so the 
>>> planner can make better choices
>>> 
>>> There is no where clause so I'm doubtful there is much to be gained going 
>>> down this path.  The Index-Only scan seems like an optimal way to obtain 
>>> this data and the existing query already does that.  The aggregation path 
>>> might vary though it seems like that shouldn't be the case here.
>>> 
>>> David J.
>> 
>

Re: Faster distinct query?

Reply via email to