Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.
Something like: station | channel ----------|----------- 1 1 1 2 2 3 2 4 or: station | channel ----------|----------- 1 1 1 2 2 1 2 2 On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrews...@alaska.edu> wrote: > On Sep 22, 2021, at 1:50 PM, Ryan Booz <r...@timescale.com> wrote: > > Cool. I'd be interested to see the explain on it if you ever try it again. > On that cardinality, I'd expect it to be really fast, so I'm interested to > see if the (SkipScan) nodes were actually used. > > > With timescaledb extension installed, the explain is what I posted in the > original message (https://explain.depesz.com/s/mtxB#html). Without > timescaledb installed, the explain looks the same, except it takes twice as > long to run. > > Unless I missed something in your message, i.e. some sort of tweak to the > query to get it to use the timescaledb features? > > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 > > > On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrews...@alaska.edu> > wrote: > >> >> On Sep 22, 2021, at 12:49 PM, Ryan Booz <r...@timescale.com> wrote: >> >> [Timescale Dev Advocate here] >> I realize this might not be the most accepted answer (could be >> interpreted as trying to "sell" something), but feels like an opportunity >> to talk about DISTINCT queries and opportunities. Because you have that >> index, Timescale 2.3 added a "Skip Scan" query planner node that works on >> regular BTree indexes (it doesn't have to be time-series/TimescaleDB >> Hypertable data at all). In this case, your distinct query would likely run >> in a few milliseconds based on the counts you mention (170 stations, 3 >> channels per station), and then the outer aggregation would do the GROUP >> BY. So, you **could** add the TimescaleDB extension to your database (or a >> copy of) and give it a try. You don't actually need to use any TimescaleDB >> features otherwise. >> >> >> I had actually already done that, as I was considering, in spite of past >> negative experiences with timescaledb, experimenting with it on this DB to >> see if it worked any better with this data. Out of curiosity, I tried >> removing the timescaledb extension, whereupon the query in question took >> roughly twice as long. So you are right that installing timescaledb speeds >> things up, even when not using any timescaledb specific functions. So that >> was a good call. Thanks! >> >> --- >> Israel Brewster >> Software Engineer >> Alaska Volcano Observatory >> Geophysical Institute - UAF >> 2156 Koyukuk Drive >> Fairbanks AK 99775-7320 >> Work: 907-474-5172 >> cell: 907-328-9145 >> >> >> >> - A writeup of why this kind of DISTINCT query is slow in PostgreSQL >> (for now) and what we did to overcome it: >> >> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ >> - Plans for a similar feature in PostgreSQL proper that we'd totally >> support but hasn't made forward progress yet: >> https://commitfest.postgresql.org/19/1741/ >> >> Anyway, it might be worth a shot. HTH >> >> Ryan B >> >> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston < >> david.g.johns...@gmail.com> wrote: >> >>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mle...@entrata.com> >>> wrote: >>> >>>> In the future, please share the plan returned by explain analyze, and >>>> some data about how many rows in the involved tables, >>>> >>> >>> I believe we consider it acceptable to link to an explain viewer, which >>> is what the OP did. Reading explain output in email has its own >>> challenges, and I'd rather have the website than a text attachment. >>> >>> >>>> How does the below work? It should do a very simple index scan only, >>>> then aggregate the relative few rows after the fact. >>>> >>>> select station, array_agg(distinct(channel)) as channels >>>> FROM( >>>> SELECT station,channel FROM data GROUP BY station,channel >>>> ) AS sub >>>> group by station; >>>> >>> >>> Yeah, am pondering this too, though seems like the queries should be >>> identical so the plan/execution should be the same either way. >>> >>> >>>> If there is correlation between station & channel, then you might look >>>> at creating a multivariate statistics object and analyzing the table so the >>>> planner can make better choices >>>> >>> >>> There is no where clause so I'm doubtful there is much to be gained >>> going down this path. The Index-Only scan seems like an optimal way to >>> obtain this data and the existing query already does that. The aggregation >>> path might vary though it seems like that shouldn't be the case here. >>> >>> David J. >>> >> >> >