> On Sep 22, 2021, at 2:05 PM, Ryan Booz <r...@timescale.com> wrote: > > Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as > one of the execution nodes. I also realize I was making a few assumptions > about your data, are channels shared among stations, or are all channels > unique (like an ID) per station? That would impact the index and approach.
Ok, that may be a good point: “channel” is currently a varchar column, containing something like ‘BHZ’, ‘EHZ’, ‘BHE’ etc. There are only a handful of possible channels that I am currently aware of, which are shared among stations - most stations have a ‘BHZ’ channel, for example. That would be fairly simple to normalize out if that would help. > > Something like: > > station | channel > ----------|----------- > 1 1 > 1 2 > 2 3 > 2 4 > > or: > station | channel > ----------|----------- > 1 1 > 1 2 > 2 1 > 2 2 > > > > > On Wed, Sep 22, 2021 at 5:53 PM Israel Brewster <ijbrews...@alaska.edu > <mailto:ijbrews...@alaska.edu>> wrote: >> On Sep 22, 2021, at 1:50 PM, Ryan Booz <r...@timescale.com >> <mailto:r...@timescale.com>> wrote: >> >> Cool. I'd be interested to see the explain on it if you ever try it again. >> On that cardinality, I'd expect it to be really fast, so I'm interested to >> see if the (SkipScan) nodes were actually used. > > With timescaledb extension installed, the explain is what I posted in the > original message (https://explain.depesz.com/s/mtxB#html > <https://explain.depesz.com/s/mtxB#html>). Without timescaledb installed, the > explain looks the same, except it takes twice as long to run. > > Unless I missed something in your message, i.e. some sort of tweak to the > query to get it to use the timescaledb features? > > --- > Israel Brewster > Software Engineer > Alaska Volcano Observatory > Geophysical Institute - UAF > 2156 Koyukuk Drive > Fairbanks AK 99775-7320 > Work: 907-474-5172 > cell: 907-328-9145 > >> >> On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster <ijbrews...@alaska.edu >> <mailto:ijbrews...@alaska.edu>> wrote: >> >>> On Sep 22, 2021, at 12:49 PM, Ryan Booz <r...@timescale.com >>> <mailto:r...@timescale.com>> wrote: >>> >>> [Timescale Dev Advocate here] >>> I realize this might not be the most accepted answer (could be interpreted >>> as trying to "sell" something), but feels like an opportunity to talk about >>> DISTINCT queries and opportunities. Because you have that index, Timescale >>> 2.3 added a "Skip Scan" query planner node that works on regular BTree >>> indexes (it doesn't have to be time-series/TimescaleDB Hypertable data at >>> all). In this case, your distinct query would likely run in a few >>> milliseconds based on the counts you mention (170 stations, 3 channels per >>> station), and then the outer aggregation would do the GROUP BY. So, you >>> **could** add the TimescaleDB extension to your database (or a copy of) and >>> give it a try. You don't actually need to use any TimescaleDB features >>> otherwise. >> >> I had actually already done that, as I was considering, in spite of past >> negative experiences with timescaledb, experimenting with it on this DB to >> see if it worked any better with this data. Out of curiosity, I tried >> removing the timescaledb extension, whereupon the query in question took >> roughly twice as long. So you are right that installing timescaledb speeds >> things up, even when not using any timescaledb specific functions. So that >> was a good call. Thanks! >> >> --- >> Israel Brewster >> Software Engineer >> Alaska Volcano Observatory >> Geophysical Institute - UAF >> 2156 Koyukuk Drive >> Fairbanks AK 99775-7320 >> Work: 907-474-5172 >> cell: 907-328-9145 >> >>> >>> A writeup of why this kind of DISTINCT query is slow in PostgreSQL (for >>> now) and what we did to overcome it: >>> https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/ >>> >>> <https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/> >>> Plans for a similar feature in PostgreSQL proper that we'd totally support >>> but hasn't made forward progress yet: >>> https://commitfest.postgresql.org/19/1741/ >>> <https://commitfest.postgresql.org/19/1741/> >>> Anyway, it might be worth a shot. HTH >>> >>> Ryan B >>> >>> On Wed, Sep 22, 2021 at 4:27 PM David G. Johnston >>> <david.g.johns...@gmail.com <mailto:david.g.johns...@gmail.com>> wrote: >>> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis <mle...@entrata.com >>> <mailto:mle...@entrata.com>> wrote: >>> In the future, please share the plan returned by explain analyze, and some >>> data about how many rows in the involved tables, >>> >>> I believe we consider it acceptable to link to an explain viewer, which is >>> what the OP did. Reading explain output in email has its own challenges, >>> and I'd rather have the website than a text attachment. >>> >>> >>> How does the below work? It should do a very simple index scan only, then >>> aggregate the relative few rows after the fact. >>> >>> select station, array_agg(distinct(channel)) as channels >>> FROM( >>> SELECT station,channel FROM data GROUP BY station,channel >>> ) AS sub >>> group by station; >>> >>> Yeah, am pondering this too, though seems like the queries should be >>> identical so the plan/execution should be the same either way. >>> >>> >>> If there is correlation between station & channel, then you might look at >>> creating a multivariate statistics object and analyzing the table so the >>> planner can make better choices >>> >>> There is no where clause so I'm doubtful there is much to be gained going >>> down this path. The Index-Only scan seems like an optimal way to obtain >>> this data and the existing query already does that. The aggregation path >>> might vary though it seems like that shouldn't be the case here. >>> >>> David J. >> >