Re: Faster distinct query?

2021-09-22 Thread Mladen Gogala
On 9/22/21 16:20, David G. Johnston wrote: I'd probably turn that index into a foreign key that just ensures that every (station,channel) that appears in the data table also appears on the lookup table.  Grouping and array-ifying the lookup table would be trivial.  Either modify the applicati

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Sorry - break for dinner! So much happens on a global scale in a few hours. :-)! I took a few minutes and created a simple example here of what I imagine you have on that table. I only inserted ~80 million rows of test data, but hopefully, it's somewhat representative. TimescaleDB's current impl

Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 13:21, Israel Brewster wrote: > Ah, yes indeed. That version runs in about 30 seconds rather than 5 minutes! > See the explain analyze output here: https://explain.depesz.com/s/L5Bf It > looks more complicated, but being able to run parallel definitely makes a > differen

Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 13:28, David G. Johnston wrote: > Agreed, though if the query author needs to do that here we’ve violated the > spirit of the declarative SQL language. At first blush nothing about the > original query seems like it should be preventing parallelism. Each worker > build

Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wednesday, September 22, 2021, David Rowley wrote: > I think writing the query in such a way > that allows it to be parallelised is likely going to result in some > quite good performance improvements. i.e: > Agreed, though if the query author needs to do that here we’ve violated the spirit o

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 5:10 PM, David Rowley wrote: > > On Thu, 23 Sept 2021 at 08:27, David G. Johnston > wrote: >> >> On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis wrote: >>> If there is correlation between station & channel, then you might look at >>> creating a multivariate statistics obje

Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 08:27, David G. Johnston wrote: > > On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis wrote: >> If there is correlation between station & channel, then you might look at >> creating a multivariate statistics object and analyzing the table so the >> planner can make better ch

Re: Faster distinct query?

2021-09-22 Thread David Rowley
On Thu, 23 Sept 2021 at 08:21, Michael Lewis wrote: > select station, array_agg(distinct(channel)) as channels > FROM( > SELECT station,channel FROM data GROUP BY station,channel > ) AS sub > group by station; Since the subquery is grouping by station, channel, then there's no need for the DISTIN

Re: Faster distinct query?

2021-09-22 Thread Tom Lane
Michael Lewis writes: > On Wed, Sep 22, 2021 at 2:48 PM Tom Lane wrote: >> The "index-only" scan is reported to do 86m heap fetches along the >> way to returning 812m rows, so the data is apparently pretty dirty. > Do you say that because you would expect many more than 10 tuples per page? No,

Re: Faster distinct query?

2021-09-22 Thread Michael Lewis
On Wed, Sep 22, 2021 at 2:48 PM Tom Lane wrote: > "David G. Johnston" writes: > > There is no where clause so I'm doubtful there is much to be gained going > > down this path. The Index-Only scan seems like an optimal way to obtain > > this data and the existing query already does that. > > The

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 2:05 PM, Ryan Booz wrote: > > Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as > one of the execution nodes. I also realize I was making a few assumptions > about your data, are channels shared among stations, or are all channels > unique (like

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Ah. I didn't realize that. If SkipScan was chosen, you'd actually see it as one of the execution nodes. I also realize I was making a few assumptions about your data, are channels shared among stations, or are all channels unique (like an ID) per station? That would impact the index and approach.

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 1:50 PM, Ryan Booz wrote: > > Cool. I'd be interested to see the explain on it if you ever try it again. On > that cardinality, I'd expect it to be really fast, so I'm interested to see > if the (SkipScan) nodes were actually used. With timescaledb extension installed, th

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
Cool. I'd be interested to see the explain on it if you ever try it again. On that cardinality, I'd expect it to be really fast, so I'm interested to see if the (SkipScan) nodes were actually used. On Wed, Sep 22, 2021 at 5:35 PM Israel Brewster wrote: > > On Sep 22, 2021, at 12:49 PM, Ryan Booz

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 12:49 PM, Ryan Booz wrote: > > [Timescale Dev Advocate here] > I realize this might not be the most accepted answer (could be interpreted as > trying to "sell" something), but feels like an opportunity to talk about > DISTINCT queries and opportunities. Because you have t

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 12:48 PM, Tom Lane wrote: > > "David G. Johnston" writes: >> There is no where clause so I'm doubtful there is much to be gained going >> down this path. The Index-Only scan seems like an optimal way to obtain >> this data and the existing query already does that. > > The

Re: Faster distinct query?

2021-09-22 Thread Ryan Booz
[Timescale Dev Advocate here] I realize this might not be the most accepted answer (could be interpreted as trying to "sell" something), but feels like an opportunity to talk about DISTINCT queries and opportunities. Because you have that index, Timescale 2.3 added a "Skip Scan" query planner node

Re: Faster distinct query?

2021-09-22 Thread Tom Lane
"David G. Johnston" writes: > There is no where clause so I'm doubtful there is much to be gained going > down this path. The Index-Only scan seems like an optimal way to obtain > this data and the existing query already does that. The "index-only" scan is reported to do 86m heap fetches along t

Re: Faster distinct query?

2021-09-22 Thread Israel Brewster
> On Sep 22, 2021, at 12:20 PM, David G. Johnston > wrote: > > On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster > wrote: > To work around the issue, I created a materialized view that I can update > periodically, and of course I can query said view in no time flat.

Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wed, Sep 22, 2021 at 1:21 PM Michael Lewis wrote: > In the future, please share the plan returned by explain analyze, and some > data about how many rows in the involved tables, > I believe we consider it acceptable to link to an explain viewer, which is what the OP did. Reading explain outp

Re: Faster distinct query?

2021-09-22 Thread Michael Lewis
In the future, please share the plan returned by explain analyze, and some data about how many rows in the involved tables, what type of system you are running it on, any changes from default configs, etc. How does the below work? It should do a very simple index scan only, then aggregate the rela

Re: Faster distinct query?

2021-09-22 Thread David G. Johnston
On Wed, Sep 22, 2021 at 1:05 PM Israel Brewster wrote: > To work around the issue, I created a materialized view that I can update > periodically, and of course I can query said view in no time flat. However, > I’m concerned that as the dataset grows, the time it takes to refresh the > view will

Faster distinct query?

2021-09-22 Thread Israel Brewster
I was wondering if there was any way to improve the performance of this query: SELECT station,array_agg(distinct(channel)) as channels FROM data GROUP BY station; The explain execution plan can be found here: https://explain.depesz.com/s/mtxB#html and it

Re: Remove duplicated row in pg_largeobject_metadata

2021-09-22 Thread Tom Lane
Tobias Meyer writes: > I was able to correlate the transaction ids (xmin) of the duplicate rows in > pg_largeobject_meta with times stored in another table, and found the > duplicates were all added on one specific day, while the originals nicely > spread out over time. I was further able to trace

Re: Remove duplicated row in pg_largeobject_metadata

2021-09-22 Thread Tobias Meyer
> > > >> A possible theory is that pg_largeobject_metadata_oid_index has been >> corrupt for a long time, allowing a lot of duplicate entries to be made. >> However, unless pg_largeobject's pg_largeobject_loid_pn_index is *also* >> corrupt, you'd think that creation of such duplicates would still b

Re: Timestamp with vs without time zone.

2021-09-22 Thread Rob Sargent
On 9/22/21 9:56 AM, Michael Lewis wrote: On Wed, Sep 22, 2021 at 12:44 AM cen > wrote: On 21. 09. 21 23:34, Michael Lewis wrote: Related to this current discussion and exchange of ideas... is there a best practice for retrieving data in such a way as t

Re: Timestamp with vs without time zone.

2021-09-22 Thread Michael Lewis
On Wed, Sep 22, 2021 at 12:44 AM cen wrote: > On 21. 09. 21 23:34, Michael Lewis wrote: > > Related to this current discussion and exchange of ideas... is there a > best practice for retrieving data in such a way as the rows are localized > to a timezone for where/group by purposes. That is, if I

Re: Timestamp with vs without time zone.

2021-09-22 Thread Karsten Hilbert
Am Wed, Sep 22, 2021 at 08:11:50PM +1200 schrieb Tim Uckun: > A korean user will fetch you single page app as static HTML from S3 > with cloudfront. It will hit your japanese API server, which will > fetch the data from your japanese read only replica with the master > being in Australia. > > The

Re: Question about behavior of conditional indexes

2021-09-22 Thread Michael Lewis
Just to clarify, I suggested fillfactor likely could *not* help because you are indexing a column that has the value change. The idea with reducing fillfactor is that the rows can be updated with the new versions remaining in the same 8KB block in the file for table/row storage. If the new version

Re: Question about behavior of conditional indexes

2021-09-22 Thread Koen De Groote
Thanks for the replies, everyone. Gavin - I can't upgrade to a more recent version, at least not for the foreseeable future. From what I'm reading, it's the best path forward, but there's considerations to be made that I can't overrule. Ninad - As I suspected about VACUUM and VACUUM FULL. Thanks

Re: Timestamp with vs without time zone.

2021-09-22 Thread Gavin Flower
On 22/09/21 20:11, Tim Uckun wrote: I'll add another layer of complication. You have a database server hosted in Australia, and that's also where your web server and api server is. You have customers all over the world though so you set up additional API servers in Europe, USA, Japan etc. A kor

Re: Timestamp with vs without time zone.

2021-09-22 Thread Tim Cross
Tim Uckun writes: > I'll add another layer of complication. > > You have a database server hosted in Australia, and that's also where > your web server and api server is. You have customers all over the > world though so you set up additional API servers in Europe, USA, > Japan etc. > > A korea

Re: Timestamp with vs without time zone.

2021-09-22 Thread Tim Uckun
I'll add another layer of complication. You have a database server hosted in Australia, and that's also where your web server and api server is. You have customers all over the world though so you set up additional API servers in Europe, USA, Japan etc. A korean user will fetch you single page ap

Surprising results from tests intended to show the difference in semantics between transaction_timestamp(), statement_timestamp(), and clock_timestamp()

2021-09-22 Thread Bryn Llewellyn
I'm surprised by the results that I describe below. Please help me interpret them. Briefly, why does transaction_timestamp() report a later value than statement_timestamp() when they're both invoked in the very first statement after "start transaction". (They report identical values in an impli