1. Show us the PG version, view definition, the exact query that's slow, and the EXPLAIN (ANALYZE). 2. Presumably there's an index on each table's *_hash column?
On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco <adr...@adriangb.com> wrote: > Thank you for the reply Ron. > > Yes there are many fewer (<1%) the number of rows in new_table. > > Thanks for making me think of normalization, I hadn’t seen it that way. > Although there is no theoretical relationship between the rows in the other > columns in the original table and the attributes column, in practice there > is a strong correlation, so I guess what I am trying to capture here is > taking advantage of that correlation, while not completely depending on it > because it can be broken. > > In any case, whatever theoretical framework is put around this solution, I > am also interested in the practical aspects, in particular that case of > selecting a subset of columns from the view that I know doesn’t need the > join but the query planner thinks does. > > On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnso...@gmail.com> > wrote: > >> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco < >> adr...@adriangb.com> wrote: >> >>> I am using Timescale so I'll be mentioning some timestamp stuff but I >>> think this is a general postgres question for the most part. >>> >>> I have a table with some fixed, small columns (id, timestamp, etc) and a >>> large JSONB column (let's call it `attributes`). `attributes` has 1000s of >>> schemas, but given a schema, there's a lot of duplication. Across all rows, >>> more than 99% of the data is duplicated (as measured by `count(attributes)` >>> vs `count(distinct attributes)`. >>> >>> I can't normalize `attributes` into real columns because it is quite >>> variable (remember 1000s of schemas). >>> >>> My best idea is to make a table like `(day timestamptz, hash text, >>> attributes jsonb)` and then in my original table replace `attributes` with >>> a reference to `new_table`. >>> >> >> Meaning that there are many fewer rows in new_table? >> >> >>> I can then make a view that joins them `select original_table.timestamp, >>> new_table.attributes from original join new_table on (time_bucket('1 day', >>> timestamp) = day AND original.hash = new_table.hash)` or something like >>> that. The idea of time bucketing into 1 day is to balance write and read >>> speed (by relying on timescale to do efficient time partitioning, data >>> retention, etc.). >>> >> >>> I recognize this is essentially creating a key-value store in postgres >>> and also janky compression, so I am cautious about it. >>> >> >> If my interpretation (that there are many fewer rows in new_table) is >> correct, then you've stumbled into the Second Normal Form of database >> design: https://en.wikipedia.org/wiki/Second_normal_form#Example >> >>