On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco < adr...@adriangb.com> wrote:
> I am using Timescale so I'll be mentioning some timestamp stuff but I > think this is a general postgres question for the most part. > > I have a table with some fixed, small columns (id, timestamp, etc) and a > large JSONB column (let's call it `attributes`). `attributes` has 1000s of > schemas, but given a schema, there's a lot of duplication. Across all rows, > more than 99% of the data is duplicated (as measured by `count(attributes)` > vs `count(distinct attributes)`. > > I can't normalize `attributes` into real columns because it is quite > variable (remember 1000s of schemas). > > My best idea is to make a table like `(day timestamptz, hash text, > attributes jsonb)` and then in my original table replace `attributes` with > a reference to `new_table`. > Meaning that there are many fewer rows in new_table? > I can then make a view that joins them `select original_table.timestamp, > new_table.attributes from original join new_table on (time_bucket('1 day', > timestamp) = day AND original.hash = new_table.hash)` or something like > that. The idea of time bucketing into 1 day is to balance write and read > speed (by relying on timescale to do efficient time partitioning, data > retention, etc.). > > I recognize this is essentially creating a key-value store in postgres and > also janky compression, so I am cautious about it. > If my interpretation (that there are many fewer rows in new_table) is correct, then you've stumbled into the Second Normal Form of database design: https://en.wikipedia.org/wiki/Second_normal_form#Example