On original_hashed, I think I'd try moving start_timestamp into its own index.
On Tue, Feb 13, 2024 at 12:02 PM Adrian Garcia Badaracco < adr...@adriangb.com> wrote: > I'm using PostgreSQL 15.5. > > Here's a self-contained example. I included links to public pgMustard > query plans. > > Gist link: > https://gist.github.com/adriangb/05a01cca99a438de2a286945903c44f4 > > Also copied below for archiving: > > ```sql > CREATE OR REPLACE FUNCTION random_bytes(length integer) > RETURNS bytea AS $$ > DECLARE > bytes bytea := ''; > i integer := 0; > BEGIN > -- generate 1024 bytes at a time using gen_random_bytes(1024) > WHILE i < length LOOP > bytes := bytes || gen_random_bytes(least(1024, length - i)); > i := i + 1024; > END LOOP; > > RETURN bytes; > END; > $$ LANGUAGE plpgsql; > > DROP TABLE IF EXISTS original, original_hashed, dictionary_table CASCADE; > DROP VIEW IF EXISTS joined_view; > > CREATE TABLE original AS ( > WITH large_random_bytes AS ( > SELECT n AS id, random_bytes(4096 + n) AS attributes > FROM generate_series(1, 1000) n > ) > SELECT > -- An incrementing timestamp > '2024-01-01 00:00:00.00+00'::timestamptz + (n || ' seconds')::interval > AS start_timestamp, > -- Another relatively small column, let's just make it a random string > including `n` to make it unique > 'random_string_' || (n % 100) AS event_name, > -- The attributes column > lrb.attributes AS attributes > FROM generate_series(0, 1000000) n > JOIN large_random_bytes lrb ON n % 1000 = lrb.id > ); > CREATE INDEX ON original(start_timestamp); > > CREATE TABLE dictionary_table AS ( > SELECT DISTINCT time_bucket('1 day', start_timestamp) AS > start_timestamp_range, attributes, md5(attributes) AS hash > FROM original > ); > CREATE INDEX ON dictionary_table (start_timestamp_range, hash); > > CREATE TABLE original_hashed AS ( > SELECT > start_timestamp, > event_name, > md5(attributes) AS hash > FROM original > ); > CREATE INDEX ON original_hashed (start_timestamp, time_bucket('1 day', > start_timestamp), hash); > > > CREATE VIEW joined_view AS ( > SELECT > original_hashed.start_timestamp, > original_hashed.event_name, > dictionary_table.attributes > FROM original_hashed > LEFT JOIN dictionary_table ON ( > time_bucket('1 day', original_hashed.start_timestamp) = > dictionary_table.start_timestamp_range > AND > original_hashed.hash = dictionary_table.hash > ) > ); > > -- Select all data > explain (analyze, buffers, verbose, settings, format json) > SELECT * > FROM joined_view > WHERE ( > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/f8b3b7d5-c236-4326-9a67-20fa4cdff4bc > > explain (analyze, buffers, verbose, settings, format json) > SELECT * > FROM original > WHERE ( > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/bcad3767-b8aa-4a37-acc7-7667266f29be > > -- A relatively aggregation selective query > explain (analyze, buffers, verbose, settings, format json) > SELECT count(*) > FROM joined_view > WHERE ( > get_byte(attributes, 4) < 100 > AND > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/c5b68839-4c66-4b14-8de5-757bfc22fdb3 > > explain (analyze, buffers, verbose, settings, format json) > SELECT count(*) > FROM original > WHERE ( > get_byte(attributes, 4) < 100 > AND > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/831d9c4c-e911-4caf-a6c3-eee5343e44d8 > > -- Select only some columns > explain (analyze, buffers, verbose, settings, format json) > SELECT start_timestamp > FROM joined_view > WHERE ( > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/5f617bfa-7816-4f8d-9fd7-24388c08978c > > explain (analyze, buffers, verbose, settings, format json) > SELECT start_timestamp > FROM original > WHERE ( > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/f05de3e8-ca8d-4404-ad33-69f103a3b85a > > > -- A relatively selective query > explain (analyze, buffers, verbose, settings, format json) > SELECT * > FROM joined_view > WHERE ( > get_byte(attributes, 4) < 100 > AND > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/a75b20bd-07d5-4402-a0d8-22419682307a > > explain (analyze, buffers, verbose, settings, format json) > SELECT * > FROM original > WHERE ( > get_byte(attributes, 4) < 100 > AND > start_timestamp > ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '1 day') > AND > start_timestamp < ('2024-01-01 00:00:00.00+00'::timestamptz + interval > '2 day') > ); > -- > https://app.pgmustard.com/#/explore/47ef84f0-a96e-4baa-af40-2ec241cbb6e2 > ``` > > On Tue, Feb 13, 2024 at 11:04 AM Ron Johnson <ronljohnso...@gmail.com> > wrote: > >> 1. Show us the PG version, view definition, the exact query that's slow, >> and the EXPLAIN (ANALYZE). >> 2. Presumably there's an index on each table's *_hash column? >> >> On Tue, Feb 13, 2024 at 8:48 AM Adrian Garcia Badaracco < >> adr...@adriangb.com> wrote: >> >>> Thank you for the reply Ron. >>> >>> Yes there are many fewer (<1%) the number of rows in new_table. >>> >>> Thanks for making me think of normalization, I hadn’t seen it that way. >>> Although there is no theoretical relationship between the rows in the other >>> columns in the original table and the attributes column, in practice there >>> is a strong correlation, so I guess what I am trying to capture here is >>> taking advantage of that correlation, while not completely depending on it >>> because it can be broken. >>> >>> In any case, whatever theoretical framework is put around this solution, >>> I am also interested in the practical aspects, in particular that case of >>> selecting a subset of columns from the view that I know doesn’t need the >>> join but the query planner thinks does. >>> >>> On Tue, Feb 13, 2024 at 3:16 AM Ron Johnson <ronljohnso...@gmail.com> >>> wrote: >>> >>>> On Mon, Feb 12, 2024 at 10:12 PM Adrian Garcia Badaracco < >>>> adr...@adriangb.com> wrote: >>>> >>>>> I am using Timescale so I'll be mentioning some timestamp stuff but I >>>>> think this is a general postgres question for the most part. >>>>> >>>>> I have a table with some fixed, small columns (id, timestamp, etc) and >>>>> a large JSONB column (let's call it `attributes`). `attributes` has 1000s >>>>> of schemas, but given a schema, there's a lot of duplication. Across all >>>>> rows, more than 99% of the data is duplicated (as measured by >>>>> `count(attributes)` vs `count(distinct attributes)`. >>>>> >>>>> I can't normalize `attributes` into real columns because it is quite >>>>> variable (remember 1000s of schemas). >>>>> >>>>> My best idea is to make a table like `(day timestamptz, hash text, >>>>> attributes jsonb)` and then in my original table replace `attributes` with >>>>> a reference to `new_table`. >>>>> >>>> >>>> Meaning that there are many fewer rows in new_table? >>>> >>>> >>>>> I can then make a view that joins them `select >>>>> original_table.timestamp, new_table.attributes from original join >>>>> new_table >>>>> on (time_bucket('1 day', timestamp) = day AND original.hash = >>>>> new_table.hash)` or something like that. The idea of time bucketing into 1 >>>>> day is to balance write and read speed (by relying on timescale to do >>>>> efficient time partitioning, data retention, etc.). >>>>> >>>> >>>>> I recognize this is essentially creating a key-value store in postgres >>>>> and also janky compression, so I am cautious about it. >>>>> >>>> >>>> If my interpretation (that there are many fewer rows in new_table) is >>>> correct, then you've stumbled into the Second Normal Form of database >>>> design: https://en.wikipedia.org/wiki/Second_normal_form#Example >>>> >>>>