Hi, first time posting, hope I have included the relevant information.

I am trying to understand the performance of a query which is intended to 
retrieve a subset of the following table:

        Table "contracts.bis_person_alle_endringer"
                      Column              |           Type           | 
Collation | Nullable | Default 
        
----------------------------------+--------------------------+-----------+----------+---------
         person_id                        | uuid                     |          
 | not null | 
         dpd_gyldig_fra_dato              | date                     |          
 | not null | 
         dpd_i_kraft_fra_dato             | date                     |          
 | not null | 
         dpd_i_kraft_til_dato             | date                     |          
 | not null | 
         dpd_endret_tidspunkt             | timestamp with time zone |          
 | not null | 
         dpd_bis_foedselsnummer           | text                     |          
 |          | 
         dpd_bis_treffkilde_id            | text                     |          
 |          | 
        ... [omitted for brevity] ...
         dpd_endret_av                    | text                     |          
 |          | 
         dpd_bis_kjoenn_id                | text                     |          
 |          | 
        Indexes:
            "bis_person_alle_endringer_by_person_id" btree (person_id)
            "bis_person_alle_endringer_unique_descending" UNIQUE, btree 
(dpd_bis_foedselsnummer, dpd_gyldig_fra_dato DESC, dpd_endret_tidspunkt DESC)



        dpd=> SELECT relname, relpages, reltuples, relallvisible, relkind, 
relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE 
relname='bis_person_alle_endringer';
                  relname          | relpages |  reltuples  | relallvisible | 
relkind | relnatts | relhassubclass | reloptions | pg_table_size
        
---------------------------+----------+-------------+---------------+---------+----------+----------------+------------+---------------
         bis_person_alle_endringer |  9367584 | 1.09584e+08 |       6392129 | r 
      |      106 | f              |            |   76760489984
        (1 row)

I have ommitted most of the columns, as there are 106 columns in total. The 
ommitted columns have data types text, numeric or date, all are nullable.

To create the subsets, I (or rather my application) will receive lists of 
records which should be matched according to some business logic. Each of these 
lists will be read into a temporary table:

                   Table "pg_temp_9.records_to_filter_on"
               Column        | Type | Collation | Nullable | Default
        ---------------------+------+-----------+----------+---------
         foedselsnummer      | text |           |          |
         tariff_dato         | date |           |          |
         versjons_dato       | date |           |          |
         kjent_i_system_dato | date |           |          |

The subset is then created by the following query, which finds the records in 
contracts.bis_person_alle_endringer which satisfies the business logic (if any).

    select * from records_to_filter_on r
    left join lateral (
        select * from contracts.bis_person_alle_endringer b
        where b.dpd_bis_foedselsnummer = r.foedselsnummer AND
            r.kjent_i_system_dato >= b.dpd_endret_tidspunkt AND
            r.tariff_dato > b.dpd_gyldig_fra_dato 
        order by b.dpd_gyldig_fra_dato desc, b.dpd_endret_tidspunkt desc
        limit 1
    ) c on true
    where person_id is not null and
        r.versjons_dato < c.dpd_i_kraft_til_dato

The temporary table records_to_filter_on and the result of the above query will 
typically contain 1-5 million rows (the returned subsets are used for training 
machine learning models).

I've created a sample data set with 3.75 million rows and run EXPLAIN (ANALYZE, 
BUFFERS) on the query, https://explain.dalibo.com/plan/U41 (and also attached). 
Running the full EXPLAIN (ANALYZE, BUFFERS) takes about 30 minutes, which seems 
quite slow. However, as I am new to postgres, I find it difficult to interpret 
the output of the EXPLAIN (ANALYZE, BUFFERS) - most of the time is spent during 
an index scan, which to my understanding is "good". However, I don't think I 
understand postgres well enough to judge whether this is the best I can achieve 
(or at last close enough) or if the query should be rewritten. Alternatively, 
is it not realistic to expect faster performance given the size of the table 
and the hardware of the database instance?

I am running PostgreSQL 11.9 on x86_64-pc-linux-gnu using AWS Aurora on a 
db.t3.large instance (https://aws.amazon.com/rds/instance-types/). The output 
of 

        SELECT name, current_setting(name), source
          FROM pg_settings
          WHERE source NOT IN ('default', 'override');

is attached in pg_settings.conf.

I realize that these questions are a little vague, but any guidance would be 
much appreciated.

Thanks, Simen Lønsethagen

Reply via email to