Re: Potential performance issues related to group by and covering index

Pavel Stehule Tue, 02 Mar 2021 01:09:49 -0800

út 2. 3. 2021 v 9:53 odesílatel Liu, Xinyu <[email protected]> napsal:


>
>
>
>
>
>
>
>
>
>
> * Hello, We have 2 TPC-H queries which fetch the same tuples but have
> significant query execution time differences (4.3 times). We are sharing a
> pair of TPC-H queries that exhibit this performance difference: First
> query: SELECT "ps_comment",         "ps_suppkey",         "ps_supplycost",
>        "ps_partkey",         "ps_availqty"  FROM   "partsupp"  WHERE
> "ps_partkey" + 16 < 1          OR "ps_partkey" = 2  GROUP
> BY "ps_partkey",            "ps_suppkey",            "ps_availqty",
>           "ps_supplycost",            "ps_comment"  Second query: SELECT
> "ps_comment",         "ps_suppkey",         "ps_supplycost",
>        "ps_partkey",         "ps_availqty"  FROM   "partsupp"  WHERE
> "ps_partkey" + 16 < 1          OR "ps_partkey" = 2  GROUP  BY
> "ps_comment",            "ps_suppkey",            "ps_supplycost",
>           "ps_partkey",            "ps_availqty"  * Actual Behavior We
> executed both queries on the TPC-H benchmark of scale factor 5: the first
> query takes over 1.7 seconds, while the second query only takes 0.4
> seconds. We think the time difference results from different plans
> selected. Specifically, in the first (slow) query, the DBMS performs an
> index scan on table partsupp using the covering index (ps_partkey,
> ps_suppkey), while the second (fast) query performs a parallel scan on
> (ps_suppkey, ps_partkey). *  Query Execution Plan - First query:
>                                                        QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------
>  Group  (cost=0.43..342188.58 rows=399262 width=144) (actual
> time=0.058..1737.659 rows=4 loops=1)    Group Key: ps_partkey, ps_suppkey
>    Buffers: shared hit=123005 read=98055    ->  Index Scan using
> partsupp_pkey on partsupp  (cost=0.43..335522.75 rows=1333167 width=144)
> (actual time=0.055..1737.651 rows=4 loops=1)          Filter: (((ps_partkey
> + 16) < 1) OR (ps_partkey = 2))          Rows Removed by Filter: 3999996
>          Buffers: shared hit=123005 read=98055  Planning Time: 0.926 ms
>  Execution Time: 1737.754 ms (9 rows) *
>

In this case there is brutal overestimation. Probably due  planner
unfriendly written predicate ps_partkey + 16  < 1) OR ps_parkey = 2. You
can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2

Regards

Pavel



>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * - Second query:
>                                                                               
>                                                             QUERY
> PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------
>  Group  (cost=250110.68..350438.93 rows=399262 width=144) (actual
> time=400.353..400.361 rows=4 loops=1)    Group Key: ps_suppkey, ps_partkey
>    Buffers: shared hit=5481 read=24093    ->  Gather Merge
> (cost=250110.68..346446.31 rows=798524 width=144) (actual
> time=400.351..406.741 rows=4 loops=1)          Workers Planned: 2
>          Workers Launched: 2          Buffers: shared hit=15151 read=72144
>          ->  Group  (cost=249110.66..253276.80 rows=399262 width=144)
> (actual time=395.882..395.883 rows=1 loops=3)                Group Key:
> ps_suppkey, ps_partkey                Buffers: shared hit=15151 read=72144
>                ->  Sort  (cost=249110.66..250499.37 rows=555486 width=144)
> (actual time=395.880..395.881 rows=1 loops=3)                      Sort
> Key: ps_suppkey, ps_partkey                      Sort Method: quicksort
> Memory: 25kB                      Worker 0:  Sort Method: quicksort
> Memory: 25kB                      Worker 1:  Sort Method: quicksort
> Memory: 25kB                      Buffers: shared hit=15151 read=72144
>                      ->  Parallel Seq Scan on partsupp
> (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615
> rows=1 loops=3)                            Filter: (((ps_partkey + 16) < 1)
> OR (ps_partkey = 2))                            Rows Removed by Filter:
> 1333332                            Buffers: shared hit=15065 read=72136
>  Planning Time: 0.360 ms  Execution Time: 406.880 ms (22 rows) *Expected
> Behavior Since these two queries are semantically equivalent, we were
> hoping that PostgreSQL would evaluate them in roughly the same amount of
> time. It looks to me that different order of group by clauses triggers
> different plans: when the group by clauses (ps_partkey, ps_suppkey) is the
> same as the covering index, it will trigger an index scan on associated
> columns; however,  when the group by clauses have different order than the
> covering index (ps_suppkey, ps_partkey), the index scan will not be
> triggered. Given that the user might not pay close attention to this subtle
> difference, I was wondering if it is worth making these two queries have
> the same and predictable performance on Postgresql. *Test Environment
> Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun
> 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux" PostgreSQL v12.3
> Database: TPC-H benchmark (with scale factor 5) The description of table
> partsupp is as follows: tpch5=# \d partsupp;                          Table
> "public.partsupp"     Column     |          Type          | Collation |
> Nullable | Default
> ---------------+------------------------+-----------+----------+---------
>  ps_partkey    | integer                |           | not null |
>  ps_suppkey    | integer                |           | not null |
>  ps_availqty   | integer                |           | not null |
>  ps_supplycost | numeric(15,2)          |           | not null |
>  ps_comment    | character varying(199) |           | not null |  Indexes:
>     "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey) Foreign-key
> constraints:     "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES
> supplier(s_suppkey)     "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES
> part(p_partkey) Referenced by:     TABLE "lineitem" CONSTRAINT
> "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES
> partsupp(ps_partkey, ps_suppkey) *Here are the steps for reproducing our
> observations: 1. Download the dataset from the link:
> https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
> <https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing>
> 2. Set up TPC-H benchmark tar xzvf tpch5_postgresql.tar.gz cd
> tpch5_postgresql db=tpch5 createdb $db psql -d $db < dss.ddl for i in `ls
> *.tbl` do     echo $i     name=`echo $i|cut -d'.' -f1`     psql -d $db -c
> "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';" done psql -d
> $db < dss_postgres.ri 1. Execute the queries *
>

Re: Potential performance issues related to group by and covering index

Reply via email to