Potential performance issues related to group by and covering index

Liu, Xinyu Tue, 02 Mar 2021 00:53:31 -0800

Hello,


We have 2 TPC-H queries which fetch the same tuples but have significant query 
execution time differences (4.3 times).


We are sharing a pair of TPC-H queries that exhibit this performance difference:


First query:

SELECT "ps_comment",

       "ps_suppkey",

       "ps_supplycost",

       "ps_partkey",

       "ps_availqty"

FROM   "partsupp"

WHERE  "ps_partkey" + 16 < 1

        OR "ps_partkey" = 2

GROUP  BY "ps_partkey",

          "ps_suppkey",

          "ps_availqty",

          "ps_supplycost",

          "ps_comment"


Second query:

SELECT "ps_comment",

       "ps_suppkey",

       "ps_supplycost",

       "ps_partkey",

       "ps_availqty"

FROM   "partsupp"

WHERE  "ps_partkey" + 16 < 1

        OR "ps_partkey" = 2

GROUP  BY "ps_comment",

          "ps_suppkey",

          "ps_supplycost",

          "ps_partkey",

          "ps_availqty"


* Actual Behavior

We executed both queries on the TPC-H benchmark of scale factor 5: the first 
query takes over 1.7 seconds, while the second query only takes 0.4 seconds.
We think the time difference results from different plans selected. 
Specifically, in the first (slow) query, the DBMS performs an index scan on 
table partsupp using the covering index (ps_partkey, ps_suppkey), while the 
second (fast) query performs a parallel scan on (ps_suppkey, ps_partkey).


*  Query Execution Plan

  *   First query:

                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=0.43..342188.58 rows=399262 width=144) (actual 
time=0.058..1737.659 rows=4 loops=1)

   Group Key: ps_partkey, ps_suppkey

   Buffers: shared hit=123005 read=98055

   ->  Index Scan using partsupp_pkey on partsupp  (cost=0.43..335522.75 
rows=1333167 width=144) (actual time=0.055..1737.651 rows=4 loops=1)

         Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))

         Rows Removed by Filter: 3999996

         Buffers: shared hit=123005 read=98055

 Planning Time: 0.926 ms

 Execution Time: 1737.754 ms

(9 rows)





  *   Second query:

                                                                                
                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=250110.68..350438.93 rows=399262 width=144) (actual 
time=400.353..400.361 rows=4 loops=1)

   Group Key: ps_suppkey, ps_partkey

   Buffers: shared hit=5481 read=24093

   ->  Gather Merge  (cost=250110.68..346446.31 rows=798524 width=144) (actual 
time=400.351..406.741 rows=4 loops=1)

         Workers Planned: 2

         Workers Launched: 2

         Buffers: shared hit=15151 read=72144

         ->  Group  (cost=249110.66..253276.80 rows=399262 width=144) (actual 
time=395.882..395.883 rows=1 loops=3)

               Group Key: ps_suppkey, ps_partkey

               Buffers: shared hit=15151 read=72144

               ->  Sort  (cost=249110.66..250499.37 rows=555486 width=144) 
(actual time=395.880..395.881 rows=1 loops=3)

                     Sort Key: ps_suppkey, ps_partkey

                     Sort Method: quicksort  Memory: 25kB

                     Worker 0:  Sort Method: quicksort  Memory: 25kB

                     Worker 1:  Sort Method: quicksort  Memory: 25kB

                     Buffers: shared hit=15151 read=72144

                     ->  Parallel Seq Scan on partsupp  (cost=0.00..116363.88 
rows=555486 width=144) (actual time=395.518..395.615 rows=1 loops=3)

                           Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))

                           Rows Removed by Filter: 1333332

                           Buffers: shared hit=15065 read=72136

 Planning Time: 0.360 ms

 Execution Time: 406.880 ms

(22 rows)







*Expected Behavior

Since these two queries are semantically equivalent, we were hoping that 
PostgreSQL would evaluate them in roughly the same amount of time.
It looks to me that different order of group by clauses triggers different 
plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the 
covering index, it will trigger an index scan on associated columns;
however,  when the group by clauses have different order than the covering 
index (ps_suppkey, ps_partkey), the index scan will not be triggered.
Given that the user might not pay close attention to this subtle difference, I 
was wondering if it is worth making these two queries have the same and 
predictable performance on Postgresql.


*Test Environment

Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 
00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"

PostgreSQL v12.3

Database: TPC-H benchmark (with scale factor 5)

The description of table partsupp is as follows:

tpch5=# \d partsupp;

                         Table "public.partsupp"

    Column     |          Type          | Collation | Nullable | Default

---------------+------------------------+-----------+----------+---------

 ps_partkey    | integer                |           | not null |

 ps_suppkey    | integer                |           | not null |

 ps_availqty   | integer                |           | not null |

 ps_supplycost | numeric(15,2)          |           | not null |

 ps_comment    | character varying(199) |           | not null |

Indexes:

    "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey)

Foreign-key constraints:

    "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)

    "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES part(p_partkey)

Referenced by:

    TABLE "lineitem" CONSTRAINT "lineitem_fk2" FOREIGN KEY (l_partkey, 
l_suppkey) REFERENCES partsupp(ps_partkey, ps_suppkey)






*Here are the steps for reproducing our observations:

  1.  Download the dataset from the link: 
https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing

  2.  Set up TPC-H benchmark

tar xzvf tpch5_postgresql.tar.gz

cd tpch5_postgresql

db=tpch5

createdb $db

psql -d $db < dss.ddl

for i in `ls *.tbl`

do

    echo $i

    name=`echo $i|cut -d'.' -f1`

    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';"

done

psql -d $db < dss_postgres.ri

  1.  Execute the queries

Potential performance issues related to group by and covering index

Reply via email to