Ășt 2. 3. 2021 v 9:53 odesĂlatel Liu, Xinyu <li...@gatech.edu> napsal:
> > > > > > > > > > > * Hello, We have 2 TPC-H queries which fetch the same tuples but have > significant query execution time differences (4.3 times). We are sharing a > pair of TPC-H queries that exhibit this performance difference: First > query: SELECT "ps_comment", "ps_suppkey", "ps_supplycost", > "ps_partkey", "ps_availqty" FROM "partsupp" WHERE > "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP > BY "ps_partkey", "ps_suppkey", "ps_availqty", > "ps_supplycost", "ps_comment" Second query: SELECT > "ps_comment", "ps_suppkey", "ps_supplycost", > "ps_partkey", "ps_availqty" FROM "partsupp" WHERE > "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP BY > "ps_comment", "ps_suppkey", "ps_supplycost", > "ps_partkey", "ps_availqty" * Actual Behavior We > executed both queries on the TPC-H benchmark of scale factor 5: the first > query takes over 1.7 seconds, while the second query only takes 0.4 > seconds. We think the time difference results from different plans > selected. Specifically, in the first (slow) query, the DBMS performs an > index scan on table partsupp using the covering index (ps_partkey, > ps_suppkey), while the second (fast) query performs a parallel scan on > (ps_suppkey, ps_partkey). * Query Execution Plan - First query: > QUERY > PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------- > Group (cost=0.43..342188.58 rows=399262 width=144) (actual > time=0.058..1737.659 rows=4 loops=1) Group Key: ps_partkey, ps_suppkey > Buffers: shared hit=123005 read=98055 -> Index Scan using > partsupp_pkey on partsupp (cost=0.43..335522.75 rows=1333167 width=144) > (actual time=0.055..1737.651 rows=4 loops=1) Filter: (((ps_partkey > + 16) < 1) OR (ps_partkey = 2)) Rows Removed by Filter: 3999996 > Buffers: shared hit=123005 read=98055 Planning Time: 0.926 ms > Execution Time: 1737.754 ms (9 rows) * > In this case there is brutal overestimation. Probably due planner unfriendly written predicate ps_partkey + 16 < 1) OR ps_parkey = 2. You can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2 Regards Pavel > > > > > > > > > > > > > > > > > > > > > > > > * - Second query: > > QUERY > PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------- > Group (cost=250110.68..350438.93 rows=399262 width=144) (actual > time=400.353..400.361 rows=4 loops=1) Group Key: ps_suppkey, ps_partkey > Buffers: shared hit=5481 read=24093 -> Gather Merge > (cost=250110.68..346446.31 rows=798524 width=144) (actual > time=400.351..406.741 rows=4 loops=1) Workers Planned: 2 > Workers Launched: 2 Buffers: shared hit=15151 read=72144 > -> Group (cost=249110.66..253276.80 rows=399262 width=144) > (actual time=395.882..395.883 rows=1 loops=3) Group Key: > ps_suppkey, ps_partkey Buffers: shared hit=15151 read=72144 > -> Sort (cost=249110.66..250499.37 rows=555486 width=144) > (actual time=395.880..395.881 rows=1 loops=3) Sort > Key: ps_suppkey, ps_partkey Sort Method: quicksort > Memory: 25kB Worker 0: Sort Method: quicksort > Memory: 25kB Worker 1: Sort Method: quicksort > Memory: 25kB Buffers: shared hit=15151 read=72144 > -> Parallel Seq Scan on partsupp > (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615 > rows=1 loops=3) Filter: (((ps_partkey + 16) < 1) > OR (ps_partkey = 2)) Rows Removed by Filter: > 1333332 Buffers: shared hit=15065 read=72136 > Planning Time: 0.360 ms Execution Time: 406.880 ms (22 rows) *Expected > Behavior Since these two queries are semantically equivalent, we were > hoping that PostgreSQL would evaluate them in roughly the same amount of > time. It looks to me that different order of group by clauses triggers > different plans: when the group by clauses (ps_partkey, ps_suppkey) is the > same as the covering index, it will trigger an index scan on associated > columns; however, when the group by clauses have different order than the > covering index (ps_suppkey, ps_partkey), the index scan will not be > triggered. Given that the user might not pay close attention to this subtle > difference, I was wondering if it is worth making these two queries have > the same and predictable performance on Postgresql. *Test Environment > Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun > 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux" PostgreSQL v12.3 > Database: TPC-H benchmark (with scale factor 5) The description of table > partsupp is as follows: tpch5=# \d partsupp; Table > "public.partsupp" Column | Type | Collation | > Nullable | Default > ---------------+------------------------+-----------+----------+--------- > ps_partkey | integer | | not null | > ps_suppkey | integer | | not null | > ps_availqty | integer | | not null | > ps_supplycost | numeric(15,2) | | not null | > ps_comment | character varying(199) | | not null | Indexes: > "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey) Foreign-key > constraints: "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES > supplier(s_suppkey) "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES > part(p_partkey) Referenced by: TABLE "lineitem" CONSTRAINT > "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES > partsupp(ps_partkey, ps_suppkey) *Here are the steps for reproducing our > observations: 1. Download the dataset from the link: > https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing > <https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing> > 2. Set up TPC-H benchmark tar xzvf tpch5_postgresql.tar.gz cd > tpch5_postgresql db=tpch5 createdb $db psql -d $db < dss.ddl for i in `ls > *.tbl` do echo $i name=`echo $i|cut -d'.' -f1` psql -d $db -c > "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';" done psql -d > $db < dss_postgres.ri 1. Execute the queries * >