Hi Hackers,

Recently, I took some performance measurements for CREATE TABLE AS. 
https://www.postgresql.org/message-id/34549865667a4a3bb330ebfd035f85d3%40G08CNEXMBPEKD05.g08.fujitsu.local

Then I found an issue about the tuples unbalance distribution(99% tuples read 
by one worker) among workers under specified case which lead the underlying 
parallel select part makes no more performance gain as we expect.
It's happening in master(HEAD). 

I think this is not a normal phenomenon, because we pay the costs which 
parallel mode needs, but we didn't get the benefits we want.
So, is there a way to improve it to achieve the same benefits as we use 
parallel select?

Below is test detail:
1. test specification environment
  CentOS 8.2, 128G RAM, 40 processors(Intel(R) Xeon(R) Silver 4210 CPU @ 
2.20GHz), disk SAS
2. test execute
  PSA test_uneven_workers.sql file includes my test data and steps.
3. test results
CREATE TABLE ... AS SELECT ... , underlying select query 130 million rows(about 
5G data size) from 200 million(about 8G source data size). Each case run 30 
times.

|                                    |  query 130 million   |
|------------------------------------|----------------|-----|   
|max_parallel_workers_per_gather     | Execution Time |%reg |
|------------------------------------|----------------|-----|
|max_parallel_workers_per_gather = 2 |    141002.030  |-1%  |
|max_parallel_workers_per_gather = 4 |    140957.221  |-1%  |
|max_parallel_workers_per_gather = 8 |    142445.061  | 0%  |
|max_parallel_workers_per_gather = 0 |    142580.535  |     |

According to above results, we almost can't get benefit, especially when we 
increase max_parallel_workers_per_gather to 8 or larger one.
Why the parallel select doesn't achieve the desired performance I think is 
because the tuples unbalance distributed among workers as showed in query plan 
. 

Query plan:
max_parallel_workers_per_gather = 8, look at worker 4, 99% tuples read by it.
postgres=# explain analyze verbose create table test1 as select 
func_restricted(),b,c from x1 where a%2=0 or a%3=0;     
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..2351761.77 rows=1995002 width=12) (actual 
time=0.411..64451.616 rows=133333333 loops=1)
   Output: func_restricted(), b, c
   Workers Planned: 7
   Workers Launched: 7
   ->  Parallel Seq Scan on public.x1  (cost=0.00..1652511.07 rows=285000 
width=8) (actual time=0.016..3553.626 rows=16666667 loops=8)
         Output: b, c
         Filter: (((x1.a % 2) = 0) OR ((x1.a % 3) = 0))
         Rows Removed by Filter: 8333333
         Worker 0:  actual time=0.014..21.415 rows=126293 loops=1
         Worker 1:  actual time=0.015..21.564 rows=126293 loops=1
         Worker 2:  actual time=0.014..21.575 rows=126294 loops=1
         Worker 3:  actual time=0.016..21.701 rows=126293 loops=1
         Worker 4:  actual time=0.019..28263.677 rows=132449393 loops=1
         Worker 5:  actual time=0.019..21.470 rows=126180 loops=1
         Worker 6:  actual time=0.015..34.441 rows=126293 loops=1  Planning 
Time: 0.210 ms  Execution Time: 142392.808 ms

Occurrence condition:
1. query plan is kind of "serial insert + parallel select".
2. underlying select query large data size(e.g. query 130 million from 200 
million). It won't happen in small data size(millions of) from what I've tested 
so far.

According to above, IMHO, I guess it may be caused by the leader write rate 
can't catch the worker read rate, then the tuples of one worker blocked in the 
queue, become more and more.

Any thoughts ?

Regards,
Tang

Attachment: test_unbalance_workers.sql
Description: test_unbalance_workers.sql

Reply via email to