On Fri, Jan 22, 2021 at 5:16 PM Tang, Haiying <tanghy.f...@cn.fujitsu.com> wrote: > > Hi Bharath, > > I'm trying to take some performance measurements on you patch v23. > But when I started, I found an issue about the tuples unbalance distribution among workers(99% tuples read by one worker) under specified case which lead the "parallel select" part makes no performance gain. > Then I find it's not introduced by your patch, because it's also happening in master(HEAD). But I don't know how to deal with it , so I put it here to see if anybody know what's going wrong with this or have good ideas to deal this issue. > > Here are the conditions to produce the issue: > 1. high CPU spec environment(say above 20 processors). In smaller CPU, it also happen but not so obvious(40% tuples on one worker in my tests). > 2. query plan is "serial insert + parallel select", I have reproduce this behavior in (CTAS, Select into, insert into select). > 3. select part needs to query large data size(e.g. query 100 million from 200 million). > > According to above, IMHO, I guess it may be caused by the leader write rate can't catch the worker read rate, then the tuples of one worker blocked in the queue, become more and more. > > Below is my test info: > 1. test spec environment > CentOS 8.2, 128G RAM, 40 processors, disk SAS > > 2. test data prepare > create table x(a int, b int, c int); > create index on x(a); > insert into x select generate_series(1,200000000),floor(random()*(10001-1)+1),floor(random()*(10001-1)+1); > > 3. test execute results > *Patched CTAS*: please look at worker 2, 99% tuples read by it. > explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------- > Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.203..24023.686 rows=100006268 loops=1) > Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c > Workers Planned: 4 > Workers Launched: 4 > -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.016..4367.035 rows=20001254 loops=5) > Output: a, c > Filter: ((x.b % 2) = 0) > Rows Removed by Filter: 19998746 > Worker 0: actual time=0.016..19.265 rows=94592 loops=1 > Worker 1: actual time=0.027..31.422 rows=94574 loops=1 > Worker 2: actual time=0.014..21744.549 rows=99627749 loops=1 > Worker 3: actual time=0.015..19.347 rows=94586 loops=1 Planning Time: 0.098 ms Execution Time: 91054.828 ms > > *Non-patched CTAS*: please look at worker 0, also 99% tuples read by it. > explain analyze verbose create table test(a,b,c) as select a,floor(random()*(10001-1)+1),c from x where b%2=0; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------- > Gather (cost=1000.00..1942082.77 rows=1000001 width=16) (actual time=0.283..19216.157 rows=100003148 loops=1) > Output: a, floor(((random() * '10000'::double precision) + '1'::double precision)), c > Workers Planned: 4 > Workers Launched: 4 > -> Parallel Seq Scan on public.x (cost=0.00..1831082.66 rows=250000 width=8) (actual time=0.020..4380.360 rows=20000630 loops=5) > Output: a, c > Filter: ((x.b % 2) = 0) > Rows Removed by Filter: 19999370 > Worker 0: actual time=0.013..21805.647 rows=99624833 loops=1 > Worker 1: actual time=0.016..19.790 rows=94398 loops=1 > Worker 2: actual time=0.013..35.340 rows=94423 loops=1 > Worker 3: actual time=0.035..19.849 rows=94679 loops=1 Planning Time: 0.083 ms Execution Time: 91151.097 ms > > I'm still working on the performance tests on your patch, if I make some progress, I will post my results here.
Thanks a lot for the tests. In your test case, parallel insertions are not being picked because the Gather node has some projections(floor(((random() * '10000'::double precision) + '1'::double precision)) to perform. That's expected. Whenever parallel insertions are chosen for CTAS, we should see "Create target_table '' under Gather node [1] and also the actual row count for Gather node 0 (but in your test it is rows=100006268) in the explain analyze output. Coming to your test case, if it's modified to something like [1], where the Gather node has no projections, then parallel insertions will be chosen. [1] - I did this test on my development system, I will run on some performance system and post my observations. postgres=# explain (analyze, verbose) create table test(a,b,c) as select a,b,c from x where b%2=0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..3846.71 rows=1000 width=12) (actual time=5581.308..5581.379 rows=0 loops=1) Output: a, b, c Workers Planned: 1 Workers Launched: 1 * -> Create test* -> Parallel Seq Scan on public.x (cost=0.00..2846.71 rows=588 width=12) (actual time=0.014..29.512 rows=50023 loops=2) Output: a, b, c Filter: ((x.b % 2) = 0) Rows Removed by Filter: 49977 Worker 0: actual time=0.015..29.751 rows=49419 loops=1 Planning Time: 1574.584 ms Execution Time: 6437.562 ms (12 rows) With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com >