Hi David, >Over the past few days I've been gathering some benchmark results >together to show the sort performance improvements in PG15 [1].
>One of the test cases I did was to demonstrate Heikki's change to use >a k-way merge (65014000b). >The test I did to try this out was along the lines of: >set max_parallel_workers_per_gather = 0; >create table t (a bigint not null, b bigint not null, c bigint not >null, d bigint not null, e bigint not null, f bigint not null); >insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB! >vacuum freeze t; >The query I ran was: >select * from t order by a offset 140247142; I redid this test here: Windows 10 64 bits msvc 2019 64 bits RAM 8GB SSD 256 GB HEAD (default configuration) Time: 229396,551 ms (03:49,397) PATCHED: Time: 220887,346 ms (03:40,887) >I tested various sizes of work_mem starting at 4MB and doubled that >all the way to 16GB. For many of the smaller values of work_mem the >performance is vastly improved by Heikki's change, however for >work_mem = 64MB I detected quite a large slowdown. PG14 took 20.9 >seconds and PG15 beta 1 took 29 seconds! >I've been trying to get to the bottom of this today and finally have >discovered this is due to the tuple size allocations in the sort being >exactly 64 bytes. Prior to 40af10b57 (Use Generation memory contexts >to store tuples in sorts) the tuple for the sort would be stored in an >aset context. After 40af10b57 we'll use a generation context. The >idea with that change is that the generation context does no >power-of-2 round ups for allocations, so we save memory in most cases. >However, due to this particular test having a tuple size of 64-bytes, >there was no power-of-2 wastage with aset. >The problem is that generation chunks have a larger chunk header than >aset do due to having to store the block pointer that the chunk >belongs to so that GenerationFree() can increment the nfree chunks in >the block. aset.c does not require this as freed chunks just go onto a >freelist that's global to the entire context. >Basically, for my test query, the slowdown is because instead of being >able to store 620702 tuples per tape over 226 tapes with an aset >context, we can now only store 576845 tuples per tape resulting in >requiring 244 tapes when using the generation context. >If I had added column "g" to make the tuple size 72 bytes causing >aset's code to round allocations up to 128 bytes and generation.c to >maintain the 72 bytes then the sort would have stored 385805 tuples >over 364 batches for aset and 538761 tuples over 261 batches using the >generation context. That would have been a huge win. >So it basically looks like I discovered a very bad case that causes a >significant slowdown. Yet other cases that are not an exact power of >2 stand to gain significantly from this change. >One thing 40af10b57 does is stops those terrible performance jumps >when the tuple size crosses a power-of-2 boundary. The performance >should be more aligned to the size of the data being sorted now... >Unfortunately, that seems to mean regressions for large sorts with >power-of-2 sized tuples. It seems to me that the solution would be to use aset allocations when the size of the tuples is power-of-2? if (state->sortopt & TUPLESORT_ALLOWBOUNDED || (state->memtupsize & (state->memtupsize - 1)) == 0) state->tuplecontext = AllocSetContextCreate(state->sortcontext, "Caller tuples", ALLOCSET_DEFAULT_SIZES); else state->tuplecontext = GenerationContextCreate(state->sortcontext, "Caller tuples", ALLOCSET_DEFAULT_SIZES); I took a look and tried some improvements to see if I had a better result. Would you mind taking a look and testing? regards, Ranier Vilela
CREATE TABLE t_test (x numeric); INSERT INTO t_test SELECT random() FROM generate_series(1, 5000000); ANALYZE; SHOW work_mem; HEAD: postgres=# explain analyze SELECT * FROM t_test ORDER BY x; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Gather Merge (cost=397084.73..883229.71 rows=4166666 width=11) (actual time=1326.281..2718.040 rows=5000000 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=396084.71..401293.04 rows=2083333 width=11) (actual time=1287.168..1520.520 rows=1666667 loops=3) Sort Key: x Sort Method: external merge Disk: 24880kB Worker 0: Sort Method: external merge Disk: 24776kB Worker 1: Sort Method: external merge Disk: 23960kB -> Parallel Seq Scan on t_test (cost=0.00..47861.33 rows=2083333 width=11) (actual time=0.241..135.730 rows=1666667 loops=3) Planning Time: 0.054 ms Execution Time: 2837.789 ms (11 rows) PATCHED: postgres=# explain analyze SELECT * FROM t_test ORDER BY x; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Gather Merge (cost=397084.73..883229.71 rows=4166666 width=11) (actual time=1283.818..2696.469 rows=5000000 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=396084.71..401293.04 rows=2083333 width=11) (actual time=1253.459..1485.734 rows=1666667 loops=3) Sort Key: x Sort Method: external merge Disk: 25136kB Worker 0: Sort Method: external merge Disk: 24096kB Worker 1: Sort Method: external merge Disk: 24304kB -> Parallel Seq Scan on t_test (cost=0.00..47861.33 rows=2083333 width=11) (actual time=0.256..133.065 rows=1666667 loops=3) Planning Time: 0.055 ms Execution Time: 2816.495 ms (11 rows) David's benchmark: set max_parallel_workers_per_gather = 0; create table t (a bigint not null, b bigint not null, c bigint not null, d bigint not null, e bigint not null, f bigint not null); insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; -- 10GB! vacuum freeze t; select * from t order by a offset 140247142; HEAD: postgres=# select * from t order by a offset 140247142; a | b | c | d | e | f ---+---+---+---+---+--- (0 rows) Time: 229396,551 ms (03:49,397) Time: 231432,750 ms (03:51,433) PATCHED: postgres=# explain analyze select * from t order by a offset 140247142; a | b | c | d | e | f ---+---+---+---+---+--- (0 rows) Time: 220887,346 ms (03:40,887) Time: 222652,487 ms (03:42,652) Ronan's benchmark: Setup 1: single table, 1 000 000 tuples, no index CREATE TABLE tbench ( a int, b int ); INSERT INTO tbench (a, b) SELECT a, b FROM generate_series(1, 100) a, generate_series(1, 10000) b; Test 1: Single-column ordered select (order by b since the table is already sorted by a) select b from tbench order by b; HEAD: postgres=# explain analyze select b from tbench order by b; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Sort (cost=266422.45..271422.45 rows=2000000 width=4) (actual time=425.062..559.464 rows=2000000 loops=1) Sort Key: b Sort Method: external merge Disk: 23528kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.056..168.422 rows=2000000 loops=1) Planning Time: 0.058 ms Execution Time: 605.315 ms (6 rows) PATCHED: postgres=# explain analyze select b from tbench order by b; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Sort (cost=266422.45..271422.45 rows=2000000 width=4) (actual time=421.642..557.750 rows=2000000 loops=1) Sort Key: b Sort Method: external merge Disk: 23528kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.047..167.848 rows=2000000 loops=1) Planning Time: 0.080 ms Execution Time: 603.583 ms (6 rows) Test 2: Ordered sum (using b so that the input is not presorted) select sum(b order by b) from tbench; HEAD: postgres=# explain analyze select sum(b order by b) from tbench; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual time=547.316..547.317 rows=1 loops=1) -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.074..134.147 rows=2000000 loops=1) Planning Time: 0.060 ms Execution Time: 547.339 ms (4 rows) PATCHED: postgres=# explain analyze select sum(b order by b) from tbench; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual time=544.565..544.566 rows=1 loops=1) -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.044..134.305 rows=2000000 loops=1) Planning Time: 0.050 ms Execution Time: 544.633 ms (4 rows) Test 3: Ordered sum + group by select b, sum(a order by a) from tbench GROUP BY b; HEAD: postgres=# explain analyze select b, sum(a order by a) from tbench GROUP BY b; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ GroupAggregate (cost=266422.45..281522.48 rows=10003 width=12) (actual time=740.557..1201.470 rows=10000 loops=1) Group Key: b -> Sort (cost=266422.45..271422.45 rows=2000000 width=8) (actual time=740.490..905.244 rows=2000000 loops=1) Sort Key: b Sort Method: external merge Disk: 35216kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8) (actual time=0.040..192.818 rows=2000000 loops=1) Planning Time: 0.080 ms Execution Time: 1206.248 ms (8 rows) PATCHED: postgres=# explain analyze select b, sum(a order by a) from tbench GROUP BY b; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ GroupAggregate (cost=266422.45..281522.48 rows=10003 width=12) (actual time=744.886..1196.399 rows=10000 loops=1) Group Key: b -> Sort (cost=266422.45..271422.45 rows=2000000 width=8) (actual time=744.822..906.223 rows=2000000 loops=1) Sort Key: b Sort Method: external merge Disk: 35216kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8) (actual time=0.053..198.255 rows=2000000 loops=1) Planning Time: 0.084 ms Execution Time: 1200.807 ms (8 rows) Setup 2: same as before, but adding an index on (b, a) CREATE INDEX ON tbench (b, a); Test 2: Ordered sum: select sum(a order by a) from tbench; HEAD: postgres=# explain analyze select sum(a order by a) from tbench; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual time=417.679..417.679 rows=1 loops=1) -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.036..134.878 rows=2000000 loops=1) Planning Time: 1.185 ms Execution Time: 417.732 ms (4 rows) PATCHED: postgres=# explain analyze select sum(a order by a) from tbench; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=33850.00..33850.01 rows=1 width=8) (actual time=421.826..421.827 rows=1 loops=1) -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=4) (actual time=0.063..134.843 rows=2000000 loops=1) Planning Time: 0.070 ms Execution Time: 421.885 ms (4 rows) Test 3: Ordered sum + group by: select a, sum(b order by b) from tbench GROUP BY a; HEAD: postgres=# explain analyze select a, sum(b order by b) from tbench GROUP BY a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ GroupAggregate (cost=199979.56..214980.56 rows=100 width=12) (actual time=502.326..1048.130 rows=100 loops=1) Group Key: a -> Sort (cost=199979.56..204979.56 rows=2000000 width=8) (actual time=496.952..654.468 rows=2000000 loops=1) Sort Key: a Sort Method: external merge Disk: 35216kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8) (actual time=0.064..136.186 rows=2000000 loops=1) Planning Time: 0.137 ms Execution Time: 1052.026 ms (8 rows) PATCHED: postgres=# explain analyze select a, sum(b order by b) from tbench GROUP BY a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ GroupAggregate (cost=199979.56..214980.56 rows=100 width=12) (actual time=485.994..1033.928 rows=100 loops=1) Group Key: a -> Sort (cost=199979.56..204979.56 rows=2000000 width=8) (actual time=480.617..637.878 rows=2000000 loops=1) Sort Key: a Sort Method: external merge Disk: 35216kB -> Seq Scan on tbench (cost=0.00..28850.00 rows=2000000 width=8) (actual time=0.066..136.371 rows=2000000 loops=1) Planning Time: 0.117 ms Execution Time: 1037.586 ms (8 rows)
002-improve-sort.patch
Description: Binary data