On Wed, 4 Aug 2021 at 02:10, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > A review would be nice, although it can wait - It'd be interesting to > know if those patches help with the workload(s) you've been looking at.
I tried out the v2 set of patches using the attached scripts. The attached spreadsheet includes the original tests and compares master with the patch which uses the generation context vs that patch plus your v2 patch. I've also included 4 additional tests, each of which starts with a 1 column table and then adds another 32 columns testing the performance after adding each additional column. I did this because I wanted to see if the performance was more similar to master when the allocations had less power of 2 wastage from allocset. If, for example, you look at row 123 of the spreadsheet you can see both patched and unpatched the allocations were 272 bytes each yet there was still a 50% performance improvement with just the generation context patch when compared to master. Looking at the spreadsheet, you'll also notice that in the 2 column test of each of the 4 new tests the number of bytes used for each allocation is larger with the generation context. 56 vs 48. This is due to the GenerationChunk struct size being later than the Allocset's version by 8 bytes. This is because it also holds the GenerationBlock. So with the patch there are some cases where we'll use slightly more memory. Additional tests: 1. Sort 10000 tuples on a column with values 0-99 in memory. 2. As #1 but with 1 million tuples. 3 As #1 but with a large OFFSET to remove the overhead of sending to the client. 4. As #2 but with a large OFFSET. Test #3 above is the most similar one to the original tests and shows similar gains. When the sort becomes larger (1 million tuple test), the gains reduce. This indicates the gains are coming from improved CPU cache efficiency from the removal of the power of 2 wastage in memory allocations. All of the tests show that the patches to improve the allocation efficiency of generation.c don't help to improve the results of the test cases. I wondered if it's maybe worth trying to see what happens if instead of doubling the allocations each time, quadruple them instead. I didn't try this. David
#!/bin/bash sec=5 dbname=postgres records=1000000 mod=100 psql -c "drop table if exists t" $dbname psql -c "create table t (a bigint not null)" $dbname psql -c "insert into t select x % $mod from generate_series(1,$records) x" $dbname psql -c "vacuum freeze t" $dbname for i in {1..32} do echo $i psql -c "explain analyze select * from t order by a offset 1000000000" $dbname | grep -E "Memory|Disk" echo "select * from t order by a offset 1000000000" > bench.sql for loops in {1..3} do pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps done psql -c "alter table t add column c$i bigint" $dbname psql -c "update t set c$i = a" $dbname psql -c "vacuum full t" $dbname psql -c "vacuum freeze t" $dbname done
generation context tuplesort.ods
Description: application/vnd.oasis.opendocument.spreadsheet
#!/bin/bash sec=5 dbname=postgres records=10000 mod=100 psql -c "drop table if exists t" $dbname psql -c "create table t (a bigint not null)" $dbname psql -c "insert into t select x % $mod from generate_series(1,$records) x" $dbname psql -c "vacuum freeze t" $dbname for i in {1..32} do echo $i psql -c "explain analyze select * from t order by a offset 1000000000" $dbname | grep -E "Memory|Disk" echo "select * from t order by a offset 1000000000" > bench.sql for loops in {1..3} do pgbench -n -M prepared -T $sec -f bench.sql $dbname | grep tps done psql -c "alter table t add column c$i bigint" $dbname psql -c "update t set c$i = a" $dbname psql -c "vacuum full t" $dbname psql -c "vacuum freeze t" $dbname done