Hackers, While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum sorts for single column sorts), I noticed that we use a separate memory context to store tuple data and we just reset when we're done if the sort fits in memory, or we dump the tuples to disk in the same order they're added and reset the context when it does not. There is a little pfree() work going on via writetup_heap() which I think possibly could just be removed to get some additional gains.
Anyway, this context that stores tuples uses the standard aset.c allocator which has the usual power of 2 wastage and additional overheads of freelists etc. I wondered how much faster it would go if I set it to use a generation context instead of an aset.c one. After running make installcheck to make the tenk1 table, running the attached tuplesortbench script, I get this: Master: work_mem = '4MB'; Sort Method: external merge Disk: 2496kB work_mem = '4GB'; Sort Method: quicksort Memory: 5541kB Patched: work_mem = '4MB'; Sort Method: quicksort Memory: 3197kB So it seems to save quite a bit of memory getting away from rounding up allocations to the next power of 2. Performance-wise, there's some pretty good gains. (Results in TPS) work_mem = '4GB'; Test master gen sort compare Test1 317.2 665.6 210% Test2 228.6 388.9 170% Test3 207.4 330.7 159% Test4 185.5 279.4 151% Test5 292.2 563.9 193% If I drop the work_mem down to standard the unpatched version does to disk, but the patched version does not. The gains get a little bigger. work_mem = '4MB'; Test master gen sort compare Test1 177.5 658.2 371% Test2 149.7 385.2 257% Test3 137.5 330.0 240% Test4 129.0 275.1 213% Test5 161.7 546.4 338% The patch is just a simple 1-liner at the moment. I likely do need to adjust what I'm passing as the blockSize to GenerationContextCreate(). Maybe a better number would be something that's calculated from work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64)) so that we just allocate at most a 16th of work_mem per chunk, but not bigger than 8MB. I don't think changing this will affect the performance of the above very much. David
#!/bin/bash sec=10 # This should improve echo "select * from tenk1 order by two offset 1000000" > bench.sql echo Test1 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by tenthous offset 1000000" > bench.sql echo Test2 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by stringu1 offset 1000000" > bench.sql echo Test3 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by stringu2 offset 1000000" > bench.sql echo Test4 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps # This should improve echo "select * from tenk1 order by twenty offset 1000000" > bench.sql echo Test5 pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps pgbench -n -f bench.sql -T $sec regression | grep tps
generation_tuplesort.patch
Description: Binary data