On Sun, Jan 10, 2021 at 4:06 PM Peter Geoghegan <p...@bowt.ie> wrote: > Attached is v13, which has this tweak, and other miscellaneous cleanup > based on review from both Victor and Heikki. I consider this version > of the patch to be committable. I intend to commit something close to > it in the next week, probably no later than Thursday. I still haven't > got to the bottom of the shellsort question raised by Heikki. I intend > to do further performance validation before committing the patch.
I benchmarked the patch with one array and without the shellsort specialization using two patches on top of v13, both of which are attached. This benchmark was similar to other low cardinality index benchmarks I've run in the past (with indexes named fiver, tenner, score, plus a pgbench_accounts INCLUDE unique index instead of the regular primary key). I used pgbench scale 500, for 30 minutes, no rate limit. One run with 16 clients, another with 32 clients. Original v13: patch.r1c32.bench.out: "tps = 32709.772257 (including connections establishing)" "latency average = 0.974 ms" "latency stddev = 0.514 ms" patch.r1c16.bench.out: "tps = 34670.929998 (including connections establishing)" "latency average = 0.461 ms" "latency stddev = 0.314 ms" v13 + attached simplifying patches: patch.r1c32.bench.out: "tps = 31848.632770 (including connections establishing)" "latency average = 1.000 ms" "latency stddev = 0.548 ms" patch.r1c16.bench.out: "tps = 33864.099333 (including connections establishing)" "latency average = 0.472 ms" "latency stddev = 0.399 ms" Clearly the optimization work still has some value, since we're looking at about a 2% - 3% increase in throughput here. This seems like it makes the cost of retaining the optimizations acceptable. The difference is much less visible with a rate-limit, which is rather more realistic. To some extent the sort is hot here because the patched version of Postgres updates tuples as fast as it can, and must therefore delete from the index as fast as it can. The sort itself was consistently near the top consumer of CPU cycles according to "perf top", even if it didn't get as bad as what I saw in earlier versions of the patch months ago. There are actually two sorts involved here (not just the heapam.c shellsort). We need to sort the deltids array twice -- once in heapam.c, and a second time (to restore the original leaf-page-wise order) in nbtpage.c, using qsort(). I'm pretty sure that the latter sort also matters, though it matters less than the heapam.c shellsort. I'm going to proceed with committing the original version of the patch -- I feel that this settles it. The code would be a bit tidier without two arrays or the shellsort, but it really doesn't make that much difference. Whereas the benefit is quite visible, and will be something that all varieties of index tuple deletion see a performance benefit from (not just bottom-up deletion). -- Peter Geoghegan
0006-Experiment-One-array-again.patch
Description: Binary data
0007-Experiment-Remove-shellsort-just-use-qsort.patch
Description: Binary data