Hi, On 2021-07-20 16:50:09 +1200, David Rowley wrote: > I've not taken the time to study the patch but I was running some > other benchmarks today on a small scale pgbench readonly test and I > took this patch for a spin to see if I could see the same performance > gains.
Thanks! > This is an AMD 3990x machine that seems to get the most throughput > from pgbench with 132 processes > > I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared > --random-seed=12345 postgres > > master = dd498998a > > Master: 3816959.53 tps > Patched: 3820723.252 tps > > I didn't quite get the same 2-3% as you did, but it did come out > faster than on master. It would not at all be suprising to me if AMD in recent microarchitectures did a better job at removing stack management overview (e.g. by better register renaming, or by resolving dependencies on %rsp in a smarter way) than Intel has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released in 2019, effectively is a moderately polished (or maybe shoehorned) microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is from 2019. It's also possible that my attempts at avoiding the stack management just didn't work on your compiler. Either due to vendor (I know that gcc is better at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer could make it harder, -fno-optimize-sibling-calls would disable it). A third plausible explanation for the difference is that at a client count of 132, the bottlenecks are sufficiently elsewhere to just not show a meaningful gain from memory management efficiency improvements. Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate palloc` from a patched run? And perhaps how high their percentages of the total work are. E.g. using something like perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree' It'd be interesting to know where the bottlenecks on a zen2 machine are. Greetings, Andres Freund