On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1...@163.com> wrote: > Yet another a wonderful optimization! I just want to know how did you > find this optimization (CPU cache hit) case and think it worths some > time. because before we invest our time to optimize something, it is > better know that we can get some measurable improvement after our time > is spend. At for this case, 30% is really a huge number even it is a > artificial case. > > Another case is Andrew introduced NullableDatum 5 years ago and said using > it in TupleTableSlot could be CPU cache friendly, I can follow that, but > how much it can improve in an ideal case, is it possible to forecast it > somehow? I ask it here because both cases are optimizing for CPU cache..
Have a look at: perf stat --pid=<backend pid> On my AMD Zen4 machine running the 16 extra column test from the script in my last email, I see: $ echo master && perf stat --pid=389510 sleep 10 master Performance counter stats for process id '389510': 9990.65 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 0 page-faults:u # 0.000 /sec 49407204156 cycles:u # 4.945 GHz 18529494 stalled-cycles-frontend:u # 0.04% frontend cycles idle 8505168 stalled-cycles-backend:u # 0.02% backend cycles idle 165442142326 instructions:u # 3.35 insn per cycle # 0.00 stalled cycles per insn 39409877343 branches:u # 3.945 G/sec 146350275 branch-misses:u # 0.37% of all branches 10.001012132 seconds time elapsed $ echo patched && perf stat --pid=380216 sleep 10 patched Performance counter stats for process id '380216': 9989.14 msec task-clock:u # 0.998 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 0 page-faults:u # 0.000 /sec 49781280456 cycles:u # 4.984 GHz 22922276 stalled-cycles-frontend:u # 0.05% frontend cycles idle 24259785 stalled-cycles-backend:u # 0.05% backend cycles idle 213688149862 instructions:u # 4.29 insn per cycle # 0.00 stalled cycles per insn 44147675129 branches:u # 4.420 G/sec 14282567 branch-misses:u # 0.03% of all branches 10.005034271 seconds time elapsed You can see the branch predictor has done a *much* better job in the patched code vs master with about 10x fewer misses. This should have helped contribute to the "insn per cycle" increase. 4.29 is quite good for postgres. I often see that around 0.5. According to [1] (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the micro-op cache". I'm unsure how micro-ops translate to "insn per cycle" that's shown in perf stat. I thought 4-5 was about the maximum pipeline size from today's era of CPUs. Maybe someone else can explain better than I can. In more simple terms, generally, the higher the "insn per cycle", the better. Also, the lower all of the idle and branch miss percentages are that's generally also better. However, you'll notice that the patched version has more front and backend stalls. I assume this is due to performing more instructions per cycle from improved branch prediction causing memory and instruction stalls to occur more frequently, effectively (I think) it's just hitting the next bottleneck(s) - memory and instruction decoding. At least, modern CPUs should be able to out-pace RAM in many workloads, so perhaps it's not that surprising that "backend cycles idle" has gone up due to such a large increase in instructions per cycle due to improved branch prediction. It would be nice to see this tested on some modern Intel CPU. A 13th series or 14th series, for example, or even any intel from the past 5 years would be better than nothing. David [1] https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/