Re: Make tuple deformation faster

David Rowley Mon, 01 Jul 2024 03:07:47 -0700

On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1...@163.com> wrote:
> Yet another a wonderful optimization! I just want to know how did you
> find this optimization (CPU cache hit) case and think it worths some
> time. because before we invest our time to optimize something, it is
> better know that we can get some measurable improvement after our time
> is spend. At for this case, 30% is really a huge number even it is a
> artificial case.
>
> Another case is Andrew introduced NullableDatum 5 years ago and said using
> it in TupleTableSlot could be CPU cache friendly, I can follow that, but
> how much it can improve in an ideal case, is it possible to forecast it
> somehow? I ask it here because both cases are optimizing for CPU cache..


Have a look at:

perf stat --pid=<backend pid>

On my AMD Zen4 machine running the 16 extra column test from the
script in my last email, I see:

$ echo master && perf stat --pid=389510 sleep 10
master

 Performance counter stats for process id '389510':

           9990.65 msec task-clock:u              #    0.999 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                 0      page-faults:u             #    0.000 /sec
       49407204156      cycles:u                  #    4.945 GHz
          18529494      stalled-cycles-frontend:u #    0.04% frontend
cycles idle
           8505168      stalled-cycles-backend:u  #    0.02% backend cycles idle
      165442142326      instructions:u            #    3.35  insn per cycle
                                                  #    0.00  stalled
cycles per insn
       39409877343      branches:u                #    3.945 G/sec
         146350275      branch-misses:u           #    0.37% of all branches

      10.001012132 seconds time elapsed

$ echo patched && perf stat --pid=380216 sleep 10
patched

 Performance counter stats for process id '380216':

           9989.14 msec task-clock:u              #    0.998 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
                 0      page-faults:u             #    0.000 /sec
       49781280456      cycles:u                  #    4.984 GHz
          22922276      stalled-cycles-frontend:u #    0.05% frontend
cycles idle
          24259785      stalled-cycles-backend:u  #    0.05% backend cycles idle
      213688149862      instructions:u            #    4.29  insn per cycle
                                                  #    0.00  stalled
cycles per insn
       44147675129      branches:u                #    4.420 G/sec
          14282567      branch-misses:u           #    0.03% of all branches

      10.005034271 seconds time elapsed

You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses.  This should have
helped contribute to the "insn per cycle" increase.  4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.

It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.

David

[1] 
https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/

Re: Make tuple deformation faster

Reply via email to