Re: Make tuple deformation faster

John Naylor Wed, 24 Jul 2024 20:18:43 -0700

On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowle...@gmail.com> wrote:


> cycles idle
>            8505168      stalled-cycles-backend:u  #    0.02% backend cycles 
> idle
>       165442142326      instructions:u            #    3.35  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        39409877343      branches:u                #    3.945 G/sec
>          146350275      branch-misses:u           #    0.37% of all branches

> patched

> cycles idle
>           24259785      stalled-cycles-backend:u  #    0.05% backend cycles 
> idle
>       213688149862      instructions:u            #    4.29  insn per cycle
>                                                   #    0.00  stalled
> cycles per insn
>        44147675129      branches:u                #    4.420 G/sec
>           14282567      branch-misses:u           #    0.03% of all branches

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses.  This should have

Nice!

> helped contribute to the "insn per cycle" increase.  4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."

Re: Make tuple deformation faster

Reply via email to