On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowle...@gmail.com> wrote:
> cycles idle > 8505168 stalled-cycles-backend:u # 0.02% backend cycles > idle > 165442142326 instructions:u # 3.35 insn per cycle > # 0.00 stalled > cycles per insn > 39409877343 branches:u # 3.945 G/sec > 146350275 branch-misses:u # 0.37% of all branches > patched > cycles idle > 24259785 stalled-cycles-backend:u # 0.05% backend cycles > idle > 213688149862 instructions:u # 4.29 insn per cycle > # 0.00 stalled > cycles per insn > 44147675129 branches:u # 4.420 G/sec > 14282567 branch-misses:u # 0.03% of all branches > You can see the branch predictor has done a *much* better job in the > patched code vs master with about 10x fewer misses. This should have Nice! > helped contribute to the "insn per cycle" increase. 4.29 is quite > good for postgres. I often see that around 0.5. According to [1] > (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the > micro-op cache". I'm unsure how micro-ops translate to "insn per > cycle" that's shown in perf stat. I thought 4-5 was about the maximum > pipeline size from today's era of CPUs. "ins per cycle" is micro-ops retired (i.e. excludes those executed speculatively on a mispredicted branch). That article mentions that 6 micro-ops per cycle can enter the backend from the frontend, but that can happen only with internally cached ops, since only 4 instructions per cycle can be decoded. In specific cases, CPUs can fuse multiple front-end instructions into a single macro-op, which I think means a pair of micro-ops that can "travel together" as one. The authors concluded further down that "Zen 4’s reorder buffer is also special, because each entry can hold up to 4 NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of fused NOPs are fused again at the rename stage."