Hi! A while ago Steve complained about x86 being weird for having different NOPs [1]
Having cursed the same thing before, I figured it was time to look at the NOP situation. 32bit simply isn't a performance target anymore, so all we need is a set of NOPs that works on all. x86_64 has two main NOP variants, NOPL and prefix NOP. NOPL was introduced by P6 and is architecturally mandated for x86_64. However, some uarchs made the choice to limit NOPL decoding to a single port, which obviously limits NOPL throughput. Other uarchs have (severe) decoding penalties for excessive (>~3) prefixes, hobbling prefix NOP throughput. But the thing is, all the modern uarchs can handle both without issue; that is AMD K10 (2007) and later and Intel Ivy Bridge (2012) and later. The only exception is Atom, which has the prefix penalty. Since ultimate performance of a 10 year old chip (Intel Sandy Bridge, 2011) is simply irrelevant today, remove variable NOPs and use NOPL. This gives us deterministic NOPs and restores sanity. [1] https://lkml.kernel.org/r/20210302105827.34036...@gandalf.local.home