https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614
--- Comment #3 from Cody Gray <cody at codygray dot com> --- (In reply to Uroš Bizjak from comment #1) > Partial register stalls were discussed many times in the past, but > apparently the compiler still produces fastest code when partial register > stalls are enabled on latest target processors (e.g. -mtune=intel). I don't understand what that means. -mtune=intel does *not* fix the partial register stall problem. It should. All Intel CPUs prior to Haswell would absolutely experience partial register stalls on this code, resulting in a performance degradation. -mtune-ctrl=partial_reg_stall does get the correct code, but I wasn't aware of this option and I believe I shouldn't have to be. If a developer is getting sub-optimal code even when he is asking the compiler to tune for his specific microarchitecture, then the optimizer has a bug. This is not an issue where there are arguments on either side. There is absolutely no benefit to generating the code that the compiler currently does. It is the same number of bytes to OR the BYTE-sized registers as it is to OR the DWORD-sized registers, while the former will run faster on the vast majority of CPUs and won't be any slower on the others. > Also, it is hard to confirm tuning PRs without hard benchmark data. No, it really isn't. I know that's a canned response, likely brought about by hard-won experience with a lot of dubious "tuning" feature requests, but it's just a cop-out in this case, if not outright dismissive. Partial register stalls are a well-documented phenomenon, confirmed by multiple sources, and have been a significant source of performance degradation since the Pentium Pro was released circa 1995. Agner Fog's manuals, as cited above, are really the authoritative reference when it comes to performance tuning on x86, and they provide confirmation of this in spades. In fact, I would argue that an accurate conceptual understanding of the microarchitecture is often a better guide than one-off microbenchmarks, since the latter are so difficult to craft and therefore so often misleading. For example, the effects of the stall might be masked by the overhead of the function call, but when the code is inlined or *certainly* when it is executed within an inner loop, there will be a significant performance degradation. Again, if this were an issue where I was proposing bloating the size of the code for a small payoff in speed, I could see how you might be skeptical. But there is literally no downside to making this change. You could possibly argue that -mtune-ctrl=partial_reg_stall should not be turned on when tuning for Haswell and later microarchitectures, as Haswell was the first to alleviate the visible performance penalties associated with reading from a full 32-bit register after writing to a partial 8-bit "view" of that same register. However, this applies *only* to the low-byte register (e.g., AL, CL, DL, etc.). With the high-byte registers (e.g., AH, CH, DH, etc.), there is still a loss in performance because an extra µop has to be inserted between the write to the 8-bit register and the read from the 32-bit register. This increases the latency by one clock cycle, and so unless the xH partial registers are treated differently from the xL partial registers, applying the optimizations described would still result in a performance win, especially since there is no drawback.