https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614

--- Comment #3 from Cody Gray <cody at codygray dot com> ---
(In reply to Uroš Bizjak from comment #1)
> Partial register stalls were discussed many times in the past, but
> apparently the compiler still produces fastest code when partial register
> stalls are enabled on latest target processors (e.g. -mtune=intel).

I don't understand what that means. -mtune=intel does *not* fix the partial
register stall problem. It should. All Intel CPUs prior to Haswell would
absolutely experience partial register stalls on this code, resulting in a
performance degradation.

-mtune-ctrl=partial_reg_stall does get the correct code, but I wasn't aware of
this option and I believe I shouldn't have to be. If a developer is getting
sub-optimal code even when he is asking the compiler to tune for his specific
microarchitecture, then the optimizer has a bug.

This is not an issue where there are arguments on either side. There is
absolutely no benefit to generating the code that the compiler currently does.
It is the same number of bytes to OR the BYTE-sized registers as it is to OR
the DWORD-sized registers, while the former will run faster on the vast
majority of CPUs and won't be any slower on the others.

> Also, it is hard to confirm tuning PRs without hard benchmark data.

No, it really isn't. I know that's a canned response, likely brought about by
hard-won experience with a lot of dubious "tuning" feature requests, but it's
just a cop-out in this case, if not outright dismissive. Partial register
stalls are a well-documented phenomenon, confirmed by multiple sources, and
have been a significant source of performance degradation since the Pentium Pro
was released circa 1995.

Agner Fog's manuals, as cited above, are really the authoritative reference
when it comes to performance tuning on x86, and they provide confirmation of
this in spades. In fact, I would argue that an accurate conceptual
understanding of the microarchitecture is often a better guide than one-off
microbenchmarks, since the latter are so difficult to craft and therefore so
often misleading. For example, the effects of the stall might be masked by the
overhead of the function call, but when the code is inlined or *certainly* when
it is executed within an inner loop, there will be a significant performance
degradation.

Again, if this were an issue where I was proposing bloating the size of the
code for a small payoff in speed, I could see how you might be skeptical. But
there is literally no downside to making this change.

You could possibly argue that -mtune-ctrl=partial_reg_stall should not be
turned on when tuning for Haswell and later microarchitectures, as Haswell was
the first to alleviate the visible performance penalties associated with
reading from a full 32-bit register after writing to a partial 8-bit "view" of
that same register. However, this applies *only* to the low-byte register
(e.g., AL, CL, DL, etc.). With the high-byte registers (e.g., AH, CH, DH,
etc.), there is still a loss in performance because an extra µop has to be
inserted between the write to the 8-bit register and the read from the 32-bit
register. This increases the latency by one clock cycle, and so unless the xH
partial registers are treated differently from the xL partial registers,
applying the optimizations described would still result in a performance win,
especially since there is no drawback.

Reply via email to