[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

crazylht at gmail dot com via Gcc-bugs Tue, 01 Mar 2022 01:35:13 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908


--- Comment #29 from Hongtao.liu <crazylht at gmail dot com> ---
>From Agner Fog's excellent optimization
manuals(https://www.agner.org/optimize/microarchitecture.pdf).

For ICX/TGL:

An aligned write of 128 bits or more followed by a read of one or both of the
two halves or
the four quarters, etc., has little or no penalty. A partial read that does not
fit into the halves
or quarters fails to forward. The write-to-read latency is 19-20 clock cycles
when forwarding
fails.
A read that is bigger than the write, or a read that covers both written and
unwritten bytes,
fails to forward. The write-to-read latency is 19-20 clock cycles.

And for Intel software optimization guide:

There are several cases in which data is passed through memory, and the store
may need to be sepa-
rated from the load:
• Spills, save and restore registers in a stack frame.
• Parameter passing.
• Global and volatile variables.
• Type conversion between integer and floating-point.
• When compilers do not analyze code that is inlined, forcing variables that
are involved in the interface with inlined code to be in memory, creating more
memory variables and preventing the elimination of
redundant loads.

[Bug target/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

Reply via email to