https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #29 from Hongtao.liu <crazylht at gmail dot com> --- >From Agner Fog's excellent optimization manuals(https://www.agner.org/optimize/microarchitecture.pdf). For ICX/TGL: An aligned write of 128 bits or more followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters fails to forward. The write-to-read latency is 19-20 clock cycles when forwarding fails. A read that is bigger than the write, or a read that covers both written and unwritten bytes, fails to forward. The write-to-read latency is 19-20 clock cycles. And for Intel software optimization guide: There are several cases in which data is passed through memory, and the store may need to be sepa- rated from the load: • Spills, save and restore registers in a stack frame. • Parameter passing. • Global and volatile variables. • Type conversion between integer and floating-point. • When compilers do not analyze code that is inlined, forcing variables that are involved in the interface with inlined code to be in memory, creating more memory variables and preventing the elimination of redundant loads.