https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435
Maxim Egorushkin <maxim.yegorushkin at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |maxim.yegorushkin at gmail dot com --- Comment #11 from Maxim Egorushkin <maxim.yegorushkin at gmail dot com> --- I have been looking for an option to align only specific loops to a specific boundary. In particular, I often have nested loops with the innermost loops being the hottest and requiring 64-byte L1i cache line alignment, while the outer loops should be minimizing padding. When I extract inner loops into separate functions, wrapped with `#pragma GCC optimize (""-falign-loops=64")`, that achieves the desired loop alignment, but prevents the loop function from being inlined. Forcing loop function inlining with `inline __attribute__((always_inline))` removes the effect of `#pragma GCC optimize (""-falign-loops=64")` for the loop function. AMD CPU manuals recommend aligning the last byte of the loop machine code to the last byte of a 64-byte L1i-cache-line, rather than aligning the the first byte byte of the loop to the first byte of the cache line. Which makes perfect sense and produces the least amount of padding. If my memory still serves me right, gcc-11 or gcc-12 did exactly that, which prompted me to examine AMD CPU manuals for possible clues in the first place, which uncovered this align-the-last-loop-byte-to-the-end-of-L1i-cache-line advice.