On 08/12/2016 04:00 PM, Michael Matz wrote:
On Thu, 11 Aug 2016, Denys Vlasenko wrote:
This change makes it possible to align function to 64-byte boundaries
*IF* this does not introduce huge amount of padding.
Patch drops forced alignment to 8 if requested alignment is higher than
8: before the patch, -falign-functions=9 was generating
.p2align 4,,8
.p2align 3
which means: "align to 16 if the skip is 8 bytes or less; else align to 8".
After this change, ".p2align 3" is not emitted.
It is dropped because I ultimately want to do something like
-falign-functions=64,8 - IOW, I want to align functions to 64 bytes, but
only if that generates padding of less than 8 bytes - otherwise I want
*no alignment at all*.
Have you tested the performance impact of your patch? Note that the macro
you changed is used for function and code label alignment. So, unless I
misunderstand something that means that if the large alignment can't be
achieved for e.g. a loop start label, you won't align it at all anymore.
This should be fairly catastrophic for any loopy benchmark, so anything
like this would have to be checked on a couple benchmarks from cpu2000
(possibly cpu2006), which has some that are extremely alignment sensitive.
Even for function labels I'd find no alignment at all strange, and I don't
see why you'd want this.
For many generations now, x86 CPUs have at least 32, and usually 64 byte
cachelines. Decoders fetch instructions in blocks of 32 or 64 bytes. Not less.
Instructions which are "misaligned" (for example, starting at byte 5) within
a cacheline but still fitting into one cacheline are fetched in one go,
with no penalty. Aligning to 8 bytes within a cacheline does not speed things
up.
It simply wastes bytes without speeding up anything.