Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Denys Vlasenko Mon, 15 Aug 2016 11:49:31 -0700

On 08/15/2016 03:30 PM, Richard Biener wrote:

On Mon, Aug 15, 2016 at 1:53 PM, Denys Vlasenko <dvlas...@redhat.com> wrote:

On 08/15/2016 11:45 AM, Richard Biener wrote:


Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if
it
happens
to align a loop to 16 bytes, but it may in fact hurt performance if it
happens to align
a loop to 16+8 bytes and this pushes loop's body end over the next
16-byte
boundary,
as it happens in the above example.

I suspect something similar was seen sometime ago on a different, earlier
CPU,
and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it
likes
8 byte alignment.

It's not true that such alignment is always a win.



It looks to me that all you want is to drop the 8-byte alignment on
entities that are smaller than a cacheline.



I don't think it can be simplified to this.

An example. A loop 122 bytes long fits into either two or three 64-byte
cachelines,
depending on where it starts. If it starts in bytes 0..5 in a cacheline, it
fits
into two cachelines. If it starts at 6 bytes or more into cacheline, it
doesn't fit.

8-byte alignment is worse for such a loop than not doing it.

It's even worse for the use case which prompted me to create these patches:
-falign-functions. Linux kernel people want to align all functions
to 64 bytes, but only if the necessary padding is, say, 9 bytes or less.
The rationale is that function calls are often "cold", i.e. function body
is not in L1, and it would be even slower if first insn(s) would require
two L1 loads, not one, to be decoded.

Hence -falign-functions=64,10. This would be a very efficient packing:
only ~15% of all functions would need any padding (the remaining 85%
would start 10 or more bytes before end of cacheline and thus need
no padding), and among those 15% the average padding length would be
only 5 bytes. With very small code size increase, we'd gain a lot
in speed.

This nice optimistic picture is currently destroyed by unnecessary
and not-asked-for "subalignment" to 8 bytes, which now adds 4.5 bytes
of padding on average *to every function*, as a "bonus" making
it *less* efficient versus instruction fetch, not more efficient!


IOW: I am proposing to remove this code because it seems arbitrary: it
helped
on one particular CPU model, and maybe only on some particular benchmarks.
On other CPUs, or in other scenarios, it's harmful.
It should not be now done for all CPUs and all programs.

If there is a value in the ability to do a "subalignment" within a larger
alignment,
maybe we can make it a separate option, and let user specify it if he wants?


Controlling this separately makes sense IMHO.  Changing the default for
generic tuning has to be backed up with measurements and old CPUs not
benchmarked should retain the old value when tuned for them.

Let me rephrase the desire again.  The desire is to maximize the number
of instructions fetched with the first cacheline for any label that is branched
(forward) to.  A side-effect may be avoiding penalties for CPUs that have
an instruction started at only N-byte aligned space (not sure that exists
for an ISA with 1-byte opcodes).  For labels that are branched backward to
(thus loops) the desire is to minimize the number of cachelines that need
to be fetched to get the whole loop covered - ISTR CPUs have limits on that
number when it comes to handling loops with loop caches.  Branch target
buffers may also not like too many targest per cache-line -- I expect
8 2-byte functions in a cache-line to be very bad here.

If the situation cannot be improved on any of the above any additional
"aritificial" alignment makes things only worse (by enlarging code).


I have an idea.

Since I am extending -falign-foo directives anyway, I can add even more
functionality to them. Such as:

-falign-functions=N[,M[,N2[,M2]]]

This would emit

                .balign N,M-1
                [.balign N2,M2-1]   // only if N2 > 1

N2 can be made to default to 8 if N is > 8. This is exactly the current
behavior on x86.

For the use case I described (about kernel aligning functions to 64 bytes)
the desired flag would be:

-falign-functions=64,10,1

Does this look good to you?

Re: [PATCH] Extend -falign-FOO=N to N[,M]: the second number is max padding

Reply via email to