On 06/29/16 07:59, James Greenhalgh wrote:
On Tue, Jun 21, 2016 at 02:39:23PM +0100, Wilco Dijkstra wrote:
ping


From: Wilco Dijkstra
Sent: 03 June 2016 11:51
To: GCC Patches
Cc: nd; philipp.toms...@theobroma-systems.com; pins...@gmail.com; 
jim.wil...@linaro.org; benedikt.hu...@theobroma-systems.com; Evandro Menezes
Subject: [PATCH][AArch64] Increase code alignment
Increase loop alignment on Cortex cores to 8 and set function alignment to
16.  This makes things consistent across big.LITTLE cores, improves
performance of benchmarks with tight loops and reduces performance variations
due to small  changes in code layout. It looks almost all AArch64 cores agree
on alignment of 16 for function, and 8 for loops and branches, so we should
change -mcpu=generic as well if there is no disagreement - feedback welcome.

OK for commit?
Hi Wilco,

Sorry for the delay.

This patch is OK for trunk.

I hope we can continue the discussion as to whether there is a set of
values for -mcpu=generic that better suits the range of cores we now
support.

After Wilco's patch, and using the values in the proposed vulcan and
qdf24xx structures, and with the comments from this thread, the state of
the tuning structures will be:

-mcpu=    : function-jump-loop alignment

cortex-a35: 16-8-8
cortex-a53: 16-8-8
cortex-a57: 16-8-8
cortex-a72: 16-8-8
cortex-a73: 16-8-4 (Though I'm guessing this is just an ordering issue on
                     when Kyrill copied the cost table. Kyrill/Wilco do you
                     want to spin an update for the Cortex-A73 alignment?
                     Consider it preapproved if you do)
exynos-m1 : 4-4-4  (2: 16-4-16, 3: 8-4-4)
thunderx  : 8-8-8  (But 16-8-8 acceptable/maybe better)
xgene1    :16-8-16
qdf24xx   :16-8-16
vulcan    :16-8-16

Generic is currently set to 8-8-4, which doesn't look representative of
these individual cores at all.

Running down that list, I see a very compelling case for 16 for function
alignment (from comments in this thread, it is second choice, but not too
bad for exynos-m1, thunderx is maybe better at 16, but should not be
worse). I also see a close to unanimous case for 8 for jump alignment,
though I note that in Evandro's experiments 8 never came out as "best"
for exynos-m1. Did you get a feel in your experiments for what the
performance penalty of aligning jumps to 8 would be? That generic is
currently set to 8 seems like a good tie-breaker if the performance impact
for exynos-m1 would be minimal, as it gives no change from the current
behaviour.

For loop alignment I see a split straight down the middle between 8
and 16 (exynos-m1/cortex-a73 are the outliers at 4, but second place for
exynos-m1 was 16-4-16, and cortex-a73 might just be an artifact of when the
table was copied).

 From that, and as a starting point for discussion, I'd suggest that
16-8-8 or 16-8-16 are most representative from the core support that has
been added to GCC.

Wilco, the Cortex-A cores tip the balance in favour of 16-8-8. As you've
most recently looked at the cost of setting loop alignment to 16, do you
have any comments on why you chose to standardize on 16-8-8 rather than
16-8-16, or what the cost would be of 16-8-16?

I've widened the CC list, and I'd appreciate any comments on what we all
think will be the right alignment parameters for -mcpu=generic.

This is a very good summary. However, I think that it should also consider the effects on code size.

Evidently, an alignment of 16 has the greatest probability of increasing code size, whereas an alignment of 4 doesn't increase it. Likewise, what is aligned also matters, for each of them have a different frequency of instances. Arguably, the function alignment has the least effect in code size, followed by jumps and then loops, based on typical frequencies. In specific cases, the weights may move somewhat, of course.

As I stated in my previous reply, I also tracked the code size when experimenting with different alignments. It was clear form my data that the jump alignment was the most critical to code size (~3% on average), followed by loop alignment (<1%) and function alignment (<1%). Therefore, I'd argue against setting the alignment for jumps to 16. The case can be made for an alignment of 8 for them, when the cost to code size is more modest.

From a performance perspective, generic alignments at 16-8-8 would be the 4th rank on Exynos M1, with a negligible code size penalty (<1%) over the current 8-4-4.

HTH

--
Evandro Menezes

Reply via email to