On Fri, Aug 5, 2016 at 12:18 AM, Andrew Pinski <pins...@gmail.com> wrote: > Hi, > On ThunderX, load (and store) pair that does a pair of two word > (32bits) load/stores is slower in some cases than doing two > load/stores. For some internal benchmarks, it provides a 2-5% > improvement. > > This patch disables the forming of the load/store pairs for SImode if > we are tuning for ThunderX. I used the tuning flags route so it can > be overridden if needed later on or if someone else wants to use the > same method for their core. > > OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions.
Here is a new version based on feedback both on the list and off. I added a check for alignment to greater than 8 bytes as that is alignment < 8 causes the slow down. I also added two new testcases testing this to make sure it did the load pair optimization when it is profitable. OK? Bootstrapped and tested on aarch64-linux-gnu with no regressions. Thanks, Andrew Pinski ChangeLog: * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option. * config/aarch64/aarch64.c (thunderx_tunings): Enable AARCH64_EXTRA_TUNE_SLOW_LDPW. (aarch64_operands_ok_for_ldpstp): Return false if AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode and the alignment is less than 8 byte. (aarch64_operands_adjust_ok_for_ldpstp): Likewise. testsuite/ChangeLog: * gcc.target/aarch64/thunderxloadpair.c: New testcase. * gcc.target/aarch64/thunderxnoloadpair.c: New testcase. > > Thanks, > Andrew Pinski > > ChangeLog: > * config/aarch64/aarch64-tuning-flags.def (slow_ldpw): New tuning option. > * config/aarch64/aarch64.c (thunderx_tunings): Enable > AARCH64_EXTRA_TUNE_SLOW_LDPW. > (aarch64_operands_ok_for_ldpstp): Return false if > AARCH64_EXTRA_TUNE_SLOW_LDPW and the mode was SImode. > (aarch64_operands_adjust_ok_for_ldpstp): Likewise.
Index: config/aarch64/aarch64-tuning-flags.def =================================================================== --- config/aarch64/aarch64-tuning-flags.def (revision 239228) +++ config/aarch64/aarch64-tuning-flags.def (working copy) @@ -29,3 +29,4 @@ AARCH64_TUNE_ to give an enum name. */ AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS) +AARCH64_EXTRA_TUNING_OPTION ("slow_ldpw", SLOW_LDPW) Index: config/aarch64/aarch64.c =================================================================== --- config/aarch64/aarch64.c (revision 239228) +++ config/aarch64/aarch64.c (working copy) @@ -712,7 +712,7 @@ 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_SLOW_LDPW) /* tune_flags. */ }; static const struct tune_params xgene1_tunings = @@ -13593,6 +13593,15 @@ if (MEM_VOLATILE_P (mem_1) || MEM_VOLATILE_P (mem_2)) return false; + /* If we have SImode and slow ldp, check the alignment to be greater + than 8 byte. */ + if (mode == SImode + && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_SLOW_LDPW) + && !optimize_size + && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT) + return false; + /* Check if the addresses are in the form of [base+offset]. */ extract_base_offset_in_addr (mem_1, &base_1, &offset_1); if (base_1 == NULL_RTX || offset_1 == NULL_RTX) @@ -13752,6 +13761,15 @@ return false; } + /* If we have SImode and slow ldp, check the alignment to be greater + than 8 byte. */ + if (mode == SImode + && (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_SLOW_LDPW) + && !optimize_size + && MEM_ALIGN (mem_1) < 8 * BITS_PER_UNIT) + return false; + if (REG_P (reg_1) && FP_REGNUM_P (REGNO (reg_1))) rclass_1 = FP_REGS; else Index: testsuite/gcc.target/aarch64/thunderxloadpair.c =================================================================== --- testsuite/gcc.target/aarch64/thunderxloadpair.c (nonexistent) +++ testsuite/gcc.target/aarch64/thunderxloadpair.c (working copy) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=thunderx" } */ + +struct ldp +{ + long long c; + int a, b; +}; + + +int f(struct ldp *a) +{ + return a->a + a->b; +} + + +/* We know the alignement of a->a to be 8 byte aligned so it is profitable + to do ldp. */ +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */ + Index: testsuite/gcc.target/aarch64/thunderxnoloadpair.c =================================================================== --- testsuite/gcc.target/aarch64/thunderxnoloadpair.c (nonexistent) +++ testsuite/gcc.target/aarch64/thunderxnoloadpair.c (working copy) @@ -0,0 +1,17 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=thunderx" } */ + +struct noldp +{ + int a, b; +}; + + +int f(struct noldp *a) +{ + return a->a + a->b; +} + +/* We know the alignement of a->a to be 4 byte aligned so it is not profitable + to do ldp. */ +/* { dg-final { scan-assembler-time "ldp\tw\[0-9\]+, w\[0-9\]" 1 } } */