Re: [17/n] PR85694: AArch64 support for AVG_FLOOR/CEIL

2018-07-03 Thread James Greenhalgh
On Fri, Jun 29, 2018 at 04:24:58AM -0500, Richard Sandiford wrote:
> This patch adds AArch64 patterns for the new AVG_FLOOR/CEIL operations.
> AVG_FLOOR is [SU]HADD and AVG_CEIL is [SU]RHADD.
> 
> Tested on aarch64-linux-gnu (with and without SVE).  OK to install?


OK.

Thanks,
James

> 2018-06-29  Richard Sandiford  
> 
> gcc/
>   PR tree-optimization/85694
>   * config/aarch64/iterators.md (HADD, RHADD): New int iterators.
>   (u): Handle UNSPEC_SHADD, UNSPEC_UHADD, UNSPEC_SRHADD and
>   UNSPEC_URHADD.
>   * config/aarch64/aarch64-simd.md (avg3_floor)
>   (avg3_ceil): New patterns.
> 
> gcc/testsuite/
>   PR tree-optimization/85694
>   * lib/target-supports.exp (check_effective_target_vect_avg_qi):
>   Return true for AArch64 without SVE.
>   * gcc.target/aarch64/vect_hadd_1.h: New file.
>   * gcc.target/aarch64/vect_shadd_1.c: New test.
>   * gcc.target/aarch64/vect_srhadd_1.c: Likewise.
>   * gcc.target/aarch64/vect_uhadd_1.c: Likewise.
>   * gcc.target/aarch64/vect_urhadd_1.c: Likewise.


Re: [PATCH][GCC][AArch64] Simplify movmem code by always doing overlapping copies when larger than 8 bytes.

2018-07-03 Thread James Greenhalgh
On Tue, Jun 19, 2018 at 09:09:27AM -0500, Tamar Christina wrote:
> Hi All,



OK.

Thanks,
James

> Thanks,
> Tamar
> 
> gcc/
> 2018-06-19  Tamar Christina  
> 
>   * config/aarch64/aarch64.c (aarch64_expand_movmem): Fix mode size.
> 
> gcc/testsuite/
> 2018-06-19  Tamar Christina  
> 
>   * gcc.target/aarch64/struct_cpy.c: New.
> 
> -- 


Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP

2018-07-17 Thread James Greenhalgh
On Mon, Jun 25, 2018 at 03:48:13AM -0500, Andre Simoes Dias Vieira wrote:
> On 18/06/18 09:08, Andre Simoes Dias Vieira wrote:
> > Hi Richard,
> > 
> > Sorry for the delay I have been on holidays.  I had a look and I think you 
> > are right.  With these changes Umq and Uml seem to have the same 
> > functionality though, so I would suggest using only one.  Maybe use a 
> > different name for both, removing both Umq and Uml in favour of Umn, where 
> > the n indicates it narrows the addressing mode.  How does that sound to you?
> > 
> > I also had a look at Ump, but that one is used in the parallel pattern for 
> > STP/LDP which does not use this "narrowing". So we should leave that one as 
> > is.
> > 
> > Cheers,
> > Andre
> > 
> > 
> > From: Richard Sandiford 
> > Sent: Thursday, June 14, 2018 12:28:16 PM
> > To: Andre Simoes Dias Vieira
> > Cc: gcc-patches@gcc.gnu.org; nd
> > Subject: Re: [AArch64][PATCH 1/2] Fix addressing printing of LDP/STP
> > 
> > Andre Simoes Dias Vieira  writes:
> >> @@ -5716,10 +5717,17 @@ aarch64_classify_address (struct 
> >> aarch64_address_info *info,
> >>unsigned int vec_flags = aarch64_classify_vector_mode (mode);
> >>bool advsimd_struct_p = (vec_flags == (VEC_ADVSIMD | VEC_STRUCT));
> >>bool load_store_pair_p = (type == ADDR_QUERY_LDP_STP
> >> + || type == ADDR_QUERY_LDP_STP_N
> >>   || mode == TImode
> >>   || mode == TFmode
> >>   || (BYTES_BIG_ENDIAN && advsimd_struct_p));
> >>
> >> +  /* If we are dealing with ADDR_QUERY_LDP_STP_N that means the incoming 
> >> mode
> >> + corresponds to the actual size of the memory being loaded/stored and 
> >> the
> >> + mode of the corresponding addressing mode is half of that.  */
> >> +  if (type == ADDR_QUERY_LDP_STP_N && known_eq (GET_MODE_SIZE (mode), 16))
> >> +mode = DFmode;
> >> +
> >>bool allow_reg_index_p = (!load_store_pair_p
> >>   && (known_lt (GET_MODE_SIZE (mode), 16)
> >>   || vec_flags == VEC_ADVSIMD
> > 
> > I don't know whether it matters in practice, but that description also
> > applies to Umq, not just Uml.  It might be worth changing it too so
> > that things stay consistent.
> > 
> > Thanks,
> > Richard
> > 
> Hi all,
> 
> This is a reworked patched, replacing Umq and Uml with Umn now.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Is this OK for trunk?

OK. Does this also need backporting to 8?

Thanks,
James

> 
> gcc
> 2018-06-25  Andre Vieira  
> 
> * config/aarch64/aarch64-simd.md (aarch64_simd_mov):
> Replace
> Umq with Umn.
> (store_pair_lanes): Likewise.
> * config/aarch64/aarch64-protos.h (aarch64_addr_query_type): Add new
> enum value 'ADDR_QUERY_LDP_STP_N'.
> * config/aarch64/aarch64.c (aarch64_addr_query_type): Likewise.
> (aarch64_print_address_internal): Add declaration.
> (aarch64_print_ldpstp_address): Remove.
> (aarch64_classify_address): Adapt mode for 'ADDR_QUERY_LDP_STP_N'.
> (aarch64_print_operand): Change printing of 'y'.
> * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): Use
> new enum value 'ADDR_QUERY_LDP_STP_N', don't hardcode mode and use
> 'true' rather than '1'.
> * gcc/config/aarch64/constraints.md (Uml): Likewise.
> (Uml): Rename to Umn.
> (Umq): Remove.



Re: [AArch64][PATCH 2/2] PR target/83009: Relax strict address checking for store pair lanes

2018-07-17 Thread James Greenhalgh
On Mon, Jun 25, 2018 at 03:48:43AM -0500, Andre Simoes Dias Vieira wrote:
> On 14/06/18 12:47, Richard Sandiford wrote:
> > Kyrill  Tkachov  writes:
> >> Hi Andre,
> >> On 07/06/18 18:02, Andre Simoes Dias Vieira wrote:
> >>> Hi,
> >>>
> >>> See below a patch to address PR 83009.
> >>>
> >>> Tested with aarch64-linux-gnu bootstrap and regtests for c, c++ and 
> >>> fortran.
> >>> Ran the adjusted testcase for -mabi=ilp32.
> >>>
> >>> Is this OK for trunk?
> >>>
> >>> Cheers,
> >>> Andre
> >>>
> >>> PR target/83009: Relax strict address checking for store pair lanes
> >>>
> >>> The operand constraint for the memory address of store/load pair lanes was
> >>> enforcing strictly hardware registers be allowed as memory addresses.  We 
> >>> want
> >>> to relax that such that these patterns can be used by combine, prior
> >>> to reload.
> >>> During register allocation the register constraint will enforce the 
> >>> correct
> >>> register is chosen.
> >>>
> >>> gcc
> >>> 2018-06-07  Andre Vieira 
> >>>
> >>> PR target/83009
> >>> * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand): 
> >>> Make
> >>> address check not strict prior to reload.
> >>>
> >>> gcc/testsuite
> >>> 2018-06-07 Andre Vieira 
> >>>
> >>> PR target/83009
> >>> * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests.
> >>
> >> diff --git a/gcc/config/aarch64/predicates.md 
> >> b/gcc/config/aarch64/predicates.md
> >> index 
> >> f0917af8b4cec945ba4e38e4dc670200f8812983..30aa88838671bf343a883077c2b606a035c030dd
> >>  100644
> >> --- a/gcc/config/aarch64/predicates.md
> >> +++ b/gcc/config/aarch64/predicates.md
> >> @@ -227,7 +227,7 @@
> >>   (define_predicate "aarch64_mem_pair_lanes_operand"
> >> (and (match_code "mem")
> >>  (match_test "aarch64_legitimate_address_p (GET_MODE (op), XEXP 
> >> (op, 0),
> >> -true,
> >> +reload_completed,
> >>  ADDR_QUERY_LDP_STP_N)")))
> >>   
> >>
> >> If you want to enforce strict checking during reload and later then I 
> >> think you need to use reload_in_progress || reload_completed ?
> > 
> > That was the old way, but it would be lra_in_progress now.
> > However...
> > 
> >> I guess that would be equivalent to !can_create_pseudo ().
> > 
> > We should never see pseudos when reload_completed, so the choice
> > shouldn't really matter then.  And I don't think we should use
> > lra_in_progress either, since that would make the checks stricter
> > before RA has actually happened, which would likely lead to an
> > unrecognisable insn ICE if recog is called during one of the LRA
> > subpasses.
> > 
> > So unless we know a reason otherwise, I think this should just
> > be "false" (like it already is for aarch64_mem_pair_operand).
> > 
> > Thanks,
> > Richard
> > 
> Changed it to false.
> 
> Bootstrapped and regression testing for aarch64-none-linux-gnu.
> 
> Is this OK for trunk?

OK.

Thanks,
James


> gcc
> 2018-06-25  Andre Vieira  
> 
> PR target/83009
> * config/aarch64/predicates.md (aarch64_mem_pair_lanes_operand):
> Make
> address check not strict.
> 
> gcc/testsuite
> 2018-06-25  Andre Vieira  
> 
> PR target/83009
> * gcc/target/aarch64/store_v2vec_lanes.c: Add extra tests.



Re: [PATCH, GCC, AARCH64] Add support for +profile extension

2018-07-17 Thread James Greenhalgh
On Mon, Jul 09, 2018 at 08:20:53AM -0500, Andre Vieira (lists) wrote:
> Hi,
> 
> This patch adds support for the Statistical Profiling Extension (SPE) on
> AArch64. Even though the compiler will not generate code any differently
> given this extension, it will need to pass it on to the assembler in
> order to let it correctly assemble inline asm containing accesses to the
> extension's system registers.  The same applies when using the
> preprocessor on an assembly file as this first must pass through cc1.
> 
> I left the hwcaps string for SPE empty as the kernel does not define a
> feature string for this extension.  The current effect of this is that
> driver will disable profile feature bit in GCC.  This is OK though
> because we don't, nor do we ever, enable this feature bit, as codegen is
> not affect by the SPE support and more importantly the driver will still
> pass the extension down to the assembler regardless.

Please make these conditions clear in the documentation. Something like.

> +@item profile
> +Enable the Statistical Profiling extension.  This option only changes
> +the behavior of the assembler, and does not change code generation.

Maybe worded better...

> 
> Boostrapped aarch64-none-linux-gnu and ran regression tests.
> 
> Is it OK for trunk?
> 
> gcc/ChangeLog:
> 2018-07-09  Andre Vieira  
> 
>   * config/aarch64/aarch64-option-extensions.def: New entry for profile
>   extension.
>   * config/aarch64/aarch64.h (AARCH64_FL_PROFILE): New.
>   * doc/invoke.texi (aarch64-feature-modifiers): New entry for profile
>   extension.
> 
> gcc/testsuite/ChangeLog:
> 2018-07-09 Andre Vieira 
> 
>   * gcc.target/aarch64/profile.c: New test.

This test will fail for targets with old assemblers. That isn't ideal, we
don't normally add these assembler tests for new instructions for that
reason. Personally I'd drop the test down to a compile-only and scan the
assembler for "+profile".

OK with those changes.

Thanks,
James


> diff --git a/gcc/config/aarch64/aarch64-option-extensions.def 
> b/gcc/config/aarch64/aarch64-option-extensions.def
> index 
> 5fe5e3f7dddf622a48a5b9458ef30449a886f395..69ab796a4e1a959b89ebb55b599919c442cfb088
>  100644
> --- a/gcc/config/aarch64/aarch64-option-extensions.def
> +++ b/gcc/config/aarch64/aarch64-option-extensions.def
> @@ -105,4 +105,7 @@ AARCH64_OPT_EXTENSION("fp16fml", AARCH64_FL_F16FML, 
> AARCH64_FL_FP | AARCH64_FL_F
> Disabling "sve" just disables "sve".  */
>  AARCH64_OPT_EXTENSION("sve", AARCH64_FL_SVE, AARCH64_FL_FP | AARCH64_FL_SIMD 
> | AARCH64_FL_F16, 0, "sve")
>  
> +/* Enabling/Disabling "profile" does not enable/disable any other feature.  
> */
> +AARCH64_OPT_EXTENSION("profile", AARCH64_FL_PROFILE, 0, 0, "")
> +
>  #undef AARCH64_OPT_EXTENSION
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 
> f284e74bfb8c9bab2aa22cc6c5a67750cbbba3c2..c1218503bab19323eee1cca8b7e4bea8fbfcf573
>  100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -158,6 +158,9 @@ extern unsigned aarch64_architecture_version;
>  #define AARCH64_FL_SHA3(1 << 18)  /* Has ARMv8.4-a SHA3 and 
> SHA512.  */
>  #define AARCH64_FL_F16FML (1 << 19)  /* Has ARMv8.4-a FP16 extensions.  
> */
>  
> +/* Statistical Profiling extensions.  */
> +#define AARCH64_FL_PROFILE(1 << 20)
> +
>  /* Has FP and SIMD.  */
>  #define AARCH64_FL_FPSIMD (AARCH64_FL_FP | AARCH64_FL_SIMD)
>  
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 56cd122b0d7b420e2b16ceb02907860879d3b9d7..4ca68a563297482afc75abed4a31c106af38caf7
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -14813,6 +14813,8 @@ instructions. Use of this option with architectures 
> prior to Armv8.2-A is not su
>  @item sm4
>  Enable the sm3 and sm4 crypto extension.  This also enables Advanced SIMD 
> instructions.
>  Use of this option with architectures prior to Armv8.2-A is not supported.
> +@item profile
> +Enable the Statistical Profiling extension.
>  
>  @end table
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/profile.c 
> b/gcc/testsuite/gcc.target/aarch64/profile.c
> new file mode 100644
> index 
> ..db51b4746dd60009d784bc0b37ea99b2f120d856
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/profile.c
> @@ -0,0 +1,9 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-std=gnu99 -march=armv8.2-a+profile" } */
> +
> +int foo (void)
> +{
> +  int ret;
> +  asm ("mrs  %0, pmblimitr_el1" : "=r" (ret));
> +  return ret;
> +}



Re: [PATCH][Aarch64] v2: Arithmetic overflow subv patterns [Patch 3/4]

2018-07-19 Thread James Greenhalgh
On Wed, Jun 13, 2018 at 03:06:05AM -0500, Michael Collison wrote:
> Updated previous patch:
> 
> https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00508.html
> 
> With coding style feedback from Richard Sandiford: (that also apply to this 
> patch)
> 
>  https://gcc.gnu.org/ml/gcc-patches/2018-06/msg00508.html
> 
> Bootstrapped and tested on aarch64-linux-gnu. Okay for trunk?

OK.

Thanks,
James

> 
> 2018-05-31  Michael Collison  
>   Richard Henderson 
> 
>   * config/aarch64/aarch64.md (subv4, usubv4): New patterns.
>   (subti): Handle op1 zero.
>   (subvti4, usub4ti4): New.
>   (*sub3_compare1_imm): New.
>   (sub3_carryinCV): New.
>   (*sub3_carryinCV_z1_z2, *sub3_carryinCV_z1): New.
>   (*sub3_carryinCV_z2, *sub3_carryinCV): New.



Re: [PATCH][Aarch64] v2: Arithmetic overflow addv patterns [Patch 2/4]

2018-07-19 Thread James Greenhalgh
On Wed, Jun 13, 2018 at 02:57:45AM -0500, Michael Collison wrote:
> Updated with Richard's style and mismatched mode comments.
> 
> Okay for trunk?

OK.

Thanks,
James



Re: [PATCH][GCC][AARCH64] Canonicalize aarch64 widening simd plus insns

2018-07-24 Thread James Greenhalgh
On Thu, Jul 19, 2018 at 07:35:22AM -0500, Matthew Malcomson wrote:
> Hi again.
> 
> Providing an updated patch to include the formatting suggestions.

Please try not to top-post replies, it makes the conversation thread
harder to follow (reply continues below!).
 
> On 12/07/18 11:39, Sudakshina Das wrote:
> > Hi Matthew
> >
> > On 12/07/18 11:18, Richard Sandiford wrote:
> >> Looks good to me FWIW (not a maintainer), just a minor formatting thing:
> >>
> >> Matthew Malcomson  writes:
> >>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> >>> b/gcc/config/aarch64/aarch64-simd.md
> >>> index 
> >>> aac5fa146ed8dde4507a0eb4ad6a07ce78d2f0cd..67b29cbe2cad91e031ee23be656ec61a403f2cf9
> >>>  
> >>> 100644
> >>> --- a/gcc/config/aarch64/aarch64-simd.md
> >>> +++ b/gcc/config/aarch64/aarch64-simd.md
> >>> @@ -3302,38 +3302,78 @@
> >>>     DONE;
> >>>   })
> >>>   -(define_insn "aarch64_w"
> >>> +(define_insn "aarch64_subw"
> >>>     [(set (match_operand: 0 "register_operand" "=w")
> >>> -    (ADDSUB: (match_operand: 1 "register_operand" 
> >>> "w")
> >>> -    (ANY_EXTEND:
> >>> -  (match_operand:VD_BHSI 2 "register_operand" "w"]
> >>> +    (minus:
> >>> + (match_operand: 1 "register_operand" "w")
> >>> + (ANY_EXTEND:
> >>> +   (match_operand:VD_BHSI 2 "register_operand" "w"]
> >>
> >> The (minus should be under the "(match_operand":
> >>
> >> (define_insn "aarch64_subw"
> >>    [(set (match_operand: 0 "register_operand" "=w")
> >> (minus: (match_operand: 1 "register_operand" "w")
> >>    (ANY_EXTEND:
> >>  (match_operand:VD_BHSI 2 "register_operand" "w"]
> >>
> >> Same for the other patterns.
> >>
> >> Thanks,
> >> Richard
> >>
> >
> > You will need a maintainer's approval but this looks good to me.
> > Thanks for doing this. I would only point out one other nit which you
> > can choose to ignore:
> >
> > +/* Ensure
> > +   saddw2 and one saddw for the function add()
> > +   ssubw2 and one ssubw for the function subtract()
> > +   uaddw2 and one uaddw for the function uadd()
> > +   usubw2 and one usubw for the function usubtract() */
> > +
> > +/* { dg-final { scan-assembler-times "\[ \t\]ssubw2\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]ssubw\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]saddw2\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]saddw\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]usubw2\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]usubw\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]uaddw2\[ \t\]+" 1 } } */
> > +/* { dg-final { scan-assembler-times "\[ \t\]uaddw\[ \t\]+" 1 } } */
> >
> > The scan-assembly directives for the different
> > functions can be placed right below each of them and that would
> > make it easier to read the expected results in the test and you
> > can get rid of the comments saying the same.

Thanks for the first-line review Sudi.

OK for trunk.

Thanks,
James



Re: [Patch] [Aarch64] PR 86538 - Define __ARM_FEATURE_LSE if LSE is available

2018-07-24 Thread James Greenhalgh
On Tue, Jul 24, 2018 at 03:22:02PM -0500, Steve Ellcey wrote:
> This is a patch for PR 86538, to define an __ARM_FEATURE_LSE macro
> when LSE is available.  Richard Earnshaw closed PR 86538 as WONTFIX
> because the ACLE (Arm C Language Extension) does not require this
> macro and because he is concerned that it might encourage people to
> use inline assembly instead of the __sync and atomic intrinsics.
> (See actual comments in the defect report.)
> 
> While I agree that we want people to use the intrinsics I still think
> there are use cases where people may want to know if LSE is available
> or not and there is currrently no (simple) way to determine if this feature
> is available since it can be turned or and off independently of the
> architecture used.  Also, as a general principle, I  think any feature
> that can be toggled on or off by the compiler should provide a way for
> users to determine what its state is.

Well, we blow that design principle all over the place (find me a macro
which tells you whether AARCH64_EXTRA_TUNE_SLOW_UNALIGNED_LDPW is on for
example :-) )

A better design principle would be that if we think language programmers
may want to compile in different C code depending on a compiler option, we
should consider adding a feature macro.

> So what do other ARM maintainers and users think?  Is this a useful
> feature to have in GCC?

I'm with Richard on this one.

Whether LSE is available or not at compile time, the best user strategy is
to use the C11/C++11 atomic extensions. That's where the memory model is
well defined, well reasoned about, and well implemented.

Purely in ACLE we're not keen on providing macros that don't provide choice
to a C level programmer (i.e. change the prescence of intrinsics).

You could well imagine an inline asm programmer wanting to choose between an
LSE path and an Armv8.0-A path; but I can't imagine what they would want to
do on that path that couldn't be expressed better in the C language. You
might say they want to validate presence of the instruction; but that will
need to be a dynamic check outside of ACLE anyway.

All of which is to say, I don't think that this is a neccessary macro. Each
time I've seen it requested by a user, we've told them the same thing; what
do you want to express here that isn't better expressed by C atomic
primitives.

I'd say this patch isn't desirable for trunk. I'd be interested in use cases
that need a static decision on presence of LSE that are not better expressed
using higher level language features.

Thanks,
James



Re: [PATCH][GCC][AArch64] Set default values for stack-clash and do basic validation in back-end. [Patch (5/6)]

2018-07-31 Thread James Greenhalgh
On Tue, Jul 24, 2018 at 05:27:05AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This patch is a cascade update from having to re-spin the configure patch 
> (no# 4 in the series).
> 
> This patch enforces that the default guard size for stack-clash protection for
> AArch64 be 64KB unless the user has overriden it via configure in which case
> the user value is used as long as that value is within the valid range.
> 
> It also does some basic validation to ensure that the guard size is only 4KB 
> or
> 64KB and also enforces that for aarch64 the stack-clash probing interval is
> equal to the guard size.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Target was tested with stack clash on and off by default.
> 
> Ok for trunk?

This is OK with the style changes below.

Thanks,
James

> gcc/
> 2018-07-24  Tamar Christina  
> 
>   PR target/86486
>   * config/aarch64/aarch64.c (aarch64_override_options_internal):
>   Add validation for stack-clash parameters and set defaults.
> 
> > -Original Message-
> > From: Tamar Christina 
> > Sent: Wednesday, July 11, 2018 12:23
> > To: gcc-patches@gcc.gnu.org
> > Cc: nd ; James Greenhalgh ;
> > Richard Earnshaw ; Marcus Shawcroft
> > 
> > Subject: [PATCH][GCC][AArch64] Set default values for stack-clash and do
> > basic validation in back-end. [Patch (5/6)]
> > 
> > Hi All,
> > 
> > This patch enforces that the default guard size for stack-clash protection 
> > for
> > AArch64 be 64KB unless the user has overriden it via configure in which case
> > the user value is used as long as that value is within the valid range.
> > 
> > It also does some basic validation to ensure that the guard size is only 
> > 4KB or
> > 64KB and also enforces that for aarch64 the stack-clash probing interval is
> > equal to the guard size.
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > Target was tested with stack clash on and off by default.
> > 
> > Ok for trunk?
> > 
> > Thanks,
> > Tamar
> > 
> > gcc/
> > 2018-07-11  Tamar Christina  
> > 
> > PR target/86486
> > * config/aarch64/aarch64.c (aarch64_override_options_internal):
> > Add validation for stack-clash parameters.
> > 
> > --

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 
> e2c34cdfc96a1d3f99f7e4834c66a7551464a518..30c62c406e10793fe041d54c73316a6c8d7c229f
>  100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -10916,6 +10916,37 @@ aarch64_override_options_internal (struct 
> gcc_options *opts)
>opts->x_param_values,
>global_options_set.x_param_values);
>  
> +  /* If the user hasn't change it via configure then set the default to 64 KB

s/change/changed/

> + for the backend.  */
> +  maybe_set_param_value (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE,
> +  DEFAULT_STK_CLASH_GUARD_SIZE == 0
> +? 16 : DEFAULT_STK_CLASH_GUARD_SIZE,
> +  opts->x_param_values,
> +  global_options_set.x_param_values);
> +
> +  /* Validate the guard size.  */
> +  int guard_size = PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
> +  if (guard_size != 12 && guard_size != 16)
> +  error ("only values 12 (4 KB) and 16 (64 KB) are supported for guard "

Formatting is wrong, two spaces to indent error.

> +  "size.  Given value %d (%llu KB) is out of range.\n",

No \n on errors. s/out of range/invalid/

> +  guard_size, (1ULL << guard_size) / 1024ULL);
> +
> +  /* Enforce that interval is the same size as size so the mid-end does the
> + right thing.  */
> +  maybe_set_param_value (PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL,
> +  guard_size,
> +  opts->x_param_values,
> +  global_options_set.x_param_values);
> +
> +  /* The maybe_set calls won't update the value if the user has explicitly 
> set
> + one.  Which means we need to validate that probing interval and guard 
> size
> + are equal.  */
> +  int probe_interval
> += PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL);
> +  if (guard_size != probe_interval)
> +error ("stack clash guard size '%d' must be equal to probing interval "
> +"'%d'\n", guard_size, probe_interval);

No \n on errors.

> +
>/* Enable sw prefetching at specified optimization level for
>   CPUS that have prefetch.  Lower optimization level threshold by 1
>   when profiling is enabled.  */
> 



Re: [PATCH][GCC][AArch64] Cleanup the AArch64 testsuite when stack-clash is on [Patch (6/6)]

2018-07-31 Thread James Greenhalgh
On Tue, Jul 24, 2018 at 05:28:03AM -0500, Tamar Christina wrote:
> Hi All,
> 
> This patch cleans up the testsuite when a run is done with stack clash
> protection turned on.
> 
> Concretely this switches off -fstack-clash-protection for a couple of tests:
> 
> * sve: We don't yet support stack-clash-protection and sve, so for now turn 
> these off.
> * assembler scan: some tests are quite fragile in that they check for exact
>assembly output, e.g. check for exact amount of sub etc.  These won't
>match now.
> * vla: Some of the ubsan tests negative array indices. Because the arrays 
> weren't
>used before the incorrect $sp wouldn't have been used. The correct 
> value is
>restored on ret.  Now however we probe the $sp which causes a segfault.
> * params: When testing the parameters we have to skip these on AArch64 
> because of our
>   custom constraints on them.  We already test them separately so 
> this isn't a
>   loss.
> 
> Note that the testsuite is not entire clean due to gdb failure caused by 
> alloca with
> stack clash. On AArch64 we output an incorrect .loc directive, but this is 
> already the
> case with the current implementation in GCC and is a bug unrelated to this 
> patch series.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu and no 
> issues.
> Both targets were tested with stack clash on and off by default.
> 
> Ok for trunk?

For each of the generic tests you skip because of mismatched bounds, I think
we should ensure we have an equivalent test checking that behaviour in
gcc.target/aarch64/ . If we have that, it might be good to cross-reference
them with a comment above your skip lines.

> * vla: Some of the ubsan tests negative array indices. Because the arrays 
> weren't
>used before the incorrect $sp wouldn't have been used. The correct 
> value is
>restored on ret.  Now however we probe the $sp which causes a segfault.

This is interesting behaviour; is it a desirable side effect of your changes?

Otherwise, this patch is OK.

Thanks,
James


> gcc/testsuite/
> 2018-07-24  Tamar Christina  
> 
>   PR target/86486
>   * gcc.dg/pr82788.c: Skip for AArch64.
>   * gcc.dg/guality/vla-1.c: Turn off stack-clash.
>   * gcc.target/aarch64/subsp.c: Likewise.
>   * gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.
>   * gcc.target/aarch64/sve/mask_struct_store_3.c: Likewise.
>   * gcc.target/aarch64/sve/mask_struct_store_4.c: Likewise.
>   * gcc.dg/params/blocksort-part.c: Skip stack-clash checks
>   on AArch64.
>   * gcc.dg/stack-check-10.c: Add AArch64 specific checks.
>   * gcc.dg/stack-check-5.c: Add AArch64 specific checks.
>   * gcc.dg/stack-check-6a.c: Skip on AArch64, we don't support this.
>   * testsuite/lib/target-supports.exp
>   (check_effective_target_frame_pointer_for_non_leaf): AArch64 does not
>   require frame pointer for non-leaf functions.
> 
> > -Original Message-
> > From: Tamar Christina 
> > Sent: Wednesday, July 11, 2018 12:23
> > To: gcc-patches@gcc.gnu.org
> > Cc: nd ; James Greenhalgh ;
> > Richard Earnshaw ; Marcus Shawcroft
> > 
> > Subject: [PATCH][GCC][AArch64] Cleanup the AArch64 testsuite when stack-
> > clash is on [Patch (6/6)]
> > 
> > Hi All,
> > 
> > This patch cleans up the testsuite when a run is done with stack clash
> > protection turned on.
> > 
> > Concretely this switches off -fstack-clash-protection for a couple of tests:
> > 
> > * sve: We don't yet support stack-clash-protection and sve, so for now turn
> > these off.
> > * assembler scan: some tests are quite fragile in that they check for exact
> >assembly output, e.g. check for exact amount of sub etc.  These won't
> >match now.
> > * vla: Some of the ubsan tests negative array indices. Because the arrays
> > weren't
> >used before the incorrect $sp wouldn't have been used. The correct
> > value is
> >restored on ret.  Now however we probe the $sp which causes a 
> > segfault.
> > * params: When testing the parameters we have to skip these on AArch64
> > because of our
> >   custom constraints on them.  We already test them separately so 
> > this
> > isn't a
> >   loss.
> > 
> > Note that the testsuite is not entire clean due to gdb failure caused by 
> > alloca
> > with stack clash. On AArch64 we output an incorrect .loc directive, but 
> > this is
> > already the case with the current implementation in

Re: [PATCH][AARCH64] PR target/84521 Fix frame pointer corruption with -fomit-frame-pointer with __builtin_setjmp

2018-07-31 Thread James Greenhalgh
On Thu, Jul 12, 2018 at 12:01:09PM -0500, Sudakshina Das wrote:
> Hi Eric
> 
> On 27/06/18 12:22, Wilco Dijkstra wrote:
> > Eric Botcazou wrote:
> > 
> >>> This test can easily be changed not to use optimize since it doesn't look
> >>> like it needs it. We really need to tests these builtins properly,
> >>> otherwise they will continue to fail on most targets.
> >>
> >> As far as I can see PR target/84521 has been reported only for Aarch64 so 
> >> I'd
> >> just leave the other targets alone (and avoid propagating FUD if possible).
> > 
> > It's quite obvious from PR84521 that this is an issue affecting all targets.
> > Adding better generic tests for __builtin_setjmp can only be a good thing.
> > 
> > Wilco
> > 
> 
> This conversation seems to have died down and I would like to
> start it again. I would agree with Wilco's suggestion about
> keeping the test in the generic folder. I have removed the
> optimize attribute and the effect is still the same. It passes
> on AArch64 with this patch and it currently fails on x86
> trunk (gcc version 9.0.0 20180712 (experimental) (GCC))
> on -O1 and above.


I don't see where the FUD comes in here; either this builtin has a defined
semantics across targets and they are adhered to, or the builtin doesn't have
well defined semantics, or the targets fail to implement those semantics.

I think this should go in as is. If other targets are unhappy with the
failing test they should fix their target or skip the test if it is not
appropriate.

You may want to CC some of the maintainers of platforms you know to fail as
a courtesy on the PR (add your testcase, and add failing targets and their
maintainers to that PR) before committing so it doesn't come as a complete
surprise.

This is OK with some attempt to get target maintainers involved in the
conversation before commit.

Thanks,
James

> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index f284e74..9792d28 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -473,7 +473,9 @@ extern unsigned aarch64_architecture_version;
>  #define EH_RETURN_STACKADJ_RTX   gen_rtx_REG (Pmode, R4_REGNUM)
>  #define EH_RETURN_HANDLER_RTX  aarch64_eh_return_handler_rtx ()
>  
> -/* Don't use __builtin_setjmp until we've defined it.  */
> +/* Don't use __builtin_setjmp until we've defined it.
> +   CAUTION: This macro is only used during exception unwinding.
> +   Don't fall for its name.  */
>  #undef DONT_USE_BUILTIN_SETJMP
>  #define DONT_USE_BUILTIN_SETJMP 1
>  
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 01f35f8..4266a3d 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -3998,7 +3998,7 @@ static bool
>  aarch64_needs_frame_chain (void)
>  {
>/* Force a frame chain for EH returns so the return address is at FP+8.  */
> -  if (frame_pointer_needed || crtl->calls_eh_return)
> +  if (frame_pointer_needed || crtl->calls_eh_return || 
> cfun->has_nonlocal_label)
>  return true;
>  
>/* A leaf function cannot have calls or write LR.  */
> @@ -12218,6 +12218,13 @@ aarch64_expand_builtin_va_start (tree valist, rtx 
> nextarg ATTRIBUTE_UNUSED)
>expand_expr (t, const0_rtx, VOIDmode, EXPAND_NORMAL);
>  }
>  
> +/* Implement TARGET_BUILTIN_SETJMP_FRAME_VALUE.  */
> +static rtx
> +aarch64_builtin_setjmp_frame_value (void)
> +{
> +  return hard_frame_pointer_rtx;
> +}
> +
>  /* Implement TARGET_GIMPLIFY_VA_ARG_EXPR.  */
>  
>  static tree
> @@ -17744,6 +17751,9 @@ aarch64_run_selftests (void)
>  #undef TARGET_FOLD_BUILTIN
>  #define TARGET_FOLD_BUILTIN aarch64_fold_builtin
>  
> +#undef TARGET_BUILTIN_SETJMP_FRAME_VALUE
> +#define TARGET_BUILTIN_SETJMP_FRAME_VALUE aarch64_builtin_setjmp_frame_value
> +
>  #undef TARGET_FUNCTION_ARG
>  #define TARGET_FUNCTION_ARG aarch64_function_arg
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index a014a01..d5f33d8 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6087,6 +6087,30 @@
>DONE;
>  })
>  
> +;; This is broadly similar to the builtins.c except that it uses
> +;; temporaries to load the incoming SP and FP.
> +(define_expand "nonlocal_goto"
> +  [(use (match_operand 0 "general_operand"))
> +   (use (match_operand 1 "general_operand"))
> +   (use (match_operand 2 "general_operand"))
> +   (use (match_operand 3 "general_operand"))]
> +  ""
> +{
> +rtx label_in = copy_to_reg (operands[1]);
> +rtx fp_in = copy_to_reg (operands[3]);
> +rtx sp_in = copy_to_reg (operands[2]);
> +
> +emit_move_insn (hard_frame_pointer_rtx, fp_in);
> +emit_stack_restore (SAVE_NONLOCAL, sp_in);
> +
> +emit_use (hard_frame_pointer_rtx);
> +emit_use (stack_pointer_rtx);
> +
> +emit_indirect_jump (label_in);
> +
> +DONE;
> +})
> +
>  ;; Helper for aarch64.c code.
>  (define_expand "set_clobber_cc"
>[(parallel [(set (match_operand 0)
> diff --git a/gcc/testsuite/gcc.c

Re: [AArch64] Add support for 16-bit FMOV immediates

2018-07-31 Thread James Greenhalgh
On Wed, Jul 18, 2018 at 12:47:27PM -0500, Richard Sandiford wrote:
> aarch64_float_const_representable_p was still returning false for
> HFmode, so we wouldn't use 16-bit FMOV immediate.  E.g. before the
> patch:
> 
> __fp16 foo (void) { return 0x1.1p-3; }
> 
> gave:
> 
>mov w0, 12352
>fmovh0, w0
> 
> with -march=armv8.2-a+fp16, whereas now it gives:
> 
>fmovh0, 1.328125e-1
> 
> Tested on aarch64-linux-gnu, both with and without SVE.  OK to install?

OK.

Thanks,
James

> 
> Richard
> 
> 
> 2018-07-18  Richard Sandiford  
> 
> gcc/
>   * config/aarch64/aarch64.c (aarch64_float_const_representable_p):
>   Allow HFmode constants if TARGET_FP_F16INST.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/f16_mov_immediate_1.c: Expect fmov immediate
>   to be used.
>   * gcc.target/aarch64/f16_mov_immediate_2.c: Likewise.
>   * gcc.target/aarch64/f16_mov_immediate_3.c: Force +nofp16.
>   * gcc.target/aarch64/sve/single_1.c: Except fmov immediate to be used
>   for .h.
>   * gcc.target/aarch64/sve/single_2.c: Likewise.
>   * gcc.target/aarch64/sve/single_3.c: Likewise.
>   * gcc.target/aarch64/sve/single_4.c: Likewise.
> 
> Index: gcc/config/aarch64/aarch64.c
> ===
> --- gcc/config/aarch64/aarch64.c  2018-07-18 18:45:26.0 +0100
> +++ gcc/config/aarch64/aarch64.c  2018-07-18 18:45:27.025332090 +0100
> @@ -14908,8 +14908,8 @@ aarch64_float_const_representable_p (rtx
>if (!CONST_DOUBLE_P (x))
>  return false;
>  
> -  /* We don't support HFmode constants yet.  */
> -  if (GET_MODE (x) == VOIDmode || GET_MODE (x) == HFmode)
> +  if (GET_MODE (x) == VOIDmode
> +  || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
>  return false;
>  
>r = *CONST_DOUBLE_REAL_VALUE (x);
> Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c
> ===
> --- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c2018-07-18 
> 18:45:26.0 +0100
> +++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_1.c2018-07-18 
> 18:45:27.025332090 +0100
> @@ -44,6 +44,6 @@ __fp16 f5 ()
>return a;
>  }
>  
> -/* { dg-final { scan-assembler-times "mov\tw\[0-9\]+, #?19520"   3 } 
> } */
> -/* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0xbc, lsl 8"  1 
> } } */
> -/* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x4c, lsl 8"  1 
> } } */
> +/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1\.7e\+1}  3 } } */
> +/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?-1\.0e\+0} 1 } } */
> +/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1\.6e\+1}  1 } } */
> Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c
> ===
> --- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c2018-07-18 
> 18:45:26.0 +0100
> +++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_2.c2018-07-18 
> 18:45:27.025332090 +0100
> @@ -40,6 +40,4 @@ float16_t f3(void)
>  /* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x5c, lsl 8" 1 
> } } */
>  /* { dg-final { scan-assembler-times "movi\tv\[0-9\]+\\\.4h, 0x7c, lsl 8" 1 
> } } */
>  
> -/* { dg-final { scan-assembler-times "mov\tw\[0-9\]+, 19520"  1 
> } } */
> -/* { dg-final { scan-assembler-times "fmov\th\[0-9\], w\[0-9\]+"  1 
> } } */
> -
> +/* { dg-final { scan-assembler-times {fmov\th[0-9]+, #?1.7e\+1}   1 
> } } */
> Index: gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c
> ===
> --- gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c2018-07-18 
> 18:45:26.0 +0100
> +++ gcc/testsuite/gcc.target/aarch64/f16_mov_immediate_3.c2018-07-18 
> 18:45:27.025332090 +0100
> @@ -1,6 +1,8 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2" } */
>  
> +#pragma GCC target "+nofp16"
> +
>  __fp16 f4 ()
>  {
>__fp16 a = 0.1;
> Index: gcc/testsuite/gcc.target/aarch64/sve/single_1.c
> ===
> --- gcc/testsuite/gcc.target/aarch64/sve/single_1.c   2018-07-18 
> 18:45:26.0 +0100
> +++ gcc/testsuite/gcc.target/aarch64/sve/single_1.c   2018-07-18 
> 18:45:27.025332090 +0100
> @@ -36,7 +36,7 @@ TEST_LOOP (double, 3.0)
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, #6\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #7\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #8\n} 1 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, #15360\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.h, #1\.0e\+0\n} 1 } } 
> */
>  /* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.s, #2\.0e\+0\n} 1 } } 
> */
>  /* { dg-final { scan-assembler-times {\tfmov\tz[0-9]+\.d, #3\.0e\+0\n

Re: [PATCH][AArch64] Implement new intrinsics vabsd_s64 and vnegd_s64

2018-07-31 Thread James Greenhalgh
On Fri, Jul 20, 2018 at 04:37:34AM -0500, Vlad Lazar wrote:
> Hi,
> 
> The patch adds implementations for the NEON intrinsics vabsd_s64 and 
> vnegd_s64.
> (https://developer.arm.com/products/architecture/cpu-architecture/a-profile/docs/ihi0073/latest/arm-neon-intrinsics-reference-architecture-specification)
> 
> Bootstrapped and regtested on aarch64-none-linux-gnu and there are no 
> regressions.
> 
> OK for trunk?
> 
> +__extension__ extern __inline int64_t
> +__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
> +vnegd_s64 (int64_t __a)
> +{
> +  return -__a;
> +}

Does this give the correct behaviour for the minimum value of int64_t? That
would be undefined behaviour in C, but well-defined under ACLE.

Thanks,
James



Re: [gen/AArch64] Generate helpers for substituting iterator values into pattern names

2018-07-31 Thread James Greenhalgh
On Fri, Jul 13, 2018 at 04:15:41AM -0500, Richard Sandiford wrote:
> Given a pattern like:
> 
>   (define_insn "aarch64_frecpe" ...)
> 
> the SVE ACLE implementation wants to generate the pattern for a
> particular (non-constant) mode.  This patch automatically generates
> helpers to do that, specifically:
> 
>   // Return CODE_FOR_nothing on failure.
>   insn_code maybe_code_for_aarch64_frecpe (machine_mode);
> 
>   // Assert that the code exists.
>   insn_code code_for_aarch64_frecpe (machine_mode);
> 
>   // Return NULL_RTX on failure.
>   rtx maybe_gen_aarch64_frecpe (machine_mode, rtx, rtx);
> 
>   // Assert that generation succeeds.
>   rtx gen_aarch64_frecpe (machine_mode, rtx, rtx);
> 
> Many patterns don't have sensible names when all <...>s are removed.
> E.g. "2" would give a base name "2".  The new functions
> therefore require explicit opt-in, which should also help to reduce
> code bloat.
> 
> The (arbitrary) opt-in syntax I went for was to prefix the pattern
> name with '@', similarly to the existing '*' marker.
> 
> The patch also makes config/aarch64 use the new routines in cases where
> they obviously apply.  This was mostly straight-forward, but it seemed
> odd that we defined:
> 
>aarch64_reload_movcp<...>
> 
> but then only used it with DImode, never SImode.  If we should be
> using Pmode instead of DImode, then that's a simple change,
> but should probably be a separate patch.
> 
> Tested on aarch64-linux-gnu (with and without SVE), aarch64_be-elf
> and x86_64-linux-gnu.  I think I can self-approve the gen* bits,
> but OK for the AArch64 parts?

For what it is worth, I like the change to AArch64, and would support it
when you get consensus around the new syntax from other targets.

You only have to look at something like:

> -  rtx (*gen) (rtx, rtx, rtx);
> -
> -  switch (src_mode)
> -{
> -case E_V8QImode:
> -  gen = gen_aarch64_simd_combinev8qi;
> -  break;
> -case E_V4HImode:
> -  gen = gen_aarch64_simd_combinev4hi;
> -  break;
> -case E_V2SImode:
> -  gen = gen_aarch64_simd_combinev2si;
> -  break;
> -case E_V4HFmode:
> -  gen = gen_aarch64_simd_combinev4hf;
> -  break;
> -case E_V2SFmode:
> -  gen = gen_aarch64_simd_combinev2sf;
> -  break;
> -case E_DImode:
> -  gen = gen_aarch64_simd_combinedi;
> -  break;
> -case E_DFmode:
> -  gen = gen_aarch64_simd_combinedf;
> -  break;
> -default:
> -  gcc_unreachable ();
> -}
> -
> -  emit_insn (gen (dst, src1, src2));
> +  emit_insn (gen_aarch64_simd_combine (src_mode, dst, src1, src2));

To understand this is a Good Thing for code maintainability.

Thanks,
James


> 
> Any objections to this approach or syntax?
> 
> Richard


Re: [GCC][PATCH][Aarch64] Stop redundant zero-extension after UMOV when in DI mode

2018-07-31 Thread James Greenhalgh
On Thu, Jul 26, 2018 at 11:52:15AM -0500, Sam Tebbs wrote:



> > Thanks for making the changes and adding more test cases. I do however
> > see that you are only covering 2 out of 4 new
> > *aarch64_get_lane_zero_extenddi<> patterns. The
> > *aarch64_get_lane_zero_extendsi<> were already existing. I don't mind
> > those tests. I would just ask you to add the other two new patterns
> > as well. Also since the different versions of the instruction generate
> > same instructions (like foo_16qi and foo_8qi both give out the same
> > instruction), I would suggest using a -fdump-rtl-final (or any relevant
> > rtl dump) with the dg-options and using a scan-rtl-dump to scan the
> > pattern name. Something like:
> > /* { dg-do compile } */
> > /* { dg-options "-O3 -fdump-rtl-final" } */
> > ...
> > ...
> > /* { dg-final { scan-rtl-dump "aarch64_get_lane_zero_extenddiv16qi" 
> > "final" } } */
> >
> > Thanks
> > Sudi
> 
> Hi Sudi,
> 
> Thanks again. Here's an update that adds 4 more tests, so all 8 patterns
> generated are now tested for!

This is OK for trunk, thanks for the patch (and thanks Sudi for the review!)

Thanks,
James

> 
> Below is the updated changelog
> 
> gcc/
> 2018-07-26  Sam Tebbs  
> 
>      * config/aarch64/aarch64-simd.md
>      (*aarch64_get_lane_zero_extendsi):
>      Rename to...
> (*aarch64_get_lane_zero_extend): ... This.
>      Use GPI iterator instead of SI mode.
> 
> gcc/testsuite
> 2018-07-26  Sam Tebbs  
> 
>      * gcc.target/aarch64/extract_zero_extend.c: New file
> 



Re: [PATCH] [AArch64, Falkor] Switch to using Falkor-specific vector costs

2018-07-31 Thread James Greenhalgh
On Wed, Jul 25, 2018 at 01:10:34PM -0500, Luis Machado wrote:
> The adjusted vector costs give Falkor a reasonable boost in performance for FP
> benchmarks (both CPU2017 and CPU2006) and doesn't change INT benchmarks that
> much. About 0.7% for CPU2017 FP and 1.54% for CPU2006 FP.
> 
> OK for trunk?

OK if this is what works best for your subtarget.

Thanks,
James

> 
> gcc/ChangeLog:
> 
> 2018-07-25  Luis Machado  
> 
>   * config/aarch64/aarch64.c (qdf24xx_vector_cost): New.
>   (qdf24xx_tunings) : Set to qdf24xx_vector_cost.
> ---
>  gcc/config/aarch64/aarch64.c | 22 +-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index fa01475..d443aee 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -430,6 +430,26 @@ static const struct cpu_vector_cost generic_vector_cost =
>1 /* cond_not_taken_branch_cost  */
>  };
>  
> +/* Qualcomm QDF24xx costs for vector insn classes.  */
> +static const struct cpu_vector_cost qdf24xx_vector_cost =
> +{
> +  1, /* scalar_int_stmt_cost  */
> +  1, /* scalar_fp_stmt_cost  */
> +  1, /* scalar_load_cost  */
> +  1, /* scalar_store_cost  */
> +  1, /* vec_int_stmt_cost  */
> +  3, /* vec_fp_stmt_cost  */
> +  2, /* vec_permute_cost  */
> +  1, /* vec_to_scalar_cost  */
> +  1, /* scalar_to_vec_cost  */
> +  1, /* vec_align_load_cost  */
> +  1, /* vec_unalign_load_cost  */
> +  1, /* vec_unalign_store_cost  */
> +  1, /* vec_store_cost  */
> +  3, /* cond_taken_branch_cost  */
> +  1  /* cond_not_taken_branch_cost  */
> +};
> +
>  /* ThunderX costs for vector insn classes.  */
>  static const struct cpu_vector_cost thunderx_vector_cost =
>  {
> @@ -890,7 +910,7 @@ static const struct tune_params qdf24xx_tunings =
>&qdf24xx_extra_costs,
>&qdf24xx_addrcost_table,
>&qdf24xx_regmove_cost,
> -  &generic_vector_cost,
> +  &qdf24xx_vector_cost,
>&generic_branch_cost,
>&generic_approx_modes,
>4, /* memmov_cost  */
> -- 
> 2.7.4
> 


Re: [PATCH] [AArch64, Falkor] Adjust Falkor's sign extend reg+reg address cost

2018-07-31 Thread James Greenhalgh
On Wed, Jul 25, 2018 at 01:35:23PM -0500, Luis Machado wrote:
> Adjust Falkor's register_sextend cost from 4 to 3.  This fixes a testsuite
> failure in gcc.target/aarch64/extend.c:ldr_sxtw where GCC was generating
> a sbfiz instruction rather than a load with sign extension.
> 
> No performance changes.

OK if this is what is best for your subtarget.

Thanks,
James

> 
> gcc/ChangeLog:
> 
> 2018-07-25  Luis Machado  
> 
>   * config/aarch64/aarch64.c (qdf24xx_addrcost_table)
>   : Set to 3.


[AArch64] Fix categorisation of the frecp* insns.

2013-09-03 Thread James Greenhalgh

Hi,

It looks like the frecp instructions got miscategorised as TARGET_FLOAT
instructions when they are in fact TARGET_SIMD instructions.

Move them to the right file, give them a simd_type, drop their "type"
and "v8type" and clean up the useless types from aarch64.md.

Also, where possible merge patterns.

Tested on aarch64-none-elf with no regression.

Thanks,
James

---
2013-09-03  James Greenhalgh   

* config/aarch64/aarch64.md
(type): Remove frecpe, frecps, frecpx.
(aarch64_frecp): Move to aarch64-simd.md,
fix to be a TARGET_SIMD instruction.
(aarch64_frecps): Remove.
* config/aarch64/aarch64-simd.md
(aarch64_frecp): New, moved from aarch64.md
(aarch64_frecps): Handle all float/vector of float modes.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f4b929edf44fbeebe6da2568a3aa76138eca0609..c085fb9c49958c5f402a28c0b39fe45ec1aadbc7 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4179,13 +4179,23 @@ (define_insn "aarch64_frecpe"
(set_attr "simd_mode" "")]
 )
 
+(define_insn "aarch64_frecp"
+  [(set (match_operand:GPF 0 "register_operand" "=w")
+	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
+		FRECP))]
+  "TARGET_SIMD"
+  "frecp\\t%0, %1"
+  [(set_attr "simd_type" "simd_frecp")
+   (set_attr "mode" "")]
+)
+
 (define_insn "aarch64_frecps"
-  [(set (match_operand:VDQF 0 "register_operand" "=w")
-	(unspec:VDQF [(match_operand:VDQF 1 "register_operand" "w")
-		 (match_operand:VDQF 2 "register_operand" "w")]
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
+		 (match_operand:VALLF 2 "register_operand" "w")]
 		UNSPEC_FRECPS))]
   "TARGET_SIMD"
-  "frecps\\t%0., %1., %2."
+  "frecps\\t%0, %1, %2"
   [(set_attr "simd_type" "simd_frecps")
(set_attr "simd_mode" "")]
 )
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 47532fca2c550e8ec9b63898511ef6c276943a45..a46dd5813acd9c800a6c519544fb7bca5de993d9 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -240,9 +240,6 @@ (define_attr "v8type"
fmovf2i,\
fmovi2f,\
fmul,\
-   frecpe,\
-   frecps,\
-   frecpx,\
frint,\
fsqrt,\
load_acq,\
@@ -3946,29 +3943,6 @@ (define_insn "smin3"
(set_attr "mode" "")]
 )
 
-(define_insn "aarch64_frecp"
-  [(set (match_operand:GPF 0 "register_operand" "=w")
-	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
-		FRECP))]
-  "TARGET_FLOAT"
-  "frecp\\t%0, %1"
-  [(set_attr "v8type" "frecp")
-   (set_attr "type" "ffarith")
-   (set_attr "mode" "")]
-)
-
-(define_insn "aarch64_frecps"
-  [(set (match_operand:GPF 0 "register_operand" "=w")
-	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")
-		 (match_operand:GPF 2 "register_operand" "w")]
-		UNSPEC_FRECPS))]
-  "TARGET_FLOAT"
-  "frecps\\t%0, %1, %2"
-  [(set_attr "v8type" "frecps")
-   (set_attr "type" "ffarith")
-   (set_attr "mode" "")]
-)
-
 ;; ---
 ;; Reload support
 ;; ---

[Patch AArch64] Obvious - Fix return types for vaddvq_64

2013-09-04 Thread James Greenhalgh

The vaddvq_s64 and vaddvq_u64 intrinsics are defined to return 32-bit
types. This is clearly wrong, so fix them to return int64_t and uint64_t
as expected.

Regression tested with a run through aarch64.exp and sanity checked.

OK for trunk?

Thanks,
James

---
gcc/

2013-09-04  James Greenhalgh  

* config/aarch64/arm_neon.h (vaddvq_64): Fix return types.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index e289a0d..29d1378 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -17033,7 +17033,7 @@ vaddvq_s32 (int32x4_t __a)
   return vgetq_lane_s32 (__builtin_aarch64_reduc_splus_v4si (__a), 0);
 }
 
-__extension__ static __inline int32_t __attribute__ ((__always_inline__))
+__extension__ static __inline int64_t __attribute__ ((__always_inline__))
 vaddvq_s64 (int64x2_t __a)
 {
   return vgetq_lane_s64 (__builtin_aarch64_reduc_splus_v2di (__a), 0);
@@ -17060,7 +17060,7 @@ vaddvq_u32 (uint32x4_t __a)
 		__builtin_aarch64_reduc_uplus_v4si ((int32x4_t) __a), 0);
 }
 
-__extension__ static __inline uint32_t __attribute__ ((__always_inline__))
+__extension__ static __inline uint64_t __attribute__ ((__always_inline__))
 vaddvq_u64 (uint64x2_t __a)
 {
   return vgetq_lane_u64 ((uint64x2_t)

[Patch AArch64] Fix register constraints for lane intrinsics.

2013-09-06 Thread James Greenhalgh

Hi,

Most of the vector-by-element instructions in AArch64 have the restriction
that, if the vector they are taking an element from has type "h"
then it must be in a register from the lower half of the vector register
set (i.e. v0-v15). While we have imposed that restriction in places, we
have not been consistent.

Fix that.

Tested with aarch64.exp with no regressions.

OK for trunk?

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/aarch64/aarch64-simd.md
(aarch64_sqdmll_n_internal): Use
 iterator to ensure correct register choice.
(aarch64_sqdmll2_n_internal): Likewise.
(aarch64_sqdmull_n): Likewise.
(aarch64_sqdmull2_n_internal): Likewise.
* config/aarch64/arm_neon.h
(vml_lane_16): Use 'x' constraint for element vector.
(vml_n_16): Likewise.
(vmll_high_lane_16): Likewise.
(vmll_high_n_16): Likewise.
(vmll_lane_16): Likewise.
(vmll_n_16): Likewise.
(vmul_lane_16): Likewise.
(vmul_n_16): Likewise.
(vmull_lane_16): Likewise.
(vmull_n_16): Likewise.
(vmull_high_lane_16): Likewise.
(vmull_high_n_16): Likewise.
(vqrdmulh_n_s16): Likewise.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index c085fb9c49958c5f402a28c0b39fe45ec1aadbc7..5161e48dfcea91910b9dc2ee68219c3ede55f4aa 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -2797,7 +2797,7 @@ (define_insn "aarch64_sqdml
 		  (match_operand:VD_HSI 2 "register_operand" "w"))
 		(sign_extend:
 		  (vec_duplicate:VD_HSI
-		(match_operand: 3 "register_operand" "w"
+		(match_operand: 3 "register_operand" ""
 	  (const_int 1]
   "TARGET_SIMD"
   "sqdmll\\t%0, %2, %3.[0]"
@@ -2955,7 +2955,7 @@ (define_insn "aarch64_sqdml
   (match_operand:VQ_HSI 4 "vect_par_cnst_hi_half" "")))
 	  (sign_extend:
 (vec_duplicate:
-		  (match_operand: 3 "register_operand" "w"
+		  (match_operand: 3 "register_operand" ""
 	(const_int 1]
   "TARGET_SIMD"
   "sqdmll2\\t%0, %2, %3.[0]"
@@ -3083,7 +3083,7 @@ (define_insn "aarch64_sqdmull_n"
 		 (match_operand:VD_HSI 1 "register_operand" "w"))
 	   (sign_extend:
  (vec_duplicate:VD_HSI
-   (match_operand: 2 "register_operand" "w")))
+   (match_operand: 2 "register_operand" "")))
 	   )
 	 (const_int 1)))]
   "TARGET_SIMD"
@@ -3193,7 +3193,7 @@ (define_insn "aarch64_sqdmull2_n_i
(match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" "")))
 	   (sign_extend:
  (vec_duplicate:
-   (match_operand: 2 "register_operand" "w")))
+   (match_operand: 2 "register_operand" "")))
 	   )
 	 (const_int 1)))]
   "TARGET_SIMD"
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 29d1378..e20d34e 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -7146,7 +7146,7 @@ vld1q_dup_u64 (const uint64_t * a)
int16x4_t result;\
__asm__ ("mla %0.4h, %2.4h, %3.h[%4]"\
 : "=w"(result)  \
-: "0"(a_), "w"(b_), "w"(c_), "i"(d) \
+: "0"(a_), "w"(b_), "x"(c_), "i"(d) \
 : /* No clobbers */);   \
result;  \
  })
@@ -7174,7 +7174,7 @@ vld1q_dup_u64 (const uint64_t * a)
uint16x4_t result;   \
__asm__ ("mla %0.4h, %2.4h, %3.h[%4]"\
 : "=w"(result)  \
-: "0"(a_), "w"(b_), "w"(c_), "i"(d) \
+: "0"(a_), "w"(b_), "x"(c_), "i"(d) \
 : /* No clobbers */);   \
result;  \
  })
@@ -7202,7 +7202,7 @@ vld1q_dup_u64 (const uint64_t * a)
int16x4_t result;\
__asm__ ("mla %0.4h, %2.4h, %3.h[%4]"  

[AArch64] Fix types of second parameter to qtbl/qtbx intrinsics

2013-09-06 Thread James Greenhalgh

Hi,

The signed variants of the qtbl and qtbx intrinsics currently
take an int8x<8,16> for their control vector parameter.
This should be a uint8x<8,16> parameter.

Fixed as attached and checked against aarch64.exp on aarch64-none-elf
with no regressions.

Is this OK to commit?

I have some similair patches kicking around in my tree, these feel
obvious, but I'd like to check that others' share that perspective
before I go committing anything!

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/aarch64/arm_neon.h
(vqtbl<1,2,3,4>_s8): Fix control vector parameter type.
(vqtbx<1,2,3,4>_s8): Likewise.

gcc/testsuite/

2013-09-06  James Greenhalgh  

* gcc.target/aarch64/table-intrinsics.c
(qtbl_tests8_< ,2,3,4>): Fix control vector parameter type.
(qtb_tests8_< ,2,3,4>): Likewise.
(qtblq_tests8_< ,2,3,4>): Likewise.
(qtbxq_tests8_< ,2,3,4>): Likewise.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index e20d34e..5864f2c 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -15973,7 +15973,7 @@ vqtbl1_p8 (poly8x16_t a, uint8x8_t b)
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbl1_s8 (int8x16_t a, int8x8_t b)
+vqtbl1_s8 (int8x16_t a, uint8x8_t b)
 {
   int8x8_t result;
   __asm__ ("tbl %0.8b, {%1.16b}, %2.8b"
@@ -16006,7 +16006,7 @@ vqtbl1q_p8 (poly8x16_t a, uint8x16_t b)
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbl1q_s8 (int8x16_t a, int8x16_t b)
+vqtbl1q_s8 (int8x16_t a, uint8x16_t b)
 {
   int8x16_t result;
   __asm__ ("tbl %0.16b, {%1.16b}, %2.16b"
@@ -16028,7 +16028,7 @@ vqtbl1q_u8 (uint8x16_t a, uint8x16_t b)
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbl2_s8 (int8x16x2_t tab, int8x8_t idx)
+vqtbl2_s8 (int8x16x2_t tab, uint8x8_t idx)
 {
   int8x8_t result;
   __asm__ ("ld1 {v16.16b, v17.16b}, %1\n\t"
@@ -16064,7 +16064,7 @@ vqtbl2_p8 (poly8x16x2_t tab, uint8x8_t idx)
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbl2q_s8 (int8x16x2_t tab, int8x16_t idx)
+vqtbl2q_s8 (int8x16x2_t tab, uint8x16_t idx)
 {
   int8x16_t result;
   __asm__ ("ld1 {v16.16b, v17.16b}, %1\n\t"
@@ -16100,7 +16100,7 @@ vqtbl2q_p8 (poly8x16x2_t tab, uint8x16_t idx)
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbl3_s8 (int8x16x3_t tab, int8x8_t idx)
+vqtbl3_s8 (int8x16x3_t tab, uint8x8_t idx)
 {
   int8x8_t result;
   __asm__ ("ld1 {v16.16b - v18.16b}, %1\n\t"
@@ -16136,7 +16136,7 @@ vqtbl3_p8 (poly8x16x3_t tab, uint8x8_t idx)
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbl3q_s8 (int8x16x3_t tab, int8x16_t idx)
+vqtbl3q_s8 (int8x16x3_t tab, uint8x16_t idx)
 {
   int8x16_t result;
   __asm__ ("ld1 {v16.16b - v18.16b}, %1\n\t"
@@ -16172,7 +16172,7 @@ vqtbl3q_p8 (poly8x16x3_t tab, uint8x16_t idx)
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbl4_s8 (int8x16x4_t tab, int8x8_t idx)
+vqtbl4_s8 (int8x16x4_t tab, uint8x8_t idx)
 {
   int8x8_t result;
   __asm__ ("ld1 {v16.16b - v19.16b}, %1\n\t"
@@ -16209,7 +16209,7 @@ vqtbl4_p8 (poly8x16x4_t tab, uint8x8_t idx)
 
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbl4q_s8 (int8x16x4_t tab, int8x16_t idx)
+vqtbl4q_s8 (int8x16x4_t tab, uint8x16_t idx)
 {
   int8x16_t result;
   __asm__ ("ld1 {v16.16b - v19.16b}, %1\n\t"
@@ -16246,7 +16246,7 @@ vqtbl4q_p8 (poly8x16x4_t tab, uint8x16_t idx)
 
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbx1_s8 (int8x8_t r, int8x16_t tab, int8x8_t idx)
+vqtbx1_s8 (int8x8_t r, int8x16_t tab, uint8x8_t idx)
 {
   int8x8_t result = r;
   __asm__ ("tbx %0.8b,{%1.16b},%2.8b"
@@ -16279,7 +16279,7 @@ vqtbx1_p8 (poly8x8_t r, poly8x16_t tab, uint8x8_t idx)
 }
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbx1q_s8 (int8x16_t r, int8x16_t tab, int8x16_t idx)
+vqtbx1q_s8 (int8x16_t r, int8x16_t tab, uint8x16_t idx)
 {
   int8x16_t result = r;
   __asm__ ("tbx %0.16b,{%1.16b},%2.16b"
@@ -16312,7 +16312,7 @@ vqtbx1q_p8 (poly8x16_t r, poly8x16_t tab, uint8x16_t idx)
 }
 
 __extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vqtbx2_s8 (int8x8_t r, int8x16x2_t tab, int8x8_t idx)
+vqtbx2_s8 (int8x8_t r, int8x16x2_t tab, uint8x8_t idx)
 {
   int8x8_t result = r;
   __asm__ ("ld1 {v16.16b, v17.16b}, %1\n\t"
@@ -16349,7 +16349,7 @@ vqtbx2_p8 (poly8x8_t r, poly8x16x2_t tab, uint8x8_t idx)
 
 
 __extension__ static __inline int8x16_t __attribute__ ((__always_inline__))
-vqtbx2q_s8 (int8x16_t r, int8x16x2_t tab, int8x16_t idx)
+vqtbx2q_s8 (int8x16_t r, int8x16x2_t tab, uin

[ARM,AARCH64] Insn type reclassification. Split f_cvt type.

2013-09-06 Thread James Greenhalgh

This patch splits the f_cvt attribute to:

 * f_cvt conversions between float representations.
 * f_cvti2f conversions from int to float.
 * f_cvtf2i conversions from float to int.

Then we update the pipeline descriptions to refelct this change.

Regression tested for aarch64-none-elf and arm-none-eabi and sanity
checked. Bootstrapped in series with other type splitting patches.

OK?

Thanks,
James

---
2013-09-06  James Greenhalgh  

* config/arm/types.md
(type): Split f_cvt as f_cvt, f_cvtf2i, f_cvti2f.
* config/aarch64/aarch64.md
(l2): Update with
new attributes.
(fix_trunc2): Likewise.
(fixuns_trunc2): Likewise.
(float2): Likewise.
* config/arm/vfp.md
(*truncsisf2_vfp): Update with new attributes.
(*truncsidf2_vfp): Likewise.
(fixuns_truncsfsi2): Likewise.
(fixuns_truncdfsi2): Likewise.
(*floatsisf2_vfp): Likewise.
(*floatsidf2_vfp): Likewise.
(floatunssisf2): Likewise.
(floatunssidf2): Likewise.
(*combine_vcvt_f32_): Likewise.
(*combine_vcvt_f64_): Likewise.
* config/arm/arm1020e.md: Update with new attributes.
* config/arm/cortex-a15-neon.md: Update with new attributes.
* config/arm/cortex-a5.md: Update with new attributes.
* config/arm/cortex-a53.md: Update with new attributes.
* config/arm/cortex-a7.md: Update with new attributes.
* config/arm/cortex-a8-neon.md: Update with new attributes.
* config/arm/cortex-a9.md: Update with new attributes.
* config/arm/cortex-m4-fpu.md: Update with new attributes.
* config/arm/cortex-r4f.md: Update with new attributes.
* config/arm/marvell-pj4.md: Update with new attributes.
* config/arm/vfp11.md: Update with new attributes.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 4dfd2ab83d00601dc8192ad47fec2c1e404d1264..6a4a975bb89c48311659db0091c76266d29cdba2 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3685,7 +3685,7 @@ (define_insn "l<
   "TARGET_FLOAT"
   "fcvt\\t%0, %1"
   [(set_attr "v8type" "fcvtf2i")
-   (set_attr "type" "f_cvt")
+   (set_attr "type" "f_cvtf2i")
(set_attr "mode" "")
(set_attr "mode2" "")]
 )
@@ -3785,7 +3785,7 @@ (define_insn "fix_trunc0, %1"
   [(set_attr "v8type" "fcvtf2i")
-   (set_attr "type" "f_cvt")
+   (set_attr "type" "f_cvtf2i")
(set_attr "mode" "")
(set_attr "mode2" "")]
 )
@@ -3796,7 +3796,7 @@ (define_insn "fixuns_trunc0, %1"
   [(set_attr "v8type" "fcvtf2i")
-   (set_attr "type" "f_cvt")
+   (set_attr "type" "f_cvtf2i")
(set_attr "mode" "")
(set_attr "mode2" "")]
 )
@@ -3807,7 +3807,7 @@ (define_insn "float2
   "TARGET_FLOAT"
   "scvtf\\t%0, %1"
   [(set_attr "v8type" "fcvti2f")
-   (set_attr "type" "f_cvt")
+   (set_attr "type" "f_cvti2f")
(set_attr "mode" "")
(set_attr "mode2" "")]
 )
diff --git a/gcc/config/arm/arm1020e.md b/gcc/config/arm/arm1020e.md
index 615c6a5b16de647cbd8c0fa947f8b763a1353ee3..e16e862c1f49b36f75ba1faf20c2095fb9aeacdf 100644
--- a/gcc/config/arm/arm1020e.md
+++ b/gcc/config/arm/arm1020e.md
@@ -289,7 +289,7 @@ (define_insn_reservation "v10_farith" 5
 
 (define_insn_reservation "v10_cvt" 5
  (and (eq_attr "vfp10" "yes")
-  (eq_attr "type" "f_cvt"))
+  (eq_attr "type" "f_cvt,f_cvti2f,f_cvtf2i"))
  "1020a_e+v10_fmac")
 
 (define_insn_reservation "v10_fmul" 6
diff --git a/gcc/config/arm/cortex-a15-neon.md b/gcc/config/arm/cortex-a15-neon.md
index f1cac9e1af88bd5e3f0d87ff50c44376ad82d441..b5d14e7f7f9c3965e02e0d6e0edf0044df341812 100644
--- a/gcc/config/arm/cortex-a15-neon.md
+++ b/gcc/config/arm/cortex-a15-neon.md
@@ -471,7 +471,7 @@ (define_insn_reservation "cortex_a15_vfp
 
 (define_insn_reservation "cortex_a15_vfp_cvt" 6
   (and (eq_attr "tune" "cortexa15")
-   (eq_attr "type" "f_cvt"))
+   (eq_attr "type" "f_cvt,f_cvtf2i,f_cvti2f"))
   "ca15_issue1,ca15_cx_vfp")
 
 (define_insn_reservation "cortex_a15_vfp_cmpd" 8
diff --git a/gcc/config/arm/cortex-a5.md b/gcc/config/arm/cortex-a5.md
index 8930baf8daff5be2d2872324cd41fd5a1cd03778..54c8c420324a155523bc961917c475c5aeb86a96 100644
--- a/gcc/config/arm/cortex-a5.md
+++ b/gcc/config/arm/cortex-a5.md
@@ -168,7 +168,8 @@ (define_insn_reservation 

[Patch ARM AARCH64] Split "type" attributes: fdiv

2013-09-06 Thread James Greenhalgh

Hi,

The type attributes "fdivs,fdivd" should be split as:

fdivs -> fsqrts, fdivs
fdivd -> fsqrtd, fdivd

Do this and update pipelines as needed.

Regression tested on aarch64-none-elf and arm-none-eabi and
bootstrapped in series with other type splitting patches.

OK?

Thanks,
James

---
2013-09-06  James Greenhalgh  

* config/arm/types.md: Split fdiv as fsqrt, fdiv.
* config/arm/arm.md (core_cycles): Remove fdiv.
* config/arm/vfp.md:
(*sqrtsf2_vfp): Update for attribute changes.
(*sqrtdf2_vfp): Likewise.
* config/aarch64/aarch64.md:
(sqrt2): Update for attribute changes.
* config/arm/arm1020e.md: Update with new attributes.
* config/arm/cortex-a15-neon.md: Update with new attributes.
* config/arm/cortex-a5.md: Update with new attributes.
* config/arm/cortex-a53.md: Update with new attributes.
* config/arm/cortex-a7.md: Update with new attributes.
* config/arm/cortex-a8-neon.md: Update with new attributes.
* config/arm/cortex-a9.md: Update with new attributes.
* config/arm/cortex-m4-fpu.md: Update with new attributes.
* config/arm/cortex-r4f.md: Update with new attributes.
* config/arm/marvell-pj4.md: Update with new attributes.
* config/arm/vfp11.md: Update with new attributes.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 6a4a975bb89c48311659db0091c76266d29cdba2..ded37efb4c86130af8dd82db66d50cc227bfeff0 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -3903,7 +3903,7 @@ (define_insn "sqrt2"
   "TARGET_FLOAT"
   "fsqrt\\t%0, %1"
   [(set_attr "v8type" "fsqrt")
-   (set_attr "type" "fdiv")
+   (set_attr "type" "fsqrt")
(set_attr "mode" "")]
 )
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 5ed8ee7dc6293bf93869545bef4cd3f60966908b..6c0fbf44288c9f6e077fe2d9836cd5c1e2042a0a 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -335,7 +335,6 @@ (define_attr "core_cycles" "single,multi
 alus_shift_imm, alus_shift_reg, bfm, csel, rev, logic_imm, logic_reg,\
 logic_shift_imm, logic_shift_reg, logics_imm, logics_reg,\
 logics_shift_imm, logics_shift_reg, extend, shift_imm, float, fcsel,\
-fdivd, fdivs,\
 wmmx_wor, wmmx_wxor, wmmx_wand, wmmx_wandn, wmmx_wmov, wmmx_tmcrr,\
 wmmx_tmrrc, wmmx_wldr, wmmx_wstr, wmmx_tmcr, wmmx_tmrc, wmmx_wadd,\
 wmmx_wsub, wmmx_wmul, wmmx_wmac, wmmx_wavg2, wmmx_tinsr, wmmx_textrm,\
diff --git a/gcc/config/arm/arm1020e.md b/gcc/config/arm/arm1020e.md
index e16e862c1f49b36f75ba1faf20c2095fb9aeacdf..8cf0890d9300527962b14f60e08c190155616425 100644
--- a/gcc/config/arm/arm1020e.md
+++ b/gcc/config/arm/arm1020e.md
@@ -299,12 +299,12 @@ (define_insn_reservation "v10_fmul" 6
 
 (define_insn_reservation "v10_fdivs" 18
  (and (eq_attr "vfp10" "yes")
-  (eq_attr "type" "fdivs"))
+  (eq_attr "type" "fdivs, fsqrts"))
  "1020a_e+v10_ds*14")
 
 (define_insn_reservation "v10_fdivd" 32
  (and (eq_attr "vfp10" "yes")
-  (eq_attr "type" "fdivd"))
+  (eq_attr "type" "fdivd, fsqrtd"))
  "1020a_e+v10_fmac+v10_ds*28")
 
 (define_insn_reservation "v10_floads" 4
diff --git a/gcc/config/arm/cortex-a15-neon.md b/gcc/config/arm/cortex-a15-neon.md
index b5d14e7f7f9c3965e02e0d6e0edf0044df341812..057507a762ab546e37b4a32a9771b4098a693d55 100644
--- a/gcc/config/arm/cortex-a15-neon.md
+++ b/gcc/config/arm/cortex-a15-neon.md
@@ -501,12 +501,12 @@ (define_insn_reservation "cortex_a15_vfp
 
 (define_insn_reservation "cortex_a15_vfp_divs" 10
   (and (eq_attr "tune" "cortexa15")
-   (eq_attr "type" "fdivs"))
+   (eq_attr "type" "fdivs, fsqrts"))
   "ca15_issue1,ca15_cx_ik")
 
 (define_insn_reservation "cortex_a15_vfp_divd" 18
   (and (eq_attr "tune" "cortexa15")
-   (eq_attr "type" "fdivd"))
+   (eq_attr "type" "fdivd, fsqrtd"))
   "ca15_issue1,ca15_cx_ik")
 
 ;; Define bypasses.
diff --git a/gcc/config/arm/cortex-a5.md b/gcc/config/arm/cortex-a5.md
index 54c8c420324a155523bc961917c475c5aeb86a96..03d3cc99106f4e8875b649b06dbd1341f18a5f55 100644
--- a/gcc/config/arm/cortex-a5.md
+++ b/gcc/config/arm/cortex-a5.md
@@ -233,14 +233,14 @@ (define_insn_reservation "cortex_a5_fpma
 
 (define_insn_reservation "cortex_a5_fdivs" 14
   (and (eq_attr "tune" "cortexa5")
-   (eq_attr "type" "fdivs"))
+   (eq_attr "type" "fdivs, fsqrts"))
   "cortex_a5_ex1, cortex_a5_

[Patch AArch64] Fix types for some multiply instructions.

2013-09-06 Thread James Greenhalgh

Hi,

We don't really need to split the types on these
instructions. The ARM backend already has suitable descriptions
of things like mla and smlal. Use them.

Regression tested on aarch64-none-elf with no regressions.

OK?

Thanks,
James
---
2013-09-06  James Greenhalgh  

* config/aarch64/aarch64.md
(*madd): Fix type attribute.
(*maddsi_uxtw): Likewise.
(*msub): Likewise.
(*msubsi_uxtw): Likewise.
(maddsidi4): Likewise.
(msubsidi4): Likewise.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index ded37efb4c86130af8dd82db66d50cc227bfeff0..e28764da5dd608259098d3150783e6eacd09be27 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -2281,7 +2281,7 @@ (define_insn "*madd"
   ""
   "madd\\t%0, %1, %2, %3"
   [(set_attr "v8type" "madd")
-   (set_attr "type" "mul")
+   (set_attr "type" "mla")
(set_attr "mode" "")]
 )
 
@@ -2295,7 +2295,7 @@ (define_insn "*maddsi_uxtw"
   ""
   "madd\\t%w0, %w1, %w2, %w3"
   [(set_attr "v8type" "madd")
-   (set_attr "type" "mul")
+   (set_attr "type" "mla")
(set_attr "mode" "SI")]
 )
 
@@ -2308,7 +2308,7 @@ (define_insn "*msub"
   ""
   "msub\\t%0, %1, %2, %3"
   [(set_attr "v8type" "madd")
-   (set_attr "type" "mul")
+   (set_attr "type" "mla")
(set_attr "mode" "")]
 )
 
@@ -2323,7 +2323,7 @@ (define_insn "*msubsi_uxtw"
   ""
   "msub\\t%w0, %w1, %w2, %w3"
   [(set_attr "v8type" "madd")
-   (set_attr "type" "mul")
+   (set_attr "type" "mla")
(set_attr "mode" "SI")]
 )
 
@@ -2373,7 +2373,7 @@ (define_insn "maddsidi4"
   ""
   "maddl\\t%0, %w1, %w2, %3"
   [(set_attr "v8type" "maddl")
-   (set_attr "type" "mul")
+   (set_attr "type" "mlal")
(set_attr "mode" "DI")]
 )
 
@@ -2387,7 +2387,7 @@ (define_insn "msubsidi4"
   ""
   "msubl\\t%0, %w1, %w2, %3"
   [(set_attr "v8type" "maddl")
-   (set_attr "type" "mul")
+   (set_attr "type" "mlal")
(set_attr "mode" "DI")]
 )
 

[AArch64, ARM] Rename the FCPYS type to FMOV

2013-09-06 Thread James Greenhalgh

Hi,

This patch updates the AArch64 backend such that floating point
moves are correctly categorized with type "FMOV".

Then in the ARM backend we rename "FCPYS" to "FMOV" everywhere
where it is appropriate to do so.

Regression tested on aarch64-none-elf and arm-none-eabi with no
regressions.

OK?

Thanks,
James
---
gcc/

2013-09-06  James Greenhalgh  

* config/arm/types.md (type): Rename fcpys to fmov.
* config/arm/vfp.md
(*arm_movsi_vfp): Rename type fcpys as fmov.
(*thumb2_movsi_vfp): Likewise
(*movhf_vfp_neon): Likewise
(*movhf_vfp): Likewise
(*movsf_vfp): Likewise
(*thumb2_movsf_vfp): Likewise
(*movsfcc_vfp): Likewise
(*thumb2_movsfcc_vfp): Likewise
* config/aarch64/aarch64-simd.md
(move_lo_quad_): Replace type mov_reg with fmovs.
* config/aarch64/aarch64.md
(*movsi_aarch64): Replace type mov_reg with fmovs.
(*movdi_aarch64): Likewise
(*movsf_aarch64): Likewise
(*movdf_aarch64): Likewise
* config/arm/arm.c
(cortexa7_older_only): Rename TYPE_FCPYS to TYPE_FMOV.
* config/arm/iwmmxt.md
(*iwmmxt_movsi_insn): Rename type fcpys as fmov.
* config/arm/arm1020e.md: Update with new attributes.
* config/arm/cortex-a15-neon.md: Update with new attributes.
* config/arm/cortex-a5.md: Update with new attributes.
* config/arm/cortex-a53.md: Update with new attributes.
* config/arm/cortex-a7.md: Update with new attributes.
* config/arm/cortex-a8-neon.md: Update with new attributes.
* config/arm/cortex-a9.md: Update with new attributes.
* config/arm/cortex-m4-fpu.md: Update with new attributes.
* config/arm/cortex-r4f.md: Update with new attributes.
* config/arm/marvell-pj4.md: Update with new attributes.
* config/arm/vfp11.md: Update with new attributes.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index c085fb9c49958c5f402a28c0b39fe45ec1aadbc7..882fe4a19368d6ef6f9ef862dffce6d6307c5ac3 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1052,7 +1052,7 @@ (define_insn "move_lo_quad_"
fmov\\t%d0, %1
dup\\t%d0, %1"
   [(set_attr "v8type" "*,fmov,*")
-   (set_attr "type" "*,mov_reg,*")
+   (set_attr "type" "*,fmov,*")
(set_attr "simd_type" "simd_dup,*,simd_dup")
(set_attr "simd_mode" "")
(set_attr "simd" "yes,*,yes")
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index e28764da5dd608259098d3150783e6eacd09be27..db6aa1d3fa15e17095ba26a64e020d098e9fa6c0 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -831,7 +831,7 @@ (define_insn "*movsi_aarch64"
fmov\\t%s0, %s1"
   [(set_attr "v8type" "move,move,move,alu,load1,load1,store1,store1,adr,adr,fmov,fmov,fmov")
(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,load1,load1,store1,store1,\
- adr,adr,mov_reg,mov_reg,mov_reg")
+ adr,adr,fmov,fmov,fmov")
(set_attr "mode" "SI")
(set_attr "fp" "*,*,*,*,*,yes,*,yes,*,*,yes,yes,yes")]
 )
@@ -858,7 +858,7 @@ (define_insn "*movdi_aarch64"
movi\\t%d0, %1"
   [(set_attr "v8type" "move,move,move,alu,load1,load1,store1,store1,adr,adr,fmov,fmov,fmov,fmov")
(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,load1,load1,store1,store1,\
- adr,adr,mov_reg,mov_reg,mov_reg,mov_reg")
+ adr,adr,fmov,fmov,fmov,fmov")
(set_attr "mode" "DI")
(set_attr "fp" "*,*,*,*,*,yes,*,yes,*,*,yes,yes,yes,*")
(set_attr "simd" "*,*,*,*,*,*,*,*,*,*,*,*,*,yes")]
@@ -961,8 +961,8 @@ (define_insn "*movsf_aarch64"
   [(set_attr "v8type" "fmovi2f,fmovf2i,\
 		   fmov,fconst,fpsimd_load,\
 		   fpsimd_store,fpsimd_load,fpsimd_store,fmov")
-   (set_attr "type" "f_mcr,f_mrc,mov_reg,fconsts,\
- f_loads,f_stores,f_loads,f_stores,mov_reg")
+   (set_attr "type" "f_mcr,f_mrc,fmov,fconsts,\
+ f_loads,f_stores,f_loads,f_stores,fmov")
(set_attr "mode" "SF")]
 )
 
@@ -984,7 +984,7 @@ (define_insn "*movdf_aarch64"
   [(set_attr "v8type" "fmovi2f,fmovf2i,\
 		   fmov,fconst,fpsimd_load,\
 		   fpsimd_store,fpsimd_load,fpsimd_store,move")
-   (set_attr "type" "f_mcr,f_mrc,mov_reg,fconstd,\
+   (set_attr "type" "f_mcr,f_mrc,fmov,fconstd,\

[AArch64] Use "multiple" for type, where more than one instruction is used for a move

2013-09-06 Thread James Greenhalgh

Hi,

We could introduce a whole new type for insns which generate two moves,
but we have already introduced a "multiple" classification for
types in the ARM backend, so use that in place of "mov_reg" where
appropriate.

Regression tested on aarch64-none-elf and arm-none-eabi with no
regressions.

OK?

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/aarch64/aarch64.md
(*movti_aarch64): Use "multiple" for type where v8type is "move2".
(*movtf_aarch64): Likewise.
* config/arm/arm.md
(thumb1_movdi_insn): Use "multiple" for type where more than one
instruction is used for a move.
(*arm32_movhf): Likewise.
(*thumb_movdf_insn): Likewise.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index db6aa1d3fa15e17095ba26a64e020d098e9fa6c0..96705862de6b53322828fd60df15207af4b2ed61 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -906,7 +906,7 @@ (define_insn "*movti_aarch64"
str\\t%q1, %0"
   [(set_attr "v8type" "move2,fmovi2f,fmovf2i,*, \
 		   load2,store2,store2,fpsimd_load,fpsimd_store")
-   (set_attr "type" "mov_reg,f_mcr,f_mrc,*, \
+   (set_attr "type" "multiple,f_mcr,f_mrc,*, \
 		 load2,store2,store2,f_loadd,f_stored")
(set_attr "simd_type" "*,*,*,simd_move,*,*,*,*,*")
(set_attr "mode" "DI,DI,DI,TI,DI,DI,DI,TI,TI")
@@ -1024,7 +1024,7 @@ (define_insn "*movtf_aarch64"
ldp\\t%0, %H0, %1
stp\\t%1, %H1, %0"
   [(set_attr "v8type" "logic,move2,fmovi2f,fmovf2i,fconst,fconst,fpsimd_load,fpsimd_store,fpsimd_load2,fpsimd_store2")
-   (set_attr "type" "logic_reg,mov_reg,f_mcr,f_mrc,fconstd,fconstd,\
+   (set_attr "type" "logic_reg,multiple,f_mcr,f_mrc,fconstd,fconstd,\
  f_loadd,f_stored,f_loadd,f_stored")
(set_attr "mode" "DF,DF,DF,DF,DF,DF,TF,TF,DF,DF")
(set_attr "length" "4,8,8,8,4,4,4,4,4,4")
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 6c0fbf44288c9f6e077fe2d9836cd5c1e2042a0a..fd0b1cbdccd23ad4d18be50417c7532b29840b91 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -6141,7 +6141,7 @@ (define_insn "*thumb1_movdi_insn"
 }
   }"
   [(set_attr "length" "4,4,6,2,2,6,4,4")
-   (set_attr "type" "multiple,mov_reg,multiple,load2,store2,load2,store2,mov_reg")
+   (set_attr "type" "multiple,multiple,multiple,load2,store2,load2,store2,multiple")
(set_attr "pool_range" "*,*,*,*,*,1018,*,*")]
 )
 
@@ -7221,7 +7221,7 @@ (define_insn "*arm32_movhf"
 }
   "
   [(set_attr "conds" "unconditional")
-   (set_attr "type" "load1,store1,mov_reg,mov_reg")
+   (set_attr "type" "load1,store1,mov_reg,multiple")
(set_attr "length" "4,4,4,8")
(set_attr "predicable" "yes")]
 )
@@ -7466,7 +7466,7 @@ (define_insn "*thumb_movdf_insn"
 }
   "
   [(set_attr "length" "4,2,2,6,4,4")
-   (set_attr "type" "multiple,load2,store2,load2,store2,mov_reg")
+   (set_attr "type" "multiple,load2,store2,load2,store2,multiple")
(set_attr "pool_range" "*,*,*,1018,*,*")]
 )
 

[AArch64, ARM] Introduce "mrs" type attribute.

2013-09-06 Thread James Greenhalgh

Hi,

This patch adds an "mrs" type to be used to categorize instructions
which read or write from a special/system/co-processor register.

Then we add this type to all the pipeline descriptions. This probably
ends up as a miscategorization in most cases as we put "mrs" in the
same category as "multiple" in the pipelines. This will give the most
consistant behaviour with what came before.

Regression tested on aarch64-none-elf and arm-none-eabi with no
regressions.

OK?

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/arm/types.md (type): Add "mrs" type.
* config/aarch64/aarch64.md
(aarch64_load_tp_hard): Make type "mrs".
* config/arm/arm.md
(load_tp_hard): Make type "mrs".
* config/arm/cortex-a15.md: Update with new attributes.
* config/arm/cortex-a5.md: Update with new attributes.
* config/arm/cortex-a53.md: Update with new attributes.
* config/arm/cortex-a7.md: Update with new attributes.
* config/arm/cortex-a8.md: Update with new attributes.
* config/arm/cortex-a9.md: Update with new attributes.
* config/arm/cortex-m4.md: Update with new attributes.
* config/arm/cortex-r4.md: Update with new attributes.
* config/arm/fa526.md: Update with new attributes.
* config/arm/fa606te.md: Update with new attributes.
* config/arm/fa626te.md: Update with new attributes.
* config/arm/fa726te.md: Update with new attributes.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 96705862de6b53322828fd60df15207af4b2ed61..5aa127bcb47912f1986007d4491b86e92c23 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -4134,7 +4134,7 @@ (define_insn "aarch64_load_tp_hard"
   ""
   "mrs\\t%0, tpidr_el0"
   [(set_attr "v8type" "mrs")
-   (set_attr "type" "mov_reg")
+   (set_attr "type" "mrs")
(set_attr "mode" "DI")]
 )
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index fd0b1cbdccd23ad4d18be50417c7532b29840b91..8a482b570ec039aad888d7d7d902b48f7e453abc 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -12453,7 +12453,7 @@ (define_insn "load_tp_hard"
   "TARGET_HARD_TP"
   "mrc%?\\tp15, 0, %0, c13, c0, 3\\t@ load_tp_hard"
   [(set_attr "predicable" "yes")
-   (set_attr "type" "mov_reg")]
+   (set_attr "type" "mrs")]
 )
 
 ;; Doesn't clobber R1-R3.  Must use r0 for the first operand.
diff --git a/gcc/config/arm/cortex-a15.md b/gcc/config/arm/cortex-a15.md
index 6b1559260246a11e6d74f7f467dbeae761d934ea..ccad62076089b5e095f472fdbf298ba7226ae4ec 100644
--- a/gcc/config/arm/cortex-a15.md
+++ b/gcc/config/arm/cortex-a15.md
@@ -68,7 +68,7 @@ (define_insn_reservation "cortex_a15_alu
 shift_imm,shift_reg,\
 mov_imm,mov_reg,\
 mvn_imm,mvn_reg,\
-multiple,no_insn"))
+mrs,multiple,no_insn"))
   "ca15_issue1,(ca15_sx1,ca15_sx1_alu)|(ca15_sx2,ca15_sx2_alu)")
 
 ;; ALU ops with immediate shift
diff --git a/gcc/config/arm/cortex-a5.md b/gcc/config/arm/cortex-a5.md
index fa3e9d59c91028214ca7aa1be2c6668b4af5e6d3..22e0a08f38e7620cef745d28c5373e2daf957f7d 100644
--- a/gcc/config/arm/cortex-a5.md
+++ b/gcc/config/arm/cortex-a5.md
@@ -64,7 +64,7 @@ (define_insn_reservation "cortex_a5_alu"
 adr,bfm,rev,\
 shift_imm,shift_reg,\
 mov_imm,mov_reg,mvn_imm,mvn_reg,\
-multiple,no_insn"))
+mrs,multiple,no_insn"))
   "cortex_a5_ex1")
 
 (define_insn_reservation "cortex_a5_alu_shift" 2
diff --git a/gcc/config/arm/cortex-a53.md b/gcc/config/arm/cortex-a53.md
index 33b5ca30150fc57e7a3c4886c01b9e8092fc3ffa..48d0d03853f147d2d7cc15c1208304617b9c1ec4 100644
--- a/gcc/config/arm/cortex-a53.md
+++ b/gcc/config/arm/cortex-a53.md
@@ -73,7 +73,7 @@ (define_insn_reservation "cortex_a53_alu
 adr,bfm,csel,rev,\
 shift_imm,shift_reg,\
 mov_imm,mov_reg,mvn_imm,mvn_reg,\
-multiple,no_insn"))
+mrs,multiple,no_insn"))
   "cortex_a53_slot_any")
 
 (define_insn_reservation "cortex_a53_alu_shift" 2
diff --git a/gcc/config/arm/cortex-a7.md b/gcc/config/arm/cortex-a7.md
index ba9da8046ebd4a886c425238ab28df5ab9d85a8a..a72a88d90af1c5491115ee84af47ec6d4f593535 100644
--- a/gcc/config/arm/cortex-a7.md
+++ b/gcc/config/arm/cortex-a7.md
@@ -110,7 +110,7 @@ (define_insn_reservation "cortex_a7_alu_
 log

[AArch64] Use neon__2 where appropriate as "type".

2013-09-06 Thread James Greenhalgh

Hi,

The final (!!!) patch in the series making types equivalent between
AArch64 and ARM backends deals with insns in the AArch64 backend
which generate ldp and stp. We could invent a new type for these and
add that type to all the pipeline descriptions, but I think the types
neon_ldm_2 and neon_stm_2 describe them adequately.

Tested on aarch64-none-elf with no regressions.

OK?

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/aarch64/aarch64.md
(*movtf_aarch64): Use neon_dm_2 as type where v8type
is fpsimd_2.
(load_pair): Likewise.
(store_pair): Likewise.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 5aa127bcb47912f1986007d4491b86e92c23..f37f98f9994bb773785d8573a7efd1e625b5e23a 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1025,7 +1025,7 @@ (define_insn "*movtf_aarch64"
stp\\t%1, %H1, %0"
   [(set_attr "v8type" "logic,move2,fmovi2f,fmovf2i,fconst,fconst,fpsimd_load,fpsimd_store,fpsimd_load2,fpsimd_store2")
(set_attr "type" "logic_reg,multiple,f_mcr,f_mrc,fconstd,fconstd,\
- f_loadd,f_stored,f_loadd,f_stored")
+ f_loadd,f_stored,neon_ldm_2,neon_stm_2")
(set_attr "mode" "DF,DF,DF,DF,DF,DF,TF,TF,DF,DF")
(set_attr "length" "4,8,8,8,4,4,4,4,4,4")
(set_attr "fp" "*,*,yes,yes,*,yes,yes,yes,*,*")
@@ -1090,7 +1090,7 @@ (define_insn "load_pair"
 			   GET_MODE_SIZE (mode)))"
   "ldp\\t%0, %2, %1"
   [(set_attr "v8type" "fpsimd_load2")
-   (set_attr "type" "f_load")
+   (set_attr "type" "neon_ldm_2")
(set_attr "mode" "")]
 )
 
@@ -1106,8 +1106,8 @@ (define_insn "store_pair"
 			   XEXP (operands[0], 0),
 			   GET_MODE_SIZE (mode)))"
   "stp\\t%1, %3, %0"
-  [(set_attr "v8type" "fpsimd_load2")
-   (set_attr "type" "f_load")
+  [(set_attr "v8type" "fpsimd_store2")
+   (set_attr "type" "neon_stm_2")
(set_attr "mode" "")]
 )
 

[AArch64] Fix parameters to vcvtx_high

2013-09-06 Thread James Greenhalgh

Hi,

vcvtx_high_f32_f64 should have two parameters, a float32x2 which
provides the lower half of the target vector, and a float64x2
which will be converted to the higher half of the target vector.

Fix thusly.

Tested with aarch64.exp on aarch64-none-elf.

OK?

Thanks,
James

---
gcc/

2013-09-06  James Greenhalgh  

* config/aarch64/arm_neon.h
(vcvtx_high_f32_f64): Fix parameters.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 5864f2c..47b45f4 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -5756,12 +5756,12 @@ vcvtx_f32_f64 (float64x2_t a)
 }
 
 __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
-vcvtx_high_f32_f64 (float64x2_t a)
+vcvtx_high_f32_f64 (float32x2_t a, float64x2_t b)
 {
   float32x4_t result;
   __asm__ ("fcvtxn2 %0.4s,%1.2d"
: "=w"(result)
-   : "w"(a)
+   : "w" (b), "0"(a)
: /* No clobbers */);
   return result;
 }

[AArch64] obvious - Fix parameter to vrsqrte_f64

2013-09-09 Thread James Greenhalgh

Hi,

vrsqrte_f64 is currently defined to take a float64x2_t, but it should
take a float64x1_t.

I've committed the attached, obvious fix as revision 202407.

James

---
2013-09-09  James Greenhalgh  

* config/aarch64/arm_neon.h (vrsqrte_f64): Fix parameter type.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index ac94516..23b1116 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -12764,11 +12764,11 @@ vrsqrte_f32 (float32x2_t a)
   return result;
 }
 
-__extension__ static __inline float64x2_t __attribute__ ((__always_inline__))
-vrsqrte_f64 (float64x2_t a)
+__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
+vrsqrte_f64 (float64x1_t a)
 {
-  float64x2_t result;
-  __asm__ ("frsqrte %0.2d,%1.2d"
+  float64x1_t result;
+  __asm__ ("frsqrte %d0,%d1"
: "=w"(result)
: "w"(a)
: /* No clobbers */);

[AArch64] Prevent generic pipeline description from dominating other pipeline descriptions.

2013-09-10 Thread James Greenhalgh

Hi,

Looking at the way we handle the generic scheduler in AArch64, the order
of includes suggests that we will always match to the generic unit, even
when tuning for a non-generic pipeline.

This patch explicitly prevents that by defining a "generic_sched" attribute,
which is only true when the generic scheduler should be used. We could
probably get away with just including the generic scheduler last in the
list of pipeline descriptions, but this solution is more robust and
prevents us from erroneously using the generic scheduler units where we
would otherwise return "nothing".

Regression tested on aarch64-none-elf with no regressions.

OK?

Thanks,
James

---
2013-09-10  James Greenhalgh  

* config/aarch64/aarch64.md (generic_sched): New.
* config/aarch64/aarch64-generic.md (load): Make conditional
on generic_sched attribute.
(nonload): Likewise.
diff --git a/gcc/config/aarch64/aarch64-generic.md b/gcc/config/aarch64/aarch64-generic.md
index cbb75600389efe69f16dd30837ad02b2b254232e..12faac84348c72c44c1c144d268ea9751a0665ac 100644
--- a/gcc/config/aarch64/aarch64-generic.md
+++ b/gcc/config/aarch64/aarch64-generic.md
@@ -30,9 +30,11 @@ (define_attr "is_load" "yes,no"
 	(const_string "no")))
 
 (define_insn_reservation "load" 2
-  (eq_attr "is_load" "yes")
+  (and (eq_attr "generic_sched" "yes")
+   (eq_attr "is_load" "yes"))
   "core")
 
 (define_insn_reservation "nonload" 1
-  (eq_attr "is_load" "no")
+  (and (eq_attr "generic_sched" "yes")
+   (eq_attr "is_load" "no"))
   "core")
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f37f98f9994bb773785d8573a7efd1e625b5e23a..b56254dbbd78aae9f135b1aeabcf7c43f0f5fd84 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -308,6 +308,14 @@ (define_attr "enabled" "no,yes"
 ;; Processor types.
 (include "aarch64-tune.md")
 
+;; True if the generic scheduling description should be used.
+
+(define_attr "generic_sched" "yes,no"
+  (const (if_then_else
+  (eq_attr "tune" "large,small,cortexa53")
+  (const_string "no")
+  (const_string "yes"
+
 ;; Scheduling
 (include "aarch64-generic.md")
 (include "large.md")

Re: [1/4] Using gen_int_mode instead of GEN_INT

2013-09-10 Thread James Greenhalgh

Hi,

This seems to have caused PR58373. The bug occurs where GEN_INT would
previously have then been used to build a constant of vector mode.

These pop in a few places when building for AArch64, though I did
the debugging using gcc.target/aarch64/vect-fcm-eq-d.c

Here we could get in to the situation where
simplify_unary_expression_1 would see (V2DI: NOT (NEG X))
and try to generate (V2DI: PLUS (X - 1)). In turn, we would reach the
call in plus_constant to gen_int_mode (-1, v2di), followed by
trunc_int_for_mode (-1, v2di) and this assert would trigger:

   /* You want to truncate to a _what_?  */
   gcc_assert (SCALAR_INT_MODE_P (mode));

In this fix I catch the case where gen_int_mode has been asked to
build a vector mode, and call trunc_int_for_mode on the inner mode
of that vector. A similar fix could sit in trunc_int_for_mode if
you would prefer.

Bootstrapped on x86_64-unknown-linux-gnu with no issues and regression
tested for aarch64-none-elf with the expected benefits and no regressions.

Thanks,
James

---
gcc/

2013-09-10  James Greenhalgh  

PR rtl-optimization/58383
* emit-rtl.c (gen_int_mode): Handle vector modes.
diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c
index 8a7b8a5..cf26bf1 100644
--- a/gcc/emit-rtl.c
+++ b/gcc/emit-rtl.c
@@ -417,7 +417,13 @@ gen_rtx_CONST_INT (enum machine_mode mode ATTRIBUTE_UNUSED, HOST_WIDE_INT arg)
 rtx
 gen_int_mode (HOST_WIDE_INT c, enum machine_mode mode)
 {
-  return GEN_INT (trunc_int_for_mode (c, mode));
+  HOST_WIDE_INT c1;
+  if (SCALAR_INT_MODE_P (mode))
+c1 = trunc_int_for_mode (c, mode);
+  else
+c1 = trunc_int_for_mode (c, GET_MODE_INNER (mode));
+
+  return GEN_INT (c1);
 }
 
 /* CONST_DOUBLEs might be created from pairs of integers, or from

Re: [1/4] Using gen_int_mode instead of GEN_INT

2013-09-10 Thread James Greenhalgh
On Tue, Sep 10, 2013 at 08:09:42PM +0100, Richard Sandiford wrote:
> Sorry for the breakage.  gen_int_mode and GEN_INT really are only for
> scalar integers though.  (So is plus_constant.)  Vector constants should
> be CONST_VECTORs rather than CONST_INTs.
> 
> I think the gcc.target/aarch64/vect-fcm-eq-d.c failure is from a latent
> bug in the way (neg (not ...)) and (not (neg ...)) are handled.
> Could you give the attached patch a go?

Thanks Richard, this patch fixes the test FAILs I was seeing.

Cheers,
James



[AArch64] Implement vmul_lane_<16,32,64> intrinsics in C

2013-09-13 Thread James Greenhalgh

Hi,

This patch converts the vmul_lane_<16,32,64> intrinsics
in arm_neon.h to a C implementation.

To support this, we add some patterns for the combiner to pick
up. We need a few patterns for this.

mul3_elt covers vmul_lane, vmulq_laneq variants, where the number
of lanes selected from matches those multiplied.

mul3_elt_ covers the vmul_laneq and vmulq_lane
variants, where the number of lanes selected from differs from
those multiplied.

mul3_elt_to_128df is needed as, when the input is a 64-bit scalar
value, there is no lane on which to vec_select so the previous
patterns would not match.

mul3_elt_to_64v2df is needed as, when the output is a 64-bit scalar
there is no need for a vec_duplicate before the multiply.

Regression tested on aarch64-none-elf with no regressions.

Thanks,
James Greenhalgh

---
gcc/

2013-09-13  James Greenhalgh  

* config/aarch64/aarch64-simd.md (aarch64_mul3_elt): New.
(aarch64_mul3_elt_): Likewise.
(aarch64_mul3_elt_to_128df): Likewise.
(aarch64_mul3_elt_to_64v2df): Likewise.
* config/aarch64/iterators.md (VEL): Also handle DFmode.
(VMUL): New.
(VMUL_CHANGE_NLANES) Likewise.
(h_con): Likewise.
(f): Likewise.
* config/aarch64/arm_neon.h
(vmul_lane_<16,32,64>): Convert to C implementation.

gcc/testsuite/

2013-09-13  James Greenhalgh  

* gcc.target/aarch64/mul_intrinsic_1.c: New.
* gcc.target/aarch64/fmul_intrinsic_1.c: Likewise.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9805197a22b084ea37425b692560949b5ff75e62..04d5794ffcae73a8b33844f3147e4315747deb69 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -582,6 +582,59 @@ (define_insn "mul3"
(set_attr "simd_mode" "")]
 )
 
+(define_insn "*aarch64_mul3_elt"
+ [(set (match_operand:VMUL 0 "register_operand" "=w")
+(mult:VMUL
+  (vec_duplicate:VMUL
+	  (vec_select:
+	(match_operand:VMUL 1 "register_operand" "")
+	(parallel [(match_operand:SI 2 "immediate_operand")])))
+  (match_operand:VMUL 3 "register_operand" "w")))]
+  "TARGET_SIMD"
+  "mul\\t%0., %3., %1.[%2]"
+  [(set_attr "simd_type" "simd_mul_elt")
+   (set_attr "simd_mode" "")]
+)
+
+(define_insn "*aarch64_mul3_elt_"
+  [(set (match_operand:VMUL_CHANGE_NLANES 0 "register_operand" "=w")
+ (mult:VMUL_CHANGE_NLANES
+   (vec_duplicate:VMUL_CHANGE_NLANES
+	  (vec_select:
+	(match_operand: 1 "register_operand" "")
+	(parallel [(match_operand:SI 2 "immediate_operand")])))
+  (match_operand:VMUL_CHANGE_NLANES 3 "register_operand" "w")))]
+  "TARGET_SIMD"
+  "mul\\t%0., %3., %1.[%2]"
+  [(set_attr "simd_type" "simd_mul_elt")
+   (set_attr "simd_mode" "")]
+)
+
+(define_insn "*aarch64_mul3_elt_to_128df"
+  [(set (match_operand:V2DF 0 "register_operand" "=w")
+ (mult:V2DF
+   (vec_duplicate:V2DF
+	 (match_operand:DF 2 "register_operand" "w"))
+  (match_operand:V2DF 1 "register_operand" "w")))]
+  "TARGET_SIMD"
+  "fmul\\t%0.2d, %1.2d, %2.d[0]"
+  [(set_attr "simd_type" "simd_fmul_elt")
+   (set_attr "simd_mode" "V2DF")]
+)
+
+(define_insn "*aarch64_mul3_elt_to_64v2df"
+  [(set (match_operand:DF 0 "register_operand" "=w")
+ (mult:DF
+   (vec_select:DF
+	 (match_operand:V2DF 1 "register_operand" "w")
+	 (parallel [(match_operand:SI 2 "immediate_operand")]))
+   (match_operand:DF 3 "register_operand" "w")))]
+  "TARGET_SIMD"
+  "fmul\\t%0.2d, %3.2d, %1.d[%2]"
+  [(set_attr "simd_type" "simd_fmul_elt")
+   (set_attr "simd_mode" "V2DF")]
+)
+
 (define_insn "neg2"
   [(set (match_operand:VDQ 0 "register_operand" "=w")
 	(neg:VDQ (match_operand:VDQ 1 "register_operand" "w")))]
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 23b1116..6c9dd79 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -9501,136 +9501,6 @@ vmovq_n_u64 (uint64_t a)
   return result;
 }
 
-#define vmul_lane_f32(a, b, c)  \
-  __extension__ \
-({  \
-   float32x2_t b_ = (b);   

[AArch64] Implement vset_lane intrinsics in C

2013-09-13 Thread James Greenhalgh

Hi,

The vset_lane_<8,16,32,64> intrinsics are currently
written useing assembler, but can be easily expressed
in C.

As I expect we will want to efficiently compose these intrinsics
I've added them as macros, just as was done with the vget_lane
intrinsics.

Regression tested for aarch64-none-elf and a new testcase
added to ensure these intrinsics generate the expected
instruction.

OK?

Thanks,
James

---
gcc/

2013-09-13  James Greenhalgh  

* config/aarch64/arm_neon.h
(__aarch64_vset_lane_any): New.
(__aarch64_vset_lane_<8,16,32,64>): Likewise.
(vset_lane_<8,16,32,64>): Use new macros.

gcc/testsuite

2013-09-13  James Greenhalgh  

* gcc.target/aarch64/vect_set_lane_1.c: New.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index cb58602..6335ddf 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -508,6 +508,58 @@ typedef struct poly16x8x4_t
 #define __aarch64_vgetq_lane_u64(__a, __b) \
   __aarch64_vget_lane_any (v2di, (uint64_t), (int64x2_t), __a, __b)
 
+/* __aarch64_vset_lane internal macros.  */
+#define __aarch64_vset_lane_any(__source, __v, __index) \
+  (__v[__index] = __source, __v)
+
+#define __aarch64_vset_lane_f32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_f64(__source, __v, __index) (__source)
+#define __aarch64_vset_lane_p8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_p16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_s64(__source, __v, __index) (__source)
+#define __aarch64_vset_lane_u8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vset_lane_u64(__source, __v, __index) (__source)
+
+/* __aarch64_vset_laneq internal macros.  */
+#define __aarch64_vsetq_lane_f32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_f64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_p8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_p16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_s64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u8(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u16(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u32(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+#define __aarch64_vsetq_lane_u64(__source, __v, __index) \
+   __aarch64_vset_lane_any (__source, __v, __index)
+
 /* __aarch64_vdup_lane internal macros.  */
 #define __aarch64_vdup_lane_any(__size, __q1, __q2, __a, __b) \
   vdup##__q1##_n_##__size (__aarch64_vget##__q2##_lane_##__size (__a, __b))
@@ -3969,6 +4021,154 @@ vreinterpretq_u32_p16 (poly16x8_t __a)
   return (uint32x4_t) __builtin_aarch64_reinterpretv4siv8hi ((int16x8_t) __a);
 }
 
+/* vset_lane.  */
+
+__extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
+vset_lane_f32 (float32_t __a, float32x2_t __v, const int __index)
+{
+  return __aarch64_vset_lane_f32 (__a, __v, __index);
+}
+
+__extension__ static __inline float64x1_t __attribute__ ((__always_inline__))
+vset_lane_f64 (float64_t __a, float64x1_t __v, const int __index)
+{
+  return __aarch64_vset_lane_f64 (__a, __v, __index);
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vset_lane_p8 (poly8_t __a, poly8x8_t __v, const int __index)
+{
+  return __aarch64_vset_lane_p8 (__a, __v, __index);
+}
+
+__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__))
+vset_lane_p16 (poly16_t __a, poly16x4_t __v, const int __index)
+{
+  return __aarch64_vset_lane

[AArch64] Implement vcopy intrinsics.

2013-09-13 Thread James Greenhalgh

Hi,

This patch adds intrinsics for vcopy_lane_<8,16,32,64>.

These are implemented in an optimal way using the vget_lane and vset_lane
intrinsics and a combine pattern.

I've added a testcase and run a full regression run for aarch64-none-elf.

OK?

Thanks,
James

---
gcc/

2013-09-13  James Greenhalgh  

* config/aarch64/aarch64-simd.md
(*aarch64_simd_vec_copy_lane): New.
(*aarch64_simd_vec_copy_lane_): Likewise.
* config/aarch64/arm_neon.h
(vcopy_lane_<8,16,32,64>): Remove asm implementations.
(vcopy_lane_<8,16,32,64>): Implement optimally.

gcc/testsuite

2013-09-13  James Greenhalgh  

* gcc.target/aarch64/vect_copy_lane_1.c: New.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index f13cd5b7cdbdff95bbc378a76a6dd05de031487d..9703dd934a2f8335ffc5086e8a421db609fe0236 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -750,6 +750,54 @@ (define_insn "aarch64_simd_vec_set
(set_attr "simd_mode" "")]
 )
 
+(define_insn_and_split "*aarch64_simd_vec_copy_lane"
+  [(set (match_operand:VALL 0 "register_operand" "=w")
+	(vec_merge:VALL
+	(vec_duplicate:VALL
+	  (vec_select:
+		(match_operand:VALL 3 "register_operand" "w")
+		(parallel
+		  [(match_operand:SI 4 "immediate_operand" "i")])))
+	(match_operand:VALL 1 "register_operand" "0")
+	(match_operand:SI 2 "immediate_operand" "i")))]
+  "TARGET_SIMD"
+  "ins\t%0.[%p2], %3.[%4]";
+  "reload_completed
+   && REGNO (operands[0]) == REGNO (operands[3])
+   && (exact_log2 (INTVAL (operands[2])) == INTVAL (operands[4]))"
+  [(const_int 0)]
+  {
+emit_note (NOTE_INSN_DELETED);
+DONE;
+  }
+  [(set_attr "simd_type" "simd_ins")
+   (set_attr "simd_mode" "")]
+)
+
+(define_insn_and_split "*aarch64_simd_vec_copy_lane_"
+  [(set (match_operand:VALL 0 "register_operand" "=w")
+	(vec_merge:VALL
+	(vec_duplicate:VALL
+	  (vec_select:
+		(match_operand: 3 "register_operand" "w")
+		(parallel
+		  [(match_operand:SI 4 "immediate_operand" "i")])))
+	(match_operand:VALL 1 "register_operand" "0")
+	(match_operand:SI 2 "immediate_operand" "i")))]
+  "TARGET_SIMD"
+  "ins\t%0.[%p2], %3.[%4]";
+  "reload_completed
+   && REGNO (operands[0]) == REGNO (operands[3])
+   && (exact_log2 (INTVAL (operands[2])) == INTVAL (operands[4]))"
+  [(const_int 0)]
+  {
+emit_note (NOTE_INSN_DELETED);
+DONE;
+  }
+  [(set_attr "simd_type" "simd_ins")
+   (set_attr "simd_mode" "")]
+)
+
 (define_insn "aarch64_simd_lshr"
  [(set (match_operand:VDQ 0 "register_operand" "=w")
(lshiftrt:VDQ (match_operand:VDQ 1 "register_operand" "w")
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 6335ddf..64f8825 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -5538,162 +5538,6 @@ vcntq_u8 (uint8x16_t a)
   return result;
 }
 
-#define vcopyq_lane_f32(a, b, c, d) \
-  __extension__ \
-({  \
-   float32x4_t c_ = (c);\
-   float32x4_t a_ = (a);\
-   float32x4_t result;  \
-   __asm__ ("ins %0.s[%2], %3.s[%4]"\
-: "=w"(result)  \
-: "0"(a_), "i"(b), "w"(c_), "i"(d)  \
-: /* No clobbers */);   \
-   result;  \
- })
-
-#define vcopyq_lane_f64(a, b, c, d) \
-  __extension__ \
-({  \
-   float64x2_t c_ = (c);\
-   float64x2_t a_ = (a);\
-   float64x2_t result;  \
-   __asm__ ("ins %0.d[%2], %3.d[%4]"\
-: "=w"(result)  \
-: "0"(a_), "i"(b), "w&qu

Re: [AArch64] Implement vset_lane intrinsics in C

2013-09-13 Thread James Greenhalgh
On Fri, Sep 13, 2013 at 07:39:08PM +0100, Andrew Pinski wrote:
> I don't think this works for big-endian due to the way ARM decided the
> lanes don't match up with array entry there.

Hi Andrew,

Certainly for the testcase I've added in this patch there are no issues.

Vector indexing should work consistently between big and little endian
AArch64 targets. So,

  int32_t foo[4] = {0, 1, 2, 3};
  int32x4_t a = vld1q_s32 (foo);
  int b = foo[1];
  return b;

Should return '1' whatever your endianness. Throwing together a quick
test case, that is the case for current trunk. Do you have a testcase
where this goes wrong?

Thanks,
James



Re: [AArch64] Fix parameters to vcvtx_high

2013-09-16 Thread James Greenhalgh
*ping*

Cheers,
James

On Fri, Sep 06, 2013 at 04:06:08PM +0100, James Greenhalgh wrote:
> 
> Hi,
> 
> vcvtx_high_f32_f64 should have two parameters, a float32x2 which
> provides the lower half of the target vector, and a float64x2
> which will be converted to the higher half of the target vector.
> 
> Fix thusly.
> 
> Tested with aarch64.exp on aarch64-none-elf.
> 
> OK?
> 
> Thanks,
> James
> 
> ---
> gcc/
> 
> 2013-09-06  James Greenhalgh  
> 
>   * config/aarch64/arm_neon.h
>   (vcvtx_high_f32_f64): Fix parameters.
> 

> diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
> index 5864f2c..47b45f4 100644
> --- a/gcc/config/aarch64/arm_neon.h
> +++ b/gcc/config/aarch64/arm_neon.h
> @@ -5756,12 +5756,12 @@ vcvtx_f32_f64 (float64x2_t a)
>  }
>  
>  __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> -vcvtx_high_f32_f64 (float64x2_t a)
> +vcvtx_high_f32_f64 (float32x2_t a, float64x2_t b)
>  {
>float32x4_t result;
>__asm__ ("fcvtxn2 %0.4s,%1.2d"
> : "=w"(result)
> -   : "w"(a)
> +   : "w" (b), "0"(a)
> : /* No clobbers */);
>return result;
>  }



Re: [AArch64] Implement vset_lane intrinsics in C

2013-09-16 Thread James Greenhalgh
On Fri, Sep 13, 2013 at 10:47:01PM +0100, Andrew Pinski wrote:
> On Fri, Sep 13, 2013 at 11:57 AM, James Greenhalgh
>  wrote:
> > Should return '1' whatever your endianness. Throwing together a quick
> > test case, that is the case for current trunk. Do you have a testcase
> > where this goes wrong?
> 
> I was not thinking of that but rather the definition of lanes in ARM64
> is different than from element due to memory ordering of endian.
> That is lane 0 is element 3 in big-endian.  Or is this only for
> aarch32 where the issue is located?
> 
> Thanks,
> Andrew Pinski

Well, AArch64 has the AArch32 style memory ordering for vectors,
which I think is different from what other big-endian architectures
use, but gives consistent behaviour between vector and array indexing.

So, take the easy case of a byte array

  uint8_t foo [8] = {0, 1, 2, 3, 4, 5, 6, 7}

We would expect both the big and little endian toolchains to lay
this out in memory as:

   0x0 ... 0x8
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

And element 0 would give us '0'. If we take the same array and load it
as a vector with ld1.b, both big and little-endian toolchains would load
it as:

   bit 128 ..   bit 64   bit 0
   lane 16   | lane 7 |   |  lane 0 |
  |. |7   | 6 | 5 | 4 | 3 | 2 | 1 |   0 |

So lane 0 is '0', we're OK so far!

For a short array:

  uint16_t foo [4] = {0x0a0b, 0x1a1b, 0x2a2b, 0x3a3b};

The little endian compiler would lay memory out as:

   0x0 ...0x8
  | 0b | 0a | 1b | 1a | 2b | 2a | 3b | 3a |

And the big endian compiler would lay out memory as:

   0x0 ...0x8
  | 0a | 0b | 1a | 1b | 2a | 2b | 3a | 3b |

In both cases, element 0 is '0x0a0b'. If we load this array as a
vector with ld1.h both big and little-endian compilers will load
the vector as:

   bit 128 ..  bit 64bit 0
   lane 16   | lane 3  |   | lane 0  |
  |. | 3b | 3a | 2b | 2a | 1b | 1a | 0b | 0a |

And lane 0 is '0x0a0b' So we are OK again!

Lanes and elements should match under our model. Which I don't think
is true of other architectures, where I think the whole vector object
is arranged big endian, such that we would need to lay our byte array
out as:

   0x0 ... 0x8
  | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

For it to be correctly loaded, at which point there is a discrepancy
between element and lane.

But as I say, that is other architectures. AArch64 should be consistent.

Thanks,
James


Re: [AArch64] Implement vset_lane intrinsics in C

2013-09-16 Thread James Greenhalgh
On Mon, Sep 16, 2013 at 10:29:37AM +0100, James Greenhalgh wrote:
> The little endian compiler would lay memory out as:
> 
>0x0 ...0x8
>   | 0b | 0a | 1b | 1a | 2b | 2a | 3b | 3a |
> 
> And the big endian compiler would lay out memory as:
> 
>0x0 ...0x8
>   | 0a | 0b | 1a | 1b | 2a | 2b | 3a | 3b |
> 
> In both cases, element 0 is '0x0a0b'. If we load this array as a
> vector with ld1.h both big and little-endian compilers will load
> the vector as:
> 
>bit 128 ..  bit 64bit 0
>lane 16   | lane 3  |   | lane 0  |
>   |. | 3b | 3a | 2b | 2a | 1b | 1a | 0b | 0a |
> 

Ugh, I knew I would make a mistake somewhere!

This should, of course, be loaded as:

bit 128 ..  bit 64bit 0
lane 16   | lane 3  |   | lane 0  |
   |. | 3a | 3b | 2a | 2b | 1a | 1b | 0a | 0b |
 
James



Re: [PATCH][AARCH64]Replace gen_rtx_PLUS with plus_constant

2013-09-21 Thread James Greenhalgh
On Fri, Sep 20, 2013 at 03:40:59PM +0100, Renlin Li wrote:
> Thank you, can you please commit it for me?
> 
> Kind regards,
> Renlin Li
> 
> On 09/20/13 15:26, Marcus Shawcroft wrote:
> > On 20 September 2013 15:18, Renlin Li  wrote:
> >
> >> 2013-09-20  Renlin Li  
> >>
> >>  * config/aarch64/aarch64.c (aarch64_expand_prologue): Use 
> >> plus_constant.
> >>  (aarch64_expand_epilogue): Likewise.
> >>  (aarch64_legitimize_reload_address): Likewise.
 
Hi Renlin,

This patch appears to have caused a number of regressions on
an aarch64-none-elf test run.

I see Internal Compiler Errors along these lines:

../src/gcc/gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c: In function 'main1':
../src/gcc/gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c:68:1: error: insn does 
not satisfy its constraints:
 }
 ^
(insn 182 472 183 (set (reg:QI 2 x2 [277])
(mem/c:QI (plus:DI (reg:DI 3 x3)
(const_int 6264 [0x1878])) [0 b1+0 S1 A64])) 
../src/gcc/gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c:39 30 {*movqi_aarch64}
 (nil))
../src/gcc/gcc/testsuite/gcc.dg/vect/slp-multitypes-2.c:68:1: internal compiler 
error: in final_scan_insn, at final.c:2886
0x8b8135 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
/work/gcc-clean/src/gcc/gcc/rtl-error.c:109
0x8b815f _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
/work/gcc-clean/src/gcc/gcc/rtl-error.c:120
0x6e1822 final_scan_insn(rtx_def*, _IO_FILE*, int, int, int*)
/work/gcc-clean/src/gcc/gcc/final.c:2886
0x6e1b01 final(rtx_def*, _IO_FILE*, int)
/work/gcc-clean/src/gcc/gcc/final.c:2017
0x6e1d49 rest_of_handle_final
/work/gcc-clean/src/gcc/gcc/final.c:4422
0x6e1d49 execute
/work/gcc-clean/src/gcc/gcc/final.c:4497
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

Thanks,
James



Re: [PATCH][AARCH64]Replace gen_rtx_PLUS with plus_constant

2013-09-23 Thread James Greenhalgh
On Sat, Sep 21, 2013 at 09:34:34AM +0100, James Greenhalgh wrote:
> On Fri, Sep 20, 2013 at 03:40:59PM +0100, Renlin Li wrote:
> > Thank you, can you please commit it for me?
> > 
> > Kind regards,
> > Renlin Li
> > 
> > On 09/20/13 15:26, Marcus Shawcroft wrote:
> > > On 20 September 2013 15:18, Renlin Li  wrote:
> > >
> > >> 2013-09-20  Renlin Li  
> > >>
> > >>  * config/aarch64/aarch64.c (aarch64_expand_prologue): Use 
> > >> plus_constant.
> > >>  (aarch64_expand_epilogue): Likewise.
> > >>  (aarch64_legitimize_reload_address): Likewise.
>  
> Hi Renlin,
> 
> This patch appears to have caused a number of regressions on
> an aarch64-none-elf test run.

Marcus, Renlin,

I'm also seeing issues building an AArch64 Linux toolchain with this patch
applied, so I've reverted it for now.

Thanks,
James



Re: [PATCH] Refactor type handling in get_alias_set, fix PR58513

2013-09-26 Thread James Greenhalgh

On Tue, Sep 24, 2013 at 12:00:48PM +0100, Richard Biener wrote:
> 2013-09-24  Richard Biener  
>
> * g++.dg/vect/pr58513.cc: New testcase.
>

Hi,

This testcase fails for arm and aarch64 targets when using -fPIC.
As discussed on IRC this can be fixed by making op static.

After asking Richard Earnshaw to explain -fPIC and dynamic linking to
me, I've committed this (now obvious) patch as r202947.

Thanks,
James

---
gcc/testsuite/

2013-09-26  James Greenhalgh  

* g++.dg/vect/pr58513.cc (op): Make static.diff --git a/gcc/testsuite/g++.dg/vect/pr58513.cc b/gcc/testsuite/g++.dg/vect/pr58513.cc
index 2563047..08a175c 100644
--- a/gcc/testsuite/g++.dg/vect/pr58513.cc
+++ b/gcc/testsuite/g++.dg/vect/pr58513.cc
@@ -1,7 +1,7 @@
 // { dg-do compile }
 // { dg-require-effective-target vect_int }
 
-int op (const int& x, const int& y) { return x + y; }
+static int op (const int& x, const int& y) { return x + y; }
 
 void foo(int* a)
 {

Re: [PATCH] Make jump thread path carry more information

2013-09-27 Thread James Greenhalgh
On Fri, Sep 27, 2013 at 04:32:10PM +0100, Jeff Law wrote:
> If you could pass along a .i file it'd be helpful in case I want to look 
> at something under the debugger.

I've opened http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58553 to save
everyone's inboxes.

Let me know if I can do anything else to help.

Cheers,
James



[ARM, AArch64] Make aarch64-common.c files more robust.

2013-09-30 Thread James Greenhalgh

Hi,

Recently I've found myself getting a number of segfaults from
code calling in to the arm_early_load/alu_dep functions in
aarch64-common.c. These functions expect a particular form
for the RTX patterns they work over, but some of them do
not validate this form.

This patch fixes that, removing segmentation faults I see
when tuning for Cortex-A15 and Cortex-A7.

Tested on aarch64-none-elf and arm-none-eabi with no regressions.

OK?

Thanks,
James

---
gcc/

2013-09-30  James Greenhalgh  

* config/arm/aarch-common.c
(arm_early_load_addr_dep): Add sanity checking.
(arm_no_early_alu_shift_dep): Likewise.
(arm_no_early_alu_shift_value_dep): Likewise.
(arm_no_early_mul_dep): Likewise.
(arm_no_early_store_addr_dep): Likewise.
diff --git a/gcc/config/arm/aarch-common.c b/gcc/config/arm/aarch-common.c
index 69366af..ea50848 100644
--- a/gcc/config/arm/aarch-common.c
+++ b/gcc/config/arm/aarch-common.c
@@ -44,7 +44,12 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
 value = COND_EXEC_CODE (value);
   if (GET_CODE (value) == PARALLEL)
 value = XVECEXP (value, 0, 0);
+
+  if (GET_CODE (value) != SET)
+return 0;
+
   value = XEXP (value, 0);
+
   if (GET_CODE (addr) == COND_EXEC)
 addr = COND_EXEC_CODE (addr);
   if (GET_CODE (addr) == PARALLEL)
@@ -54,6 +59,10 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
   else
 addr = XVECEXP (addr, 0, 0);
 }
+
+  if (GET_CODE (addr) != SET)
+return 0;
+
   addr = XEXP (addr, 1);
 
   return reg_overlap_mentioned_p (value, addr);
@@ -74,21 +83,41 @@ arm_no_early_alu_shift_dep (rtx producer, rtx consumer)
 value = COND_EXEC_CODE (value);
   if (GET_CODE (value) == PARALLEL)
 value = XVECEXP (value, 0, 0);
+
+  if (GET_CODE (value) != SET)
+return 0;
+
   value = XEXP (value, 0);
+
   if (GET_CODE (op) == COND_EXEC)
 op = COND_EXEC_CODE (op);
   if (GET_CODE (op) == PARALLEL)
 op = XVECEXP (op, 0, 0);
+
+  if (GET_CODE (op) != SET)
+return 0;
+
   op = XEXP (op, 1);
 
+  if (!INSN_P (op))
+return 0;
+
   early_op = XEXP (op, 0);
+
   /* This is either an actual independent shift, or a shift applied to
  the first operand of another operation.  We want the whole shift
  operation.  */
   if (REG_P (early_op))
 early_op = op;
 
-  return !reg_overlap_mentioned_p (value, early_op);
+  if (GET_CODE (op) == ASHIFT
+  || GET_CODE (op) == ROTATE
+  || GET_CODE (op) == ASHIFTRT
+  || GET_CODE (op) == LSHIFTRT
+  || GET_CODE (op) == ROTATERT)
+return !reg_overlap_mentioned_p (value, early_op);
+  else
+return 0;
 }
 
 /* Return nonzero if the CONSUMER instruction (an ALU op) does not
@@ -106,13 +135,25 @@ arm_no_early_alu_shift_value_dep (rtx producer, rtx consumer)
 value = COND_EXEC_CODE (value);
   if (GET_CODE (value) == PARALLEL)
 value = XVECEXP (value, 0, 0);
+
+  if (GET_CODE (value) != SET)
+return 0;
+
   value = XEXP (value, 0);
+
   if (GET_CODE (op) == COND_EXEC)
 op = COND_EXEC_CODE (op);
   if (GET_CODE (op) == PARALLEL)
 op = XVECEXP (op, 0, 0);
+
+  if (GET_CODE (op) != SET)
+return 0;
+
   op = XEXP (op, 1);
 
+  if (!INSN_P (op))
+return 0;
+
   early_op = XEXP (op, 0);
 
   /* This is either an actual independent shift, or a shift applied to
@@ -121,7 +162,14 @@ arm_no_early_alu_shift_value_dep (rtx producer, rtx consumer)
   if (!REG_P (early_op))
 early_op = XEXP (early_op, 0);
 
-  return !reg_overlap_mentioned_p (value, early_op);
+  if (GET_CODE (op) == ASHIFT
+  || GET_CODE (op) == ROTATE
+  || GET_CODE (op) == ASHIFTRT
+  || GET_CODE (op) == LSHIFTRT
+  || GET_CODE (op) == ROTATERT)
+return !reg_overlap_mentioned_p (value, early_op);
+  else
+return 0;
 }
 
 /* Return nonzero if the CONSUMER (a mul or mac op) does not
@@ -138,11 +186,20 @@ arm_no_early_mul_dep (rtx producer, rtx consumer)
 value = COND_EXEC_CODE (value);
   if (GET_CODE (value) == PARALLEL)
 value = XVECEXP (value, 0, 0);
+
+  if (GET_CODE (value) != SET)
+return 0;
+
   value = XEXP (value, 0);
+
   if (GET_CODE (op) == COND_EXEC)
 op = COND_EXEC_CODE (op);
   if (GET_CODE (op) == PARALLEL)
 op = XVECEXP (op, 0, 0);
+
+  if (GET_CODE (op) != SET)
+return 0;
+
   op = XEXP (op, 1);
 
   if (GET_CODE (op) == PLUS || GET_CODE (op) == MINUS)
@@ -169,11 +226,20 @@ arm_no_early_store_addr_dep (rtx producer, rtx consumer)
 value = COND_EXEC_CODE (value);
   if (GET_CODE (value) == PARALLEL)
 value = XVECEXP (value, 0, 0);
+
+  if (GET_CODE (value) != SET)
+return 0;
+
   value = XEXP (value, 0);
+
   if (GET_CODE (addr) == COND_EXEC)
 addr = COND_EXEC_CODE (addr);
   if (GET_CODE (addr) == PARALLEL)
 addr = XVECEXP (addr, 0, 0);
+
+  if (GET_CODE (addr) != SET)
+return 0;
+
   addr = XEXP (addr, 0);
 
   return !reg_overlap_mentioned_p (value, addr);

[PATCH] [AArch64] Refactor Advanced SIMD builtin initialisation.

2012-10-05 Thread James Greenhalgh

Hi,

This patch refactors the initialisation code for the Advanced
SIMD builtins under the AArch64 target. The patch has been
regression tested on aarch64-none-elf.

OK for aarch64-branch?

(If yes, someone will have to commit this for me as I do not
have commit rights)

Thanks,
James Greenhalgh

---
2012-09-07  James Greenhalgh  
Tejas Belagod  

* config/aarch64/aarch64-builtins.c
(aarch64_simd_builtin_type_bits): Rename to...
(aarch64_simd_builtin_type_mode): ...this, make sequential.
(aarch64_simd_builtin_datum): Refactor members where possible.
(VAR1, VAR2, ..., VAR12): Update accordingly.
(aarch64_simd_builtin_data): Update accordingly.
(init_aarch64_simd_builtins): Refactor.
(aarch64_simd_builtin_compare): Remove.
(locate_simd_builtin_icode): Likewise.
diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 429a0df..9142aca 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -31,27 +31,28 @@
 #include "diagnostic-core.h"
 #include "optabs.h"
 
-enum aarch64_simd_builtin_type_bits
+enum aarch64_simd_builtin_type_mode
 {
-  T_V8QI = 0x0001,
-  T_V4HI = 0x0002,
-  T_V2SI = 0x0004,
-  T_V2SF = 0x0008,
-  T_DI = 0x0010,
-  T_DF = 0x0020,
-  T_V16QI = 0x0040,
-  T_V8HI = 0x0080,
-  T_V4SI = 0x0100,
-  T_V4SF = 0x0200,
-  T_V2DI = 0x0400,
-  T_V2DF = 0x0800,
-  T_TI = 0x1000,
-  T_EI = 0x2000,
-  T_OI = 0x4000,
-  T_XI = 0x8000,
-  T_SI = 0x1,
-  T_HI = 0x2,
-  T_QI = 0x4
+  T_V8QI,
+  T_V4HI,
+  T_V2SI,
+  T_V2SF,
+  T_DI,
+  T_DF,
+  T_V16QI,
+  T_V8HI,
+  T_V4SI,
+  T_V4SF,
+  T_V2DI,
+  T_V2DF,
+  T_TI,
+  T_EI,
+  T_OI,
+  T_XI,
+  T_SI,
+  T_HI,
+  T_QI,
+  T_MAX
 };
 
 #define v8qi_UP  T_V8QI
@@ -76,8 +77,6 @@ enum aarch64_simd_builtin_type_bits
 
 #define UP(X) X##_UP
 
-#define T_MAX 19
-
 typedef enum
 {
   AARCH64_SIMD_BINOP,
@@ -124,68 +123,121 @@ typedef struct
 {
   const char *name;
   const aarch64_simd_itype itype;
-  const int bits;
-  const enum insn_code codes[T_MAX];
-  const unsigned int num_vars;
-  unsigned int base_fcode;
+  enum aarch64_simd_builtin_type_mode mode;
+  const enum insn_code code;
+  unsigned int fcode;
 } aarch64_simd_builtin_datum;
 
 #define CF(N, X) CODE_FOR_aarch64_##N##X
 
 #define VAR1(T, N, A) \
-  #N, AARCH64_SIMD_##T, UP (A), { CF (N, A) }, 1, 0
+  {#N, AARCH64_SIMD_##T, UP (A), CF (N, A), 0}
 #define VAR2(T, N, A, B) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B), { CF (N, A), CF (N, B) }, 2, 0
+  VAR1 (T, N, A), \
+  VAR1 (T, N, B)
 #define VAR3(T, N, A, B, C) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C), \
-  { CF (N, A), CF (N, B), CF (N, C) }, 3, 0
+  VAR2 (T, N, A, B), \
+  VAR1 (T, N, C)
 #define VAR4(T, N, A, B, C, D) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D) }, 4, 0
+  VAR3 (T, N, A, B, C), \
+  VAR1 (T, N, D)
 #define VAR5(T, N, A, B, C, D, E) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) | UP (E), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E) }, 5, 0
+  VAR4 (T, N, A, B, C, D), \
+  VAR1 (T, N, E)
 #define VAR6(T, N, A, B, C, D, E, F) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) | UP (E) | UP (F), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E), CF (N, F) }, 6, 0
+  VAR5 (T, N, A, B, C, D, E), \
+  VAR1 (T, N, F)
 #define VAR7(T, N, A, B, C, D, E, F, G) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) \
-			| UP (E) | UP (F) | UP (G), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E), CF (N, F), \
-CF (N, G) }, 7, 0
+  VAR6 (T, N, A, B, C, D, E, F), \
+  VAR1 (T, N, G)
 #define VAR8(T, N, A, B, C, D, E, F, G, H) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) \
-		| UP (E) | UP (F) | UP (G) \
-		| UP (H), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E), CF (N, F), \
-CF (N, G), CF (N, H) }, 8, 0
+  VAR7 (T, N, A, B, C, D, E, F, G), \
+  VAR1 (T, N, H)
 #define VAR9(T, N, A, B, C, D, E, F, G, H, I) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) \
-		| UP (E) | UP (F) | UP (G) \
-		| UP (H) | UP (I), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E), CF (N, F), \
-CF (N, G), CF (N, H), CF (N, I) }, 9, 0
+  VAR8 (T, N, A, B, C, D, E, F, G, H), \
+  VAR1 (T, N, I)
 #define VAR10(T, N, A, B, C, D, E, F, G, H, I, J) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) \
-		| UP (E) | UP (F) | UP (G) \
-		| UP (H) | UP (I) | UP (J), \
-  { CF (N, A), CF (N, B), CF (N, C), CF (N, D), CF (N, E), CF (N, F), \
-CF (N, G), CF (N, H), CF (N, I), CF (N, J) }, 10, 0
-
+  VAR9 (T, N, A, B, C, D, E, F, G, H, I), \
+  VAR1 (T, N, J)
 #define VAR11(T, N, A, B, C, D, E, F, G, H, I, J, K) \
-  #N, AARCH64_SIMD_##T, UP (A) | UP (B) | UP (C) | UP (D) \
-		| UP (E) | UP (F) | UP (G) \
-		| UP (H) | UP (I) | UP (J) | UP (K), \
-  { CF (N, A),

[PATCH] [AArch64] Add vcond, vcondu support.

2012-10-09 Thread James Greenhalgh

Hi,

This patch adds support for vcond and vcondu to the AArch64
backend.

Tested with no regressions on aarch64-none-elf.

OK for aarch64-branch?

(If so, someone will have to commit for me, as I do not
have commit rights.)

Thanks
James Greenhalgh

---
2012-09-11  James Greenhalgh  
Tejas Belagod  

* config/aarch64/aarch64-simd.md
(aarch64_simd_bsl_internal): New pattern.
(aarch64_simd_bsl): Likewise.
(aarch64_vcond_internal): Likewise.
(vcondu): Likewise.
(vcond): Likewise.
* config/aarch64/iterators.md (UNSPEC_BSL): Add to define_constants.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index a7ddfb1..c9b5e17 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1467,6 +1467,150 @@
(set_attr "simd_mode" "V2SI")]
 )
 
+;; vbsl_* intrinsics may compile to any of vbsl/vbif/vbit depending on register
+;; allocation.  For an intrinsic of form:
+;;   vD = bsl_* (vS, vN, vM)
+;; We can use any of:
+;;   bsl vS, vN, vM  (if D = S)
+;;   bit vD, vN, vS  (if D = M, so 1-bits in vS choose bits from vN, else vM)
+;;   bif vD, vM, vS  (if D = N, so 0-bits in vS choose bits from vM, else vN)
+
+(define_insn "aarch64_simd_bsl_internal"
+  [(set (match_operand:VDQ 0 "register_operand"		 "=w,w,w")
+	(unspec:VDQ [(match_operand:VDQ 1 "register_operand" " 0,w,w")
+		 (match_operand:VDQ 2 "register_operand" " w,w,0")
+ (match_operand:VDQ 3 "register_operand" " w,0,w")]
+UNSPEC_BSL))]
+  "TARGET_SIMD"
+  "@
+  bsl\\t%0., %2., %3.
+  bit\\t%0., %2., %1.
+  bif\\t%0., %3., %1."
+)
+
+(define_expand "aarch64_simd_bsl"
+  [(set (match_operand:VDQ 0 "register_operand")
+(unspec:VDQ [(match_operand: 1 "register_operand")
+  (match_operand:VDQ 2 "register_operand")
+  (match_operand:VDQ 3 "register_operand")]
+ UNSPEC_BSL))]
+  "TARGET_SIMD"
+{
+  /* We can't alias operands together if they have different modes.  */
+  operands[1] = gen_lowpart (mode, operands[1]);
+})
+
+(define_expand "aarch64_vcond_internal"
+  [(set (match_operand:VDQ 0 "register_operand")
+	(if_then_else:VDQ
+	  (match_operator 3 "comparison_operator"
+	[(match_operand:VDQ 4 "register_operand")
+	 (match_operand:VDQ 5 "nonmemory_operand")])
+	  (match_operand:VDQ 1 "register_operand")
+	  (match_operand:VDQ 2 "register_operand")))]
+  "TARGET_SIMD"
+{
+  int inverse = 0, has_zero_imm_form = 0;
+  rtx mask = gen_reg_rtx (mode);
+
+  switch (GET_CODE (operands[3]))
+{
+case LE:
+case LT:
+case NE:
+  inverse = 1;
+  /* Fall through.  */
+case GE:
+case GT:
+case EQ:
+  has_zero_imm_form = 1;
+  break;
+case LEU:
+case LTU:
+  inverse = 1;
+  break;
+default:
+  break;
+}
+
+  if (!REG_P (operands[5])
+  && (operands[5] != CONST0_RTX (mode) || !has_zero_imm_form))
+operands[5] = force_reg (mode, operands[5]);
+
+  switch (GET_CODE (operands[3]))
+{
+case LT:
+case GE:
+  emit_insn (gen_aarch64_cmge (mask, operands[4], operands[5]));
+  break;
+
+case LE:
+case GT:
+  emit_insn (gen_aarch64_cmgt (mask, operands[4], operands[5]));
+  break;
+
+case LTU:
+case GEU:
+  emit_insn (gen_aarch64_cmhs (mask, operands[4], operands[5]));
+  break;
+
+case LEU:
+case GTU:
+  emit_insn (gen_aarch64_cmhi (mask, operands[4], operands[5]));
+  break;
+
+case NE:
+case EQ:
+  emit_insn (gen_aarch64_cmeq (mask, operands[4], operands[5]));
+  break;
+
+default:
+  gcc_unreachable ();
+}
+
+  if (inverse)
+emit_insn (gen_aarch64_simd_bsl (operands[0], mask, operands[2],
+operands[1]));
+  else
+emit_insn (gen_aarch64_simd_bsl (operands[0], mask, operands[1],
+operands[2]));
+
+  DONE;
+})
+
+(define_expand "vcond"
+  [(set (match_operand:VDQ 0 "register_operand")
+	(if_then_else:VDQ
+	  (match_operator 3 "comparison_operator"
+	[(match_operand:VDQ 4 "register_operand")
+	 (match_operand:VDQ 5 "nonmemory_operand")])
+	  (match_operand:VDQ 1 "register_operand")
+	  (match_operand:VDQ 2 "register_operand")))]
+  "TARGET_SIMD"
+{
+  emit_insn (gen_aarch64_vcond_internal (operands[0], operands[1],
+	   operands[2], operands[3],
+	   operands[4], operands[5]));
+  DONE;
+})
+
+
+(define_expand "vcondu"
+  [(set (match_operand:VDQ 0 "register_operand")
+	(if_then_else:VDQ
+	  (match_operator 3 "comparison_operator&

RE: [PATCH] [AArch64] Add vcond, vcondu support.

2012-10-26 Thread James Greenhalgh
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of Marcus Shawcroft
> Sent: 15 October 2012 12:37
> To: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [AArch64] Add vcond, vcondu support.
> 
> On 09/10/12 12:08, James Greenhalgh wrote:
> >
> > Hi,
> >
> > This patch adds support for vcond and vcondu to the AArch64
> > backend.
> >
> > Tested with no regressions on aarch64-none-elf.
> >
> > OK for aarch64-branch?
> >
> > (If so, someone will have to commit for me, as I do not
> > have commit rights.)
> >
> > Thanks
> > James Greenhalgh
> >
> > ---
> > 2012-09-11  James Greenhalgh
> > Tejas Belagod
> >
> > * config/aarch64/aarch64-simd.md
> > (aarch64_simd_bsl_internal): New pattern.
> > (aarch64_simd_bsl): Likewise.
> > (aarch64_vcond_internal): Likewise.
> > (vcondu): Likewise.
> > (vcond): Likewise.
> > * config/aarch64/iterators.md (UNSPEC_BSL): Add to
> define_constants.
> 
> OK
> /Marcus
> 

Hi Marcus,

Thanks for the review, could someone please commit this patch
for me as I do not have SVN write access.

Regards,
James Greenhalgh


RE: [PATCH] [AArch64] Refactor Advanced SIMD builtin initialisation.

2012-10-26 Thread James Greenhalgh
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of Marcus Shawcroft
> Sent: 15 October 2012 12:35
> To: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [AArch64] Refactor Advanced SIMD builtin
> initialisation.
> 
> On 05/10/12 16:52, James Greenhalgh wrote:
> >
> > Hi,
> >
> > This patch refactors the initialisation code for the Advanced
> > SIMD builtins under the AArch64 target. The patch has been
> > regression tested on aarch64-none-elf.
> >
> > OK for aarch64-branch?
> >
> > (If yes, someone will have to commit this for me as I do not
> > have commit rights)
> >
> > Thanks,
> > James Greenhalgh
> >
> > ---
> > 2012-09-07  James Greenhalgh
> > Tejas Belagod
> >
> > * config/aarch64/aarch64-builtins.c
> > (aarch64_simd_builtin_type_bits): Rename to...
> > (aarch64_simd_builtin_type_mode): ...this, make sequential.
> > (aarch64_simd_builtin_datum): Refactor members where possible.
> > (VAR1, VAR2, ..., VAR12): Update accordingly.
> > (aarch64_simd_builtin_data): Update accordingly.
> > (init_aarch64_simd_builtins): Refactor.
> > (aarch64_simd_builtin_compare): Remove.
> > (locate_simd_builtin_icode): Likewise.
> 
> OK and backport to aarch64-4.7-branch please.
> 
> /Marcus
> 

Hi Marcus,

Thanks for the review, could someone please be commit this to the
appropriate branches for me, as I do not have SVN write access.

Regards,
James Greenhalgh 


[Patch] Add myself to MAINTAINERS as Write After Approval

2012-10-29 Thread James Greenhalgh
Hi,

This patch adds me to the Write After Approval section of MAINTAINERS.

Regards,
James Greenhalgh

--

2012-10-26  James Greenhalgh  

* MAINTAINERS (Write After Approval): Add myself.

wap.patch
Description: wap.patch


RE: [PATCH] [AArch64] Add vcond, vcondu support.

2012-10-30 Thread James Greenhalgh
> -Original Message-
> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
> ow...@gcc.gnu.org] On Behalf Of Marcus Shawcroft
> Sent: 15 October 2012 12:37
> To: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] [AArch64] Add vcond, vcondu support.
> 
> On 09/10/12 12:08, James Greenhalgh wrote:
> >
> > Hi,
> >
> > This patch adds support for vcond and vcondu to the AArch64
> > backend.
> >
> > Tested with no regressions on aarch64-none-elf.
> >
> > OK for aarch64-branch?
> >
> > (If so, someone will have to commit for me, as I do not
> > have commit rights.)
> >
> > Thanks
> > James Greenhalgh
> >
> > ---
> > 2012-09-11  James Greenhalgh
> > Tejas Belagod
> >
> > * config/aarch64/aarch64-simd.md
> > (aarch64_simd_bsl_internal): New pattern.
> > (aarch64_simd_bsl): Likewise.
> > (aarch64_vcond_internal): Likewise.
> > (vcondu): Likewise.
> > (vcond): Likewise.
> > * config/aarch64/iterators.md (UNSPEC_BSL): Add to
> define_constants.
> 
> OK
> /Marcus
> 

Hi,

Committed as revision 192985.

Thanks,
James Greenhalgh


[AArch64] Fix early-clobber operands to vtbx[1,3]

2013-10-11 Thread James Greenhalgh

Hi,

The vtbx intrinsics are implemented in assembly without noting
that their tmp1 operand is early-clobber. This can, when the
wind blows the wrong way, result in us making a total mess of
the state of registers.

Fix by marking the required operands as early-clobber.

Regression tested against aarch64.exp with no problems.

OK?

Thanks,
James

---
2013-10-11  James Greenhalgh  

* config/aarch64/arm_neon.h
(vtbx<1,3>_8): Fix register constriants.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 482d7d0..f7c9db6 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -15636,7 +15636,7 @@ vtbx1_s8 (int8x8_t r, int8x8_t tab, int8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15652,7 +15652,7 @@ vtbx1_u8 (uint8x8_t r, uint8x8_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15668,7 +15668,7 @@ vtbx1_p8 (poly8x8_t r, poly8x8_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15723,7 +15723,7 @@ vtbx3_s8 (int8x8_t r, int8x8x3_t tab, int8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;
@@ -15742,7 +15742,7 @@ vtbx3_u8 (uint8x8_t r, uint8x8x3_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;
@@ -15761,7 +15761,7 @@ vtbx3_p8 (poly8x8_t r, poly8x8x3_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;

Re: [AArch64] Fix early-clobber operands to vtbx[1,3]

2013-10-12 Thread James Greenhalgh
On Fri, Oct 11, 2013 at 07:52:48PM +0100, Marcus Shawcroft wrote:
> > 2013-10-11  James Greenhalgh  
> >
> > * config/aarch64/arm_neon.h
> > (vtbx<1,3>_8): Fix register constriants.
> >
> > OK?
> 
> OK, and back port to 4.8 please.
> /Marcus
> 

Hi Marcus,

I've committed this as revision 203478, but 4.8 is currently
frozen for release, so Jakub (+CC) will have to approve it.

This patch is small, not very controversial and only affects
the AArch64 tree.

Otherwise, I'll backport this when 4.8 opens again.

Thanks,
James
>From ba67f60eb238b71c55cc4363f5061b6e6810990a Mon Sep 17 00:00:00 2001
From: James Greenhalgh 
Date: Fri, 13 Sep 2013 17:18:23 +0100
Subject: [AArch64] Fix early-clobber operands to vtbx[1,3]
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="1.8.3-rc0"

This is a multi-part message in MIME format.
--1.8.3-rc0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


Hi,

The vtbx intrinsics are implemented in assembly without noting
that their tmp1 operand is early-clobber. This can, when the
wind blows the wrong way, result in us making a total mess of
the state of registers.

Fix by marking the required operands as early-clobber.

Regression tested against aarch64.exp with no problems.

OK?

Thanks,
James

---
2013-10-11  James Greenhalgh  

	* config/aarch64/arm_neon.h
	(vtbx<1,3>_8): Fix register constriants.


--1.8.3-rc0
Content-Type: text/x-patch; name="0001-AArch64-Fix-early-clobber-operands-to-vtbx-1-3.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-AArch64-Fix-early-clobber-operands-to-vtbx-1-3.patch"

diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 482d7d0..f7c9db6 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -15636,7 +15636,7 @@ vtbx1_s8 (int8x8_t r, int8x8_t tab, int8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15652,7 +15652,7 @@ vtbx1_u8 (uint8x8_t r, uint8x8_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15668,7 +15668,7 @@ vtbx1_p8 (poly8x8_t r, poly8x8_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {%2.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "w"(temp), "w"(idx), "w"(r)
: /* No clobbers */);
   return result;
@@ -15723,7 +15723,7 @@ vtbx3_s8 (int8x8_t r, int8x8x3_t tab, int8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;
@@ -15742,7 +15742,7 @@ vtbx3_u8 (uint8x8_t r, uint8x8x3_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;
@@ -15761,7 +15761,7 @@ vtbx3_p8 (poly8x8_t r, poly8x8x3_t tab, uint8x8_t idx)
 	   "cmhs %0.8b, %3.8b, %0.8b\n\t"
 	   "tbl %1.8b, {v16.16b - v17.16b}, %3.8b\n\t"
 	   "bsl %0.8b, %4.8b, %1.8b\n\t"
-   : "+w"(result), "=w"(tmp1)
+   : "+w"(result), "=&w"(tmp1)
: "Q"(temp), "w"(idx), "w"(r)
: "v16", "v17", "memory");
   return result;

--1.8.3-rc0--




[ARM/AARCH64] Remodel type attribute values for Neon Instructions.

2013-10-15 Thread James Greenhalgh
Hi,

The historical neon_type attribute formed groups over the Neon
instructions which were well suited for modelling the Cortex-A8
pipeline, but were cumbersome for other processor models.

The AArch64 has another classification "simd_type". This is, with a
few exceptions, and when augmented by simd_mode, suitably high
resolution for our needs. However, this is not integrated in to the
"type" attribute, which we would ideally like to be the One True
instruction classification attribute.

This patch series aims to solve these problems by defining a new,
high resolution classification across the Neon instructions. From
this we can derive two benefits:

  * Convergence between the A32 and A64 backends.
  * Better Pipeline modeling.

The patch series first introduces the new Neon type classifications,
then updates config/arm/neon.md and config/aarch64/aarch64-simd.md
to use the new classifications.

We then update the pipeline models for the new types. For Cortex-A8
and Cortex-A9, this simply means reforming the old groups. For Cortex-A15,
this is a chance to form new groups with which we can better model the
pipeline latencies.

Finally, we can remove the old types and config/arm/neon-schedgen.ml.

The patch series has been bootstrapped on a Chromebook and the full
testsuite run with no regressions. All pipeline models have been
checked against some sample neon intrinsics code to ensure the new
schedules are sensible, and there are no holes in the pipeline models.

Thanks,
James

---
James Greenhalgh (10):
  [ARM] [1/10] Add new types to describe Neon insns.
  [AArch64] [Neon types 2/10] Update Current type attributes to new Neon
Types.
  [ARM] [Neon types 3/10] Update Current type attributes to new Neon
Types.
  [AArch64] [Neon types 4/10] Add type attributes to all simd insns
  [ARM] [Neon types 5/10] Update Cortex-A8 pipeline model
  [ARM] [Neon types 6/10] Cortex-A9 neon pipeline changes
  [ARM] [Neon types 7/10] Cortex-A15 neon pipeline changes
  [ARM] [Neon types 8/10] Cortex-A7 neon pipeline model
  [ARM] [Neon types 9/10] Remove old neon types
  [ARM] [Neon types 10/10] Remove neon-schedgen.ml

[ARM] [1/10] Add new types to describe Neon insns.

2013-10-15 Thread James Greenhalgh
ad2_4reg_q
neon_vld3_vld4
   neon_load3_3reg, neon_load3_3reg_q,
   neon_load4_4reg, neon_load4_4reg_q
neon_vld1_vld2_lane
   f_loads, f_loadd, f_stores, f_stored,
   neon_load1_one_lane, neon_load1_one_lane_q,
   neon_load2_one_lane, neon_load2_one_lane_q
neon_vld3_vld4_lane
   neon_load3_one_lane, neon_load3_one_lane_q,
   neon_load4_one_lane, neon_load4_one_lane_q
neon_vst1_1_2_regs_vst2_2_regs
   neon_store1_1reg, neon_store1_1reg_q,
   neon_store1_2reg, neon_store1_2reg_q,
   neon_store2_2reg, neon_store2_2reg_q
neon_vst1_3_4_regs
   neon_store1_3reg, neon_store1_3reg_q,
   neon_store1_4reg, neon_store1_4reg_q
neon_vst2_4_regs_vst3_vst4
   neon_store2_4reg, neon_store2_4reg_q,
   neon_store3_3reg, neon_store3_3reg_q,
   neon_store4_4reg, neon_store4_4reg_q
neon_vst1_vst2_lane
   neon_store1_one_lane, neon_store1_one_lane_q,
   neon_store2_one_lane, neon_store2_one_lane_q
neon_vst3_vst4_lane
   neon_store3_one_lane, neon_store3_one_lane_q,
   neon_store4_one_lane, neon_store4_one_lane_q
neon_mcr
   neon_from_gp, f_mcr
neon_mcr_2_mcrr
   neon_from_gp_q, f_mcrr
neon_mrc
   neon_to_gp, f_mrc
neon_mrrc
   neon_to_gp_q, f_mrrc

Bootstrapped in series, and sanity checked.

Thanks,
James

---
gcc

2013-10-15  James Greenhalgh  

* config/arm/types.md: Add new types for Neon insns.
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7a96438fd48d5e52dda4508ed637695c8290f492..7cb8aa87a261856e3b89d325a45e6a87d976f697 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -247,7 +247,6 @@
 ; neon_int_4
 ; neon_int_5
 ; neon_ldm_2
-; neon_ldr
 ; neon_mcr_2_mcrr
 ; neon_mcr
 ; neon_mla_ddd_16_scalar_qdd_32_16_long_scalar
@@ -266,7 +265,6 @@
 ; neon_shift_2
 ; neon_shift_3
 ; neon_stm_2
-; neon_str
 ; neon_vaba_qqq
 ; neon_vaba
 ; neon_vld1_1_2_regs
@@ -289,6 +287,299 @@
 ; neon_vst2_4_regs_vst3_vst4
 ; neon_vst3_vst4_lane
 ; neon_vst3_vst4
+;
+; neon_add
+; neon_add_q
+; neon_add_widen
+; neon_add_long
+; neon_qadd
+; neon_qadd_q
+; neon_add_halve
+; neon_add_halve_q
+; neon_add_halve_narrow_q
+; neon_sub
+; neon_sub_q
+; neon_sub_widen
+; neon_sub_long
+; neon_qsub
+; neon_qsub_q
+; neon_sub_halve
+; neon_sub_halve_q
+; neon_sub_halve_narrow_q
+; neon_abs
+; neon_abs_q
+; neon_neg
+; neon_neg_q
+; neon_qneg
+; neon_qneg_q
+; neon_qabs
+; neon_qabs_q
+; neon_abd
+; neon_abd_q
+; neon_abd_long
+; neon_minmax
+; neon_minmax_q
+; neon_compare
+; neon_compare_q
+; neon_compare_zero
+; neon_compare_zero_q
+; neon_arith_acc
+; neon_arith_acc_q
+; neon_reduc_add
+; neon_reduc_add_q
+; neon_reduc_add_long
+; neon_reduc_add_acc
+; neon_reduc_add_acc_q
+; neon_reduc_minmax
+; neon_reduc_minmax_q
+; neon_logic
+; neon_logic_q
+; neon_tst
+; neon_tst_q
+; neon_shift_imm
+; neon_shift_imm_q
+; neon_shift_imm_narrow_q
+; neon_shift_imm_long
+; neon_shift_reg
+; neon_shift_reg_q
+; neon_shift_acc
+; neon_shift_acc_q
+; neon_sat_shift_imm
+; neon_sat_shift_imm_q
+; neon_sat_shift_imm_narrow_q
+; neon_sat_shift_reg
+; neon_sat_shift_reg_q
+; neon_ins
+; neon_ins_q
+; neon_move
+; neon_move_q
+; neon_move_narrow_q
+; neon_permute
+; neon_permute_q
+; neon_zip
+; neon_zip_q
+; neon_tbl1
+; neon_tbl1_q
+; neon_tbl2
+; neon_tbl2_q
+; neon_tbl3
+; neon_tbl3_q
+; neon_tbl4
+; neon_tbl4_q
+; neon_bsl
+; neon_bsl_q
+; neon_cls
+; neon_cls_q
+; neon_cnt
+; neon_cnt_q
+; neon_ext
+; neon_ext_q
+; neon_rbit
+; neon_rbit_q
+; neon_rev
+; neon_rev_q
+; neon_mul_b
+; neon_mul_b_q
+; neon_mul_h
+; neon_mul_h_q
+; neon_mul_s
+; neon_mul_s_q
+; neon_mul_b_long
+; neon_mul_h_long
+; neon_mul_s_long
+; neon_mul_h_scalar
+; neon_mul_h_scalar_q
+; neon_mul_s_scalar
+; neon_mul_s_scalar_q
+; neon_mul_h_scalar_long
+; neon_mul_s_scalar_long
+; neon_sat_mul_b
+; neon_sat_mul_b_q
+; neon_sat_mul_h
+; neon_sat_mul_h_q
+; neon_sat_mul_s
+; neon_sat_mul_s_q
+; neon_sat_mul_b_long
+; neon_sat_mul_h_long
+; neon_sat_mul_s_long
+; neon_sat_mul_h_scalar
+; neon_sat_mul_h_scalar_q
+; neon_sat_mul_s_scalar
+; neon_sat_mul_s_scalar_q
+; neon_sat_mul_h_scalar_long
+; neon_sat_mul_s_scalar_long
+; neon_mla_b
+; neon_mla_b_q
+; neon_mla_h
+; neon_mla_h_q
+; neon_mla_s
+; neon_mla_s_q
+; neon_mla_b_long
+; neon_mla_h_long
+; neon_mla_s_long
+; neon_mla_h_scalar
+; neon_mla_h_scalar_q
+; neon_mla_s_scalar
+; neon_mla_s_scalar_q
+; neon_mla_h_scalar_long
+; neon_mla_s_scalar_long
+; neon_sat_mla_b_long
+; neon_sat_mla_h_long
+; neon_sat_mla_s_long
+; neon_sat_mla_h_scalar_long
+; neon_sat_mla_s_scalar_long
+; neon_to_gp
+; neon_to_gp_q
+; neon_from_gp
+; neon_from_gp_q
+; neon_ldr
+; neon_load1_1reg
+; neon_load1_1reg_q
+; neon_load1_2reg
+; neon_load1_2reg_q
+; neon_load1_3reg
+; neon_load1_3reg_q
+; neon_load1_4reg
+; neon_load1_4reg_q
+; neon_load1_all_lanes
+; neon_load1_all_lanes_q
+; neon_load1_one_lane
+; neon_load1_one_lane_q
+; neon_load2_2reg
+; neon_load2_2reg_q
+; neon_load2_4reg
+; neon_load2_4reg_q
+; neon_load2_all_lanes
+; neon_load2_all_lanes_q
+; neon_load2_one_lane
+; neon_load2_one_lane_q
+; neon_

[AArch64] [Neon types 2/10] Update Current type attributes to new Neon Types.

2013-10-15 Thread James Greenhalgh

Hi,

This patch transforms:
  neon_ldm_2 to neon_load1_2reg
  neon_str_2 to neon_store1_2reg

in the aarch64 backend. This is just an administrative change, as
we have no cores consuming these types in the aarch64 backend.

Tested on aarch64-none-elf with no regressions.

Thanks,
James

---
gcc/

2013-10-15  James Greenhalgh  

* config/aarch64/aarch64.md (movtf_aarch64): Update type attribute.
(load_pair): Update type attribute.
(store_pair): Update type attribute.
* config/aarch64/iterators.md (q): New.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f3e004b6c3e300e4769bf9f8b49596282b42906b..01664665e7d309f2cf2076bdc3ca6e0825612cea 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1033,7 +1033,7 @@ (define_insn "*movtf_aarch64"
stp\\t%1, %H1, %0"
   [(set_attr "v8type" "logic,move2,fmovi2f,fmovf2i,fconst,fconst,fpsimd_load,fpsimd_store,fpsimd_load2,fpsimd_store2")
(set_attr "type" "logic_reg,multiple,f_mcr,f_mrc,fconstd,fconstd,\
- f_loadd,f_stored,neon_ldm_2,neon_stm_2")
+ f_loadd,f_stored,neon_load1_2reg,neon_store1_2reg")
(set_attr "mode" "DF,DF,DF,DF,DF,DF,TF,TF,DF,DF")
(set_attr "length" "4,8,8,8,4,4,4,4,4,4")
(set_attr "fp" "*,*,yes,yes,*,yes,yes,yes,*,*")
@@ -1098,7 +1098,7 @@ (define_insn "load_pair"
 			   GET_MODE_SIZE (mode)))"
   "ldp\\t%0, %2, %1"
   [(set_attr "v8type" "fpsimd_load2")
-   (set_attr "type" "neon_ldm_2")
+   (set_attr "type" "neon_load1_2reg")
(set_attr "mode" "")]
 )
 
@@ -1115,7 +1115,7 @@ (define_insn "store_pair"
 			   GET_MODE_SIZE (mode)))"
   "stp\\t%1, %3, %0"
   [(set_attr "v8type" "fpsimd_store2")
-   (set_attr "type" "neon_stm_2")
+   (set_attr "type" "neon_store1_2reg")
(set_attr "mode" "")]
 )
 
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index ec8d813fa3f53ad822d94ccf42ac0619380d7e3b..13c6d958826a593dfcc54e31756ef9978dda9e4b 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -566,6 +566,15 @@ (define_mode_attr f [(V8QI "")  (V16QI "
 		 (V2SF "f") (V4SF  "f")
 		 (V2DF "f") (DF"f")])
 
+;; Defined to '_q' for 128-bit types.
+(define_mode_attr q [(V8QI "") (V16QI "_q")
+(V4HI "") (V8HI  "_q")
+(V2SI "") (V4SI  "_q")
+(DI   "") (V2DI  "_q")
+(V2SF "") (V4SF  "_q")
+  (V2DF  "_q")
+(QI "") (HI "") (SI "") (DI "") (SF "") (DF "")])
+
 ;; ---
 ;; Code Iterators
 ;; ---

[ARM] [Neon types 9/10] Remove old neon types

2013-10-15 Thread James Greenhalgh

Hi,

Now that we have proted the pipeline models, this patch removes the
old neon types.

Bootstrapped on a chromebook in series and sanity checked.

Thanks,
James

---
gcc/

2013-10-15  James Greenhalgh  

* config/arm/types: Remove old neon types.
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index 7cb8aa87a261856e3b89d325a45e6a87d976f697..1c4b9e33c7e5fb35b1fcdb987eb94286aab70d23 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -227,67 +227,6 @@
 ;
 ; The classification below is for NEON instructions.
 ;
-; neon_bp_2cycle
-; neon_bp_3cycle
-; neon_bp_simple
-; neon_fp_vadd_ddd_vabs_dd
-; neon_fp_vadd_qqq_vabs_qq
-; neon_fp_vmla_ddd_scalar
-; neon_fp_vmla_ddd
-; neon_fp_vmla_qqq_scalar
-; neon_fp_vmla_qqq
-; neon_fp_vmul_ddd
-; neon_fp_vmul_qqd
-; neon_fp_vrecps_vrsqrts_ddd
-; neon_fp_vrecps_vrsqrts_qqq
-; neon_fp_vsum
-; neon_int_1
-; neon_int_2
-; neon_int_3
-; neon_int_4
-; neon_int_5
-; neon_ldm_2
-; neon_mcr_2_mcrr
-; neon_mcr
-; neon_mla_ddd_16_scalar_qdd_32_16_long_scalar
-; neon_mla_ddd_32_qqd_16_ddd_32_scalar_qdd_64_32_long_scalar_qdd_64_32_long
-; neon_mla_ddd_8_16_qdd_16_8_long_32_16_long
-; neon_mla_qqq_32_qqd_32_scalar
-; neon_mla_qqq_8_16
-; neon_mrc
-; neon_mrrc
-; neon_mul_ddd_16_scalar_32_16_long_scalar
-; neon_mul_ddd_8_16_qdd_16_8_long_32_16_long
-; neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar
-; neon_mul_qqd_32_scalar
-; neon_mul_qqq_8_16_32_ddd_32
-; neon_shift_1
-; neon_shift_2
-; neon_shift_3
-; neon_stm_2
-; neon_vaba_qqq
-; neon_vaba
-; neon_vld1_1_2_regs
-; neon_vld1_3_4_regs
-; neon_vld1_vld2_lane
-; neon_vld2_2_regs_vld1_vld2_all_lanes
-; neon_vld2_4_regs
-; neon_vld3_vld4_all_lanes
-; neon_vld3_vld4_lane
-; neon_vld3_vld4
-; neon_vmov
-; neon_vqneg_vqabs
-; neon_vqshl_vrshl_vqrshl_qqq
-; neon_vshl_ddd
-; neon_vsma
-; neon_vsra_vrsra
-; neon_vst1_1_2_regs_vst2_2_regs
-; neon_vst1_3_4_regs
-; neon_vst1_vst2_lane
-; neon_vst2_4_regs_vst3_vst4
-; neon_vst3_vst4_lane
-; neon_vst3_vst4
-;
 ; neon_add
 ; neon_add_q
 ; neon_add_widen
@@ -772,66 +711,6 @@ (define_attr "type"
   wmmx_wunpckih,\
   wmmx_wunpckil,\
   wmmx_wxor,\
-  neon_bp_2cycle,\
-  neon_bp_3cycle,\
-  neon_bp_simple,\
-  neon_fp_vadd_ddd_vabs_dd,\
-  neon_fp_vadd_qqq_vabs_qq,\
-  neon_fp_vmla_ddd_scalar,\
-  neon_fp_vmla_ddd,\
-  neon_fp_vmla_qqq_scalar,\
-  neon_fp_vmla_qqq,\
-  neon_fp_vmul_ddd,\
-  neon_fp_vmul_qqd,\
-  neon_fp_vrecps_vrsqrts_ddd,\
-  neon_fp_vrecps_vrsqrts_qqq,\
-  neon_fp_vsum,\
-  neon_int_1,\
-  neon_int_2,\
-  neon_int_3,\
-  neon_int_4,\
-  neon_int_5,\
-  neon_ldm_2,\
-  neon_mcr_2_mcrr,\
-  neon_mcr,\
-  neon_mla_ddd_16_scalar_qdd_32_16_long_scalar,\
-  neon_mla_ddd_32_qqd_16_ddd_32_scalar_qdd_64_32_long_scalar_qdd_64_32_long,\
-  neon_mla_ddd_8_16_qdd_16_8_long_32_16_long,\
-  neon_mla_qqq_32_qqd_32_scalar,\
-  neon_mla_qqq_8_16,\
-  neon_mrc,\
-  neon_mrrc,\
-  neon_mul_ddd_16_scalar_32_16_long_scalar,\
-  neon_mul_ddd_8_16_qdd_16_8_long_32_16_long,\
-  neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar,\
-  neon_mul_qqd_32_scalar,\
-  neon_mul_qqq_8_16_32_ddd_32,\
-  neon_shift_1,\
-  neon_shift_2,\
-  neon_shift_3,\
-  neon_stm_2,\
-  neon_vaba_qqq,\
-  neon_vaba,\
-  neon_vld1_1_2_regs,\
-  neon_vld1_3_4_regs,\
-  neon_vld1_vld2_lane,\
-  neon_vld2_2_regs_vld1_vld2_all_lanes,\
-  neon_vld2_4_regs,\
-  neon_vld3_vld4_all_lanes,\
-  neon_vld3_vld4_lane,\
-  neon_vld3_vld4,\
-  neon_vmov,\
-  neon_vqneg_vqabs,\
-  neon_vqshl_vrshl_vqrshl_qqq,\
-  neon_vshl_ddd,\
-  neon_vsma,\
-  neon_vsra_vrsra,\
-  neon_vst1_1_2_regs_vst2_2_regs,\
-  neon_vst1_3_4_regs,\
-  neon_vst1_vst2_lane,\
-  neon_vst2_4_regs_vst3_vst4,\
-  neon_vst3_vst4_lane,\
-  neon_vst3_vst4,\
 \
   neon_add,\
   neon_add_q,\

[ARM] [Neon types 10/10] Remove neon-schedgen.ml

2013-10-15 Thread James Greenhalgh

Hi,

After refactoring all the Neon "type" attributes, neon-schedgen.ml is
out of date and only serves to distract.

This patch removes the script.

I've run a bootstrap for arm just to ensure that no funky Make
machinery remains.

OK?

Thanks,
James

---
2013-10-15  James Greenhalgh  

* config/arm/neon-schedgen.ml: Remove.
diff --git a/gcc/config/arm/cortex-a9-neon.md b/gcc/config/arm/cortex-a9-neon.md
index ba005d464f1402b904438c4fd03541a5e18ba3f1..cd6b7a4fd36d40c50c78d5cdb7ca484652770cb4 100644
--- a/gcc/config/arm/cortex-a9-neon.md
+++ b/gcc/config/arm/cortex-a9-neon.md
@@ -330,8 +330,6 @@ (define_insn_reservation "ca9_neon_mrrc"
(eq_attr "cortex_a9_neon_type" "neon_mrrc"))
   "ca9_issue_vfp_neon + cortex_a9_neon_mcr")
 
-;; The remainder of this file is auto-generated by neon-schedgen.
-
 ;; Instructions using this reservation read their source operands at N2, and
 ;; produce a result at N3.
 (define_insn_reservation "cortex_a9_neon_int_1" 3
diff --git a/gcc/config/arm/neon-schedgen.ml b/gcc/config/arm/neon-schedgen.ml
deleted file mode 100644
index b369956..000
--- a/gcc/config/arm/neon-schedgen.ml
+++ /dev/null
@@ -1,543 +0,0 @@
-(* Emission of the core of the Cortex-A8 NEON scheduling description.
-   Copyright (C) 2007-2013 Free Software Foundation, Inc.
-   Contributed by CodeSourcery.
-   This file is part of GCC.
-
-   GCC is free software; you can redistribute it and/or modify it under
-   the terms of the GNU General Public License as published by the Free
-   Software Foundation; either version 3, or (at your option) any later
-   version.
-
-   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
-   WARRANTY; without even the implied warranty of MERCHANTABILITY or
-   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
-   for more details.
-
-   You should have received a copy of the GNU General Public License
-   along with GCC; see the file COPYING3.  If not see
-   <http://www.gnu.org/licenses/>.
-*)
-
-(* This scheduling description generator works as follows.
-   - Each group of instructions has source and destination requirements
- specified and a list of cores supported. This is then filtered
- and per core scheduler descriptions are generated out.
- The reservations generated are prefixed by the name of the
- core and the check is performed on the basis of what the tuning
- string is. Running this will generate Neon scheduler descriptions
- for all cores supported.
-
- The source requirements may be specified using
- Source (the stage at which all source operands not otherwise
- described are read), Source_m (the stage at which Rm operands are
- read), Source_n (likewise for Rn) and Source_d (likewise for Rd).
-   - For each group of instructions the earliest stage where a source
- operand may be required is calculated.
-   - Each group of instructions is selected in turn as a producer.
- The latencies between this group and every other group are then
- calculated, yielding up to four values for each combination:
-	1. Producer -> consumer Rn latency
-	2. Producer -> consumer Rm latency
-	3. Producer -> consumer Rd (as a source) latency
-	4. Producer -> consumer worst-case latency.
- Value 4 is calculated from the destination availability requirements
- of the consumer and the earliest source availability requirements
- of the producer.
-   - The largest Value 4 calculated for the current producer is the
- worse-case latency, L, for that instruction group.  This value is written
- out in a define_insn_reservation for the producer group.
-   - For each producer and consumer pair, the latencies calculated above
- are collated.  The average (of up to four values) is calculated and
- if this average is different from the worst-case latency, an
- unguarded define_bypass construction is issued for that pair.
- (For each pair only one define_bypass construction will be emitted,
- and at present we do not emit specific guards.)
-*)
-
-let find_with_result fn lst =
-  let rec scan = function
-  [] -> raise Not_found
-| l::ls -> 
-  match fn l with
-  Some result -> result
-   | _ -> scan ls in
-scan lst
-
-let n1 = 1 and n2 = 2 and n3 = 3 and n4 = 4 and n5 = 5 and n6 = 6
-and n7 = 7 and n8 = 8 and n9 = 9
-
-type availability = Source of int
-  | Source_n of int
-  | Source_m of int
-  | Source_d of int
-  | Dest of int
-		  | Dest_n_after of int * int
-
-type guard = Guard_none | Guard_only_m | Guard_only_n | Guard_only_d
-
-(* Reservation behaviors.  All but the last row here correspond to one
-   pipeline each.  Each constructor will correspond to one
-   define_reservation.  *)
-type reservation =
-  Mul | Mul_2cycle | Mul_4cycle
-| Shift | Shift_2c

[ARM] [Neon types 8/10] Cortex-A7 neon pipeline model

2013-10-15 Thread James Greenhalgh

Hi,

This patch updates the A7 pipeline for the new Neon types.

Sanity checked and tested with some neon intrinsics code to see
schedule quality.

Thanks,
James

---
gcc/

2013-10-15  James Greenhalgh  

* config/arm/cortex-a7.md
(cortex_a7_neon_type): New.
(cortex_a7_neon_mul): Update for new types.
(cortex_a7_neon_mla): Likewise.
(cortex_a7_neon): Likewise.
diff --git a/gcc/config/arm/cortex-a7.md b/gcc/config/arm/cortex-a7.md
index a72a88d90af1c5491115ee84af47ec6d4f593535..7db6c5b24fb9cfe0c8a6a9837798736bd94b7788 100644
--- a/gcc/config/arm/cortex-a7.md
+++ b/gcc/config/arm/cortex-a7.md
@@ -20,6 +20,45 @@
 ;; along with GCC; see the file COPYING3.  If not see
 ;; <http://www.gnu.org/licenses/>.
 
+(define_attr "cortex_a7_neon_type"
+  "neon_mul, neon_mla, neon_other"
+  (cond [
+  (eq_attr "type" "neon_mul_b, neon_mul_b_q,\
+	   neon_mul_h, neon_mul_h_q,\
+			   neon_mul_s, neon_mul_s_q,\
+			   neon_mul_b_long, neon_mul_h_long,\
+			   neon_mul_s_long, neon_mul_h_scalar,\
+			   neon_mul_h_scalar_q, neon_mul_s_scalar,\
+			   neon_mul_s_scalar_q, neon_mul_h_scalar_long,\
+			   neon_mul_s_scalar_long,\
+			   neon_sat_mul_b, neon_sat_mul_b_q,\
+			   neon_sat_mul_h, neon_sat_mul_h_q,\
+			   neon_sat_mul_s, neon_sat_mul_s_q,\
+			   neon_sat_mul_b_long, neon_sat_mul_h_long,\
+			   neon_sat_mul_s_long,\
+			   neon_sat_mul_h_scalar, neon_sat_mul_h_scalar_q,\
+			   neon_sat_mul_s_scalar, neon_sat_mul_s_scalar_q,\
+			   neon_sat_mul_h_scalar_long,\
+			   neon_sat_mul_s_scalar_long,\
+			   neon_fp_mul_s, neon_fp_mul_s_q,\
+			   neon_fp_mul_s_scalar, neon_fp_mul_s_scalar_q")
+ (const_string "neon_mul")
+  (eq_attr "type" "neon_mla_b, neon_mla_b_q, neon_mla_h,\
+	   neon_mla_h_q, neon_mla_s, neon_mla_s_q,\
+			   neon_mla_b_long, neon_mla_h_long,\
+   neon_mla_s_long,\
+			   neon_mla_h_scalar, neon_mla_h_scalar_q,\
+			   neon_mla_s_scalar, neon_mla_s_scalar_q,\
+			   neon_mla_h_scalar_long, neon_mla_s_scalar_long,\
+			   neon_sat_mla_b_long, neon_sat_mla_h_long,\
+			   neon_sat_mla_s_long,\
+			   neon_sat_mla_h_scalar_long,\
+   neon_sat_mla_s_scalar_long,\
+			   neon_fp_mla_s, neon_fp_mla_s_q,\
+			   neon_fp_mla_s_scalar, neon_fp_mla_s_scalar_q")
+ (const_string "neon_mla")]
+   (const_string "neon_other")))
+
 (define_automaton "cortex_a7")
 
 
@@ -227,14 +266,7 @@ (define_insn_reservation "cortex_a7_fpmu
 
 (define_insn_reservation "cortex_a7_neon_mul" 4
   (and (eq_attr "tune" "cortexa7")
-   (eq_attr "type"
-"neon_mul_ddd_8_16_qdd_16_8_long_32_16_long,\
- neon_mul_qqq_8_16_32_ddd_32,\
- neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar,\
- neon_mul_ddd_16_scalar_32_16_long_scalar,\
- neon_mul_qqd_32_scalar,\
- neon_fp_vmul_ddd,\
- neon_fp_vmul_qqd"))
+   (eq_attr "cortex_a7_neon_type" "neon_mul"))
   "(cortex_a7_both+cortex_a7_fpmul_pipe)*2")
 
 (define_insn_reservation "cortex_a7_fpmacs" 8
@@ -244,16 +276,7 @@ (define_insn_reservation "cortex_a7_fpma
 
 (define_insn_reservation "cortex_a7_neon_mla" 8
   (and (eq_attr "tune" "cortexa7")
-   (eq_attr "type"
-"neon_mla_ddd_8_16_qdd_16_8_long_32_16_long,\
- neon_mla_qqq_8_16,\
- neon_mla_ddd_32_qqd_16_ddd_32_scalar_qdd_64_32_long_scalar_qdd_64_32_long,\
- neon_mla_qqq_32_qqd_32_scalar,\
- neon_mla_ddd_16_scalar_qdd_32_16_long_scalar,\
- neon_fp_vmla_ddd,\
- neon_fp_vmla_qqq,\
- neon_fp_vmla_ddd_scalar,\
- neon_fp_vmla_qqq_scalar"))
+   (eq_attr "cortex_a7_neon_type" "neon_mla"))
   "cortex_a7_both+cortex_a7_fpmul_pipe")
 
 (define_bypass 4 "cortex_a7_fpmacs,cortex_a7_neon_mla"
@@ -366,21 +389,6 @@ (define_bypass 2 "cortex_a7_f_loads, cor
 
 (define_insn_reservation "cortex_a7_neon" 4
   (and (eq_attr "tune" "cortexa7")
-   (eq_attr "type"
-"neon_mul_ddd_8_16_qdd_16_8_long_32_16_long,\
- neon_mul_qqq_8_16_32_ddd_32,\
- neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar,\
- neon_mla_ddd_8_16_qdd_16_8_long_32_16_long,\
- neon_mla_qqq_8_16,\
- neon_mla_ddd_32_qqd_16_ddd_32_scalar_qdd_64_32_long_scalar_qdd_64_32_long,\
- neon_mla_qqq_32_qqd_32_scalar,\
-   

[AArch64] Fix output template for Scalar Neon->Neon register move.

2013-10-16 Thread James Greenhalgh

Hi,

To move a scalar char/short/int around in the vector registers there
is no such instruction as:
  dup v0, v0.h[0]
But there is:
  dup h0, v0.h[0]
(Alternately there is dup v0.4h, v0.h[0], but I don't think that
is what we are aiming for).

Fix the output template we are using to reflect this.

aarch64.exp came back clean and the correct instruction form is
now generated.

OK?

Thanks,
James

---
2013-10-14  James Greenhalgh  

* config/aarch64/aarch64.md
(*mov_aarch64): Fix output template for DUP (element) Scalar.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 01664665e7d309f2cf2076bdc3ca6e0825612cea..758be47420e95fad74c57c1a9dcb7934b87c141e 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -789,7 +789,7 @@ (define_insn "*mov_aarch64"
  case 8:
return "dup\t%0., %w1";
  case 9:
-   return "dup\t%0, %1.[0]";
+   return "dup\t%0, %1.[0]";
  default:
gcc_unreachable ();
  }

[AArch64] Fix types for vcvt_n intrinsics.

2013-10-17 Thread James Greenhalgh

Hi,

I spotted that the types of arguments to these intrinsics are wrong,
which results in all sorts of fun issues!

Fixed thusly, regression tested with aarch64.exp on aarch64-none-elf
with no issues.

OK?

Thanks,
James

---
2013-10-17  James Greenhalgh  

* config/aarch64/arm_neon.h
(vcvt_n_<32,64>_<32,64>): Correct argument types.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index f7c9db6..55aa742 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -5442,7 +5442,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
int64_t a_ = (a);\
-   int64_t result;  \
+   float64_t result;\
__asm__ ("scvtf %d0,%d1,%2"  \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5454,7 +5454,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
uint64_t a_ = (a);   \
-   uint64_t result; \
+   float64_t result;\
__asm__ ("ucvtf %d0,%d1,%2"  \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5466,7 +5466,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
float64_t a_ = (a);  \
-   float64_t result;\
+   int64_t result;  \
__asm__ ("fcvtzs %d0,%d1,%2" \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5478,7 +5478,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
float64_t a_ = (a);  \
-   float64_t result;\
+   uint64_t result;  \
__asm__ ("fcvtzu %d0,%d1,%2" \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5586,7 +5586,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
int32_t a_ = (a);\
-   int32_t result;  \
+   float32_t result;\
__asm__ ("scvtf %s0,%s1,%2"  \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5598,7 +5598,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \
uint32_t a_ = (a);   \
-   uint32_t result; \
+   float32_t result;\
__asm__ ("ucvtf %s0,%s1,%2"  \
 : "=w"(result)  \
 : "w"(a_), "i"(b)   \
@@ -5610,7 +5610,7 @@ static float32x2_t vdup_n_f32 (float32_t);
   __extension__ \
 ({  \

Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).

2013-10-18 Thread James Greenhalgh
On Fri, Oct 18, 2013 at 11:55:08AM +0100, Richard Biener wrote:
> I suppose with Jeffs recent work on jump-threading through paths
> this case in handled and the patch in this thread is obsolete or can
> be reworked?

Yes, this patch is now obsolete, Jeff's solution is much more
elegant :-)

Thanks,
James



[AArch64] Fix size of memory store for the vst_lane intrinsics

2013-10-29 Thread James Greenhalgh

Hi,

The vst_lane_ intrinsics should write
(sizeof (lane_type) * n) bytes to memory.

In their current form, their asm constraints suggest a write size of
(sizeof (vector_type) * n). This is anywhere from 1 to 16 times too
much data, can cause huge headaches with dead store elimination.

This patch better models how much data we will be writing, which in
turn lets us eliminate the memory clobber. Together, we avoid the
problems with dead store elimination.

Tested with aarch64.exp and checked the C++ neon mangling test which
often breaks when you do these ugly casts.

OK?

Thanks,
James

---
gcc/

2013-10-29  James Greenhalgh  

* config/aarch64/arm_neon.h
(__ST2_LANE_FUNC): Better model data size.
(__ST3_LANE_FUNC): Likewise.
(__ST4_LANE_FUNC): Likewise.
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 787ff15..7a63ea1 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -14704,16 +14704,19 @@ __LD4_LANE_FUNC (uint64x2x4_t, uint64_t, 2d, d, u64, q)
 
 #define __ST2_LANE_FUNC(intype, ptrtype, regsuffix,			\
 			lnsuffix, funcsuffix, Q)			\
+  typedef struct { ptrtype __x[2]; } __ST2_LANE_STRUCTURE_##intype;	\
   __extension__ static __inline void	\
   __attribute__ ((__always_inline__))	\
-  vst2 ## Q ## _lane_ ## funcsuffix (const ptrtype *ptr,		\
+  vst2 ## Q ## _lane_ ## funcsuffix (ptrtype *ptr,			\
  intype b, const int c)		\
   {	\
+__ST2_LANE_STRUCTURE_##intype *__p =\
+(__ST2_LANE_STRUCTURE_##intype *)ptr;	\
 __asm__ ("ld1 {v16." #regsuffix ", v17." #regsuffix "}, %1\n\t"	\
 	 "st2 {v16." #lnsuffix ", v17." #lnsuffix "}[%2], %0\n\t"	\
-	 : "=Q"(*(intype *) ptr)	\
+	 : "=Q"(*__p)		\
 	 : "Q"(b), "i"(c)		\
-	 : "memory", "v16", "v17");	\
+	 : "v16", "v17");		\
   }
 
 __ST2_LANE_FUNC (int8x8x2_t, int8_t, 8b, b, s8,)
@@ -14743,16 +14746,19 @@ __ST2_LANE_FUNC (uint64x2x2_t, uint64_t, 2d, d, u64, q)
 
 #define __ST3_LANE_FUNC(intype, ptrtype, regsuffix,			\
 			lnsuffix, funcsuffix, Q)			\
+  typedef struct { ptrtype __x[3]; } __ST3_LANE_STRUCTURE_##intype;	\
   __extension__ static __inline void	\
   __attribute__ ((__always_inline__))	\
-  vst3 ## Q ## _lane_ ## funcsuffix (const ptrtype *ptr,		\
+  vst3 ## Q ## _lane_ ## funcsuffix (ptrtype *ptr,			\
  intype b, const int c)		\
   {	\
+__ST3_LANE_STRUCTURE_##intype *__p =\
+(__ST3_LANE_STRUCTURE_##intype *)ptr;	\
 __asm__ ("ld1 {v16." #regsuffix " - v18." #regsuffix "}, %1\n\t"	\
 	 "st3 {v16." #lnsuffix " - v18." #lnsuffix "}[%2], %0\n\t"	\
-	 : "=Q"(*(intype *) ptr)	\
+	 : "=Q"(*__p)		\
 	 : "Q"(b), "i"(c)		\
-	 : "memory", "v16", "v17", "v18");\
+	 : "v16", "v17", "v18");	\
   }
 
 __ST3_LANE_FUNC (int8x8x3_t, int8_t, 8b, b, s8,)
@@ -14782,16 +14788,19 @@ __ST3_LANE_FUNC (uint64x2x3_t, uint64_t, 2d, d, u64, q)
 
 #define __ST4_LANE_FUNC(intype, ptrtype, regsuffix,			\
 			lnsuffix, funcsuffix, Q)			\
+  typedef struct { ptrtype __x[4]; } __ST4_LANE_STRUCTURE_##intype;	\
   __extension__ static __inline void	\
   __attribute__ ((__always_inline__))	\
-  vst4 ## Q ## _lane_ ## funcsuffix (const ptrtype *ptr,		\
+  vst4 ## Q ## _lane_ ## funcsuffix (ptrtype *ptr,			\
  intype b, const int c)		\
   {	\
+__ST4_LANE_STRUCTURE_##intype *__p =\
+(__ST4_LANE_STRUCTURE_##intype *)ptr;	\
 __asm__ ("ld1 {v16." #regsuffix " - v19." #regsuffix "}, %1\n\t"	\
 	 "st4 {v16." #lnsuffix " - v19." #lnsuffix "}[%2], %0\n\t"	\
-	 : "=Q"(*(intype *) ptr)	\
+	 : "=Q"(*__p)		\
 	 : "Q"(b), "i"(c)		\
-	 : "memory", "v16", "v17", "v18", "v19");			\
+	 : "v16", "v17", "v18", "v19");\
   }
 
 __ST4_LANE_FUNC (int8x8x4_t, int8_t, 8b, b, s8,)

Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-01 Thread James Greenhalgh
On Fri, Nov 01, 2013 at 02:03:52AM +, Cong Hou wrote:
> 3. Add the document for SAD_EXPR.

I think this patch should also document the new Standard Names usad and
ssad in doc/md.texi?

Your Changelog is missing the change to doc/generic.texi.

Thanks,
James



Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-04 Thread James Greenhalgh
On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 2a5a2e1..8f5d39a 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
> Operand 3 is of a mode equal or
>  wider than the mode of the product. The result is placed in operand 0, which
>  is of the same mode as operand 3.
> 
> +@cindex @code{ssad@var{m}} instruction pattern
> +@item @samp{ssad@var{m}}
> +@cindex @code{usad@var{m}} instruction pattern
> +@item @samp{usad@var{m}}
> +Compute the sum of absolute differences of two signed/unsigned elements.
> +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
> which
> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a 
> mode
> +equal or wider than the mode of the absolute difference. The result is placed
> +in operand 0, which is of the same mode as operand 3.
> +
>  @cindex @code{ssum_widen@var{m3}} instruction pattern
>  @item @samp{ssum_widen@var{m3}}
>  @cindex @code{usum_widen@var{m3}} instruction pattern
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 4975a64..1db8a49 100644

I'm not sure I follow, and if I do - I don't think it matches what
you have implemented for i386.

>From your text description I would guess the series of operations to be:

  v1 = widen (operands[1])
  v2 = widen (operands[2])
  v3 = abs (v1 - v2)
  operands[0] = v3 + operands[3]

But if I understand the behaviour of PSADBW correctly, what you have
actually implemented is:

  v1 = widen (operands[1])
  v2 = widen (operands[2])
  v3 = abs (v1 - v2)
  v4 = reduce_plus (v3)
  operands[0] = v4 + operands[3]

To my mind, synthesizing the reduce_plus step will be wasteful for targets
who do not get this for free with their Absolute Difference step. Imagine a
simple loop where we have synthesized the reduce_plus, we compute partial
sums each loop iteration, though we would be better to leave the reduce_plus
step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
Tree code for this.

I would prefer to see this Tree code not imply the reduce_plus.

Thanks,
James



Re: [PATCH] Try to avoid vector mode punning in SET_DEST on i?86

2013-11-04 Thread James Greenhalgh
On Thu, Oct 31, 2013 at 04:49:47PM +, Jakub Jelinek wrote:

> 2013-10-31  Jakub Jelinek  
> 
> * optabs.c (expand_vec_perm): Avoid vector mode punning
> SUBREGs in SET_DEST.
> * expmed.c (store_bit_field_1): Likewise.
> * config/i386/sse.md (movdi_to_sse, vec_pack_sfix_trunc_v2df,
> vec_pack_sfix_v2df, vec_shl_, vec_shr_,
> vec_interleave_high, vec_interleave_low): Likewise.
> * config/i386/i386.c (ix86_expand_vector_move_misalign,
> ix86_expand_sse_movcc, ix86_expand_int_vcond, ix86_expand_vec_perm,
> ix86_expand_sse_unpack, ix86_expand_args_builtin,
> ix86_expand_vector_init_duplicate, ix86_expand_vector_set,
> emit_reduc_half, expand_vec_perm_blend, expand_vec_perm_pshufb,
> expand_vec_perm_interleave2, expand_vec_perm_pshufb2,
> expand_vec_perm_vpshufb2_vpermq,
> expand_vec_perm_vpshufb2_vpermq_even_odd, expand_vec_perm_even_odd_1,
> expand_vec_perm_broadcast_1, expand_vec_perm_vpshufb4_vpermq2,
> ix86_expand_sse2_mulv4si3, ix86_expand_pinsr): Likewise.
> (expand_vec_perm_palignr): Likewise.  Modify a copy of *d rather
> than *d itself.
> 
> --- gcc/optabs.c.jj 2013-10-29 09:25:45.0 +0100
> +++ gcc/optabs.c2013-10-31 13:20:40.384808642 +0100
> @@ -6674,7 +6674,7 @@ expand_vec_perm (enum machine_mode mode,
> }
>tmp = gen_rtx_CONST_VECTOR (qimode, vec);
>sel = gen_lowpart (qimode, sel);
> -  sel = expand_vec_perm (qimode, sel, sel, tmp, NULL);
> +  sel = expand_vec_perm (qimode, gen_reg_rtx (qimode), sel, tmp, NULL);
>gcc_assert (sel != NULL);
> 
>/* Add the byte offset to each byte element.  */

This hunk causes issues on AArch64 and ARM.

We look to see which permute operation we should generate in
aarch64_expand_vec_perm_const. If we notice that all elements in 
would select from op0 we copy op0 to op1 and generate appropriate code.

With this hunk applied we end up selecting from the register
generated by gen_reg_rtx (qimode), rather than from 'sel' as we
intended. Thus we lose the value of 'sel' and everything starts to
go wrong!

The hunk looks suspicious to me (why do we pick a register out of
thin air?), and reverting it fixes the problems I see. Could you
give me a pointer as to what this hunk fixes on i?86?

I don't think I can fix this up in the expander very easily?

Thanks,
James




Re: [PATCH] Introducing SAD (Sum of Absolute Differences) operation to GCC vectorizer.

2013-11-05 Thread James Greenhalgh
On Mon, Nov 04, 2013 at 06:30:55PM +, Cong Hou wrote:
> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh
>  wrote:
> > On Fri, Nov 01, 2013 at 04:48:53PM +, Cong Hou wrote:
> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> >> index 2a5a2e1..8f5d39a 100644
> >> --- a/gcc/doc/md.texi
> >> +++ b/gcc/doc/md.texi
> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3.
> >> Operand 3 is of a mode equal or
> >>  wider than the mode of the product. The result is placed in operand 0, 
> >> which
> >>  is of the same mode as operand 3.
> >>
> >> +@cindex @code{ssad@var{m}} instruction pattern
> >> +@item @samp{ssad@var{m}}
> >> +@cindex @code{usad@var{m}} instruction pattern
> >> +@item @samp{usad@var{m}}
> >> +Compute the sum of absolute differences of two signed/unsigned elements.
> >> +Operand 1 and operand 2 are of the same mode. Their absolute difference, 
> >> which
> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of a 
> >> mode
> >> +equal or wider than the mode of the absolute difference. The result is 
> >> placed
> >> +in operand 0, which is of the same mode as operand 3.
> >> +
> >>  @cindex @code{ssum_widen@var{m3}} instruction pattern
> >>  @item @samp{ssum_widen@var{m3}}
> >>  @cindex @code{usum_widen@var{m3}} instruction pattern
> >> diff --git a/gcc/expr.c b/gcc/expr.c
> >> index 4975a64..1db8a49 100644
> >
> > I'm not sure I follow, and if I do - I don't think it matches what
> > you have implemented for i386.
> >
> > From your text description I would guess the series of operations to be:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   operands[0] = v3 + operands[3]
> >
> > But if I understand the behaviour of PSADBW correctly, what you have
> > actually implemented is:
> >
> >   v1 = widen (operands[1])
> >   v2 = widen (operands[2])
> >   v3 = abs (v1 - v2)
> >   v4 = reduce_plus (v3)
> >   operands[0] = v4 + operands[3]
> >
> > To my mind, synthesizing the reduce_plus step will be wasteful for targets
> > who do not get this for free with their Absolute Difference step. Imagine a
> > simple loop where we have synthesized the reduce_plus, we compute partial
> > sums each loop iteration, though we would be better to leave the reduce_plus
> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate
> > Tree code for this.
> 
> What do you mean when you use "synthesizing" here? For each pattern,
> the only synthesized operation is the one being returned from the
> pattern recognizer. In this case, it is USAD_EXPR. The recognition of
> reduce sum is necessary as we need corresponding prolog and epilog for
> reductions, which is already done before pattern recognition. Note
> that reduction is not a pattern but is a type of vector definition. A
> vectorization pattern can still be a reduction operation as long as
> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You
> can check the other two reduction patterns: widen_sum_pattern and
> dot_prod_pattern for reference.

My apologies for not being clear. What I mean is, for a target which does
not have a dedicated PSADBW instruction, the individual steps of
'usad' must be "synthesized" in such a way as to match the expected
behaviour of the tree code.

So, I must expand 'usadm' to a series of equivalent instructions
as USAD_EXPR expects.

If USAD_EXPR requires me to emit a reduction on each loop iteration,
I think that will be inefficient compared to performing the reduction
after the loop body.

To a first approximation on ARM, I would expect from your description
of 'usad' that generating,

 VABAL   ops[3], ops[1], ops[2]
 (Vector widening Absolute Difference and Accumulate)

would fulfil the requirements.

But to match the behaviour you have implemented in the i386
backend I would be required to generate:

VABAL   ops[3], ops[1], ops[2]
VPADD   ops[3], ops[3], ops[3] (add one set of pairs)
VPADD   ops[3], ops[3], ops[3] (and the other)
VANDops[0], ops[3], MASK   (clear high lanes)

Which additionally performs the (redundant) vector reduction
and high lane zeroing step on each loop iteration.

My comment is that your documentation and implementation are
inconsistent so I am not sure which behaviour you intend for USAD_EXPR.

Additionally, I think it would be more generic to choose the first
behaviour, rather than requiring a wasteful decomposition to match
a very particular i386 opcode.

Thanks,
James



[Patch AArch64] GCC 6 regression in vector performance. - Fix vector initialization to happen with lane load instructions.

2016-01-20 Thread James Greenhalgh

Hi,

In a number of cases where we try to create vectors we end up spilling to the
stack and then filling. This is one example distilled from a couple of
micro-benchmrks where the issue shows up. The reason for the extra cost
in this case is the unnecessary use of the stack. The patch attempts to
finesse this by using lane loads or vector inserts to produce the right
results.

This patch is mostly Ramana's work, I've just cleaned it up a little.

This has been in a number of our trees lately, and we haven't seen any
regressions. I've also bootstrapped and tested it, and run a set of
benchmarks to show no regressions on Cortex-A57 or Cortex-A53.

The patch fixes some regressions caused by the more agressive vectorization
in GCC6, so I'd like to propose it to go in even though we are in Stage 4.

OK?

Thanks,
James

---
gcc/

2016-01-20  James Greenhalgh  
Ramana Radhakrishnan  

* config/aarch64/aarch64.c (aarch64_expand_vector_init): Refactor,
always use lane loads to construct non-constant vectors.

gcc/testsuite/

2016-01-20  James Greenhalgh  
Ramana Radhakrishnan  

* gcc.target/aarch64/vector_initialization_nostack.c: New.

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 03bc1b9..3787b38 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10985,28 +10985,37 @@ aarch64_simd_make_constant (rtx vals)
 return NULL_RTX;
 }
 
+/* Expand a vector initialisation sequence, such that TARGET is
+   initialised to contain VALS.  */
+
 void
 aarch64_expand_vector_init (rtx target, rtx vals)
 {
   machine_mode mode = GET_MODE (target);
   machine_mode inner_mode = GET_MODE_INNER (mode);
+  /* The number of vector elements.  */
   int n_elts = GET_MODE_NUNITS (mode);
+  /* The number of vector elements which are not constant.  */
   int n_var = 0;
   rtx any_const = NULL_RTX;
+  /* The first element of vals.  */
+  rtx v0 = XVECEXP (vals, 0, 0);
   bool all_same = true;
 
+  /* Count the number of variable elements to initialise.  */
   for (int i = 0; i < n_elts; ++i)
 {
   rtx x = XVECEXP (vals, 0, i);
-  if (!CONST_INT_P (x) && !CONST_DOUBLE_P (x))
+  if (!(CONST_INT_P (x) || CONST_DOUBLE_P (x)))
 	++n_var;
   else
 	any_const = x;
 
-  if (i > 0 && !rtx_equal_p (x, XVECEXP (vals, 0, 0)))
-	all_same = false;
+  all_same &= rtx_equal_p (x, v0);
 }
 
+  /* No variable elements, hand off to aarch64_simd_make_constant which knows
+ how best to handle this.  */
   if (n_var == 0)
 {
   rtx constant = aarch64_simd_make_constant (vals);
@@ -11020,14 +11029,15 @@ aarch64_expand_vector_init (rtx target, rtx vals)
   /* Splat a single non-constant element if we can.  */
   if (all_same)
 {
-  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0));
+  rtx x = copy_to_mode_reg (inner_mode, v0);
   aarch64_emit_move (target, gen_rtx_VEC_DUPLICATE (mode, x));
   return;
 }
 
-  /* Half the fields (or less) are non-constant.  Load constant then overwrite
- varying fields.  Hope that this is more efficient than using the stack.  */
-  if (n_var <= n_elts/2)
+  /* Initialise a vector which is part-variable.  We want to first try
+ to build those lanes which are constant in the most efficient way we
+ can.  */
+  if (n_var != n_elts)
 {
   rtx copy = copy_rtx (vals);
 
@@ -11054,31 +11064,21 @@ aarch64_expand_vector_init (rtx target, rtx vals)
 	  XVECEXP (copy, 0, i) = subst;
 	}
   aarch64_expand_vector_init (target, copy);
+}
 
-  /* Insert variables.  */
-  enum insn_code icode = optab_handler (vec_set_optab, mode);
-  gcc_assert (icode != CODE_FOR_nothing);
+  /* Insert the variable lanes directly.  */
 
-  for (int i = 0; i < n_elts; i++)
-	{
-	  rtx x = XVECEXP (vals, 0, i);
-	  if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
-	continue;
-	  x = copy_to_mode_reg (inner_mode, x);
-	  emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
-	}
-  return;
-}
+  enum insn_code icode = optab_handler (vec_set_optab, mode);
+  gcc_assert (icode != CODE_FOR_nothing);
 
-  /* Construct the vector in memory one field at a time
- and load the whole vector.  */
-  rtx mem = assign_stack_temp (mode, GET_MODE_SIZE (mode));
   for (int i = 0; i < n_elts; i++)
-emit_move_insn (adjust_address_nv (mem, inner_mode,
-i * GET_MODE_SIZE (inner_mode)),
-		XVECEXP (vals, 0, i));
-  emit_move_insn (target, mem);
-
+{
+  rtx x = XVECEXP (vals, 0, i);
+  if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
+	continue;
+  x = copy_to_mode_reg (inner_mode, x);
+  emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
+}
 }
 
 static unsigned HOST_WIDE_INT
diff --git a/gcc/testsuite/gcc.target/aarch64/vector_initialization_nostack.c b/gcc/testsuite/gcc.target/aarch64/vector_initialization_nostack.c
new file mode 100644
index 0

Re: [PATCH 2/4 v2][AArch64] Add support for FCCMP

2016-01-21 Thread James Greenhalgh
On Wed, Jan 06, 2016 at 02:44:47PM -0600, Evandro Menezes wrote:
> Hi, Wilco.
> 
> On 01/06/2016 06:04 AM, Wilco Dijkstra wrote:
> >>Here's what I had in mind when I inquired about distinguishing FCMP from
> >>FCCMP.  As you can see in the patch, Exynos is the only target that
> >>cares about it, but I wonder if ThunderX or Xgene would too.
> >>
> >>What do you think?
> >The new attributes look fine (I've got a similar outstanding change), however
> >please don't add them to non-AArch64 cores. We only need it for thunderx.md,
> >cortex-a53.md, cortex-a57.md, xgene1.md and exynos-m1.md.
> 
> Add support for the FCCMP insn types
> 
> 2016-01-04  Evandro Menezes  
> 
> gcc/
> * config/aarch64/aarch64.md (fccmp): Change insn type.
> (fccmpe): Likewise.
> * config/aarch64/thunderx.md (thunderx_fcmp): Add
>"fccmp{s,d}" types.
> * config/arm/cortex-a53.md (cortex_a53_fpalu): Likewise.
> * config/arm/cortex-a57.md (cortex_a57_fp_cmp): Likewise.
> * config/arm/xgene1.md (xgene1_fcmp): Likewise.
> * config/arm/exynos-m1.md (exynos_m1_fp_ccmp): New insn
>reservation.
> * config/arm/types.md (fccmps): Add new insn type.
> (fccmpd): Likewise.
> 
> Got it.  Here's an updated patch.  Again, assuming that your
> original patch is in place.  Perhaps you can build on it.

If we don't have any targets which care about the fccmps/fccmpd split in
the code base, do we really need it? Can we just follow the example of
fcsel?

> diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
> index 321ff89..daf7162 100644
> --- a/gcc/config/arm/types.md
> +++ b/gcc/config/arm/types.md
> @@ -70,6 +70,7 @@
>  ; f_rint[d,s]double/single floating point rount to integral.
>  ; f_store[d,s]   double/single store to memory.  Used for VFP unit.
>  ; fadd[d,s]  double/single floating-point scalar addition.
> +; fccmp[d,s] double/single floating-point conditional compare.

Can we follow the convention fcsel uses of calling out "From ARMv8-A:"
for this type?

Thanks,
James



Re: [PATCH 2/4 v2][AArch64] Add support for FCCMP

2016-01-21 Thread James Greenhalgh
On Thu, Jan 21, 2016 at 11:13:29AM +, Wilco Dijkstra wrote:
> James Greenhalgh  wrote:
> > If we don't have any targets which care about the fccmps/fccmpd split in
> > the code base, do we really need it? Can we just follow the example of
> > fcsel?
> 
> If we do that then we should also change fcmps/d to fcmp to keep the f(c)cmp
> attributes orthogonal. However it seems better to have all FP operations use
> {s|d} postfix as the convention (rather than assume that all current and 
> future
> microarchitectures will treat float and double identically on all operations),
> so fcsel should ideally be fixed.

Adding values to this type attributes is a pretty lightweight change, and
each new type attribute has a small cost in compiler build-time and scheduler
performance. Given this, I don't see any need to design for the future, and
I don't see why we'd want to add more of them than we need to.

The fcmps/fcmpd split is used in cortex-a15-neon.md and cortex-r4f.md so
doesn't make a good comparison.

If we support a target in future which would benefit from different
modeling for fccmps and fccmpd we can split the value then.

Thanks,
James



Re: [AARCH64][ACLE][NEON] Implement vcvt*_s64_f64 and vcvt*_u64_f64 NEON intrinsics.

2016-01-21 Thread James Greenhalgh
On Wed, Jan 13, 2016 at 05:44:30PM +, Bilyan Borisov wrote:
> This patch implements all the vcvtR_s64_f64 and vcvtR_u64_f64 vector
> intrinsics, where R is ['',a,m,n,p]. Since these intrinsics are
> identical in semantics to the corresponding scalar variants, they are
> implemented in terms of them, with appropriate packing and unpacking
> of vector arguments. New test cases, covering all the intrinsics were
> also added.

This patch is very low risk, gets us another step towards closing pr58693,
and was posted before the Stage 3 deadline. This is OK for trunk.

Thanks,
James

> 
> Cross tested on aarch64-none-elf and aarch64-none-linux-gnu.
> Bootstrapped and
> tested on aarch64-none-linux-gnu.
> 
> ---
> 
> gcc/
> 
> 2015-XX-XX  Bilyan Borisov  
> 
>   * config/aarch64/arm_neon.h (vcvt_s64_f64): New intrinsic.
>   (vcvt_u64_f64): Likewise.
>   (vcvta_s64_f64): Likewise.
>   (vcvta_u64_f64): Likewise.
>   (vcvtm_s64_f64): Likewise.
>   (vcvtm_u64_f64): Likewise.
>   (vcvtn_s64_f64): Likewise.
>   (vcvtn_u64_f64): Likewise.
>   (vcvtp_s64_f64): Likewise.
>   (vcvtp_u64_f64): Likewise.
> 
> gcc/testsuite/
> 
> 2015-XX-XX  Bilyan Borisov  
> 
>   * gcc.target/aarch64/simd/vcvt_s64_f64_1.c: New.
>   * gcc.target/aarch64/simd/vcvt_u64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvta_s64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvta_u64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtm_s64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtm_u64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtn_s64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtn_u64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtp_s64_f64_1.c: Likewise.
>   * gcc.target/aarch64/simd/vcvtp_u64_f64_1.c: Likewise.



Re: [PATCH 2/4 v2][AArch64] Add support for FCCMP

2016-01-21 Thread James Greenhalgh
On Thu, Jan 21, 2016 at 01:58:31PM -0600, Evandro Menezes wrote:
> Hi, James.
> 
> On 01/21/16 03:24, James Greenhalgh wrote:
> >On Wed, Jan 06, 2016 at 02:44:47PM -0600, Evandro Menezes wrote:
> >>On 01/06/2016 06:04 AM, Wilco Dijkstra wrote:
> >>>>Here's what I had in mind when I inquired about distinguishing FCMP from
> >>>>FCCMP.  As you can see in the patch, Exynos is the only target that
> >>>>cares about it, but I wonder if ThunderX or Xgene would too.
> >>>>
> >>>>What do you think?
> >>>The new attributes look fine (I've got a similar outstanding change), 
> >>>however
> >>>please don't add them to non-AArch64 cores. We only need it for 
> >>>thunderx.md,
> >>>cortex-a53.md, cortex-a57.md, xgene1.md and exynos-m1.md.
> >> Add support for the FCCMP insn types
> >>
> >> 2016-01-04  Evandro Menezes  
> >>
> >> gcc/
> >> * config/aarch64/aarch64.md (fccmp): Change insn type.
> >> (fccmpe): Likewise.
> >> * config/aarch64/thunderx.md (thunderx_fcmp): Add
> >>"fccmp{s,d}" types.
> >> * config/arm/cortex-a53.md (cortex_a53_fpalu): Likewise.
> >> * config/arm/cortex-a57.md (cortex_a57_fp_cmp): Likewise.
> >> * config/arm/xgene1.md (xgene1_fcmp): Likewise.
> >> * config/arm/exynos-m1.md (exynos_m1_fp_ccmp): New insn
> >>reservation.
> >> * config/arm/types.md (fccmps): Add new insn type.
> >> (fccmpd): Likewise.
> >>
> >>Got it.  Here's an updated patch.  Again, assuming that your
> >>original patch is in place.  Perhaps you can build on it.
> >If we don't have any targets which care about the fccmps/fccmpd split in
> >the code base, do we really need it? Can we just follow the example of
> >fcsel?
> 
> The Exynos M1 does care about the difference between FCMP and FCCMP,
> as can be seen in the patch.

> More explicitly:
> 
>(define_insn_reservation "exynos_m1_fp_cmp" 4
>   (and (eq_attr "tune" "exynosm1")
>(eq_attr "type" "fcmps, fcmpd"))
>   "em1_nmisc")
> 
>(define_insn_reservation "exynos_m1_fp_ccmp" 7
>   (and (eq_attr "tune" "exynosm1")
>(eq_attr "type" "fccmps, fccmpd"))
>   "em1_st, em1_nmisc")
> 

I think I was unclear. Your exynos-m1 model cares about splitting fcmp[s/d]
and fccmp, but it doesn't care about splitting fccmp in to fccmps/fccmpd. It
is the split to fccmps/fccmpd that I think is unneccesary at this time.

> >>diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
> >>index 321ff89..daf7162 100644
> >>--- a/gcc/config/arm/types.md
> >>+++ b/gcc/config/arm/types.md
> >>@@ -70,6 +70,7 @@
> >>  ; f_rint[d,s]double/single floating point rount to integral.
> >>  ; f_store[d,s]   double/single store to memory.  Used for VFP unit.
> >>  ; fadd[d,s]  double/single floating-point scalar addition.
> >>+; fccmp[d,s] double/single floating-point conditional compare.
> >Can we follow the convention fcsel uses of calling out "From ARMv8-A:"
> >for this type?
> >
> 
> I'm not sure I follow.  Though I didn't refer to the ISA spec, I
> used the description from it for the *fccmp* type.
> 
> Please, advise.

Something like:

; fccmpFrom ARMv8-A: floating point conditional compare.

Just to capture that this instruction is only available for cores implementing
ARMv8-A.

Thanks,
James



[Patch Obvious] gcc.dg/vect/bb-slp-pr68892.c requires vectorization of doubles

2016-01-22 Thread James Greenhalgh

Hi,

As title. This testcase fails on arm-none-linux-gnueabihf, because we don't
have vectorization of doubles there.

Committed as obvious as revision 232731.

Thanks,
James

---
2016-01-22  James Greenhalgh  

* gcc.dg/vect/bb-slp-pr68892.c: Require vect_double.
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c
index 648fe481..ba51b76 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr68892.c
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
 /* { dg-additional-options "-fvect-cost-model=dynamic" } */
+/* { dg-require-effective-target vect_double } */
 
 double a[128][128];
 double b[128];


Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning

2016-01-25 Thread James Greenhalgh
On Mon, Jan 11, 2016 at 12:04:43PM +, James Greenhalgh wrote:
> 
> Hi,
> 
> I've seen a couple of large performance issues caused by expanding
> the high-precision reciprocal square root for Cortex-A57, so I'd like
> to turn it off by default.
> 
> This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> some private microbenchmark kernels which stress the divide/sqrt/multiply
> units. It therefore seems to me to be the correct choice to make across
> a number of workloads.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> 
> OK?

*Ping*

Thanks,
James

> ---
> 2015-12-11  James Greenhalgh  
> 
>   * config/aarch64/aarch64.c (cortexa57_tunings): Remove
>   AARCH64_EXTRA_TUNE_RECIP_SQRT.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 1d5d898..999c9fc 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
>0, /* max_case_values.  */
>0, /* cache_line_size.  */
>tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)  /* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)   /* tune_flags.  */
>  };
>  
>  static const struct tune_params cortexa72_tunings =



Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt

2016-01-25 Thread James Greenhalgh
On Mon, Jan 11, 2016 at 11:53:39AM +, James Greenhalgh wrote:
> 
> Hi,
> 
> I'd like to switch the logic around in aarch64.c such that
> -mlow-precision-recip-sqrt causes us to always emit the low-precision
> software expansion for reciprocal square root. I have two reasons to do
> this; first is consistency across -mcpu targets, second is enabling more
> -mcpu targets to use the flag for peak tuning.
> 
> I don't much like that the precision we use for -mlow-precision-recip-sqrt
> differs between cores (and possibly compiler revisions). Yes, we're
> under -ffast-math but I take this flag to mean the user explicitly wants the
> low-precision expansion, and we should not diverge from that based on an
> internal decision as to what is optimal for performance in the
> high-precision case. I'd prefer to keep things as predictable as possible,
> and here that means always emitting the low-precision expansion when asked.
> 
> Judging by the comments in the thread proposing the reciprocal square
> root optimisation, this will benefit all cores currently supported by GCC.
> To be clear, we would still not expand in the high-precision case for any
> cores which do not explicitly ask for it. Currently that is Cortex-A57
> and xgene, though I will be proposing a patch to remove Cortex-A57 from
> that list shortly.
> 
> Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> is intended as a tuning flag for situations where performance is more
> important than precision, but the current logic requires setting an
> internal flag which also changes the performance characteristics where
> high-precision is needed. This conflates two decisions the target might
> want to make, and reduces the applicability of an option targets might
> want to enable for performance. In particular, I'd still like to see
> -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> sequence for floats under Cortex-A57.
> 
> Based on that reasoning, this patch makes the appropriate change to the
> logic. I've checked with the current -mcpu values to ensure that behaviour
> without -mlow-precision-recip-sqrt does not change, and that behaviour
> with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> 
> I've also put this through bootstrap and test on aarch64-none-linux-gnu
> with no issues.
> 
> OK?

*Ping*

Thanks,
James

> 2015-12-10  James Greenhalgh  
> 
>   * config/aarch64/aarch64.c (use_rsqrt_p): Always use software
>   reciprocal sqrt for -mlow-precision-recip-sqrt.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 9142ac0..1d5d898 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
>  {
>return (!flag_trapping_math
> && flag_unsafe_math_optimizations
> -   && (aarch64_tune_params.extra_tuning_flags
> -   & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> +   && ((aarch64_tune_params.extra_tuning_flags
> +& AARCH64_EXTRA_TUNE_RECIP_SQRT)
> +   || flag_mrecip_low_precision_sqrt));
>  }
>  
>  /* Function to decide when to use



Re: [AARCH64][ACLE][NEON] Implement vcvt*_s64_f64 and vcvt*_u64_f64 NEON intrinsics.

2016-01-25 Thread James Greenhalgh
On Thu, Jan 21, 2016 at 12:32:07PM +, James Greenhalgh wrote:
> On Wed, Jan 13, 2016 at 05:44:30PM +, Bilyan Borisov wrote:
> > This patch implements all the vcvtR_s64_f64 and vcvtR_u64_f64 vector
> > intrinsics, where R is ['',a,m,n,p]. Since these intrinsics are
> > identical in semantics to the corresponding scalar variants, they are
> > implemented in terms of them, with appropriate packing and unpacking
> > of vector arguments. New test cases, covering all the intrinsics were
> > also added.
> 
> This patch is very low risk, gets us another step towards closing pr58693,
> and was posted before the Stage 3 deadline. This is OK for trunk.

I realised you don't have commit access, so I've committed this on your
behalf as revision 232789.

Thanks,
James

> > gcc/
> > 
> > 2015-XX-XX  Bilyan Borisov  
> > 
> > * config/aarch64/arm_neon.h (vcvt_s64_f64): New intrinsic.
> > (vcvt_u64_f64): Likewise.
> > (vcvta_s64_f64): Likewise.
> > (vcvta_u64_f64): Likewise.
> > (vcvtm_s64_f64): Likewise.
> > (vcvtm_u64_f64): Likewise.
> > (vcvtn_s64_f64): Likewise.
> > (vcvtn_u64_f64): Likewise.
> > (vcvtp_s64_f64): Likewise.
> > (vcvtp_u64_f64): Likewise.
> > 
> > gcc/testsuite/
> > 
> > 2015-XX-XX  Bilyan Borisov  
> > 
> > * gcc.target/aarch64/simd/vcvt_s64_f64_1.c: New.
> > * gcc.target/aarch64/simd/vcvt_u64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvta_s64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvta_u64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtm_s64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtm_u64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtn_s64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtn_u64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtp_s64_f64_1.c: Likewise.
> > * gcc.target/aarch64/simd/vcvtp_u64_f64_1.c: Likewise.
> 


Re: [PATCH, AArch64] Fix for PR67896 (C++ FE cannot distinguish __Poly{8,16,64,128}_t types)

2016-01-25 Thread James Greenhalgh
On Wed, Jan 20, 2016 at 09:27:41PM +0100, Roger Ferrer Ibáñez wrote:
> Hi James,
> 
> > This patch looks technically correct to me, though there is a small
> > style issue to correct (in-line below), and your ChangeLogs don't fit
> > our usual style.
> 
> thank you very much for the useful comments. I'm attaching a new
> version of the patch with the style issues (hopefully) ironed out.

Thanks, this version of the patch looks correct to me.

> > > P.S.: I haven't signed the copyright assignment to the FSF. The change
> > > is really small but I can do the paperwork if required.

I can't commit it on your behalf until we've heard back regarding whether
this needs a copyright assignment to the FSF, but once I've heard I'd
be happy to commit this for you. I'll expand the CC list a bit further
to see if we can get an answer on that.

Thanks again for the analysis and patch.

James

> gcc/ChangeLog:
> 
> 2016-01-19  Roger Ferrer Ibáñez  
> 
> PR target/67896
> * config/aarch64/aarch64-builtins.c
> (aarch64_init_simd_builtin_types): Do not set structural
> equality to __Poly{8,16,64,128}_t types.
> 
> gcc/testsuite/ChangeLog:
> 
> 2016-01-19  Roger Ferrer Ibáñez  
> 
> PR target/67896
> * gcc.target/aarch64/simd/pr67896.C: New.
> 
> -- 
> Roger Ferrer Ibáñez

> From 72c065f6a3f9d168baf357de1b567faa6042c03b Mon Sep 17 00:00:00 2001
> From: Roger Ferrer Ibanez 
> Date: Wed, 20 Jan 2016 21:11:42 +0100
> Subject: [PATCH] Do not set structural equality on polynomial types
> 
> ---
>  gcc/config/aarch64/aarch64-builtins.c   | 10 ++
>  gcc/testsuite/gcc.target/aarch64/simd/pr67896.C |  7 +++
>  2 files changed, 13 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr67896.C
> 
> diff --git a/gcc/config/aarch64/aarch64-builtins.c 
> b/gcc/config/aarch64/aarch64-builtins.c
> index bd7a8dd..40272ed 100644
> --- a/gcc/config/aarch64/aarch64-builtins.c
> +++ b/gcc/config/aarch64/aarch64-builtins.c
> @@ -610,14 +610,16 @@ aarch64_init_simd_builtin_types (void)
>enum machine_mode mode = aarch64_simd_types[i].mode;
>  
>if (aarch64_simd_types[i].itype == NULL)
> - aarch64_simd_types[i].itype =
> -   build_distinct_type_copy
> - (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
> + {
> +   aarch64_simd_types[i].itype
> + = build_distinct_type_copy
> +   (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
> +   SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
> + }
>  
>tdecl = add_builtin_type (aarch64_simd_types[i].name,
>   aarch64_simd_types[i].itype);
>TYPE_NAME (aarch64_simd_types[i].itype) = tdecl;
> -  SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
>  }
>  
>  #define AARCH64_BUILD_SIGNED_TYPE(mode)  \
> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/pr67896.C 
> b/gcc/testsuite/gcc.target/aarch64/simd/pr67896.C
> new file mode 100644
> index 000..1f916e0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/simd/pr67896.C
> @@ -0,0 +1,7 @@
> +typedef __Poly8_t A;
> +typedef __Poly16_t A; /* { dg-error "conflicting declaration" } */
> +typedef __Poly64_t A; /* { dg-error "conflicting declaration" } */
> +typedef __Poly128_t A; /* { dg-error "conflicting declaration" } */
> +
> +typedef __Poly8x8_t B;
> +typedef __Poly16x8_t B; /* { dg-error "conflicting declaration" } */ 
> -- 
> 2.1.4
> 



Re: [PATCH][AArch64] Add vector permute cost

2016-01-26 Thread James Greenhalgh
On Tue, Dec 15, 2015 at 11:35:45AM +, Wilco Dijkstra wrote:
> 
> Add support for vector permute cost since various permutes can expand into a 
> complex
> sequence of instructions.  This fixes major performance regressions due to 
> recent changes
> in the SLP vectorizer (which now vectorizes more aggressively and emits many 
> complex 
> permutes).
> 
> Set the cost to > 1 for all microarchitectures so that the number of permutes 
> is usually zero
> and regressions disappear.  An example of the kind of code that might be 
> emitted for
> VEC_PERM_EXPR {0, 3} where registers happen to be in the wrong order:
> 
> adrpx4, .LC16
> ldr q5, [x4, #:lo12:.LC16
> eor v1.16b, v1.16b, v0.16b
> eor v0.16b, v1.16b, v0.16b
> eor v1.16b, v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v5.16b
> 
> Regress passes. This fixes regressions that were introduced recently, so OK 
> for commit?

OK.

Thanks,
James

> ChangeLog:
> 2015-12-15  Wilco Dijkstra  
> 
>   * gcc/config/aarch64/aarch64.c (generic_vector_cost):
>   Set vec_permute_cost.
>   (cortexa57_vector_cost): Likewise.
>   (exynosm1_vector_cost): Likewise.
>   (xgene1_vector_cost): Likewise.
>   (aarch64_builtin_vectorization_cost): Use vec_permute_cost.
>   * gcc/config/aarch64/aarch64-protos.h (cpu_vector_cost):
>   Add vec_permute_cost entry.
 


[Patch AArch64] Restrict 16-bit sqrdml{sa}h instructions to FP_LO_REGS

2016-01-26 Thread James Greenhalgh

Hi,

In their forms using 16-bit lanes, the sqrdmlah and sqrdmlsh instruction
available when compiling with -march=armv8.1-a are only usable with
a register number in the range 0 to 15 for operand 3, as gas will point
out:

  Error: register number out of range 0 to 15 at
operand 3 -- `sqrdmlsh v2.4h,v4.4h,v23.h[5]'

This patch teaches GCC to avoid registers outside of this range when
appropriate, in the same fashion as we do for other instructions with
this limitation.

Tested on an internal testsuite targeting Neon intrinsics.

OK?

Thanks,
James

---
2016-01-25  James Greenhalgh  

* config/aarch64/aarch64.md
(arch64_sqrdmlh_lane): Fix register
constraints for operand 3.
(aarch64_sqrdmlh_laneq): Likewise.

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index e1f5682..0b46e78 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -3240,7 +3240,7 @@
 	  [(match_operand:VDQHS 1 "register_operand" "0")
 	   (match_operand:VDQHS 2 "register_operand" "w")
 	   (vec_select:
-	 (match_operand: 3 "register_operand" "w")
+	 (match_operand: 3 "register_operand" "")
 	 (parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
 	  SQRDMLH_AS))]
"TARGET_SIMD_RDMA"
@@ -3258,7 +3258,7 @@
 	  [(match_operand:SD_HSI 1 "register_operand" "0")
 	   (match_operand:SD_HSI 2 "register_operand" "w")
 	   (vec_select:
-	 (match_operand: 3 "register_operand" "w")
+	 (match_operand: 3 "register_operand" "")
 	 (parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
 	  SQRDMLH_AS))]
"TARGET_SIMD_RDMA"
@@ -3278,7 +3278,7 @@
 	  [(match_operand:VDQHS 1 "register_operand" "0")
 	   (match_operand:VDQHS 2 "register_operand" "w")
 	   (vec_select:
-	 (match_operand: 3 "register_operand" "w")
+	 (match_operand: 3 "register_operand" "")
 	 (parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
 	  SQRDMLH_AS))]
"TARGET_SIMD_RDMA"
@@ -3296,7 +3296,7 @@
 	  [(match_operand:SD_HSI 1 "register_operand" "0")
 	   (match_operand:SD_HSI 2 "register_operand" "w")
 	   (vec_select:
-	 (match_operand: 3 "register_operand" "w")
+	 (match_operand: 3 "register_operand" "")
 	 (parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
 	  SQRDMLH_AS))]
"TARGET_SIMD_RDMA"


Re: [PATCH, aarch64] Fix pr69305 -- addti miscompilation

2016-01-27 Thread James Greenhalgh
On Sun, Jan 24, 2016 at 03:19:35AM -0800, Richard Henderson wrote:
> As Jakub notes in the PR, the representation for add_compare and
> sub_compare were wrong.  And several of the add_carryin patterns
> were duplicates.
> 
> This adds a CC_Cmode for which only the Carry bit is valid.
> 
> The patch appears to generate moderately decent code.  For gcc7 we
> should look into why we'll prefer to mark an output REG_UNUSED
> instead of matching the pattern with that output removed.  This
> results in continuing to use adds (though simplifying adc) after
> we've proved that there will be no carry into the high part of an
> adds+adc pair.
> 
> Ok?
> 
> 
> r~

Hi Richard,

Some tiny nits below:

> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 71fc514..363785e 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1755,6 +1755,44 @@
>[(set_attr "type" "alus_sreg,alus_imm,alus_imm")]
>  )
>  
> +(define_insn "*add3_compare1_cconly"

I don't understand the naming scheme, it got me a wee bit confused with
add3_compare0 and friends, where the 0 indicates a comparison with
zero...

> +  [(set (reg:CC_C CC_REGNUM)
> + (ne:CC_C
> +   (plus:
> + (zero_extend:
> +   (match_operand:GPI 0 "aarch64_reg_or_zero" "%rZ,rZ,rZ"))
> + (zero_extend:
> +   (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))
> +   (zero_extend:
> + (plus:GPI (match_dup 0) (match_dup 1)]
> +  ""
> +  "@
> +  cmn\\t%0, %1
> +  cmn\\t%0, %1
> +  cmp\\t%0, #%n1"
> +  [(set_attr "type" "alus_sreg,alus_imm,alus_imm")]
> +)
> +
> +(define_insn "add3_compare1"
> +  [(set (reg:CC_C CC_REGNUM)
> + (ne:CC_C
> +   (plus:
> + (zero_extend:
> +   (match_operand:GPI 1 "aarch64_reg_or_zero" "%rZ,rZ,rZ"))
> + (zero_extend:
> +   (match_operand:GPI 2 "aarch64_plus_operand" "r,I,J")))
> +   (zero_extend:
> + (plus:GPI (match_dup 1) (match_dup 2)
> +   (set (match_operand:GPI 0 "register_operand" "=r,r,r")
> + (plus:GPI (match_dup 1) (match_dup 2)))]
> +  ""
> +  "@
> +  adds\\t%0, %1, %2
> +  adds\\t%0, %1, %2
> +  subs\\t%0, %1, #%n2"
> +  [(set_attr "type" "alus_sreg,alus_imm,alus_imm")]
> +)
> +
>  (define_insn "*adds_shift_imm_"
>[(set (reg:CC_NZ CC_REGNUM)
>   (compare:CC_NZ



> +;; Note that a single add with carry is matched by cinc,
> +;; and the adc_reg and csel types are matched into the same
> +;; pipelines by existing cores.

I can't see us remembering to update this comment on pipeline models
were it to ever become false. Maybe just drop it?

> @@ -2440,13 +2427,53 @@
>[(set_attr "type" "alu_ext")]
>  )
>  
> -(define_insn "sub3_carryin"
> -  [(set
> -(match_operand:GPI 0 "register_operand" "=r")
> -(minus:GPI (minus:GPI
> - (match_operand:GPI 1 "register_operand" "r")
> - (ltu:GPI (reg:CC CC_REGNUM) (const_int 0)))
> -(match_operand:GPI 2 "register_operand" "r")))]
> +;; The hardware description is op1 + ~op2 + C.
> +;;   = op1 + (-op2 + 1) + (1 - !C)
> +;;   = op1 - op2 - 1 + 1 - !C
> +;;   = op1 - op2 - !C.
> +;; We describe the later.

s/later/latter/

Otherwise, this is OK.

Thanks,
James



Re: [PATCH 4/4][AArch64] Cost CCMP instruction sequences to choose better expand order

2016-01-28 Thread James Greenhalgh
On Mon, Jan 25, 2016 at 08:09:39PM +, Wilco Dijkstra wrote:
> Andreas Schwab  wrote:
> 
> > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler-times \tcmp\tw[0-9]+, 0 4
> > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler adds\t
> > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler-times fccmpe\t.*0\\.0 1
> 
> Yes I noticed those too, and here is the fix. Richard's recent change added
> UNSPEC to the CCMP patterns to stop combine optimizing the CCMP CCmode
> immediate in a rare case. This requires a change to the CCMP cost calculation
> as the CCMP instruction with unspec is no longer recognized.
> 
> Fix the ccmp_1.c test to allow both '0' and 'wzr' on cmp - BTW is there a
> regular expression that correctly implements (0|xzr)? If I use that the test
> still fails somehow but \[0wzr\]+ works fine... Is the correct syntax
> documented somewhere?
> 
> Finally to ensure FCCMPE is emitted on relational compares, add
> -ffinite-math-only.

OK.

Thanks,
James

> 
> ChangeLog:
> 2016-01-25  Wilco Dijkstra  
> 
> gcc/
>   * config/aarch64/aarch64.c (aarch64_if_then_else_costs):
>   Remove CONST_INT_P check in CCMP cost calculation.
> 
> gcc/testsuite/
>   * gcc.target/aarch64/ccmp_1.c: Fix test issues.
> 



Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt

2016-02-01 Thread James Greenhalgh
On Mon, Jan 25, 2016 at 11:21:25AM +, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > I'd like to switch the logic around in aarch64.c such that
> > -mlow-precision-recip-sqrt causes us to always emit the low-precision
> > software expansion for reciprocal square root. I have two reasons to do
> > this; first is consistency across -mcpu targets, second is enabling more
> > -mcpu targets to use the flag for peak tuning.
> > 
> > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > differs between cores (and possibly compiler revisions). Yes, we're
> > under -ffast-math but I take this flag to mean the user explicitly wants the
> > low-precision expansion, and we should not diverge from that based on an
> > internal decision as to what is optimal for performance in the
> > high-precision case. I'd prefer to keep things as predictable as possible,
> > and here that means always emitting the low-precision expansion when asked.
> > 
> > Judging by the comments in the thread proposing the reciprocal square
> > root optimisation, this will benefit all cores currently supported by GCC.
> > To be clear, we would still not expand in the high-precision case for any
> > cores which do not explicitly ask for it. Currently that is Cortex-A57
> > and xgene, though I will be proposing a patch to remove Cortex-A57 from
> > that list shortly.
> > 
> > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt
> > is intended as a tuning flag for situations where performance is more
> > important than precision, but the current logic requires setting an
> > internal flag which also changes the performance characteristics where
> > high-precision is needed. This conflates two decisions the target might
> > want to make, and reduces the applicability of an option targets might
> > want to enable for performance. In particular, I'd still like to see
> > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> > sequence for floats under Cortex-A57.
> > 
> > Based on that reasoning, this patch makes the appropriate change to the
> > logic. I've checked with the current -mcpu values to ensure that behaviour
> > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > 
> > I've also put this through bootstrap and test on aarch64-none-linux-gnu
> > with no issues.
> > 
> > OK?
> 
> *Ping*

*Pingx2*

Thanks,
James

> 
> Thanks,
> James
> 
> > 2015-12-10  James Greenhalgh  
> > 
> > * config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> > reciprocal sqrt for -mlow-precision-recip-sqrt.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 9142ac0..1d5d898 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
> >  {
> >return (!flag_trapping_math
> >   && flag_unsafe_math_optimizations
> > - && (aarch64_tune_params.extra_tuning_flags
> > - & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> > + && ((aarch64_tune_params.extra_tuning_flags
> > +  & AARCH64_EXTRA_TUNE_RECIP_SQRT)
> > + || flag_mrecip_low_precision_sqrt));
> >  }
> >  
> >  /* Function to decide when to use
> 


Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning

2016-02-01 Thread James Greenhalgh
On Mon, Jan 25, 2016 at 11:20:46AM +, James Greenhalgh wrote:
> On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > I've seen a couple of large performance issues caused by expanding
> > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > to turn it off by default.
> > 
> > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > units. It therefore seems to me to be the correct choice to make across
> > a number of workloads.
> > 
> > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > 
> > OK?
> 
> *Ping*

*pingx2*

Thanks,
James

> > ---
> > 2015-12-11  James Greenhalgh  
> > 
> > * config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 1d5d898..999c9fc 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> >0,   /* max_case_values.  */
> >0,   /* cache_line_size.  */
> >tune_params::AUTOPREFETCHER_WEAK,/* autoprefetcher_model.  */
> > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)/* tune_flags.  */
> > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags.  */
> >  };
> >  
> >  static const struct tune_params cortexa72_tunings =
> 


Re: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS

2016-02-02 Thread James Greenhalgh
On Tue, Jan 26, 2016 at 05:39:24PM +, Wilco Dijkstra wrote:
> ping (note the regressions discussed below are addressed by 
> https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01761.html)

OK, but please be extra vigilant for any fallout on AArch64 after this
and the follow-up linked above is applied.

Thanks,
James

> James Greenhalgh wrote:
> > On Wed, Dec 16, 2015 at 01:05:21PM +, Wilco Dijkstra wrote:
> > > James Greenhalgh wrote:
> > > > On Tue, Dec 15, 2015 at 10:54:49AM +, Wilco Dijkstra wrote:
> > > > > ping
> > > > >
> > > > > > -Original Message-
> > > > > > From: Wilco Dijkstra [mailto:wilco.dijks...@arm.com]
> > > > > > Sent: 06 November 2015 20:06
> > > > > > To: 'gcc-patches@gcc.gnu.org'
> > > > > > Subject: [PATCH][AArch64] Add TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > > > > >
> > > > > > This patch adds support for the 
> > > > > > TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS
> > > > > > hook. When the cost of GENERAL_REGS and FP_REGS is identical, the 
> > > > > > register
> > > > > > allocator always uses ALL_REGS even when it has a much higher cost. 
> > > > > > The
> > > > > > hook changes the class to either FP_REGS or GENERAL_REGS depending 
> > > > > > on the
> > > > > > mode of the register. This results in better register allocation 
> > > > > > overall,
> > > > > > fewer spills and reduced codesize - particularly in SPEC2006 gamess.
> > > > > >
> > > > > > GCC regression passes with several minor fixes.
> > > > > >
> > > > > > OK for commit?
> > > > > >
> > > > > > ChangeLog:
> > > > > > 2015-11-06  Wilco Dijkstra  
> > > > > >
> > > > > >   * gcc/config/aarch64/aarch64.c
> > > > > >   (TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS): New define.
> > > > > >   (aarch64_ira_change_pseudo_allocno_class): New function.
> > > > > >   * gcc/testsuite/gcc.target/aarch64/cvtf_1.c: Build with -O2.
> > > > > >   * gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > > >   (test_corners_sisd_di): Improve force to SIMD register.
> > > > > >   (test_corners_sisd_si): Likewise.
> > > > > >   * gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c: Build with 
> > > > > > -O2.
> > > > > >   * gcc/testsuite/gcc.target/aarch64/vect-ld1r-compile-fp.c:
> > > > > >   Remove scan-assembler check for ldr.
> > > >
> > > > Drop the gcc/ from the ChangeLog.
> > > >
> > > > > > --
> > > > > >  gcc/config/aarch64/aarch64.c   | 22 
> > > > > > ++
> > > > > >  gcc/testsuite/gcc.target/aarch64/cvtf_1.c  |  2 +-
> > > > > >  gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c  |  4 ++--
> > > > > >  gcc/testsuite/gcc.target/aarch64/vdup_lane_2.c |  2 +-
> > > > > >  .../gcc.target/aarch64/vect-ld1r-compile-fp.c  |  1 -
> > > >
> > > > These testsuite changes concern me a bit, and you don't mention them 
> > > > beyond
> > > > saying they are minor fixes...
> > >
> > > Well any changes to register allocator preferencing would cause fallout in
> > > tests that are assuming which register is allocated, especially if they 
> > > use
> > > nasty inline assembler hacks to do so...
> >
> > Sure, but the testcases here each operate on data that should live in
> > FP_REGS given the initial conditions that the nasty hacks try to mimic -
> > that's what makes the regressions notable.
> >
> > >
> > > > > >  #define FCVTDEF(ftype,itype) \
> > > > > >  void \
> > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c 
> > > > > > b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > > > index 363f554..8465c89 100644
> > > > > > --- a/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/scalar_shift_1.c
> > > > > > @@ -186,9 +186,9 @@ test_corners_sisd_di (Int64x1 b)
> > > > > >  {
> > 

Re: [Patch AArch64] GCC 6 regression in vector performance. - Fix vector initialization to happen with lane load instructions.

2016-02-02 Thread James Greenhalgh
On Wed, Jan 20, 2016 at 03:22:11PM +, James Greenhalgh wrote:
> 
> Hi,
> 
> In a number of cases where we try to create vectors we end up spilling to the
> stack and then filling. This is one example distilled from a couple of
> micro-benchmrks where the issue shows up. The reason for the extra cost
> in this case is the unnecessary use of the stack. The patch attempts to
> finesse this by using lane loads or vector inserts to produce the right
> results.
> 
> This patch is mostly Ramana's work, I've just cleaned it up a little.
> 
> This has been in a number of our trees lately, and we haven't seen any
> regressions. I've also bootstrapped and tested it, and run a set of
> benchmarks to show no regressions on Cortex-A57 or Cortex-A53.
> 
> The patch fixes some regressions caused by the more agressive vectorization
> in GCC6, so I'd like to propose it to go in even though we are in Stage 4.
> 
> OK?

*Ping*

I just ran in to this investigating another performance regression. It would
be nice to get fixed.

Thanks,
James


> 
> Thanks,
> James
> 
> ---
> gcc/
> 
> 2016-01-20  James Greenhalgh  
>   Ramana Radhakrishnan  
> 
>   * config/aarch64/aarch64.c (aarch64_expand_vector_init): Refactor,
>   always use lane loads to construct non-constant vectors.
> 
> gcc/testsuite/
> 
> 2016-01-20  James Greenhalgh  
>   Ramana Radhakrishnan  
> 
>   * gcc.target/aarch64/vector_initialization_nostack.c: New.
> 

> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 03bc1b9..3787b38 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -10985,28 +10985,37 @@ aarch64_simd_make_constant (rtx vals)
>  return NULL_RTX;
>  }
>  
> +/* Expand a vector initialisation sequence, such that TARGET is
> +   initialised to contain VALS.  */
> +
>  void
>  aarch64_expand_vector_init (rtx target, rtx vals)
>  {
>machine_mode mode = GET_MODE (target);
>machine_mode inner_mode = GET_MODE_INNER (mode);
> +  /* The number of vector elements.  */
>int n_elts = GET_MODE_NUNITS (mode);
> +  /* The number of vector elements which are not constant.  */
>int n_var = 0;
>rtx any_const = NULL_RTX;
> +  /* The first element of vals.  */
> +  rtx v0 = XVECEXP (vals, 0, 0);
>bool all_same = true;
>  
> +  /* Count the number of variable elements to initialise.  */
>for (int i = 0; i < n_elts; ++i)
>  {
>rtx x = XVECEXP (vals, 0, i);
> -  if (!CONST_INT_P (x) && !CONST_DOUBLE_P (x))
> +  if (!(CONST_INT_P (x) || CONST_DOUBLE_P (x)))
>   ++n_var;
>else
>   any_const = x;
>  
> -  if (i > 0 && !rtx_equal_p (x, XVECEXP (vals, 0, 0)))
> - all_same = false;
> +  all_same &= rtx_equal_p (x, v0);
>  }
>  
> +  /* No variable elements, hand off to aarch64_simd_make_constant which knows
> + how best to handle this.  */
>if (n_var == 0)
>  {
>rtx constant = aarch64_simd_make_constant (vals);
> @@ -11020,14 +11029,15 @@ aarch64_expand_vector_init (rtx target, rtx vals)
>/* Splat a single non-constant element if we can.  */
>if (all_same)
>  {
> -  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0));
> +  rtx x = copy_to_mode_reg (inner_mode, v0);
>aarch64_emit_move (target, gen_rtx_VEC_DUPLICATE (mode, x));
>return;
>  }
>  
> -  /* Half the fields (or less) are non-constant.  Load constant then 
> overwrite
> - varying fields.  Hope that this is more efficient than using the stack. 
>  */
> -  if (n_var <= n_elts/2)
> +  /* Initialise a vector which is part-variable.  We want to first try
> + to build those lanes which are constant in the most efficient way we
> + can.  */
> +  if (n_var != n_elts)
>  {
>rtx copy = copy_rtx (vals);
>  
> @@ -11054,31 +11064,21 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> XVECEXP (copy, 0, i) = subst;
>   }
>aarch64_expand_vector_init (target, copy);
> +}
>  
> -  /* Insert variables.  */
> -  enum insn_code icode = optab_handler (vec_set_optab, mode);
> -  gcc_assert (icode != CODE_FOR_nothing);
> +  /* Insert the variable lanes directly.  */
>  
> -  for (int i = 0; i < n_elts; i++)
> - {
> -   rtx x = XVECEXP (vals, 0, i);
> -   if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
> - continue;
> -   x = copy_to_mode_reg (inner_mode, x);
> -   emit_insn (GEN_FCN (icode) (target, x, GEN_INT (i)));
> - }
> -  return;
> -  

Re: [aarch64] Improve TImode constant moves

2016-02-02 Thread James Greenhalgh
On Sun, Jan 24, 2016 at 02:54:32AM -0800, Richard Henderson wrote:
> This looks to be an incomplete transition of the aarch64 backend to
> CONST_WIDE_INT.  I haven't checked to see if it's a regression from
> gcc5, but I suspect not, since there should have been similar checks
> for CONST_DOUBLE.
> 
> This is probably gcc7 fodder, but it helped me debug another TImode PR.

When the time comes, this is OK.

Thanks,
James

>   * config/aarch64/aarch64.c (aarch64_rtx_costs): Handle CONST_WIDE_INT.
>   (aarch64_legitimate_constant_p): Accept CONST_SCALAR_INT_P.
>   * config/aarch64/predicates.md (aarch64_movti_operand): Accept
>   const_wide_int and const_scalar_int_operand.
>   (aarch64_reg_or_imm): Likewise.
> 
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index df3dec0..38c7443 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -6227,6 +6227,17 @@ aarch64_rtx_costs (rtx x, machine_mode mode, int outer 
> ATTRIBUTE_UNUSED,
>   }
>return true;
>  
> +case CONST_WIDE_INT:
> +  *cost = 0;
> +  for (unsigned int n = CONST_WIDE_INT_NUNITS(x), i = 0; i < n; ++i)
> + {
> +   unsigned HOST_WIDE_INT e = CONST_WIDE_INT_ELT(x, i);
> +   if (e != 0)
> + *cost += COSTS_N_INSNS (aarch64_internal_mov_immediate
> + (NULL_RTX, GEN_INT (e), false, DImode));
> + }
> +  return true;
> +
>  case CONST_DOUBLE:
>if (speed)
>   {
> @@ -9400,6 +9411,9 @@ aarch64_legitimate_constant_p (machine_mode mode, rtx x)
>&& aarch64_valid_symref (XEXP (x, 0), GET_MODE (XEXP (x, 0
>  return true;
>  
> +  if (CONST_SCALAR_INT_P (x))
> +return true;
> +
>return aarch64_constant_address_p (x);
>  }
>  
> diff --git a/gcc/config/aarch64/predicates.md 
> b/gcc/config/aarch64/predicates.md
> index e96dc00..3eb33fa 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -217,15 +217,15 @@
>(match_test "aarch64_mov_operand_p (op, mode)")
>  
>  (define_predicate "aarch64_movti_operand"
> -  (and (match_code "reg,subreg,mem,const_int")
> +  (and (match_code "reg,subreg,mem,const_int,const_wide_int")
> (ior (match_operand 0 "register_operand")
>   (ior (match_operand 0 "memory_operand")
> -  (match_operand 0 "const_int_operand")
> +  (match_operand 0 "const_scalar_int_operand")
>  
>  (define_predicate "aarch64_reg_or_imm"
> -  (and (match_code "reg,subreg,const_int")
> +  (and (match_code "reg,subreg,const_int,const_wide_int")
> (ior (match_operand 0 "register_operand")
> - (match_operand 0 "const_int_operand"
> + (match_operand 0 "const_scalar_int_operand"
>  
>  ;; True for integer comparisons and for FP comparisons other than LTGT or 
> UNEQ.
>  (define_special_predicate "aarch64_comparison_operator"



Re: [PATCH 4/4][AArch64] Cost CCMP instruction sequences to choose better expand order

2016-02-03 Thread James Greenhalgh
On Thu, Jan 28, 2016 at 02:33:20PM +, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 08:09:39PM +, Wilco Dijkstra wrote:
> > Andreas Schwab  wrote:
> > 
> > > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler-times \tcmp\tw[0-9]+, 0 4
> > > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler adds\t
> > > FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler-times fccmpe\t.*0\\.0 1
> > 
> > Yes I noticed those too, and here is the fix. Richard's recent change added
> > UNSPEC to the CCMP patterns to stop combine optimizing the CCMP CCmode
> > immediate in a rare case. This requires a change to the CCMP cost 
> > calculation
> > as the CCMP instruction with unspec is no longer recognized.
> > 
> > Fix the ccmp_1.c test to allow both '0' and 'wzr' on cmp - BTW is there a
> > regular expression that correctly implements (0|xzr)? If I use that the test
> > still fails somehow but \[0wzr\]+ works fine... Is the correct syntax
> > documented somewhere?
> > 
> > Finally to ensure FCCMPE is emitted on relational compares, add
> > -ffinite-math-only.
> > 
> > ChangeLog:
> > 2016-01-25  Wilco Dijkstra  
> > 
> > gcc/
> > * config/aarch64/aarch64.c (aarch64_if_then_else_costs):
> > Remove CONST_INT_P check in CCMP cost calculation.
> > 
> > gcc/testsuite/
> > * gcc.target/aarch64/ccmp_1.c: Fix test issues.

I'm still seeing:

  FAIL: gcc.target/aarch64/ccmp_1.c scan-assembler-times \\tcmp\\tw[0-9]+, 
(0|wzr) 4

Looking at the assembly generated for me with this testcase I see ccmp
with zero in 5 places:

  f3:
cmp w1, 34
ccmpw0, 19, 0, eq
csetw0, eq
ret
  f4:
cmp w0, 35
ccmpw1, 20, 0, eq
csetw0, eq
ret

  f7:
cmp w0, 0
ccmpw1, 7, 0, eq
csetw0, eq
ret

  f8:
cmp w1, 0
ccmpw0, 9, 0, eq
csetw0, eq
ret

  f11:
fcmpe   d0, #0.0
ccmpw0, 30, 0, mi
csetw0, eq
ret

Are these all expected? If so, can you spin the "obvious" patch to bump
this number to 5.

Thanks,
James



Re: [PATCH AArch64]Force register scaling out of mem ref and comment why

2016-02-04 Thread James Greenhalgh
On Thu, Feb 04, 2016 at 10:11:53AM +, Bin Cheng wrote:
> Hi,
> There is a performance regression caused by my previous change to
> aarch64_legitimize_address, in which I forced constant offset out of memory
> ref if the address expr is in the form of "reg1 + reg2 << scale + const".
> The intention is to reveal as many loop-invariant opportunities as possible,
> while depend on GIMPLE optimizers picking up CSE opportunities of "reg <<
> scale" among different memory references.
> 
> Though the assumption still holds, gimple optimizers are not powerful enough
> to pick up CSE opportunities of register scaling expressions at current time.
> Here comes a workaround: this patch forces register scaling expression out of
> memory ref, so that RTL CSE pass can handle common register scaling
> expressions issue, of course, at a cost of possibly missed loop invariants.
> 
> James and I collected perf data, fortunately this change can improve
> performance for several cases in various benchmarks, while doesn't cause big
> regression.  It also recovers big regression we observed before for the
> previous change.
> 
> I also added comment explaining why the workaround is necessary.  I also
> files PR69653 as an example showing tree optimizer should be improved.
> 
> Bootstrap and test on AArch64, is it OK?

OK.

Thanks,
James

> 
> Thanks,
> bin
> 
> 
> 2016-02-04  Bin Cheng  
> 
>   * config/aarch64/aarch64.c (aarch64_legitimize_address): Force
>   register scaling out of memory reference and comment why.



Re: [PATCH][AArch64] PR target/69161: Don't use special predicate for CCmode comparisons in expressions that require matching modes

2016-02-04 Thread James Greenhalgh
On Fri, Jan 29, 2016 at 02:27:34PM +, Kyrill Tkachov wrote:
> Hi all,
> 
> In this PR we ICE during combine when trying to propagate a comparison into a 
> vec_duplicate,
> that is we end up creating the rtx:
> (vec_duplicate:V4SI (eq:CC_NZ (reg:CC_NZ 66 cc)
> (const_int 0 [0])))
> 
> The documentation for vec_duplicate says:
> "The output vector mode must have the same submodes as the input vector mode 
> or the scalar modes"
> So this is invalid RTL, which triggers an assert in simplify-rtx to that 
> effect.
> 
> It has been suggested on the PR that this is because we use a 
> special_predicate for
> aarch64_comparison_operator which means that it ignores the mode when 
> matching.
> This is fine when used in RTXes that don't need it, like if_then_else 
> expressions
> but can cause trouble when used in places where the modes do matter, like in
> SET operations. In this particular ICE the cause was the conditional store
> patterns that could end up matching an intermediate rtx during combine of
> (set (reg:SI) (eq:CC_NZ x y)).
> 
> The suggested solution is to define a separate predicate with the same
> conditions as aarch64_comparison_operator but make it not special, so it gets
> automatic mode checks to prevent such a situation.
> 
> This patch does that.
> Bootstrapped and tested on aarch64-linux-gnu.
> SPEC2006 codegen did not change with this patch, so there shouldn't be
> any code quality regressions.
> 
> Ok for trunk?

It would be good to leave a more detailed comment on
"aarch64_comparison_operator_mode" as to why we need it.

Otherwise, this is OK.

Thanks,
James

> 
> Thanks,
> Kyrill
> 
> 2016-01-29  Kyrylo Tkachov  
> 
> PR target/69161
> * config/aarch64/predicates.md (aarch64_comparison_operator_mode):
> New predicate.
> (aarch64_comparison_operator): Break overly long line into two.
> (aarch64_comparison_operation): Likewise.
> * config/aarch64/aarch64.md (cstorecc4): Use
> aarch64_comparison_operator_mode instead of
> aarch64_comparison_operator.
> (cstore4): Likewise.
> (aarch64_cstore): Likewise.
> (*cstoresi_insn_uxtw): Likewise.
> (cstore_neg): Likewise.
> (*cstoresi_neg_uxtw): Likewise.
> 
> 2016-01-29  Kyrylo Tkachov  
> 
> PR target/69161
> * gcc.c-torture/compile/pr69161.c: New test.



Re: [PATCH] Fix jit crash on aarch64, mips

2016-02-04 Thread James Greenhalgh
On Thu, Feb 04, 2016 at 10:31:27AM -0500, David Malcolm wrote:
> The jit testsuite was showing numerous segfaults and fatal
> errors for trunk on aarch64; typically on the 2nd iteration of each
> test, with errors like:
>  test-volatile.c.exe: fatal error: pass ‘rnreg’ not found but is referenced 
> by new pass ‘whole-program’
> where the new pass' name varies, and can be bogus, e.g.:
>  test-nested-loops.c.exe: fatal error: pass 'rnreg' not found but is 
> referenced by new pass '/tmp/libgccjit-FMb7g3/fake.c'
> 
> This is a regression relative to gcc 5.
> 
> The root cause is that aarch64_register_fma_steering builds and
> registers an "fma_steering" pass after "rnreg", but the
>   struct register_pass_info
> containing the arguments to register_pass is marked "static".
> Hence after the 1st iteration, the pointer to the pass isn't touched,
> and we have a use-after-free of the 1st iteration's pass_fma_steering.
> 
> The attached patch removes the "static" from the relevant local, so
> that the pass pointer is updated before each call to register_pass.
> 
> With this patch, the jit testsuite runs successfully (8514 passes) on
> gcc113 (aarch64-unknown-linux-gnu).
> 
> I used grep to see if there were any other
>   "static struct register_pass_info"
> in the code, and there's one in the mips backend, so I did the same
> change there (untested).
> 
> Bootstrap on aarch64 in progress; I don't have mips handy.
> 
> OK for trunk if it passes?

The AArch64 part is OK (assuming bootstrap and test succeed), thanks.

James

> 
> gcc/ChangeLog:
>   * config/aarch64/cortex-a57-fma-steering.c
>   (aarch64_register_fma_steering): Remove "static" from arguments
>   to register_pass.
>   * config/mips/frame-header-opt.c (mips_register_frame_header_opt):
>   Likewise.
> ---
>  gcc/config/aarch64/cortex-a57-fma-steering.c | 2 +-
>  gcc/config/mips/frame-header-opt.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 



Re: [Patch AArch64] GCC 6 regression in vector performance. - Fix vector initialization to happen with lane load instructions.

2016-02-08 Thread James Greenhalgh
On Tue, Feb 02, 2016 at 10:29:29AM +, James Greenhalgh wrote:
> On Wed, Jan 20, 2016 at 03:22:11PM +0000, James Greenhalgh wrote:
> > 
> > Hi,
> > 
> > In a number of cases where we try to create vectors we end up spilling to 
> > the
> > stack and then filling. This is one example distilled from a couple of
> > micro-benchmrks where the issue shows up. The reason for the extra cost
> > in this case is the unnecessary use of the stack. The patch attempts to
> > finesse this by using lane loads or vector inserts to produce the right
> > results.
> > 
> > This patch is mostly Ramana's work, I've just cleaned it up a little.
> > 
> > This has been in a number of our trees lately, and we haven't seen any
> > regressions. I've also bootstrapped and tested it, and run a set of
> > benchmarks to show no regressions on Cortex-A57 or Cortex-A53.
> > 
> > The patch fixes some regressions caused by the more agressive vectorization
> > in GCC6, so I'd like to propose it to go in even though we are in Stage 4.
> > 
> > OK?
> 
> *Ping*

*ping^2*

Cheers,
James

> > 2016-01-20  James Greenhalgh  
> > Ramana Radhakrishnan  
> > 
> >     * config/aarch64/aarch64.c (aarch64_expand_vector_init): Refactor,
> > always use lane loads to construct non-constant vectors.
> > 
> > gcc/testsuite/
> > 
> > 2016-01-20  James Greenhalgh  
> > Ramana Radhakrishnan  
> > 
> > * gcc.target/aarch64/vector_initialization_nostack.c: New.
> > 
> 
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 03bc1b9..3787b38 100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -10985,28 +10985,37 @@ aarch64_simd_make_constant (rtx vals)
> >  return NULL_RTX;
> >  }
> >  
> > +/* Expand a vector initialisation sequence, such that TARGET is
> > +   initialised to contain VALS.  */
> > +
> >  void
> >  aarch64_expand_vector_init (rtx target, rtx vals)
> >  {
> >machine_mode mode = GET_MODE (target);
> >machine_mode inner_mode = GET_MODE_INNER (mode);
> > +  /* The number of vector elements.  */
> >int n_elts = GET_MODE_NUNITS (mode);
> > +  /* The number of vector elements which are not constant.  */
> >int n_var = 0;
> >rtx any_const = NULL_RTX;
> > +  /* The first element of vals.  */
> > +  rtx v0 = XVECEXP (vals, 0, 0);
> >bool all_same = true;
> >  
> > +  /* Count the number of variable elements to initialise.  */
> >for (int i = 0; i < n_elts; ++i)
> >  {
> >rtx x = XVECEXP (vals, 0, i);
> > -  if (!CONST_INT_P (x) && !CONST_DOUBLE_P (x))
> > +  if (!(CONST_INT_P (x) || CONST_DOUBLE_P (x)))
> > ++n_var;
> >else
> > any_const = x;
> >  
> > -  if (i > 0 && !rtx_equal_p (x, XVECEXP (vals, 0, 0)))
> > -   all_same = false;
> > +  all_same &= rtx_equal_p (x, v0);
> >  }
> >  
> > +  /* No variable elements, hand off to aarch64_simd_make_constant which 
> > knows
> > + how best to handle this.  */
> >if (n_var == 0)
> >  {
> >rtx constant = aarch64_simd_make_constant (vals);
> > @@ -11020,14 +11029,15 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >/* Splat a single non-constant element if we can.  */
> >if (all_same)
> >  {
> > -  rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0));
> > +  rtx x = copy_to_mode_reg (inner_mode, v0);
> >aarch64_emit_move (target, gen_rtx_VEC_DUPLICATE (mode, x));
> >return;
> >  }
> >  
> > -  /* Half the fields (or less) are non-constant.  Load constant then 
> > overwrite
> > - varying fields.  Hope that this is more efficient than using the 
> > stack.  */
> > -  if (n_var <= n_elts/2)
> > +  /* Initialise a vector which is part-variable.  We want to first try
> > + to build those lanes which are constant in the most efficient way we
> > + can.  */
> > +  if (n_var != n_elts)
> >  {
> >rtx copy = copy_rtx (vals);
> >  
> > @@ -11054,31 +11064,21 @@ aarch64_expand_vector_init (rtx target, rtx vals)
> >   XVECEXP (copy, 0, i) = subst;
> > }
> >aarch64_expand_vector_init (target, copy);
> > +}
> >  
> > -  /* Insert variables.  */
> > -  enum insn_code icode = optab_handler (vec_set_optab, mode);
> &

Re: [AArch64] Remove AARCH64_EXTRA_TUNE_RECIP_SQRT from Cortex-A57 tuning

2016-02-08 Thread James Greenhalgh
On Mon, Feb 01, 2016 at 02:00:01PM +, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote:
> > On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote:
> > > 
> > > Hi,
> > > 
> > > I've seen a couple of large performance issues caused by expanding
> > > the high-precision reciprocal square root for Cortex-A57, so I'd like
> > > to turn it off by default.
> > > 
> > > This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from
> > > Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) for
> > > some private microbenchmark kernels which stress the divide/sqrt/multiply
> > > units. It therefore seems to me to be the correct choice to make across
> > > a number of workloads.
> > > 
> > > Bootstrapped and tested on aarch64-none-linux-gnu with no issues.
> > > 
> > > OK?
> > 
> > *Ping*
> 
> *pingx2*

*ping^3*

Thanks,
James

> > > ---
> > > 2015-12-11  James Greenhalgh  
> > > 
> > >   * config/aarch64/aarch64.c (cortexa57_tunings): Remove
> > >   AARCH64_EXTRA_TUNE_RECIP_SQRT.
> > > 
> > 
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index 1d5d898..999c9fc 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings =
> > >0, /* max_case_values.  */
> > >0, /* cache_line_size.  */
> > >tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> > > -  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS
> > > -   | AARCH64_EXTRA_TUNE_RECIP_SQRT)  /* tune_flags.  */
> > > +  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)   /* tune_flags.  */
> > >  };
> > >  
> > >  static const struct tune_params cortexa72_tunings =
> > 
> 


Re: [Patch AArch64] Use software sqrt expansion always for -mlow-precision-recip-sqrt

2016-02-08 Thread James Greenhalgh
On Mon, Feb 01, 2016 at 01:59:34PM +, James Greenhalgh wrote:
> On Mon, Jan 25, 2016 at 11:21:25AM +0000, James Greenhalgh wrote:
> > On Mon, Jan 11, 2016 at 11:53:39AM +0000, James Greenhalgh wrote:
> > > 
> > > Hi,
> > > 
> > > I'd like to switch the logic around in aarch64.c such that
> > > -mlow-precision-recip-sqrt causes us to always emit the low-precision
> > > software expansion for reciprocal square root. I have two reasons to do
> > > this; first is consistency across -mcpu targets, second is enabling more
> > > -mcpu targets to use the flag for peak tuning.
> > > 
> > > I don't much like that the precision we use for -mlow-precision-recip-sqrt
> > > differs between cores (and possibly compiler revisions). Yes, we're
> > > under -ffast-math but I take this flag to mean the user explicitly wants 
> > > the
> > > low-precision expansion, and we should not diverge from that based on an
> > > internal decision as to what is optimal for performance in the
> > > high-precision case. I'd prefer to keep things as predictable as possible,
> > > and here that means always emitting the low-precision expansion when 
> > > asked.
> > > 
> > > Judging by the comments in the thread proposing the reciprocal square
> > > root optimisation, this will benefit all cores currently supported by GCC.
> > > To be clear, we would still not expand in the high-precision case for any
> > > cores which do not explicitly ask for it. Currently that is Cortex-A57
> > > and xgene, though I will be proposing a patch to remove Cortex-A57 from
> > > that list shortly.
> > > 
> > > Which gives my second motivation for this patch. 
> > > -mlow-precision-recip-sqrt
> > > is intended as a tuning flag for situations where performance is more
> > > important than precision, but the current logic requires setting an
> > > internal flag which also changes the performance characteristics where
> > > high-precision is needed. This conflates two decisions the target might
> > > want to make, and reduces the applicability of an option targets might
> > > want to enable for performance. In particular, I'd still like to see
> > > -mlow-precision-recip-sqrt continue to emit the cheaper, low-precision
> > > sequence for floats under Cortex-A57.
> > > 
> > > Based on that reasoning, this patch makes the appropriate change to the
> > > logic. I've checked with the current -mcpu values to ensure that behaviour
> > > without -mlow-precision-recip-sqrt does not change, and that behaviour
> > > with -mlow-precision-recip-sqrt is to emit the low precision sequences.
> > > 
> > > I've also put this through bootstrap and test on aarch64-none-linux-gnu
> > > with no issues.
> > > 
> > > OK?
> > 
> > *Ping*
> 
> *Pingx2*

*Ping^3*

Thanks,
James

> > > 2015-12-10  James Greenhalgh  
> > > 
> > >   * config/aarch64/aarch64.c (use_rsqrt_p): Always use software
> > >   reciprocal sqrt for -mlow-precision-recip-sqrt.
> > > 
> > 
> > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > > index 9142ac0..1d5d898 100644
> > > --- a/gcc/config/aarch64/aarch64.c
> > > +++ b/gcc/config/aarch64/aarch64.c
> > > @@ -7485,8 +7485,9 @@ use_rsqrt_p (void)
> > >  {
> > >return (!flag_trapping_math
> > > && flag_unsafe_math_optimizations
> > > -   && (aarch64_tune_params.extra_tuning_flags
> > > -   & AARCH64_EXTRA_TUNE_RECIP_SQRT));
> > > +   && ((aarch64_tune_params.extra_tuning_flags
> > > +& AARCH64_EXTRA_TUNE_RECIP_SQRT)
> > > +   || flag_mrecip_low_precision_sqrt));
> > >  }
> > >  
> > >  /* Function to decide when to use
> > 
> 


Re: [Patch AArch64] Restrict 16-bit sqrdml{sa}h instructions to FP_LO_REGS

2016-02-08 Thread James Greenhalgh
On Tue, Jan 26, 2016 at 04:04:47PM +, James Greenhalgh wrote:
> 
> Hi,
> 
> In their forms using 16-bit lanes, the sqrdmlah and sqrdmlsh instruction
> available when compiling with -march=armv8.1-a are only usable with
> a register number in the range 0 to 15 for operand 3, as gas will point
> out:
> 
>   Error: register number out of range 0 to 15 at
> operand 3 -- `sqrdmlsh v2.4h,v4.4h,v23.h[5]'
> 
> This patch teaches GCC to avoid registers outside of this range when
> appropriate, in the same fashion as we do for other instructions with
> this limitation.
> 
> Tested on an internal testsuite targeting Neon intrinsics.
> 
> OK?

*ping*

Thanks,
James

> ---
> 2016-01-25  James Greenhalgh  
> 
>   * config/aarch64/aarch64.md
>   (arch64_sqrdmlh_lane): Fix register
>   constraints for operand 3.
>   (aarch64_sqrdmlh_laneq): Likewise.
> 

> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index e1f5682..0b46e78 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -3240,7 +3240,7 @@
> [(match_operand:VDQHS 1 "register_operand" "0")
>  (match_operand:VDQHS 2 "register_operand" "w")
>  (vec_select:
> -  (match_operand: 3 "register_operand" "w")
> +  (match_operand: 3 "register_operand" "")
>(parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
> SQRDMLH_AS))]
> "TARGET_SIMD_RDMA"
> @@ -3258,7 +3258,7 @@
> [(match_operand:SD_HSI 1 "register_operand" "0")
>  (match_operand:SD_HSI 2 "register_operand" "w")
>  (vec_select:
> -  (match_operand: 3 "register_operand" "w")
> +  (match_operand: 3 "register_operand" "")
>(parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
> SQRDMLH_AS))]
> "TARGET_SIMD_RDMA"
> @@ -3278,7 +3278,7 @@
> [(match_operand:VDQHS 1 "register_operand" "0")
>  (match_operand:VDQHS 2 "register_operand" "w")
>  (vec_select:
> -  (match_operand: 3 "register_operand" "w")
> +  (match_operand: 3 "register_operand" "")
>(parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
> SQRDMLH_AS))]
> "TARGET_SIMD_RDMA"
> @@ -3296,7 +3296,7 @@
> [(match_operand:SD_HSI 1 "register_operand" "0")
>  (match_operand:SD_HSI 2 "register_operand" "w")
>  (vec_select:
> -  (match_operand: 3 "register_operand" "w")
> +  (match_operand: 3 "register_operand" "")
>(parallel [(match_operand:SI 4 "immediate_operand" "i")]))]
> SQRDMLH_AS))]
> "TARGET_SIMD_RDMA"



[Patch] Gate vect-mask-store-move-1.c correctly, and actually output the dump

2016-02-08 Thread James Greenhalgh

Hi,

As far as I can tell, this testcase will only vectorize for x86_64/i?86
targets, so it should be gated to only check for vectorization on those.

Additionally, this test wants to scan the vectorizer dumps, so we ought
to add -fdump-tree-vect-all to the options.

Checked on aarch64 (cross/native) and x86 with no issues.

OK?

Thanks,
James

---
2016-02-08  James Greenhalgh  

* gcc.dg/vect/vect-mask-store-move-1.c: Add dump option, and gate
check on x86_64/i?86.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c b/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
index e575f6d..3ef613d 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3" } */
+/* { dg-options "-O3 -fdump-tree-vect-all" } */
 /* { dg-additional-options "-mavx2" { target { i?86-*-* x86_64-*-* } } } */
 
 #define N 256
@@ -16,4 +16,4 @@ void foo (int n)
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Move stmt to created bb" 6 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Move stmt to created bb" 6 "vect" { target { i?86-*-* x86_64-*-* } } } } */


Re: [Patch] Gate vect-mask-store-move-1.c correctly, and actually output the dump

2016-02-08 Thread James Greenhalgh
On Mon, Feb 08, 2016 at 04:29:31PM +0300, Yuri Rumyantsev wrote:
> Hi James,
> 
> Thanks for reporting this issue.
> I prepared slightly different patch since we don't need to add
> tree-vect dump option - it is on by default for all tests in /vect
> directory.

Hm, I added that line as my test runs were showing:

  UNRESOLVED: gcc.dg/vect/vect-mask-store-move-1.c: dump file does not exist

I would guess the explicit 

  /* { dg-options "-O3" } */

is clobbering the vect.exp setup of flags?

This also affects the x86-64 results H.J. Lu is sending out:

  https://gcc.gnu.org/ml/gcc-testresults/2016-02/msg00824.html

Thanks,
James

> 
> gcc/testsuite/ChangeLog:
> 
> * gcc.dg/vect/vect-mask-store-move-1.c: Gate dump with x86 target.
> 
> 2016-02-08 16:07 GMT+03:00 James Greenhalgh :
> >
> > Hi,
> >
> > As far as I can tell, this testcase will only vectorize for x86_64/i?86
> > targets, so it should be gated to only check for vectorization on those.
> >
> > Additionally, this test wants to scan the vectorizer dumps, so we ought
> > to add -fdump-tree-vect-all to the options.
> >
> > Checked on aarch64 (cross/native) and x86 with no issues.
> >
> > OK?
> >
> > Thanks,
> > James
> >
> > ---
> > 2016-02-08  James Greenhalgh  
> >
> > * gcc.dg/vect/vect-mask-store-move-1.c: Add dump option, and gate
> > check on x86_64/i?86.
> >




Re: [Patch] Gate vect-mask-store-move-1.c correctly, and actually output the dump

2016-02-09 Thread James Greenhalgh

On Mon, Feb 08, 2016 at 03:24:14PM +0100, Richard Biener wrote:
> On Mon, Feb 8, 2016 at 2:40 PM, James Greenhalgh
>  wrote:
> > On Mon, Feb 08, 2016 at 04:29:31PM +0300, Yuri Rumyantsev wrote:
> >> Hi James,
> >>
> >> Thanks for reporting this issue.
> >> I prepared slightly different patch since we don't need to add
> >> tree-vect dump option - it is on by default for all tests in /vect
> >> directory.
> >
> > Hm, I added that line as my test runs were showing:
> >
> >   UNRESOLVED: gcc.dg/vect/vect-mask-store-move-1.c: dump file does not exist
> >
> > I would guess the explicit
> >
> >   /* { dg-options "-O3" } */
> >
> > is clobbering the vect.exp setup of flags?
>
> Yes.  Use { dg-additional-options "-O3" } instead.

I don't see why this test needs anything more than the default vect
options anyway... In which case, the patch would look like this.

Tested on x86-64 where the test passes, and on AArch64 where it is
correctly skipped.

OK?

Thanks,
James

---
2016-02-09  James Greenhalgh  

* gcc.dg/vect/vect-mask-store-move-1.c: Drop dg-options directive,
gate check on x86_64/i?86.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c b/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
index e575f6d..f5cae4f 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-mask-store-move-1.c
@@ -1,5 +1,4 @@
 /* { dg-do compile } */
-/* { dg-options "-O3" } */
 /* { dg-additional-options "-mavx2" { target { i?86-*-* x86_64-*-* } } } */
 
 #define N 256
@@ -16,4 +15,4 @@ void foo (int n)
   }
 }
 
-/* { dg-final { scan-tree-dump-times "Move stmt to created bb" 6 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Move stmt to created bb" 6 "vect" { target { i?86-*-* x86_64-*-* } } } } */


Re: [PATCH][AArch64] Only update assembler .arch directive when necessary

2016-02-10 Thread James Greenhalgh
On Thu, Feb 04, 2016 at 01:50:31PM +, Kyrill Tkachov wrote:
> Hi all,
> 
> As part of the target attributes and pragmas support for GCC 6 I changed the
> aarch64 port to emit a .arch assembly directive for each function that
> describes the architectural features used by that function.  This is a change
> from GCC 5 behaviour where we output a single .arch directive at the
> beginning of the assembly file corresponding to architectural features given
> on the command line.

> Bootstrapped and tested on aarch64-none-linux-gnu.  With this patch I managed
> to build a recent allyesconfig Linux kernel where before the build would fail
> when assembling the LSE instructions.
> 
> Ok for trunk?

One comment, that I'm willing to be convinced on...

> 
> Thanks,
> Kyrill
> 
> 2016-02-04  Kyrylo Tkachov  
> 
> * config/aarch64/aarch64.c (struct aarch64_output_asm_info):
> New struct definition.
> (aarch64_previous_asm_output): New variable.
> (aarch64_declare_function_name): Only output .arch assembler
> directive if it will be different from the previously output
> directive.
> (aarch64_start_file): New function.
> (TARGET_ASM_FILE_START): Define.
> 
> 2016-02-04  Kyrylo Tkachov  
> 
> * gcc.target/aarch64/assembler_arch_1.c: Add -dA to dg-options.
> Delete unneeded -save-temps.
> * gcc.target/aarch64/assembler_arch_7.c: Likewise.
> * gcc.target/aarch64/target_attr_15.c: Scan assembly for
> .arch armv8-a\n.
> * gcc.target/aarch64/assembler_arch_1.c: New test.

> commit 2df0f24332e316b8d18d4571438f76726a0326e7
> Author: Kyrylo Tkachov 
> Date:   Wed Jan 27 12:54:54 2016 +
> 
> [AArch64] Only update assembler .arch directive when necessary
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 5ca2ae8..0751440 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -11163,6 +11163,17 @@ aarch64_asm_preferred_eh_data_format (int code 
> ATTRIBUTE_UNUSED, int global)
> return (global ? DW_EH_PE_indirect : 0) | DW_EH_PE_pcrel | type;
>  }
>  
> +struct aarch64_output_asm_info
> +{
> +  const struct processor *arch;
> +  const struct processor *cpu;
> +  unsigned long isa_flags;

Why not just keep the last string you printed, and use a string compare
to decide whether to print or not? Sure we'll end up doing a bit more
work, but the logic becomes simpler to follow and we don't need to pass
around another struct...

Thanks,
James




Re: [PATCH][AArch64] Only update assembler .arch directive when necessary

2016-02-10 Thread James Greenhalgh
On Wed, Feb 10, 2016 at 10:32:16AM +, Kyrill Tkachov wrote:
> Hi James,
> 
> On 10/02/16 10:11, James Greenhalgh wrote:
> >On Thu, Feb 04, 2016 at 01:50:31PM +, Kyrill Tkachov wrote:
> >>Hi all,
> >>
> >>As part of the target attributes and pragmas support for GCC 6 I changed the
> >>aarch64 port to emit a .arch assembly directive for each function that
> >>describes the architectural features used by that function.  This is a 
> >>change
> >>from GCC 5 behaviour where we output a single .arch directive at the
> >>beginning of the assembly file corresponding to architectural features given
> >>on the command line.
> >
> >>Bootstrapped and tested on aarch64-none-linux-gnu.  With this patch I 
> >>managed
> >>to build a recent allyesconfig Linux kernel where before the build would 
> >>fail
> >>when assembling the LSE instructions.
> >>
> >>Ok for trunk?
> >One comment, that I'm willing to be convinced on...
> >
> >>Thanks,
> >>Kyrill
> >>
> >>2016-02-04  Kyrylo Tkachov  
> >>
> >> * config/aarch64/aarch64.c (struct aarch64_output_asm_info):
> >> New struct definition.
> >> (aarch64_previous_asm_output): New variable.
> >> (aarch64_declare_function_name): Only output .arch assembler
> >> directive if it will be different from the previously output
> >> directive.
> >> (aarch64_start_file): New function.
> >> (TARGET_ASM_FILE_START): Define.
> >>
> >>2016-02-04  Kyrylo Tkachov  
> >>
> >> * gcc.target/aarch64/assembler_arch_1.c: Add -dA to dg-options.
> >> Delete unneeded -save-temps.
> >> * gcc.target/aarch64/assembler_arch_7.c: Likewise.
> >> * gcc.target/aarch64/target_attr_15.c: Scan assembly for
> >> .arch armv8-a\n.
> >> * gcc.target/aarch64/assembler_arch_1.c: New test.
> >>commit 2df0f24332e316b8d18d4571438f76726a0326e7
> >>Author: Kyrylo Tkachov 
> >>Date:   Wed Jan 27 12:54:54 2016 +
> >>
> >> [AArch64] Only update assembler .arch directive when necessary
> >>
> >>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> >>index 5ca2ae8..0751440 100644
> >>--- a/gcc/config/aarch64/aarch64.c
> >>+++ b/gcc/config/aarch64/aarch64.c
> >>@@ -11163,6 +11163,17 @@ aarch64_asm_preferred_eh_data_format (int code 
> >>ATTRIBUTE_UNUSED, int global)
> >> return (global ? DW_EH_PE_indirect : 0) | DW_EH_PE_pcrel | type;
> >>  }
> >>+struct aarch64_output_asm_info
> >>+{
> >>+  const struct processor *arch;
> >>+  const struct processor *cpu;
> >>+  unsigned long isa_flags;
> >Why not just keep the last string you printed, and use a string compare
> >to decide whether to print or not? Sure we'll end up doing a bit more
> >work, but the logic becomes simpler to follow and we don't need to pass
> >around another struct...
> 
> I did do it this way to avoid a string comparison (I try to avoid
> manual string manipulations where I can as they're so easy to get wrong)
> though this isn't on any hot path.
> We don't really pass the structure around anywhere, we just keep one
> instance. We'd have to do the same with a string i.e. keep a string
> object around that we'd strcpy (or C++ equivalent) a string to every time
> we wanted to update it, so I thought this approach is cleaner as the
> architecture features are already fully described by a pointer to
> an element in the static constant all_architectures table and an
> unsigned long holding the ISA flags.
> 
> If you insist I can change it to a string, but I personally don't
> think it's worth it.

Had you been working on a C string I probably wouldn't have noticed. But
you're already working with C++ strings in this function, so much of what
you are concerned about is straightforward.

I'd encourage you to try it using idiomatic string manipulation in C++, the
cleanup should be worth it.

Thanks,
James



  1   2   3   4   5   6   7   8   9   10   >