Re: [GCC][PATCH][Aarch64] Add Bfloat16_t scalar type, vector types and machine modes to Aarch64 back-end [1/2]

Stam Markianos-Wright Tue, 07 Jan 2020 03:42:24 -0800

On 23/12/2019 16:57, Richard Sandiford wrote:
> Stam Markianos-Wright <stam.markianos-wri...@arm.com> writes:
>> On 12/19/19 10:01 AM, Richard Sandiford wrote:
>>>> +
>>>> +#pragma GCC push_options
>>>> +#pragma GCC target ("arch=armv8.2-a+bf16")
>>>> +#ifdef __ARM_FEATURE_BF16_SCALAR_ARITHMETIC
>>>> +
>>>> +typedef __bf16 bfloat16_t;
>>>> +
>>>> +
>>>> +#endif
>>>> +#pragma GCC pop_options
>>>> +
>>>> +#endif
>>>
>>> Are you sure we need the #ifdef?  The target pragma should guarantee
>>> that the macro's defined.
>>>
>>> But the validity of the typedef shouldn't depend on target options,
>>> so AFAICT this should just be:
>>>
>>> typedef __bf16 bfloat16_t;
>>
>> Ok so it's a case of "what do we want to happen if the user tries to use 
>> bfloats
>> without +bf16 enabled.
>>
>> So the intent of the ifdef was to not have bfloat16_t be visible if the macro
>> wasn't defined (i.e. not having any bf16 support), but I see now that this 
>> was
>> being negated by the target macro, anyway! Oops, my bad for not really
>> understanding that, sorry!
>>
>> If we have the types always visible, then the user may use them, resulting 
>> in an
>> ICE.
>>
>> But even if the #ifdef worked this still doesn't stop the user from trying to
>> use  __bf16 or __Bfloat16x4_t, __Bfloat16x8_t , which would still do produce 
>> an
>> ICE, so it's not a perfect solution anyway...
> 
> Right.  Or they could use #pragma GCC target to switch to a different
> non-bf16 target after including arm_bf16.h.
> 
>> One other thing I tried was the below change to aarch64-builtins.c which 
>> stops
>> __bf16 or the vector types from being registered at all:
>>
>> --- a/gcc/config/aarch64/aarch64-builtins.c
>> +++ b/gcc/config/aarch64/aarch64-builtins.c
>> @@ -759,26 +759,32 @@ aarch64_init_simd_builtin_types (void)
>>       aarch64_simd_types[Float64x1_t].eltype = double_type_node;
>>       aarch64_simd_types[Float64x2_t].eltype = double_type_node;
>>
>> -  /* Init Bfloat vector types with underlying __bf16 type.  */
>> -  aarch64_simd_types[Bfloat16x4_t].eltype = aarch64_bf16_type_node;
>> -  aarch64_simd_types[Bfloat16x8_t].eltype = aarch64_bf16_type_node;
>> +  if (TARGET_BF16_SIMD)
>> +    {
>> +      /* Init Bfloat vector types with underlying __bf16 type.  */
>> +      aarch64_simd_types[Bfloat16x4_t].eltype = aarch64_bf16_type_node;
>> +      aarch64_simd_types[Bfloat16x8_t].eltype = aarch64_bf16_type_node;
>> +    }
>>
>>       for (i = 0; i < nelts; i++)
>>         {
>>           tree eltype = aarch64_simd_types[i].eltype;
>>           machine_mode mode = aarch64_simd_types[i].mode;
>>
>> -      if (aarch64_simd_types[i].itype == NULL)
>> +      if (eltype != NULL)
>>            {
>> -         aarch64_simd_types[i].itype
>> -           = build_distinct_type_copy
>> -             (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
>> -         SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
>> -       }
>> +         if (aarch64_simd_types[i].itype == NULL)
>> +           {
>> +             aarch64_simd_types[i].itype
>> +               = build_distinct_type_copy
>> +               (build_vector_type (eltype, GET_MODE_NUNITS (mode)));
>> +             SET_TYPE_STRUCTURAL_EQUALITY (aarch64_simd_types[i].itype);
>> +           }
>>
>> -      tdecl = add_builtin_type (aarch64_simd_types[i].name,
>> -                               aarch64_simd_types[i].itype);
>> -      TYPE_NAME (aarch64_simd_types[i].itype) = tdecl;
>> +         tdecl = add_builtin_type (aarch64_simd_types[i].name,
>> +                                   aarch64_simd_types[i].itype);
>> +         TYPE_NAME (aarch64_simd_types[i].itype) = tdecl;
>> +       }
>>         }
>>
>>     #define AARCH64_BUILD_SIGNED_TYPE(mode)  \
>> @@ -1240,7 +1246,8 @@ aarch64_general_init_builtins (void)
>>
>>       aarch64_init_fp16_types ();
>>
>> -  aarch64_init_bf16_types ();
>> +  if (TARGET_BF16_FP)
>> +    aarch64_init_bf16_types ();
>>
>>       if (TARGET_SIMD)
>>         aarch64_init_simd_builtins ();
>>
>>
>>
>> But the problem in that case was that it the types could not be re-enabled 
>> using
>> a target pragma like:
>>
>> #pragma GCC push_options
>> #pragma GCC target ("+bf16")
>>
>> Inside the test.
>>
>> (i.e. the pragma caused the ifdef to be TRUE, but __bf16 was still not being
>> enabled afaict?)
>>
>> So I'm not sure what to do, presumably we do want some guard around the type 
>> so
>> as not to just ICE if the type is used without +bf16?
> 
> Other header files work both ways: you get the same definitions regardless
> of what the target was when the header file was included.  Then we need
> to raise an error if the user tries to do something that the current
> target doesn't support.
> 
> I suppose for bf16 we could either (a) try to raise an error whenever
> BF-related moves are emitted without the required target feature or
> (b) handle __bf16 types like __fp16 types.  The justification for
> (b) is that there aren't really any new instructions for moves;
> __bf16 is mostly a software construct as far as this specific
> patch goes.  (It's a different story for the intrinsics patch
> of course.)
> 
> I don't know which of (a) or (b) is better.  Whichever we go for,
> it would be good if clang and GCC were consistent here.


Following our downstream discussions we have implemented (b) by removing 
TARGET_xx restrictions on BFmode MOVs.

I also noticed an ICE when typing to move BF vector types to X registers 
("could 
not split insn"),so I have added the BF-enabled iterators to other patterns 
needed for this

Lmk if you spot any issues!

Also I've update the filenames of all our tests to make them a bit clearer:

C tests:

__ bfloat16_scalar_compile_1.c to bfloat16_scalar_compile_3.c: Compilation of 
scalar moves/loads/stores with "-march8.2-a+bf16", "-march8.2-a and +bf16 
target 
pragma", "-march8.2-a" (now does not error out at all). There now include 
register asms to check more MOV alternatives.

__ bfloat16_scalar_compile_4.c: The _Complex error test.

__ bfloat16_simd_compile_1.c to bfloat16_simd_compile_3.c: Likewise to 
x_scalar_x, but also include (vector) 0x1234.. compilation (no assembler scan).

I had also done a small c++ test, but have chosen to shift that to the [2/2] 
patch because it is currently being blocked by target_invalid_conversion.

Let know know if anything is missing!

> 
>>> It would be good to have more test coverage than this.  E.g.:
>>>
>>> - a test that includes arm_bf16.h, with just scalar tests.
>>
>> Done as test 2, but it is a small test. Is there anything I could add to it?
>> (I feel like ideally I'd want to try and force it down every alternative of 
>> the
>> RTL pattern)
> 
> register asms are one way of doing that, see e.g
> gcc.target/aarch64/sve/struct_move_1.c
> 
Added some to check the movs in/out of GPRs, too.

>>>
>>> - a test for _Complex bfloat16_t.
>>
>> I don't think we currently have a decision on whether this should be 
>> supported
>> or not.
>> AFAICT we also don't have complex __fp16 support either. I'm getting the same
>> error messages attempting to compile a _Complex __fp16 but it's always likely
>> I'm going at this wrong!
>>
>> Added test 5 to show you what I was trying to do and to catch the error 
>> messages
>> in their current form, but I'm not sure if I've done this right either, tbh!
> 
> Testing for an error is a good option if we don't intend to support this.
> The main reason for having a test is to make sure that there's no ICE.
> 
> So the test in the new patch LGTM, thanks.
Cheers!

> 
>>> - a test for moves involving:
>>>
>>>       typedef bfloat16_t v16bf __attribute__((vector_size(32)));
>>
>> Oh that's a good idea, thank you for pointing it out!
>>
>> See test 6 for reference.
>>
>> So for vector size 16, 128bits, this looks fine, loading and storing from q
>> registers (using aarch64_simd_movv8bf).
>>
>> For vector size 32, 256 bits, the compiler chooses to use 4*x-registers 
>> instead,
>> resulting in this piece of assembler
>>
>> stacktest2:
>>            sub     sp, sp, #64
>>            ldp     x2, x3, [x0]
>>            stp     x2, x3, [sp]
>>            ldp     x0, x1, [x0, 16]
>>            stp     x0, x1, [sp, 16]
>>            ldp     x0, x1, [sp]
>>            stp     x0, x1, [sp, 32]
>>            ldp     x2, x3, [sp, 16]
>>            stp     x2, x3, [sp, 48]
>>            stp     x0, x1, [x8]
>>            ldp     x0, x1, [sp, 48]
>>            stp     x0, x1, [x8, 16]
>>            add     sp, sp, 64
>>            ret
>>
>> Which looks strange using regular registers in movti mode, but I tested it 
>> with
>> float16 and float32 vectors and they the same also give the same result.
>>
>> However, using an integer vector generates:
>>
>> stacktest2:
>>            ld1     {v0.16b - v1.16b}, [x0]
>>            sub     sp, sp, #32
>>            st1     {v0.16b - v1.16b}, [sp]
>>            ld1     {v0.16b - v1.16b}, [sp]
>>            st1     {v0.16b - v1.16b}, [x8]
>>            add     sp, sp, 32
>>            ret
>>
>> from the aarch64_movoi pattern. So now I'm unsure whether to leave this as 
>> is or
>> to look into why all float modes are not being used through the seemingly 
>> more
>> efficient movoi pattern. What do you think?
>> (i intend to look into this further)
> 
> Haven't tried, but is this affected by -fno-split-wide-types?
Apparently not! I seem to be getting the same assembler in both cases. In 
investigating I got as far as finding that for float types the ld/str was going 
through aarch64_expand_cpymem which limits them to TImode for some reason (and 
removing the limit allowed them to use OImode,XImode, etc.), but I stopped 
there.
> 
> But here too the main thing is to make sure that there's no ICE when
> using the vectors.  Making it efficient can be (very low priority)
> follow-on work.
> 
> So it's probably best not to match any specific output here.
> Just testing that the moves compile is OK.
Done :) And integrated into the vector tests.
> 
>>> - a test that involves moving constants, for both scalars and vectors.
>>>     You can create zero scalar constants in C++ using bfloat16_t() etc.
>>>     For vectors it's possible to do things like:
>>>
>>>       typedef short v2bf __attribute__((vector_size(4)));
>>>       v2hi foo (void) { return (v2hi) 0x12345678; }
>>>
>>>     The same sort of things should work for bfloat16x4_t and bfloat16x8_t.
>>
>> Leaving this as an open issue for now because I'm not 100% sure what we
>> should/shouldn't be allowing past the tree-level target hooks.
>>
>> If we do want to block this we would do this in the [2/2] patch.
>> I will come back to it and create a scan-assembler test when I'm more clear 
>> on
>> what we should and shouldn't allow at the higher level :)
> 
> FWIW, I'm not sure we should go out of our way to disallow this.
> Preventing bfloat16_t() in C++ would IMO be unnatural.  And the
> "(vector) vector-sized-integer" syntax specifically treats the vector
> as a bundle of bits without really caring what the element type is.
> Even if we did manage to forbid the conversion in that context,
> it would still be possible to achieve the same thing using:
> 
>     v2hi
>     foo (void)
>     {
>       union { v2hi v; unsigned int i; } u;
>       u.i = 0x12345678;
>       return u.v;
>     }
> 
Added the compilation of "(vector) vector-sized-integer" in the vector tests.

But target_invalid_conversion in the [2/2] patch is a complication to this (as 
with bfloat_16t() in c++.

I was under the impression that the original intent of bfloat was for it to be 
storage only, with any initialisation happening through the float32 convert 
intrinsic.

Either I'd be happy to allow it, but it does feel like we'd slightly be going 
against what's the ACLE currently.
However, looking back at it now, it only mentions using ACLE intrinsics over C 
operators, so I'd be happy to allow this for vectors.

For scalars though, if we e.g. were to allow:

bfloat16_t (0x1234);

on a single bfloat, I don't see how we could still block conversions like:

bfloat16_t scalar1 = 0.1;
bfloat16_t scalar2 = 0;
bfloat16_t scalar3 = is_a_float;

Agreed that the union {} would still always slip through, though.

I'll also reply to the 2/2 email to show you what I currently have there.
Let me know of your thoughts on that!

Cheers,
Stam

> Thanks for the new patch, looks good apart from the points above and:
> 
>> +;; Iterator for all scalar floating point modes suitable for moving, 
>> including
>> +;; special BF type.(HF, SF, DF, TF and BF)
> 
> Nit: should be space rather than "." before "(".
Done
> 
>> +(define_mode_iterator GPF_TF_F16_MOV [(HF "") (BF "TARGET_BF16_FP") (SF "")
>> +                                  (DF "") (TF "")])
>> +
>>   ;; Double vector modes.
>>   (define_mode_iterator VDF [V2SF V4HF])
>>   
>> @@ -79,6 +87,9 @@
>>   ;; Double vector modes.
>>   (define_mode_iterator VD [V8QI V4HI V4HF V2SI V2SF])
>>   
>> +;; Double vector modes suitable for moving.  Includes BFmode.
>> +(define_mode_iterator VDMOV [V8QI V4HI V4HF V4BF V2SI V2SF])
>> +
>>   ;; All modes stored in registers d0-d31.
>>   (define_mode_iterator DREG [V8QI V4HI V4HF V2SI V2SF DF])
>>   
>> @@ -94,6 +105,9 @@
>>   ;; Quad vector modes.
>>   (define_mode_iterator VQ [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
>>   
>> +;; Quad vector modes suitable for moving.  Includes BFmode.
>> +(define_mode_iterator VQMOV [V16QI V8HI V4SI V2DI V8HF V8BF V4SF V2DF])
>> +
>>   ;; Copy of the above.
>>   (define_mode_iterator VQ2 [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
>>   
> 
> This looks a bit inconsistent: the scalar iterator requires
> TARGET_BF16_FP for bf16 modes, but the vector iterator doesn't.

Ah yes this was because I chose to put the TARGET_xx only on the 
define_expands. 
But all this has been removed for now (all the BFmode movs are unrestricted).

> 
>> @@ -160,6 +174,15 @@
>>   (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
>>                              V4HF V8HF V2SF V4SF V2DF])
>>   
>> +;; All Advanced SIMD modes suitable for moving, loading, and storing,
>> +;; including special Bfloat vector types.
>> +(define_mode_iterator VALL_F16MOV [(V8QI "") (V16QI "") (V4HI "") (V8HI "")
>> +                               (V2SI "") (V4SI "") (V2DI "")
>> +                               (V4HF "") (V8HF "")
>> +                               (V4BF "TARGET_BF16_SIMD")
>> +                               (V8BF "TARGET_BF16_SIMD")
>> +                               (V2SF "") (V4SF "") (V2DF "")])
>> +
>>   ;; The VALL_F16 modes except the 128-bit 2-element ones.
>>   (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
>>                              V4HF V8HF V2SF V4SF])
> 
> whereas here we do check.  But that comes back to the (a)/(b) choice above

Agreed. Since we implemented (b) in this revision all these restrictions on the 
iterators have been removed.

.
> 
>> diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c 
>> b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c
>> new file mode 100644
>> index 00000000000..f2bef671deb
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_1.c
>> @@ -0,0 +1,51 @@
>> +/* { dg-do assemble { target { aarch64*-*-* } } } */
>> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
>> +/* { dg-add-options arm_v8_2a_bf16_neon }  */
>> +/* { dg-additional-options "-O3 --save-temps" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +#include <arm_neon.h>
>> +
>> +/*
>> +**stacktest1:
>> +**  ...
>> +**  str     h0, \[sp, [0-9]+\]
>> +**  ldr     h0, \[sp, [0-9]+\]
>> +**  ...
>> +**  ret
>> +*/
>> +bfloat16_t stacktest1 (bfloat16_t __a)
>> +{
>> +  volatile bfloat16_t b = __a;
>> +  return b;
>> +}
>> +
>> +/*
>> +**stacktest2:
>> +**  ...
>> +**  str     d0, \[sp, [0-9]+\]
>> +**  ldr     d0, \[sp, [0-9]+\]
>> +**  ...
>> +**  ret
>> +*/
>> +bfloat16x4_t stacktest2 (bfloat16x4_t __a)
>> +{
>> +  volatile bfloat16x4_t b = __a;
>> +  return b;
>> +}
>> +
>> +/*
>> +**stacktest3:
>> +**  ...
>> +**  str     q0, \[sp\]
>> +**  ldr     q0, \[sp\]
>> +**  ...
>> +**  ret
>> +*/
>> +bfloat16x8_t stacktest3 (bfloat16x8_t __a)
>> +{
>> +  volatile bfloat16x8_t b = __a;
>> +  return b;
>> +}
> 
> Might be a daft question, but why do we have an offset for the first
> two and not for the last one?  Might be worth hard-coding whatever
> offset we use.
> 
> If we use -fomit-frame-pointer then the whole function body should
> be stable: sub, str, ldr, add, ret.

Oh I don't know why to be honest, it just seemed to be how they were compiled.
In this case -fomit-frame-pointer doesn't seem to do anything to remove the 
offset (apparently this flag is enabled by default from -O, anyway).
So I've hard-coded the offset into the test for now. Lmk if this is what you 
meant!



> 
>> @@ -0,0 +1,21 @@
>> +/* { dg-do assemble { target { aarch64*-*-* } } } */
>> +/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
>> +/* { dg-add-options arm_v8_2a_bf16_neon }  */
>> +/* { dg-additional-options "-O3 --save-temps" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +#include <arm_bf16.h>
>> +
>> +/*
>> +**stacktest1:
>> +**  ...
>> +**  str     h0, \[sp, [0-9]+\]
>> +**  ldr     h0, \[sp, [0-9]+\]
>> +**  ...
>> +**  ret
>> +*/
>> +bfloat16_t stacktest1 (bfloat16_t __a)
>> +{
>> +  volatile bfloat16_t b = __a;
>> +  return b;
>> +}
> 
> Same comment here.
Done
> 
>> diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c 
>> b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c
>> new file mode 100644
>> index 00000000000..9bcb53b32d8
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_compile_3.c
>> @@ -0,0 +1,25 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-march=armv8.2-a -O2" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +#pragma GCC push_options
>> +#pragma GCC target ("+bf16")
>> +
>> +#include <arm_bf16.h>
>> +
>> +/*
>> +**stacktest1:
>> +**  ...
>> +**  str     h0, \[sp, [0-9]+\]
>> +**  ldr     h0, \[sp, [0-9]+\]
>> +**  ...
>> +**  ret
>> +*/
>> +bfloat16_t stacktest1 (bfloat16_t __a)
>> +{
>> +  volatile bfloat16_t b = __a;
>> +  return b;
>> +}
>> +
>> +#pragma GCC pop_options
> 
> Here too.  No real need for the push & pop, but keeping them is fine
> if that seems more obvious.
Same as above
Oh good to know! Yes I left the push & pop in, if only for the sake of clarity.

Cheers,
Stam
> 
> Thanks,
> Richard
>

diff --git a/gcc/config.gcc b/gcc/config.gcc
index c3d6464f3e6..075e46072d1 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -315,7 +315,7 @@ m32c*-*-*)
         ;;
 aarch64*-*-*)
 	cpu_type=aarch64
-	extra_headers="arm_fp16.h arm_neon.h arm_acle.h arm_sve.h"
+	extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h"
 	c_target_objs="aarch64-c.o"
 	cxx_target_objs="aarch64-c.o"
 	d_target_objs="aarch64-d.o"
diff --git a/gcc/config/aarch64/aarch64-builtins.c b/gcc/config/aarch64/aarch64-builtins.c
index 1bd2640a1ce..b2d6b761489 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -68,6 +68,9 @@
 #define hi_UP    E_HImode
 #define hf_UP    E_HFmode
 #define qi_UP    E_QImode
+#define bf_UP    E_BFmode
+#define v4bf_UP  E_V4BFmode
+#define v8bf_UP  E_V8BFmode
 #define UP(X) X##_UP
 
 #define SIMD_MAX_BUILTIN_ARGS 5
@@ -568,6 +571,10 @@ static tree aarch64_simd_intXI_type_node = NULL_TREE;
 tree aarch64_fp16_type_node = NULL_TREE;
 tree aarch64_fp16_ptr_type_node = NULL_TREE;
 
+/* Back-end node type for brain float (bfloat) types.  */
+tree aarch64_bf16_type_node = NULL_TREE;
+tree aarch64_bf16_ptr_type_node = NULL_TREE;
+
 /* Wrapper around add_builtin_function.  NAME is the name of the built-in
    function, TYPE is the function type, and CODE is the function subcode
    (relative to AARCH64_BUILTIN_GENERAL).  */
@@ -659,6 +666,8 @@ aarch64_simd_builtin_std_type (machine_mode mode,
       return float_type_node;
     case E_DFmode:
       return double_type_node;
+    case E_BFmode:
+      return aarch64_bf16_type_node;
     default:
       gcc_unreachable ();
     }
@@ -750,6 +759,10 @@ aarch64_init_simd_builtin_types (void)
   aarch64_simd_types[Float64x1_t].eltype = double_type_node;
   aarch64_simd_types[Float64x2_t].eltype = double_type_node;
 
+  /* Init Bfloat vector types with underlying __bf16 type.  */
+  aarch64_simd_types[Bfloat16x4_t].eltype = aarch64_bf16_type_node;
+  aarch64_simd_types[Bfloat16x8_t].eltype = aarch64_bf16_type_node;
+
   for (i = 0; i < nelts; i++)
     {
       tree eltype = aarch64_simd_types[i].eltype;
@@ -1059,6 +1072,19 @@ aarch64_init_fp16_types (void)
   aarch64_fp16_ptr_type_node = build_pointer_type (aarch64_fp16_type_node);
 }
 
+/* Initialize the backend REAL_TYPE type supporting bfloat types.  */
+static void
+aarch64_init_bf16_types (void)
+{
+  aarch64_bf16_type_node = make_node (REAL_TYPE);
+  TYPE_PRECISION (aarch64_bf16_type_node) = 16;
+  SET_TYPE_MODE (aarch64_bf16_type_node, BFmode);
+  layout_type (aarch64_bf16_type_node);
+
+  lang_hooks.types.register_builtin_type (aarch64_bf16_type_node, "__bf16");
+  aarch64_bf16_ptr_type_node = build_pointer_type (aarch64_bf16_type_node);
+}
+
 /* Pointer authentication builtins that will become NOP on legacy platform.
    Currently, these builtins are for internal use only (libgcc EH unwinder).  */
 
@@ -1214,6 +1240,8 @@ aarch64_general_init_builtins (void)
 
   aarch64_init_fp16_types ();
 
+  aarch64_init_bf16_types ();
+
   if (TARGET_SIMD)
     aarch64_init_simd_builtins ();
 
diff --git a/gcc/config/aarch64/aarch64-modes.def b/gcc/config/aarch64/aarch64-modes.def
index 6cd8ed0972a..1eeb8d88452 100644
--- a/gcc/config/aarch64/aarch64-modes.def
+++ b/gcc/config/aarch64/aarch64-modes.def
@@ -69,6 +69,13 @@ VECTOR_MODES (FLOAT, 16);     /*            V4SF V2DF.  */
 VECTOR_MODE (FLOAT, DF, 1);   /*                 V1DF.  */
 VECTOR_MODE (FLOAT, HF, 2);   /*                 V2HF.  */
 
+/* Bfloat16 modes.  */
+FLOAT_MODE (BF, 2, 0);
+ADJUST_FLOAT_FORMAT (BF, &arm_bfloat_half_format);
+
+VECTOR_MODE (FLOAT, BF, 4);   /*		 V4BF.  */
+VECTOR_MODE (FLOAT, BF, 8);   /*		 V8BF.  */
+
 /* Oct Int: 256-bit integer mode needed for 32-byte vector arguments.  */
 INT_MODE (OI, 32);
 
diff --git a/gcc/config/aarch64/aarch64-simd-builtin-types.def b/gcc/config/aarch64/aarch64-simd-builtin-types.def
index 76d4d130013..e885755bc92 100644
--- a/gcc/config/aarch64/aarch64-simd-builtin-types.def
+++ b/gcc/config/aarch64/aarch64-simd-builtin-types.def
@@ -50,3 +50,5 @@
   ENTRY (Float32x4_t, V4SF, none, 13)
   ENTRY (Float64x1_t, V1DF, none, 13)
   ENTRY (Float64x2_t, V2DF, none, 13)
+  ENTRY (Bfloat16x4_t, V4BF, none, 14)
+  ENTRY (Bfloat16x8_t, V8BF, none, 14)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 4e28cf97516..cea9592695a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -19,8 +19,8 @@
 ;; <http://www.gnu.org/licenses/>.
 
 (define_expand "mov<mode>"
-  [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
-	(match_operand:VALL_F16 1 "general_operand"))]
+  [(set (match_operand:VALL_F16MOV 0 "nonimmediate_operand")
+	(match_operand:VALL_F16MOV 1 "general_operand"))]
   "TARGET_SIMD"
   "
   /* Force the operand into a register if it is not an
@@ -101,10 +101,10 @@
   [(set_attr "type" "neon_dup<q>")]
 )
 
-(define_insn "*aarch64_simd_mov<VD:mode>"
-  [(set (match_operand:VD 0 "nonimmediate_operand"
+(define_insn "*aarch64_simd_mov<VDMOV:mode>"
+  [(set (match_operand:VDMOV 0 "nonimmediate_operand"
 		"=w, m,  m,  w, ?r, ?w, ?r, w")
-	(match_operand:VD 1 "general_operand"
+	(match_operand:VDMOV 1 "general_operand"
 		"m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
    && (register_operand (operands[0], <MODE>mode)
@@ -129,10 +129,10 @@
 		     mov_reg, neon_move<q>")]
 )
 
-(define_insn "*aarch64_simd_mov<VQ:mode>"
-  [(set (match_operand:VQ 0 "nonimmediate_operand"
+(define_insn "*aarch64_simd_mov<VQMOV:mode>"
+  [(set (match_operand:VQMOV 0 "nonimmediate_operand"
 		"=w, Umn,  m,  w, ?r, ?w, ?r, w")
-	(match_operand:VQ 1 "general_operand"
+	(match_operand:VQMOV 1 "general_operand"
 		"m,  Dz, w,  w,  w,  r,  r, Dn"))]
   "TARGET_SIMD
    && (register_operand (operands[0], <MODE>mode)
@@ -234,8 +234,8 @@
 
 
 (define_split
-  [(set (match_operand:VQ 0 "register_operand" "")
-      (match_operand:VQ 1 "register_operand" ""))]
+  [(set (match_operand:VQMOV 0 "register_operand" "")
+      (match_operand:VQMOV 1 "register_operand" ""))]
   "TARGET_SIMD && reload_completed
    && GP_REGNUM_P (REGNO (operands[0]))
    && GP_REGNUM_P (REGNO (operands[1]))"
@@ -246,8 +246,8 @@
 })
 
 (define_split
-  [(set (match_operand:VQ 0 "register_operand" "")
-        (match_operand:VQ 1 "register_operand" ""))]
+  [(set (match_operand:VQMOV 0 "register_operand" "")
+        (match_operand:VQMOV 1 "register_operand" ""))]
   "TARGET_SIMD && reload_completed
    && ((FP_REGNUM_P (REGNO (operands[0])) && GP_REGNUM_P (REGNO (operands[1])))
        || (GP_REGNUM_P (REGNO (operands[0])) && FP_REGNUM_P (REGNO (operands[1]))))"
@@ -258,8 +258,8 @@
 })
 
 (define_expand "@aarch64_split_simd_mov<mode>"
-  [(set (match_operand:VQ 0)
-        (match_operand:VQ 1))]
+  [(set (match_operand:VQMOV 0)
+        (match_operand:VQMOV 1))]
   "TARGET_SIMD"
   {
     rtx dst = operands[0];
@@ -295,8 +295,8 @@
 (define_insn "aarch64_simd_mov_from_<mode>low"
   [(set (match_operand:<VHALF> 0 "register_operand" "=r")
         (vec_select:<VHALF>
-          (match_operand:VQ 1 "register_operand" "w")
-          (match_operand:VQ 2 "vect_par_cnst_lo_half" "")))]
+          (match_operand:VQMOV 1 "register_operand" "w")
+          (match_operand:VQMOV 2 "vect_par_cnst_lo_half" "")))]
   "TARGET_SIMD && reload_completed"
   "umov\t%0, %1.d[0]"
   [(set_attr "type" "neon_to_gp<q>")
@@ -306,8 +306,8 @@
 (define_insn "aarch64_simd_mov_from_<mode>high"
   [(set (match_operand:<VHALF> 0 "register_operand" "=r")
         (vec_select:<VHALF>
-          (match_operand:VQ 1 "register_operand" "w")
-          (match_operand:VQ 2 "vect_par_cnst_hi_half" "")))]
+          (match_operand:VQMOV 1 "register_operand" "w")
+          (match_operand:VQMOV 2 "vect_par_cnst_hi_half" "")))]
   "TARGET_SIMD && reload_completed"
   "umov\t%0, %1.d[1]"
   [(set_attr "type" "neon_to_gp<q>")
@@ -1471,8 +1471,8 @@
 ;; On big-endian this is { zeroes, operand }
 
 (define_insn "move_lo_quad_internal_<mode>"
-  [(set (match_operand:VQ_NO2E 0 "register_operand" "=w,w,w")
-	(vec_concat:VQ_NO2E
+  [(set (match_operand:VQMOV_NO2E 0 "register_operand" "=w,w,w")
+	(vec_concat:VQMOV_NO2E
 	  (match_operand:<VHALF> 1 "register_operand" "w,r,r")
 	  (vec_duplicate:<VHALF> (const_int 0))))]
   "TARGET_SIMD && !BYTES_BIG_ENDIAN"
@@ -1501,8 +1501,8 @@
 )
 
 (define_insn "move_lo_quad_internal_be_<mode>"
-  [(set (match_operand:VQ_NO2E 0 "register_operand" "=w,w,w")
-	(vec_concat:VQ_NO2E
+  [(set (match_operand:VQMOV_NO2E 0 "register_operand" "=w,w,w")
+	(vec_concat:VQMOV_NO2E
 	  (vec_duplicate:<VHALF> (const_int 0))
 	  (match_operand:<VHALF> 1 "register_operand" "w,r,r")))]
   "TARGET_SIMD && BYTES_BIG_ENDIAN"
@@ -1531,8 +1531,8 @@
 )
 
 (define_expand "move_lo_quad_<mode>"
-  [(match_operand:VQ 0 "register_operand")
-   (match_operand:VQ 1 "register_operand")]
+  [(match_operand:VQMOV 0 "register_operand")
+   (match_operand:VQMOV 1 "register_operand")]
   "TARGET_SIMD"
 {
   if (BYTES_BIG_ENDIAN)
@@ -1549,11 +1549,11 @@
 ;; For big-endian this is { operand1, operand2 }
 
 (define_insn "aarch64_simd_move_hi_quad_<mode>"
-  [(set (match_operand:VQ 0 "register_operand" "+w,w")
-        (vec_concat:VQ
+  [(set (match_operand:VQMOV 0 "register_operand" "+w,w")
+        (vec_concat:VQMOV
           (vec_select:<VHALF>
                 (match_dup 0)
-                (match_operand:VQ 2 "vect_par_cnst_lo_half" ""))
+                (match_operand:VQMOV 2 "vect_par_cnst_lo_half" ""))
 	  (match_operand:<VHALF> 1 "register_operand" "w,r")))]
   "TARGET_SIMD && !BYTES_BIG_ENDIAN"
   "@
@@ -1563,12 +1563,12 @@
 )
 
 (define_insn "aarch64_simd_move_hi_quad_be_<mode>"
-  [(set (match_operand:VQ 0 "register_operand" "+w,w")
-        (vec_concat:VQ
+  [(set (match_operand:VQMOV 0 "register_operand" "+w,w")
+        (vec_concat:VQMOV
 	  (match_operand:<VHALF> 1 "register_operand" "w,r")
           (vec_select:<VHALF>
                 (match_dup 0)
-                (match_operand:VQ 2 "vect_par_cnst_lo_half" ""))))]
+                (match_operand:VQMOV 2 "vect_par_cnst_lo_half" ""))))]
   "TARGET_SIMD && BYTES_BIG_ENDIAN"
   "@
    ins\\t%0.d[1], %1.d[0]
@@ -1577,7 +1577,7 @@
 )
 
 (define_expand "move_hi_quad_<mode>"
- [(match_operand:VQ 0 "register_operand")
+ [(match_operand:VQMOV 0 "register_operand")
   (match_operand:<VHALF> 1 "register_operand")]
  "TARGET_SIMD"
 {
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 85cadef1be8..ddf5a84a3b5 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -1692,6 +1692,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V2SImode:
     /* ...E_V1DImode doesn't exist.  */
     case E_V4HFmode:
+    case E_V4BFmode:
     case E_V2SFmode:
     case E_V1DFmode:
     /* 128-bit Advanced SIMD vectors.  */
@@ -1700,6 +1701,7 @@ aarch64_classify_vector_mode (machine_mode mode)
     case E_V4SImode:
     case E_V2DImode:
     case E_V8HFmode:
+    case E_V8BFmode:
     case E_V4SFmode:
     case E_V2DFmode:
       return TARGET_SIMD ? VEC_ADVSIMD : 0;
@@ -15603,6 +15605,10 @@ aarch64_gimplify_va_arg_expr (tree valist, tree type, gimple_seq *pre_p,
 	  field_t = aarch64_fp16_type_node;
 	  field_ptr_t = aarch64_fp16_ptr_type_node;
 	  break;
+	case E_BFmode:
+	  field_t = aarch64_bf16_type_node;
+	  field_ptr_t = aarch64_bf16_ptr_type_node;
+	  break;
 	case E_V2SImode:
 	case E_V4SImode:
 	    {
@@ -16116,6 +16122,8 @@ aarch64_vq_mode (scalar_mode mode)
       return V4SFmode;
     case E_HFmode:
       return V8HFmode;
+    case E_BFmode:
+      return V8BFmode;
     case E_SImode:
       return V4SImode;
     case E_HImode:
@@ -16149,6 +16157,8 @@ aarch64_simd_container_mode (scalar_mode mode, poly_int64 width)
 	    return V2SFmode;
 	  case E_HFmode:
 	    return V4HFmode;
+	  case E_BFmode:
+	    return V4BFmode;
 	  case E_SImode:
 	    return V2SImode;
 	  case E_HImode:
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 04dabd46437..b0492205610 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1120,13 +1120,13 @@ extern enum aarch64_code_model aarch64_cmodel;
 #define AARCH64_VALID_SIMD_DREG_MODE(MODE) \
   ((MODE) == V2SImode || (MODE) == V4HImode || (MODE) == V8QImode \
    || (MODE) == V2SFmode || (MODE) == V4HFmode || (MODE) == DImode \
-   || (MODE) == DFmode)
+   || (MODE) == DFmode || (MODE) == V4BFmode)
 
 /* Modes valid for AdvSIMD Q registers.  */
 #define AARCH64_VALID_SIMD_QREG_MODE(MODE) \
   ((MODE) == V4SImode || (MODE) == V8HImode || (MODE) == V16QImode \
    || (MODE) == V4SFmode || (MODE) == V8HFmode || (MODE) == V2DImode \
-   || (MODE) == V2DFmode)
+   || (MODE) == V2DFmode || (MODE) == V8BFmode)
 
 #define ENDIAN_LANE_N(NUNITS, N) \
   (BYTES_BIG_ENDIAN ? NUNITS - 1 - N : N)
@@ -1174,6 +1174,11 @@ extern const char *host_detect_local_cpu (int argc, const char **argv);
 extern tree aarch64_fp16_type_node;
 extern tree aarch64_fp16_ptr_type_node;
 
+/* This type is the user-visible __bf16, and a pointer to that type.  Defined
+   in aarch64-builtins.c.  */
+extern tree aarch64_bf16_type_node;
+extern tree aarch64_bf16_ptr_type_node;
+
 /* The generic unwind code in libgcc does not initialize the frame pointer.
    So in order to unwind a function using a frame pointer, the very first
    function that is unwound must save the frame pointer.  That way the frame
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 34cb99e2897..85106910f74 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1304,8 +1304,8 @@
 })
 
 (define_expand "mov<mode>"
-  [(set (match_operand:GPF_TF_F16 0 "nonimmediate_operand")
-	(match_operand:GPF_TF_F16 1 "general_operand"))]
+  [(set (match_operand:GPF_TF_F16_MOV 0 "nonimmediate_operand")
+	(match_operand:GPF_TF_F16_MOV 1 "general_operand"))]
   ""
   {
     if (!TARGET_FLOAT)
@@ -1321,11 +1321,11 @@
   }
 )
 
-(define_insn "*movhf_aarch64"
-  [(set (match_operand:HF 0 "nonimmediate_operand" "=w,w  , w,?r,w,w  ,w  ,w,m,r,m ,r")
-	(match_operand:HF 1 "general_operand"      "Y ,?rY,?r, w,w,Ufc,Uvi,m,w,m,rY,r"))]
-  "TARGET_FLOAT && (register_operand (operands[0], HFmode)
-    || aarch64_reg_or_fp_zero (operands[1], HFmode))"
+(define_insn "*mov<mode>_aarch64"
+  [(set (match_operand:HFBF 0 "nonimmediate_operand" "=w,w  , w,?r,w,w  ,w  ,w,m,r,m ,r")
+	(match_operand:HFBF 1 "general_operand"      "Y ,?rY,?r, w,w,Ufc,Uvi,m,w,m,rY,r"))]
+  "TARGET_FLOAT && (register_operand (operands[0], <MODE>mode)
+    || aarch64_reg_or_fp_zero (operands[1], <MODE>mode))"
   "@
    movi\\t%0.4h, #0
    fmov\\t%h0, %w1
diff --git a/gcc/config/aarch64/arm_bf16.h b/gcc/config/aarch64/arm_bf16.h
new file mode 100644
index 00000000000..884b6f3bc7a
--- /dev/null
+++ b/gcc/config/aarch64/arm_bf16.h
@@ -0,0 +1,32 @@
+/* Arm BF16 instrinsics include file.
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   Contributed by Arm.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _AARCH64_BF16_H_
+#define _AARCH64_BF16_H_
+
+typedef __bf16 bfloat16_t;
+
+#endif
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index c7425346b86..eaba156e26c 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -73,6 +73,9 @@ typedef __fp16 float16_t;
 typedef float float32_t;
 typedef double float64_t;
 
+typedef __Bfloat16x4_t bfloat16x4_t;
+typedef __Bfloat16x8_t bfloat16x8_t;
+
 typedef struct int8x8x2_t
 {
   int8x8_t val[2];
@@ -34606,6 +34609,8 @@ vrnd64xq_f64 (float64x2_t __a)
 
 #pragma GCC pop_options
 
+#include "arm_bf16.h"
+
 #undef __aarch64_vget_lane_any
 
 #undef __aarch64_vdup_lane_any
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index e5fa31f6748..9fd05abe3b6 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -57,9 +57,16 @@
 ;; Iterator for all scalar floating point modes (HF, SF, DF)
 (define_mode_iterator GPF_HF [HF SF DF])
 
+;; Iterator for all 16-bit scalar floating point modes (HF, BF)
+(define_mode_iterator HFBF [HF BF])
+
 ;; Iterator for all scalar floating point modes (HF, SF, DF and TF)
 (define_mode_iterator GPF_TF_F16 [HF SF DF TF])
 
+;; Iterator for all scalar floating point modes suitable for moving, including
+;; special BF type (HF, SF, DF, TF and BF)
+(define_mode_iterator GPF_TF_F16_MOV [HF BF SF DF TF])
+
 ;; Double vector modes.
 (define_mode_iterator VDF [V2SF V4HF])
 
@@ -79,6 +86,9 @@
 ;; Double vector modes.
 (define_mode_iterator VD [V8QI V4HI V4HF V2SI V2SF])
 
+;; Double vector modes suitable for moving.  Includes BFmode.
+(define_mode_iterator VDMOV [V8QI V4HI V4HF V4BF V2SI V2SF])
+
 ;; All modes stored in registers d0-d31.
 (define_mode_iterator DREG [V8QI V4HI V4HF V2SI V2SF DF])
 
@@ -97,6 +107,12 @@
 ;; Copy of the above.
 (define_mode_iterator VQ2 [V16QI V8HI V4SI V2DI V8HF V4SF V2DF])
 
+;; Quad vector modes suitable for moving.  Includes BFmode.
+(define_mode_iterator VQMOV [V16QI V8HI V4SI V2DI V8HF V8BF V4SF V2DF])
+
+;; Quad vector modes suitable for moving.  Includes BFmode.
+(define_mode_iterator VQMOV_NO2E [V16QI V8HI V4SI V8HF V8BF V4SF])
+
 ;; Quad integer vector modes.
 (define_mode_iterator VQ_I [V16QI V8HI V4SI V2DI])
 
@@ -160,6 +176,11 @@
 (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
 				V4HF V8HF V2SF V4SF V2DF])
 
+;; All Advanced SIMD modes suitable for moving, loading, and storing,
+;; including special Bfloat vector types.
+(define_mode_iterator VALL_F16MOV [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
+				V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
+
 ;; The VALL_F16 modes except the 128-bit 2-element ones.
 (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI V4SI
 				V4HF V8HF V2SF V4SF])
@@ -226,6 +247,9 @@
 ;; Advanced SIMD modes for Q and H types.
 (define_mode_iterator VDQQH [V8QI V16QI V4HI V8HI])
 
+;; Advanced SIMD modes for BF vector types.
+(define_mode_iterator VBF [V4BF V8BF])
+
 ;; Advanced SIMD modes for H and S types.
 (define_mode_iterator VDQHS [V4HI V8HI V2SI V4SI])
 
@@ -745,6 +769,7 @@
 			  (V2SI "2") (V4SI "4")
 				     (V2DI "2")
 			  (V4HF "4") (V8HF "8")
+			  (V4BF "4") (V8BF "8")
 			  (V2SF "2") (V4SF "4")
 			  (V1DF "1") (V2DF "2")
 			  (DI "1") (DF "1")])
@@ -885,7 +910,8 @@
 			  (V8HF "16b") (V2SF  "8b")
 			  (V4SF "16b") (V2DF  "16b")
 			  (DI   "8b")  (DF    "8b")
-			  (SI   "8b")  (SF    "8b")])
+			  (SI   "8b")  (SF    "8b")
+			  (V4BF "8b")  (V8BF  "16b")])
 
 ;; Define element mode for each vector mode.
 (define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
@@ -965,12 +991,13 @@
 			 (V2SI "SI")    (V4SI  "V2SI")
 			 (V2DI "DI")    (V2SF  "SF")
 			 (V4SF "V2SF")  (V4HF "V2HF")
-			 (V8HF "V4HF")  (V2DF  "DF")])
+			 (V8HF "V4HF")  (V2DF  "DF")
+			 (V8BF "V4BF")])
 
 ;; Half modes of all vector modes, in lower-case.
 (define_mode_attr Vhalf [(V8QI "v4qi")  (V16QI "v8qi")
 			 (V4HI "v2hi")  (V8HI  "v4hi")
-			 (V8HF  "v4hf")
+			 (V8HF  "v4hf") (V8BF  "v4bf")
 			 (V2SI "si")    (V4SI  "v2si")
 			 (V2DI "di")    (V2SF  "sf")
 			 (V4SF "v2sf")  (V2DF  "df")])
@@ -1265,6 +1292,7 @@
 		     (V2SI "") (V4SI  "_q")
 		     (DI   "") (V2DI  "_q")
 		     (V4HF "") (V8HF "_q")
+		     (V4BF "") (V8BF "_q")
 		     (V2SF "") (V4SF  "_q")
 			       (V2DF  "_q")
 		     (QI "") (HI "") (SI "") (DI "") (HF "") (SF "") (DF "")])
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_1.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_1.c
new file mode 100644
index 00000000000..5186d0e3d24
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_1.c
@@ -0,0 +1,118 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon }  */
+/* { dg-additional-options "-O3 --save-temps -std=gnu90" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include <arm_bf16.h>
+
+/*
+**stacktest1:
+**	...
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	...
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**bfloat_mov_ww:
+**	...
+**	mov	v1.h\[0\], v2.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_ww (void)
+{
+  register bfloat16_t x asm ("h2");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_rw:
+**	...
+**	dup	v1.4h, w1
+**	...
+**	ret
+*/
+void bfloat_mov_rw (void)
+{
+  register bfloat16_t x asm ("w1");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_wr:
+**	...
+**	umov	w1, v1.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_wr (void)
+{
+  register bfloat16_t x asm ("h1");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rr:
+**	...
+**	mov	w1, w2
+**	...
+**	ret
+*/
+void bfloat_mov_rr (void)
+{
+  register bfloat16_t x asm ("w2");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rm:
+**	...
+**	strh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_rm (void)
+{
+  register bfloat16_t x asm ("w2");
+  volatile bfloat16_t y;
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" : : : "memory");
+}
+
+/*
+**bfloat_mov_mr:
+**	...
+**	ldrh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_mr (void)
+{
+  volatile bfloat16_t x;
+  register bfloat16_t y asm ("w2");
+  asm volatile ("#foo" : : : "memory");
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_2.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_2.c
new file mode 100644
index 00000000000..02656d32f14
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_2.c
@@ -0,0 +1,122 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-additional-options "-march=armv8.2-a -O3 --save-temps -std=gnu90" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#pragma GCC push_options
+#pragma GCC target ("+bf16")
+
+#include <arm_bf16.h>
+
+/*
+**stacktest1:
+**	...
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	...
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**bfloat_mov_ww:
+**	...
+**	mov	v1.h\[0\], v2.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_ww (void)
+{
+  register bfloat16_t x asm ("h2");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_rw:
+**	...
+**	dup	v1.4h, w1
+**	...
+**	ret
+*/
+void bfloat_mov_rw (void)
+{
+  register bfloat16_t x asm ("w1");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_wr:
+**	...
+**	umov	w1, v1.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_wr (void)
+{
+  register bfloat16_t x asm ("h1");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rr:
+**	...
+**	mov	w1, w2
+**	...
+**	ret
+*/
+void bfloat_mov_rr (void)
+{
+  register bfloat16_t x asm ("w2");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rm:
+**	...
+**	strh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_rm (void)
+{
+  register bfloat16_t x asm ("w2");
+  volatile bfloat16_t y;
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" : : : "memory");
+}
+
+/*
+**bfloat_mov_mr:
+**	...
+**	ldrh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_mr (void)
+{
+  volatile bfloat16_t x;
+  register bfloat16_t y asm ("w2");
+  asm volatile ("#foo" : : : "memory");
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+#pragma GCC pop_options
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_3.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_3.c
new file mode 100644
index 00000000000..6170c85b196
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_3.c
@@ -0,0 +1,116 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-additional-options "-march=armv8.2-a -O3 --save-temps -std=gnu90" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include <arm_bf16.h>
+
+/*
+**stacktest1:
+**	...
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	...
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**bfloat_mov_ww:
+**	...
+**	mov	v1.h\[0\], v2.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_ww (void)
+{
+  register bfloat16_t x asm ("h2");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_rw:
+**	...
+**	dup	v1.4h, w1
+**	...
+**	ret
+*/
+void bfloat_mov_rw (void)
+{
+  register bfloat16_t x asm ("w1");
+  register bfloat16_t y asm ("h1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "w" (y));
+}
+
+/*
+**bfloat_mov_wr:
+**	...
+**	umov	w1, v1.h\[0\]
+**	...
+**	ret
+*/
+void bfloat_mov_wr (void)
+{
+  register bfloat16_t x asm ("h1");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=w" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rr:
+**	...
+**	mov	w1, w2
+**	...
+**	ret
+*/
+void bfloat_mov_rr (void)
+{
+  register bfloat16_t x asm ("w2");
+  register bfloat16_t y asm ("w1");
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
+
+/*
+**bfloat_mov_rm:
+**	...
+**	strh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_rm (void)
+{
+  register bfloat16_t x asm ("w2");
+  volatile bfloat16_t y;
+  asm volatile ("#foo" : "=r" (x));
+  y = x;
+  asm volatile ("#foo" : : : "memory");
+}
+
+/*
+**bfloat_mov_mr:
+**	...
+**	ldrh	w2, \[sp, 14\]
+**	...
+**	ret
+*/
+void bfloat_mov_mr (void)
+{
+  volatile bfloat16_t x;
+  register bfloat16_t y asm ("w2");
+  asm volatile ("#foo" : : : "memory");
+  y = x;
+  asm volatile ("#foo" :: "r" (y));
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_4.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_4.c
new file mode 100644
index 00000000000..b812011c223
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_scalar_compile_4.c
@@ -0,0 +1,16 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon }  */
+/* { dg-additional-options "-std=c99 -pedantic-errors -O3 --save-temps" } */
+
+#include <arm_bf16.h>
+
+_Complex bfloat16_t stacktest1 (_Complex bfloat16_t __a)
+{
+  volatile _Complex bfloat16_t b = __a;
+  return b;
+}
+
+/* { dg-error {ISO C does not support plain 'complex' meaning 'double complex'} "" { target *-*-* } 8 } */
+/* { dg-error {expected '=', ',', ';', 'asm' or '__attribute__' before 'stacktest1'} "" { target *-*-* } 8 } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_1.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_1.c
new file mode 100644
index 00000000000..1db85fb9ba0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_1.c
@@ -0,0 +1,93 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-add-options arm_v8_2a_bf16_neon }  */
+/* { dg-additional-options "-O3 --save-temps" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include <arm_neon.h>
+
+/*
+**stacktest1:
+**	sub.*
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	add.*
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**stacktest2:
+**	sub.*
+**	str	d0, \[sp, 8\]
+**	ldr	d0, \[sp, 8\]
+**	add.*
+**	ret
+*/
+bfloat16x4_t stacktest2 (bfloat16x4_t __a)
+{
+  volatile bfloat16x4_t b = __a;
+  return b;
+}
+
+/*
+**stacktest3:
+**	sub.*
+**	str	q0, \[sp\]
+**	ldr	q0, \[sp\]
+**	add.*
+**	ret
+*/
+bfloat16x8_t stacktest3 (bfloat16x8_t __a)
+{
+  volatile bfloat16x8_t b = __a;
+  return b;
+}
+
+/* Test compilation of __attribute__ vectors of 8, 16, 32, etc. BFloats.  */
+typedef bfloat16_t v8bf __attribute__((vector_size(16)));
+typedef bfloat16_t v16bf __attribute__((vector_size(32)));
+typedef bfloat16_t v32bf __attribute__((vector_size(64)));
+typedef bfloat16_t v64bf __attribute__((vector_size(128)));
+typedef bfloat16_t v128bf __attribute__((vector_size(256)));
+
+v8bf stacktest4 (v8bf __a)
+{
+  volatile v8bf b = __a;
+  return b;
+}
+
+v16bf stacktest5 (v16bf __a)
+{
+  volatile v16bf b = __a;
+  return b;
+}
+
+v32bf stacktest6 (v32bf __a)
+{
+  volatile v32bf b = __a;
+  return b;
+}
+
+v64bf stacktest7 (v64bf __a)
+{
+  volatile v64bf b = __a;
+  return b;
+}
+
+v128bf stacktest8 (v128bf __a)
+{
+  volatile v128bf b = __a;
+  return b;
+}
+
+/* Test use of constant values to assign values to vectors.  */
+
+typedef bfloat16_t v2bf __attribute__((vector_size(4)));
+v2bf c2 (void) { return (v2bf) 0x12345678; }
+
+bfloat16x4_t c3 (void) { return (bfloat16x4_t) 0x1234567812345678; }
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_2.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_2.c
new file mode 100644
index 00000000000..660a02d0f03
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_2.c
@@ -0,0 +1,97 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-additional-options "-march=armv8.2-a -O3 --save-temps" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include <arm_neon.h>
+
+#pragma GCC push_options
+#pragma GCC target ("+bf16")
+
+/*
+**stacktest1:
+**	sub.*
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	add.*
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**stacktest2:
+**	sub.*
+**	str	d0, \[sp, 8\]
+**	ldr	d0, \[sp, 8\]
+**	add.*
+**	ret
+*/
+bfloat16x4_t stacktest2 (bfloat16x4_t __a)
+{
+  volatile bfloat16x4_t b = __a;
+  return b;
+}
+
+/*
+**stacktest3:
+**	sub.*
+**	str	q0, \[sp\]
+**	ldr	q0, \[sp\]
+**	add.*
+**	ret
+*/
+bfloat16x8_t stacktest3 (bfloat16x8_t __a)
+{
+  volatile bfloat16x8_t b = __a;
+  return b;
+}
+
+/*  Test compilation of __attribute__ vectors of 8, 16, 32, etc. BFloats.  */
+typedef bfloat16_t v8bf __attribute__((vector_size(16)));
+typedef bfloat16_t v16bf __attribute__((vector_size(32)));
+typedef bfloat16_t v32bf __attribute__((vector_size(64)));
+typedef bfloat16_t v64bf __attribute__((vector_size(128)));
+typedef bfloat16_t v128bf __attribute__((vector_size(256)));
+
+v8bf stacktest4 (v8bf __a)
+{
+  volatile v8bf b = __a;
+  return b;
+}
+
+v16bf stacktest5 (v16bf __a)
+{
+  volatile v16bf b = __a;
+  return b;
+}
+
+v32bf stacktest6 (v32bf __a)
+{
+  volatile v32bf b = __a;
+  return b;
+}
+
+v64bf stacktest7 (v64bf __a)
+{
+  volatile v64bf b = __a;
+  return b;
+}
+
+v128bf stacktest8 (v128bf __a)
+{
+  volatile v128bf b = __a;
+  return b;
+}
+
+/* Test use of constant values to assign values to vectors.  */
+
+typedef bfloat16_t v2bf __attribute__((vector_size(4)));
+v2bf c2 (void) { return (v2bf) 0x12345678; }
+
+bfloat16x4_t c3 (void) { return (bfloat16x4_t) 0x1234567812345678; }
+
+#pragma GCC pop_options
diff --git a/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_3.c b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_3.c
new file mode 100644
index 00000000000..6b22bae59af
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bfloat16_simd_compile_3.c
@@ -0,0 +1,92 @@
+/* { dg-do assemble { target { aarch64*-*-* } } } */
+/* { dg-require-effective-target arm_v8_2a_bf16_neon_ok } */
+/* { dg-additional-options "-march=armv8.2-a -O3 --save-temps" } */
+/* { dg-final { check-function-bodies "**" "" } } */
+
+#include <arm_neon.h>
+
+/*
+**stacktest1:
+**	sub.*
+**	str	h0, \[sp, 14\]
+**	ldr	h0, \[sp, 14\]
+**	add.*
+**	ret
+*/
+bfloat16_t stacktest1 (bfloat16_t __a)
+{
+  volatile bfloat16_t b = __a;
+  return b;
+}
+
+/*
+**stacktest2:
+**	sub.*
+**	str	d0, \[sp, 8\]
+**	ldr	d0, \[sp, 8\]
+**	add.*
+**	ret
+*/
+bfloat16x4_t stacktest2 (bfloat16x4_t __a)
+{
+  volatile bfloat16x4_t b = __a;
+  return b;
+}
+
+/*
+**stacktest3:
+**	sub.*
+**	str	q0, \[sp\]
+**	ldr	q0, \[sp\]
+**	add.*
+**	ret
+*/
+bfloat16x8_t stacktest3 (bfloat16x8_t __a)
+{
+  volatile bfloat16x8_t b = __a;
+  return b;
+}
+
+/*  Test compilation of __attribute__ vectors of 8, 16, 32, etc. BFloats.  */
+typedef bfloat16_t v8bf __attribute__((vector_size(16)));
+typedef bfloat16_t v16bf __attribute__((vector_size(32)));
+typedef bfloat16_t v32bf __attribute__((vector_size(64)));
+typedef bfloat16_t v64bf __attribute__((vector_size(128)));
+typedef bfloat16_t v128bf __attribute__((vector_size(256)));
+
+v8bf stacktest4 (v8bf __a)
+{
+  volatile v8bf b = __a;
+  return b;
+}
+
+v16bf stacktest5 (v16bf __a)
+{
+  volatile v16bf b = __a;
+  return b;
+}
+
+v32bf stacktest6 (v32bf __a)
+{
+  volatile v32bf b = __a;
+  return b;
+}
+
+v64bf stacktest7 (v64bf __a)
+{
+  volatile v64bf b = __a;
+  return b;
+}
+
+v128bf stacktest8 (v128bf __a)
+{
+  volatile v128bf b = __a;
+  return b;
+}
+
+/* Test use of constant values to assign values to vectors.  */
+
+typedef bfloat16_t v2bf __attribute__((vector_size(4)));
+v2bf c2 (void) { return (v2bf) 0x12345678; }
+
+bfloat16x4_t c3 (void) { return (bfloat16x4_t) 0x1234567812345678; }

Re: [GCC][PATCH][Aarch64] Add Bfloat16_t scalar type, vector types and machine modes to Aarch64 back-end [1/2]

Reply via email to