On Tue, May 27, 2014 at 10:24 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> Hi,
>
> To attract more eyes, I should have used a scarier subject like "GCC's
> vectorizer is heading in the wrong direction on big-endian targets".
>
> The idea came from a simple vectorization case I ran into and a
> discussion with Richard.  Given simple case like:
>
> typedef signed short *__restrict__ pRSINT16;
> void check_vector_S16 (pRSINT16 a, pRSINT16 b);
> int foo (void)
> {
>   int i;
>   signed char input1_S8[16] =
>       { 127, 126, 125, 124, 123, 122, 121, 120,
>         119, 118, 117, 116, 115, 114, 113, 112};
>   signed char input2_S8[16] =
>       { 127, 125, 123, 121, 119, 117, 115, 113,
>         111, 109, 107, 105, 103, 101, 99,  97};
>   signed short output_S16[16];
>   signed short expected_S16[] =
>       { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560,
>         13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 };
>
>   for (i=0; i<16; i++)
>     output_S16[i] = (signed short)input1_S8[i] * input2_S8[i];
>
>   check_vector_S16 (expected_S16, output_S16);
>   return 0;
> }
>
>
> It is vectorized at optimization level "O3", and the GIMPLE optimized
> dump for aarch64 big-endian is like:
>
> ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)
>
> foo ()
> {
>   vector(8) short int vect_patt_77.9;
>   vector(16) signed char vect__54.8;
>   vector(16) signed char vect__52.5;
>   short int expected_S16[16];
>   short int output_S16[16];
>   signed char input2_S8[16];
>   signed char input1_S8[16];
>
>   <bb 2>:
>   MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
> 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
>   MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
> 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
>   MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
> 14637, 14274, 13915, 13560 };
>   MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
> 12180, 11845, 11514, 11187, 10864 };
>   vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
>   vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
>   vect_patt_77.9_72 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>;
>   vect_patt_77.9_71 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>;
>   MEM[(short int *)&output_S16] = vect_patt_77.9_72;
>   MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
>   check_vector_S16 (&expected_S16, &output_S16);
>   input1_S8 ={v} {CLOBBER};
>   input2_S8 ={v} {CLOBBER};
>   output_S16 ={v} {CLOBBER};
>   expected_S16 ={v} {CLOBBER};
>   return 0;
>
> }
>
>
> while dump for aarch64 little-endian is like:
>
> ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)
>
> foo ()
> {
>   vector(8) short int vect_patt_77.9;
>   vector(16) signed char vect__54.8;
>   vector(16) signed char vect__52.5;
>   short int expected_S16[16];
>   short int output_S16[16];
>   signed char input2_S8[16];
>   signed char input1_S8[16];
>
>   <bb 2>:
>   MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
> 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
>   MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
> 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
>   MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
> 14637, 14274, 13915, 13560 };
>   MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
> 12180, 11845, 11514, 11187, 10864 };
>   vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
>   vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
>   vect_patt_77.9_72 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>;
>   vect_patt_77.9_71 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>;
>   MEM[(short int *)&output_S16] = vect_patt_77.9_72;
>   MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
>   check_vector_S16 (&expected_S16, &output_S16);
>   input1_S8 ={v} {CLOBBER};
>   input2_S8 ={v} {CLOBBER};
>   output_S16 ={v} {CLOBBER};
>   expected_S16 ={v} {CLOBBER};
>   return 0;
>
> }
>
>
> It's clear that WIDEN_MULT_LO_EXPR/WIDEN_MULT_HI_EXPR are switched
> depending on big-endian/little-endian.  The corresponding code doing
> the shuffle is like below in
> tree-vect-stmts.c:supportable_widening_operation
>
>   if (BYTES_BIG_ENDIAN && c1 != VEC_WIDEN_MULT_EVEN_EXPR)
>     {
>       enum tree_code ctmp = c1;
>       c1 = c2;
>       c2 = ctmp;
>     }
>
>
> There are some other similar cases in vectorizer and all of them look
> suspicious since intuitively, vectorizer should neither care about
> target endianess nor do such shuffle.  Anyway, this is how we do
> vectorization currently.
>
> Stick to the test case, below insns are expanded for the WIDEN_MULT
> pair on big-endian:
>
> ;;WIDEN_MULT_HI_EXPR part
> (insn:TI 39 50 16 (set (reg:V8HI 36 v4 [orig:106 vect_patt_77.9 ] [106])
>         (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82])
>                     (parallel:V16QI [
>                             (const_int 8 [0x8])
>                             (const_int 9 [0x9])
>                             (const_int 10 [0xa])
>                             (const_int 11 [0xb])
>                             (const_int 12 [0xc])
>                             (const_int 13 [0xd])
>                             (const_int 14 [0xe])
>                             (const_int 15 [0xf])
>                         ])))
>             (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87])
>                     (parallel:V16QI [
>                             (const_int 8 [0x8])
>                             (const_int 9 [0x9])
>                             (const_int 10 [0xa])
>                             (const_int 11 [0xb])
>                             (const_int 12 [0xc])
>                             (const_int 13 [0xd])
>                             (const_int 14 [0xe])
>                             (const_int 15 [0xf])
>                         ]))))) bug.c:26 1069 {aarch64_simd_vec_smult_hi_v16qi}
>      (nil))
> ;;WIDEN_MULT_LO_EXPR part
> (insn:TI 45 22 40 (set (reg:V8HI 34 v2 [orig:111 vect_patt_77.9 ] [111])
>         (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82])
>                     (parallel:V16QI [
>                             (const_int 0 [0])
>                             (const_int 1 [0x1])
>                             (const_int 2 [0x2])
>                             (const_int 3 [0x3])
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ])))
>             (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87])
>                     (parallel:V16QI [
>                             (const_int 0 [0])
>                             (const_int 1 [0x1])
>                             (const_int 2 [0x2])
>                             (const_int 3 [0x3])
>                             (const_int 4 [0x4])
>                             (const_int 5 [0x5])
>                             (const_int 6 [0x6])
>                             (const_int 7 [0x7])
>                         ]))))) bug.c:26 1063 {aarch64_simd_vec_smult_lo_v16qi}
>      (nil))
>
> Symantically, insn39 does vect multiplication for high elements of
> vectors v1 and v0, insn45 does that for low elements of vectors v1 and
> v0.  To cope with (or take advantage of) the mentioned shuffle in
> vectorizer, aarch64 back-end for big-endian matches insn39 to
> "aarch64_simd_vec_smult_hi_v16qi" using predicate
> "vect_par_cnst_hi_half" and generates assembly "smull2    v4.8h,
> v1.16b, v0.16b".  Point is that assembly insn symantically multiplies
> high bits of vectors v1 and v0, which actually mapps to low elements
> of vectors v1 and v0 on aarch64 big-endian.  The interesting thing
> here is, back-end generates wrong assembly insn to shuffle back and
> get right results.
>
> Though a little bit ugly, it is still acceptable if thing stays in
> this way.  Unfortunately, GCC also wants to do as much optimization as
> possible on vectorized code.  Look into dump information before RTL
> cse1 (this time with my local patch, but that doesn't matter):
>
>     3: NOTE_INSN_BASIC_BLOCK 2
>     2: NOTE_INSN_FUNCTION_BEG
>     5: r78:V16QI=[`*.LC0']     ;;r78={127,126,125,...,114,113,112}
>     6: [sfp:DI-0x60]=r78:V16QI
>     7: r79:V16QI=[`*.LC1']     ;;r79={127,125,123,...,101, 99, 97}
>     8: [sfp:DI-0x50]=r79:V16QI
>     9: r80:V8HI=[`*.LC2']      ;;r80=low_part(r78) w* low_part(r79)
>    10: [sfp:DI-0x20]=r80:V8HI
>    11: r81:V8HI=[`*.LC3']      ;;r81=high_part(r78) w* high_part(r79)
>    12: [sfp:DI-0x10]=r81:V8HI
>    13: r73:V16QI=[sfp:DI-0x60]
>    14: r76:V16QI=[sfp:DI-0x50]
>    15: 
> r82:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel))
>  ;;WIDEN_MULT_HI_EXPR
>    16: [sfp:DI-0x40]=r82:V8HI
>    17: 
> r83:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel))
>  ;;WIDEN_MULT_LO_EXPR
>    18: [sfp:DI-0x30]=r83:V8HI
>    19: r84:DI=sfp:DI-0x40
>    20: r85:DI=sfp:DI-0x20
>    21: x1:DI=r84:DI
>    22: x0:DI=r85:DI
>    23: call [`check_vector_S16'] argc:0
>    24: r77:SI=0
>    28: x0:SI=r77:SI
>    29: use x0:SI
>
> Pass CSE1 knows that r76 equals to r79, and r73 equals to r78, thus:
>    insn 15
>       <=> 
> r82:V8HI=sign_extend(vec_select(r78:V16QI,{8,9,10,11,12,13,14,15}))*sign_extend(vec_select(r79:V16QI,{8,9,10,11,12,13,14,15}))
>       <=> 
> r82:V8HI=sign_extend(119,118,117,116,115,114,113,112)*sign_extend(111,109,107,105,103,101,99,97)
>       <=> r82:V8HI=r81:V8HI
>
> Similarly, insn17 <=> r83:V8HI=r80:V8HI
>
> The logic is, function simplify-rtx.c::simplify_binary_operation_1
> called by CSE1 doesn't know or care that vectorizer did the shuffle
> before.  It just selects high elements of a vector for
> insn15/WIDEN_MULT_HI_EXPR.  As a result, insn15 is optimized into a
> direct store of high emlements (i.e., low bits) of multiplication
> result, which is the exact opposite comparing to smull2 instruction.
> So with this useful optimization, wrong code is generated.
>
> This isn't the only case.  PR61114 is caused by the same reason.  Only
> it happens with BIT_FIELD_REF and dom2 optimization this time.
>
> One way out is to fix these issues one by one, introducing more and
> more BIG_ENDIAN checks at places where normal optimization is done for
> auto-vectorized code.  Apparently, this is not
> sustainable/maintainable/acceptable.  The right fix I think is to make
> vectorizer(middle-end) endianess independent and rectify all backends
> with vectorizer enabled for big-endian.  Richard suggested that
> tree-vect*, optabs.c and related backends should be fixed as the first
> step, while RTL be fixed as next step if it breaks in the same way.
>
> Moreover, I am also interested about practical issues, for example,
>     How much (back-end) work for this?
>     Should we open a developping branch thus people from each back-end
> can do the fix?
>
> Any comment on this?

I'm taking a pragmatic position on this issue.  Go and fix the GIMPLE
middle-end and adjust the backend you care for.  The "other" big-endian
backend just decided to switch to little-endian - but it is also well maintained
and thus I expect the maintainers to quickly adapt to these changes.

Leave other ports that may be affected (who knows ...) broken.  We do
have testing coverage for the big-endian special cases (they were after
all introduced bit-by-bit for broken cases ...), so the autotesters should
come up with regressions on other platforms if there are any.  They
should be easily fixable.

So I think to get attention you have to post a patch (even if it only
touches one backend).  I'll happily approve the middle-end bits
conditional on "testing on one other big-endian platform".

Thanks,
Richard.

> Thanks,
> bin
>
> --
> Best Regards.

Reply via email to