Hi,

To attract more eyes, I should have used a scarier subject like "GCC's
vectorizer is heading in the wrong direction on big-endian targets".

The idea came from a simple vectorization case I ran into and a
discussion with Richard.  Given simple case like:

typedef signed short *__restrict__ pRSINT16;
void check_vector_S16 (pRSINT16 a, pRSINT16 b);
int foo (void)
{
  int i;
  signed char input1_S8[16] =
      { 127, 126, 125, 124, 123, 122, 121, 120,
        119, 118, 117, 116, 115, 114, 113, 112};
  signed char input2_S8[16] =
      { 127, 125, 123, 121, 119, 117, 115, 113,
        111, 109, 107, 105, 103, 101, 99,  97};
  signed short output_S16[16];
  signed short expected_S16[] =
      { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560,
        13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 };

  for (i=0; i<16; i++)
    output_S16[i] = (signed short)input1_S8[i] * input2_S8[i];

  check_vector_S16 (expected_S16, output_S16);
  return 0;
}


It is vectorized at optimization level "O3", and the GIMPLE optimized
dump for aarch64 big-endian is like:

;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)

foo ()
{
  vector(8) short int vect_patt_77.9;
  vector(16) signed char vect__54.8;
  vector(16) signed char vect__52.5;
  short int expected_S16[16];
  short int output_S16[16];
  signed char input2_S8[16];
  signed char input1_S8[16];

  <bb 2>:
  MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
  MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
  MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
14637, 14274, 13915, 13560 };
  MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
12180, 11845, 11514, 11187, 10864 };
  vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
  vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
  vect_patt_77.9_72 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>;
  vect_patt_77.9_71 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>;
  MEM[(short int *)&output_S16] = vect_patt_77.9_72;
  MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
  check_vector_S16 (&expected_S16, &output_S16);
  input1_S8 ={v} {CLOBBER};
  input2_S8 ={v} {CLOBBER};
  output_S16 ={v} {CLOBBER};
  expected_S16 ={v} {CLOBBER};
  return 0;

}


while dump for aarch64 little-endian is like:

;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0)

foo ()
{
  vector(8) short int vect_patt_77.9;
  vector(16) signed char vect__54.8;
  vector(16) signed char vect__52.5;
  short int expected_S16[16];
  short int output_S16[16];
  signed char input2_S8[16];
  signed char input1_S8[16];

  <bb 2>:
  MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122,
121, 120, 119, 118, 117, 116, 115, 114, 113, 112 };
  MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117,
115, 113, 111, 109, 107, 105, 103, 101, 99, 97 };
  MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004,
14637, 14274, 13915, 13560 };
  MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519,
12180, 11845, 11514, 11187, 10864 };
  vect__52.5_2 = MEM[(signed char[16] *)&input1_S8];
  vect__54.8_73 = MEM[(signed char[16] *)&input2_S8];
  vect_patt_77.9_72 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>;
  vect_patt_77.9_71 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>;
  MEM[(short int *)&output_S16] = vect_patt_77.9_72;
  MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71;
  check_vector_S16 (&expected_S16, &output_S16);
  input1_S8 ={v} {CLOBBER};
  input2_S8 ={v} {CLOBBER};
  output_S16 ={v} {CLOBBER};
  expected_S16 ={v} {CLOBBER};
  return 0;

}


It's clear that WIDEN_MULT_LO_EXPR/WIDEN_MULT_HI_EXPR are switched
depending on big-endian/little-endian.  The corresponding code doing
the shuffle is like below in
tree-vect-stmts.c:supportable_widening_operation

  if (BYTES_BIG_ENDIAN && c1 != VEC_WIDEN_MULT_EVEN_EXPR)
    {
      enum tree_code ctmp = c1;
      c1 = c2;
      c2 = ctmp;
    }


There are some other similar cases in vectorizer and all of them look
suspicious since intuitively, vectorizer should neither care about
target endianess nor do such shuffle.  Anyway, this is how we do
vectorization currently.

Stick to the test case, below insns are expanded for the WIDEN_MULT
pair on big-endian:

;;WIDEN_MULT_HI_EXPR part
(insn:TI 39 50 16 (set (reg:V8HI 36 v4 [orig:106 vect_patt_77.9 ] [106])
        (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82])
                    (parallel:V16QI [
                            (const_int 8 [0x8])
                            (const_int 9 [0x9])
                            (const_int 10 [0xa])
                            (const_int 11 [0xb])
                            (const_int 12 [0xc])
                            (const_int 13 [0xd])
                            (const_int 14 [0xe])
                            (const_int 15 [0xf])
                        ])))
            (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87])
                    (parallel:V16QI [
                            (const_int 8 [0x8])
                            (const_int 9 [0x9])
                            (const_int 10 [0xa])
                            (const_int 11 [0xb])
                            (const_int 12 [0xc])
                            (const_int 13 [0xd])
                            (const_int 14 [0xe])
                            (const_int 15 [0xf])
                        ]))))) bug.c:26 1069 {aarch64_simd_vec_smult_hi_v16qi}
     (nil))
;;WIDEN_MULT_LO_EXPR part
(insn:TI 45 22 40 (set (reg:V8HI 34 v2 [orig:111 vect_patt_77.9 ] [111])
        (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82])
                    (parallel:V16QI [
                            (const_int 0 [0])
                            (const_int 1 [0x1])
                            (const_int 2 [0x2])
                            (const_int 3 [0x3])
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ])))
            (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87])
                    (parallel:V16QI [
                            (const_int 0 [0])
                            (const_int 1 [0x1])
                            (const_int 2 [0x2])
                            (const_int 3 [0x3])
                            (const_int 4 [0x4])
                            (const_int 5 [0x5])
                            (const_int 6 [0x6])
                            (const_int 7 [0x7])
                        ]))))) bug.c:26 1063 {aarch64_simd_vec_smult_lo_v16qi}
     (nil))

Symantically, insn39 does vect multiplication for high elements of
vectors v1 and v0, insn45 does that for low elements of vectors v1 and
v0.  To cope with (or take advantage of) the mentioned shuffle in
vectorizer, aarch64 back-end for big-endian matches insn39 to
"aarch64_simd_vec_smult_hi_v16qi" using predicate
"vect_par_cnst_hi_half" and generates assembly "smull2    v4.8h,
v1.16b, v0.16b".  Point is that assembly insn symantically multiplies
high bits of vectors v1 and v0, which actually mapps to low elements
of vectors v1 and v0 on aarch64 big-endian.  The interesting thing
here is, back-end generates wrong assembly insn to shuffle back and
get right results.

Though a little bit ugly, it is still acceptable if thing stays in
this way.  Unfortunately, GCC also wants to do as much optimization as
possible on vectorized code.  Look into dump information before RTL
cse1 (this time with my local patch, but that doesn't matter):

    3: NOTE_INSN_BASIC_BLOCK 2
    2: NOTE_INSN_FUNCTION_BEG
    5: r78:V16QI=[`*.LC0']     ;;r78={127,126,125,...,114,113,112}
    6: [sfp:DI-0x60]=r78:V16QI
    7: r79:V16QI=[`*.LC1']     ;;r79={127,125,123,...,101, 99, 97}
    8: [sfp:DI-0x50]=r79:V16QI
    9: r80:V8HI=[`*.LC2']      ;;r80=low_part(r78) w* low_part(r79)
   10: [sfp:DI-0x20]=r80:V8HI
   11: r81:V8HI=[`*.LC3']      ;;r81=high_part(r78) w* high_part(r79)
   12: [sfp:DI-0x10]=r81:V8HI
   13: r73:V16QI=[sfp:DI-0x60]
   14: r76:V16QI=[sfp:DI-0x50]
   15: 
r82:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel))
 ;;WIDEN_MULT_HI_EXPR
   16: [sfp:DI-0x40]=r82:V8HI
   17: 
r83:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel))
 ;;WIDEN_MULT_LO_EXPR
   18: [sfp:DI-0x30]=r83:V8HI
   19: r84:DI=sfp:DI-0x40
   20: r85:DI=sfp:DI-0x20
   21: x1:DI=r84:DI
   22: x0:DI=r85:DI
   23: call [`check_vector_S16'] argc:0
   24: r77:SI=0
   28: x0:SI=r77:SI
   29: use x0:SI

Pass CSE1 knows that r76 equals to r79, and r73 equals to r78, thus:
   insn 15
      <=> 
r82:V8HI=sign_extend(vec_select(r78:V16QI,{8,9,10,11,12,13,14,15}))*sign_extend(vec_select(r79:V16QI,{8,9,10,11,12,13,14,15}))
      <=> 
r82:V8HI=sign_extend(119,118,117,116,115,114,113,112)*sign_extend(111,109,107,105,103,101,99,97)
      <=> r82:V8HI=r81:V8HI

Similarly, insn17 <=> r83:V8HI=r80:V8HI

The logic is, function simplify-rtx.c::simplify_binary_operation_1
called by CSE1 doesn't know or care that vectorizer did the shuffle
before.  It just selects high elements of a vector for
insn15/WIDEN_MULT_HI_EXPR.  As a result, insn15 is optimized into a
direct store of high emlements (i.e., low bits) of multiplication
result, which is the exact opposite comparing to smull2 instruction.
So with this useful optimization, wrong code is generated.

This isn't the only case.  PR61114 is caused by the same reason.  Only
it happens with BIT_FIELD_REF and dom2 optimization this time.

One way out is to fix these issues one by one, introducing more and
more BIG_ENDIAN checks at places where normal optimization is done for
auto-vectorized code.  Apparently, this is not
sustainable/maintainable/acceptable.  The right fix I think is to make
vectorizer(middle-end) endianess independent and rectify all backends
with vectorizer enabled for big-endian.  Richard suggested that
tree-vect*, optabs.c and related backends should be fixed as the first
step, while RTL be fixed as next step if it breaks in the same way.

Moreover, I am also interested about practical issues, for example,
    How much (back-end) work for this?
    Should we open a developping branch thus people from each back-end
can do the fix?

Any comment on this?

Thanks,
bin

-- 
Best Regards.

Reply via email to