On Tue, May 27, 2014 at 10:24 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: > Hi, > > To attract more eyes, I should have used a scarier subject like "GCC's > vectorizer is heading in the wrong direction on big-endian targets". > > The idea came from a simple vectorization case I ran into and a > discussion with Richard. Given simple case like: > > typedef signed short *__restrict__ pRSINT16; > void check_vector_S16 (pRSINT16 a, pRSINT16 b); > int foo (void) > { > int i; > signed char input1_S8[16] = > { 127, 126, 125, 124, 123, 122, 121, 120, > 119, 118, 117, 116, 115, 114, 113, 112}; > signed char input2_S8[16] = > { 127, 125, 123, 121, 119, 117, 115, 113, > 111, 109, 107, 105, 103, 101, 99, 97}; > signed short output_S16[16]; > signed short expected_S16[] = > { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560, > 13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 }; > > for (i=0; i<16; i++) > output_S16[i] = (signed short)input1_S8[i] * input2_S8[i]; > > check_vector_S16 (expected_S16, output_S16); > return 0; > } > > > It is vectorized at optimization level "O3", and the GIMPLE optimized > dump for aarch64 big-endian is like: > > ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0) > > foo () > { > vector(8) short int vect_patt_77.9; > vector(16) signed char vect__54.8; > vector(16) signed char vect__52.5; > short int expected_S16[16]; > short int output_S16[16]; > signed char input2_S8[16]; > signed char input1_S8[16]; > > <bb 2>: > MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122, > 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 }; > MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117, > 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 }; > MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004, > 14637, 14274, 13915, 13560 }; > MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519, > 12180, 11845, 11514, 11187, 10864 }; > vect__52.5_2 = MEM[(signed char[16] *)&input1_S8]; > vect__54.8_73 = MEM[(signed char[16] *)&input2_S8]; > vect_patt_77.9_72 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>; > vect_patt_77.9_71 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>; > MEM[(short int *)&output_S16] = vect_patt_77.9_72; > MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71; > check_vector_S16 (&expected_S16, &output_S16); > input1_S8 ={v} {CLOBBER}; > input2_S8 ={v} {CLOBBER}; > output_S16 ={v} {CLOBBER}; > expected_S16 ={v} {CLOBBER}; > return 0; > > } > > > while dump for aarch64 little-endian is like: > > ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0) > > foo () > { > vector(8) short int vect_patt_77.9; > vector(16) signed char vect__54.8; > vector(16) signed char vect__52.5; > short int expected_S16[16]; > short int output_S16[16]; > signed char input2_S8[16]; > signed char input1_S8[16]; > > <bb 2>: > MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122, > 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 }; > MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117, > 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 }; > MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004, > 14637, 14274, 13915, 13560 }; > MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519, > 12180, 11845, 11514, 11187, 10864 }; > vect__52.5_2 = MEM[(signed char[16] *)&input1_S8]; > vect__54.8_73 = MEM[(signed char[16] *)&input2_S8]; > vect_patt_77.9_72 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>; > vect_patt_77.9_71 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>; > MEM[(short int *)&output_S16] = vect_patt_77.9_72; > MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71; > check_vector_S16 (&expected_S16, &output_S16); > input1_S8 ={v} {CLOBBER}; > input2_S8 ={v} {CLOBBER}; > output_S16 ={v} {CLOBBER}; > expected_S16 ={v} {CLOBBER}; > return 0; > > } > > > It's clear that WIDEN_MULT_LO_EXPR/WIDEN_MULT_HI_EXPR are switched > depending on big-endian/little-endian. The corresponding code doing > the shuffle is like below in > tree-vect-stmts.c:supportable_widening_operation > > if (BYTES_BIG_ENDIAN && c1 != VEC_WIDEN_MULT_EVEN_EXPR) > { > enum tree_code ctmp = c1; > c1 = c2; > c2 = ctmp; > } > > > There are some other similar cases in vectorizer and all of them look > suspicious since intuitively, vectorizer should neither care about > target endianess nor do such shuffle. Anyway, this is how we do > vectorization currently. > > Stick to the test case, below insns are expanded for the WIDEN_MULT > pair on big-endian: > > ;;WIDEN_MULT_HI_EXPR part > (insn:TI 39 50 16 (set (reg:V8HI 36 v4 [orig:106 vect_patt_77.9 ] [106]) > (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82]) > (parallel:V16QI [ > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 14 [0xe]) > (const_int 15 [0xf]) > ]))) > (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87]) > (parallel:V16QI [ > (const_int 8 [0x8]) > (const_int 9 [0x9]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 12 [0xc]) > (const_int 13 [0xd]) > (const_int 14 [0xe]) > (const_int 15 [0xf]) > ]))))) bug.c:26 1069 {aarch64_simd_vec_smult_hi_v16qi} > (nil)) > ;;WIDEN_MULT_LO_EXPR part > (insn:TI 45 22 40 (set (reg:V8HI 34 v2 [orig:111 vect_patt_77.9 ] [111]) > (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82]) > (parallel:V16QI [ > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))) > (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87]) > (parallel:V16QI [ > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > ]))))) bug.c:26 1063 {aarch64_simd_vec_smult_lo_v16qi} > (nil)) > > Symantically, insn39 does vect multiplication for high elements of > vectors v1 and v0, insn45 does that for low elements of vectors v1 and > v0. To cope with (or take advantage of) the mentioned shuffle in > vectorizer, aarch64 back-end for big-endian matches insn39 to > "aarch64_simd_vec_smult_hi_v16qi" using predicate > "vect_par_cnst_hi_half" and generates assembly "smull2 v4.8h, > v1.16b, v0.16b". Point is that assembly insn symantically multiplies > high bits of vectors v1 and v0, which actually mapps to low elements > of vectors v1 and v0 on aarch64 big-endian. The interesting thing > here is, back-end generates wrong assembly insn to shuffle back and > get right results. > > Though a little bit ugly, it is still acceptable if thing stays in > this way. Unfortunately, GCC also wants to do as much optimization as > possible on vectorized code. Look into dump information before RTL > cse1 (this time with my local patch, but that doesn't matter): > > 3: NOTE_INSN_BASIC_BLOCK 2 > 2: NOTE_INSN_FUNCTION_BEG > 5: r78:V16QI=[`*.LC0'] ;;r78={127,126,125,...,114,113,112} > 6: [sfp:DI-0x60]=r78:V16QI > 7: r79:V16QI=[`*.LC1'] ;;r79={127,125,123,...,101, 99, 97} > 8: [sfp:DI-0x50]=r79:V16QI > 9: r80:V8HI=[`*.LC2'] ;;r80=low_part(r78) w* low_part(r79) > 10: [sfp:DI-0x20]=r80:V8HI > 11: r81:V8HI=[`*.LC3'] ;;r81=high_part(r78) w* high_part(r79) > 12: [sfp:DI-0x10]=r81:V8HI > 13: r73:V16QI=[sfp:DI-0x60] > 14: r76:V16QI=[sfp:DI-0x50] > 15: > r82:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel)) > ;;WIDEN_MULT_HI_EXPR > 16: [sfp:DI-0x40]=r82:V8HI > 17: > r83:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel)) > ;;WIDEN_MULT_LO_EXPR > 18: [sfp:DI-0x30]=r83:V8HI > 19: r84:DI=sfp:DI-0x40 > 20: r85:DI=sfp:DI-0x20 > 21: x1:DI=r84:DI > 22: x0:DI=r85:DI > 23: call [`check_vector_S16'] argc:0 > 24: r77:SI=0 > 28: x0:SI=r77:SI > 29: use x0:SI > > Pass CSE1 knows that r76 equals to r79, and r73 equals to r78, thus: > insn 15 > <=> > r82:V8HI=sign_extend(vec_select(r78:V16QI,{8,9,10,11,12,13,14,15}))*sign_extend(vec_select(r79:V16QI,{8,9,10,11,12,13,14,15})) > <=> > r82:V8HI=sign_extend(119,118,117,116,115,114,113,112)*sign_extend(111,109,107,105,103,101,99,97) > <=> r82:V8HI=r81:V8HI > > Similarly, insn17 <=> r83:V8HI=r80:V8HI > > The logic is, function simplify-rtx.c::simplify_binary_operation_1 > called by CSE1 doesn't know or care that vectorizer did the shuffle > before. It just selects high elements of a vector for > insn15/WIDEN_MULT_HI_EXPR. As a result, insn15 is optimized into a > direct store of high emlements (i.e., low bits) of multiplication > result, which is the exact opposite comparing to smull2 instruction. > So with this useful optimization, wrong code is generated. > > This isn't the only case. PR61114 is caused by the same reason. Only > it happens with BIT_FIELD_REF and dom2 optimization this time. > > One way out is to fix these issues one by one, introducing more and > more BIG_ENDIAN checks at places where normal optimization is done for > auto-vectorized code. Apparently, this is not > sustainable/maintainable/acceptable. The right fix I think is to make > vectorizer(middle-end) endianess independent and rectify all backends > with vectorizer enabled for big-endian. Richard suggested that > tree-vect*, optabs.c and related backends should be fixed as the first > step, while RTL be fixed as next step if it breaks in the same way. > > Moreover, I am also interested about practical issues, for example, > How much (back-end) work for this? > Should we open a developping branch thus people from each back-end > can do the fix? > > Any comment on this?
I'm taking a pragmatic position on this issue. Go and fix the GIMPLE middle-end and adjust the backend you care for. The "other" big-endian backend just decided to switch to little-endian - but it is also well maintained and thus I expect the maintainers to quickly adapt to these changes. Leave other ports that may be affected (who knows ...) broken. We do have testing coverage for the big-endian special cases (they were after all introduced bit-by-bit for broken cases ...), so the autotesters should come up with regressions on other platforms if there are any. They should be easily fixable. So I think to get attention you have to post a patch (even if it only touches one backend). I'll happily approve the middle-end bits conditional on "testing on one other big-endian platform". Thanks, Richard. > Thanks, > bin > > -- > Best Regards.