Hi, To attract more eyes, I should have used a scarier subject like "GCC's vectorizer is heading in the wrong direction on big-endian targets".
The idea came from a simple vectorization case I ran into and a discussion with Richard. Given simple case like: typedef signed short *__restrict__ pRSINT16; void check_vector_S16 (pRSINT16 a, pRSINT16 b); int foo (void) { int i; signed char input1_S8[16] = { 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, 114, 113, 112}; signed char input2_S8[16] = { 127, 125, 123, 121, 119, 117, 115, 113, 111, 109, 107, 105, 103, 101, 99, 97}; signed short output_S16[16]; signed short expected_S16[] = { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560, 13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 }; for (i=0; i<16; i++) output_S16[i] = (signed short)input1_S8[i] * input2_S8[i]; check_vector_S16 (expected_S16, output_S16); return 0; } It is vectorized at optimization level "O3", and the GIMPLE optimized dump for aarch64 big-endian is like: ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0) foo () { vector(8) short int vect_patt_77.9; vector(16) signed char vect__54.8; vector(16) signed char vect__52.5; short int expected_S16[16]; short int output_S16[16]; signed char input2_S8[16]; signed char input1_S8[16]; <bb 2>: MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 }; MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117, 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 }; MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560 }; MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 }; vect__52.5_2 = MEM[(signed char[16] *)&input1_S8]; vect__54.8_73 = MEM[(signed char[16] *)&input2_S8]; vect_patt_77.9_72 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>; vect_patt_77.9_71 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>; MEM[(short int *)&output_S16] = vect_patt_77.9_72; MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71; check_vector_S16 (&expected_S16, &output_S16); input1_S8 ={v} {CLOBBER}; input2_S8 ={v} {CLOBBER}; output_S16 ={v} {CLOBBER}; expected_S16 ={v} {CLOBBER}; return 0; } while dump for aarch64 little-endian is like: ;; Function foo (foo, funcdef_no=0, decl_uid=2564, symbol_order=0) foo () { vector(8) short int vect_patt_77.9; vector(16) signed char vect__54.8; vector(16) signed char vect__52.5; short int expected_S16[16]; short int output_S16[16]; signed char input2_S8[16]; signed char input1_S8[16]; <bb 2>: MEM[(signed char[16] *)&input1_S8] = { 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, 114, 113, 112 }; MEM[(signed char[16] *)&input2_S8] = { 127, 125, 123, 121, 119, 117, 115, 113, 111, 109, 107, 105, 103, 101, 99, 97 }; MEM[(short int *)&expected_S16] = { 16129, 15750, 15375, 15004, 14637, 14274, 13915, 13560 }; MEM[(short int *)&expected_S16 + 16B] = { 13209, 12862, 12519, 12180, 11845, 11514, 11187, 10864 }; vect__52.5_2 = MEM[(signed char[16] *)&input1_S8]; vect__54.8_73 = MEM[(signed char[16] *)&input2_S8]; vect_patt_77.9_72 = WIDEN_MULT_LO_EXPR <vect__52.5_2, vect__54.8_73>; vect_patt_77.9_71 = WIDEN_MULT_HI_EXPR <vect__52.5_2, vect__54.8_73>; MEM[(short int *)&output_S16] = vect_patt_77.9_72; MEM[(short int *)&output_S16 + 16B] = vect_patt_77.9_71; check_vector_S16 (&expected_S16, &output_S16); input1_S8 ={v} {CLOBBER}; input2_S8 ={v} {CLOBBER}; output_S16 ={v} {CLOBBER}; expected_S16 ={v} {CLOBBER}; return 0; } It's clear that WIDEN_MULT_LO_EXPR/WIDEN_MULT_HI_EXPR are switched depending on big-endian/little-endian. The corresponding code doing the shuffle is like below in tree-vect-stmts.c:supportable_widening_operation if (BYTES_BIG_ENDIAN && c1 != VEC_WIDEN_MULT_EVEN_EXPR) { enum tree_code ctmp = c1; c1 = c2; c2 = ctmp; } There are some other similar cases in vectorizer and all of them look suspicious since intuitively, vectorizer should neither care about target endianess nor do such shuffle. Anyway, this is how we do vectorization currently. Stick to the test case, below insns are expanded for the WIDEN_MULT pair on big-endian: ;;WIDEN_MULT_HI_EXPR part (insn:TI 39 50 16 (set (reg:V8HI 36 v4 [orig:106 vect_patt_77.9 ] [106]) (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82]) (parallel:V16QI [ (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) (const_int 15 [0xf]) ]))) (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87]) (parallel:V16QI [ (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) (const_int 15 [0xf]) ]))))) bug.c:26 1069 {aarch64_simd_vec_smult_hi_v16qi} (nil)) ;;WIDEN_MULT_LO_EXPR part (insn:TI 45 22 40 (set (reg:V8HI 34 v2 [orig:111 vect_patt_77.9 ] [111]) (mult:V8HI (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 33 v1 [82]) (parallel:V16QI [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ]))) (sign_extend:V8HI (vec_select:V8QI (reg:V16QI 32 v0 [87]) (parallel:V16QI [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) ]))))) bug.c:26 1063 {aarch64_simd_vec_smult_lo_v16qi} (nil)) Symantically, insn39 does vect multiplication for high elements of vectors v1 and v0, insn45 does that for low elements of vectors v1 and v0. To cope with (or take advantage of) the mentioned shuffle in vectorizer, aarch64 back-end for big-endian matches insn39 to "aarch64_simd_vec_smult_hi_v16qi" using predicate "vect_par_cnst_hi_half" and generates assembly "smull2 v4.8h, v1.16b, v0.16b". Point is that assembly insn symantically multiplies high bits of vectors v1 and v0, which actually mapps to low elements of vectors v1 and v0 on aarch64 big-endian. The interesting thing here is, back-end generates wrong assembly insn to shuffle back and get right results. Though a little bit ugly, it is still acceptable if thing stays in this way. Unfortunately, GCC also wants to do as much optimization as possible on vectorized code. Look into dump information before RTL cse1 (this time with my local patch, but that doesn't matter): 3: NOTE_INSN_BASIC_BLOCK 2 2: NOTE_INSN_FUNCTION_BEG 5: r78:V16QI=[`*.LC0'] ;;r78={127,126,125,...,114,113,112} 6: [sfp:DI-0x60]=r78:V16QI 7: r79:V16QI=[`*.LC1'] ;;r79={127,125,123,...,101, 99, 97} 8: [sfp:DI-0x50]=r79:V16QI 9: r80:V8HI=[`*.LC2'] ;;r80=low_part(r78) w* low_part(r79) 10: [sfp:DI-0x20]=r80:V8HI 11: r81:V8HI=[`*.LC3'] ;;r81=high_part(r78) w* high_part(r79) 12: [sfp:DI-0x10]=r81:V8HI 13: r73:V16QI=[sfp:DI-0x60] 14: r76:V16QI=[sfp:DI-0x50] 15: r82:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel)) ;;WIDEN_MULT_HI_EXPR 16: [sfp:DI-0x40]=r82:V8HI 17: r83:V8HI=sign_extend(vec_select(r73:V16QI,parallel))*sign_extend(vec_select(r76:V16QI,parallel)) ;;WIDEN_MULT_LO_EXPR 18: [sfp:DI-0x30]=r83:V8HI 19: r84:DI=sfp:DI-0x40 20: r85:DI=sfp:DI-0x20 21: x1:DI=r84:DI 22: x0:DI=r85:DI 23: call [`check_vector_S16'] argc:0 24: r77:SI=0 28: x0:SI=r77:SI 29: use x0:SI Pass CSE1 knows that r76 equals to r79, and r73 equals to r78, thus: insn 15 <=> r82:V8HI=sign_extend(vec_select(r78:V16QI,{8,9,10,11,12,13,14,15}))*sign_extend(vec_select(r79:V16QI,{8,9,10,11,12,13,14,15})) <=> r82:V8HI=sign_extend(119,118,117,116,115,114,113,112)*sign_extend(111,109,107,105,103,101,99,97) <=> r82:V8HI=r81:V8HI Similarly, insn17 <=> r83:V8HI=r80:V8HI The logic is, function simplify-rtx.c::simplify_binary_operation_1 called by CSE1 doesn't know or care that vectorizer did the shuffle before. It just selects high elements of a vector for insn15/WIDEN_MULT_HI_EXPR. As a result, insn15 is optimized into a direct store of high emlements (i.e., low bits) of multiplication result, which is the exact opposite comparing to smull2 instruction. So with this useful optimization, wrong code is generated. This isn't the only case. PR61114 is caused by the same reason. Only it happens with BIT_FIELD_REF and dom2 optimization this time. One way out is to fix these issues one by one, introducing more and more BIG_ENDIAN checks at places where normal optimization is done for auto-vectorized code. Apparently, this is not sustainable/maintainable/acceptable. The right fix I think is to make vectorizer(middle-end) endianess independent and rectify all backends with vectorizer enabled for big-endian. Richard suggested that tree-vect*, optabs.c and related backends should be fixed as the first step, while RTL be fixed as next step if it breaks in the same way. Moreover, I am also interested about practical issues, for example, How much (back-end) work for this? Should we open a developping branch thus people from each back-end can do the fix? Any comment on this? Thanks, bin -- Best Regards.