https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117557
Tamar Christina <tnfchris at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs-reduction | Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #8 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Testcase: #include <stdint.h> #include <string.h> #define N 8 #define L 8 void f(const uint8_t * restrict seq1, const uint8_t *idx, uint8_t *seq_out) { for (int i = 0; i < L; ++i) { uint8_t h = idx[i]; memcpy((void *)&seq_out[i * N], (const void *)&seq1[h * N / 2], N / 2); } } compiled at -O3 -mcpu=neoverse-n1+sve miscompiles to: vect_patt_26.9_89 = [vec_unpack_lo_expr] vect_patt_27.8_88; vect_patt_26.9_90 = [vec_unpack_hi_expr] vect_patt_27.8_88; vect_patt_25.10_94 = .MASK_GATHER_LOAD (_91, vect_patt_26.9_89, 1, { 0, ... }, loop_mask_92, { 0, ... }); vect_patt_25.11_95 = .MASK_GATHER_LOAD (_91, vect_patt_26.9_90, 1, { 0, ... }, loop_mask_93, { 0, ... }); .MASK_SCATTER_STORE (seq_out_15(D), { 0, 8, 16, ... }, 1, vect_patt_25.10_94, loop_mask_92); .MASK_SCATTER_STORE (seq_out_15(D), { 0, 8, 16, ... }, 1, vect_patt_25.11_95, loop_mask_92); rather than vect_patt_26.9_90 = [vec_unpack_lo_expr] vect_patt_27.8_89; vect_patt_26.9_91 = [vec_unpack_hi_expr] vect_patt_27.8_89; vect_patt_25.10_95 = .MASK_GATHER_LOAD (_92, vect_patt_26.9_90, 1, { 0, ... }, loop_mask_93); vect_patt_25.11_96 = .MASK_GATHER_LOAD (_92, vect_patt_26.9_91, 1, { 0, ... }, loop_mask_94); .MASK_SCATTER_STORE (seq_out_15(D), { 0, 8, 16, ... }, 1, vect_patt_25.10_95, loop_mask_93); vectp_seq_out.12_100 = seq_out_15(D) + POLY_INT_CST [32, 32]; .MASK_SCATTER_STORE (vectp_seq_out.12_100, { 0, 8, 16, ... }, 1, vect_patt_25.11_96, loop_mask_94); This happens because the index passed to vect_get_loop_mask is wrong for SLP as Richi suspected and dataref_ptr is wrong because it's being treated as a constant inside the vec_num loop. i.e. it thinks for SLP every store is to the same location. The bump_vector_ptr call needs to be inside the inner loop as well or the inner loop flattened into the outer one which then iterates over ncopies * vec_num. Testing a patch. So mine.