https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Tamar Christina from comment #3) > (In reply to Richard Biener from comment #2) > > (In reply to Tamar Christina from comment #0) > > > GCC seems to miss that there is no gap between the group accesses and that > > > stride == 1. > > > test3 is vectorized linearly by GCC, so it seems this is missed > > > optimization > > > in data ref analysis? > > > > The load-lanes look fine, so it must be the code generation for the > > HI to DI via SI conversions using unpacks you are complaining about? > > > > No, that one I have a patch for. > > > Using load-lanes is natural here. > > > > This isn't about permutes due to VF or so, isn't it? > > It is, the load lanes is unnecessary, because there is no permute during the > loop because the group size is equal to the stride and offsets are linear. > > LOAD_LANES are really expensive, especially 4 register ones. > > My complaint is that this loop, does not have a permute. While it may look > like the entries are permuted they are not. > > essentially test1 and test3 are the same. the vectorizer picks VF=8, so > unrolls test1 into test3, but fails to see that the unrolled code is linear, > but when manually unrolled it does: > > e.g. > > void > test3 (unsigned short *x, double *y, int n) > { > for (int i = 0; i < n; i+=2) > { > unsigned short a1 = x[i * 4 + 0]; > unsigned short b1 = x[i * 4 + 1]; > unsigned short c1 = x[i * 4 + 2]; > unsigned short d1 = x[i * 4 + 3]; > y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1; > unsigned short a2 = x[(i + 1) * 4 + 0]; > unsigned short b2 = x[(i + 1) * 4 + 1]; > unsigned short c2 = x[(i + 1) * 4 + 2]; > unsigned short d2 = x[(i + 1) * 4 + 3]; > y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2; > } > } > > does not use LOAD_LANES. It uses interleaving because there's no ld8 and when vect_lower_load_permutations decides heuristically to use load-lanes it tries to do so vector-size agnostic so it doesn't consider using two times ld4. There _are_ permutes because of the use of 4 lanes to compute the single lane store in the reduction operation. The vectorization for the unrolled loop not using load-lanes show them: vect_a1_53.10_234 = MEM <vector(8) short unsigned int> [(short unsigned int *)vectp_x.8_232]; vectp_x.8_235 = vectp_x.8_232 + 16; vect_a1_53.11_236 = MEM <vector(8) short unsigned int> [(short unsigned int *)vectp_x.8_235]; vectp_x.8_237 = vectp_x.8_232 + 32; vect_a1_53.12_238 = MEM <vector(8) short unsigned int> [(short unsigned int *)vectp_x.8_237]; vectp_x.8_239 = vectp_x.8_232 + 48; vect_a1_53.13_240 = MEM <vector(8) short unsigned int> [(short unsigned int *)vectp_x.8_239]; _254 = VEC_PERM_EXPR <vect_a1_53.10_234, vect_a1_53.11_236, { 1, 3, 5, 7, 9, 11, 13, 15 }>; _255 = VEC_PERM_EXPR <vect_a1_53.12_238, vect_a1_53.13_240, { 1, 3, 5, 7, 9, 11, 13, 15 }>; _286 = VEC_PERM_EXPR <_254, _255, { 1, 3, 5, 7, 9, 11, 13, 15 }>; ... that's simply load-lanes open-coded. If open-coding ld4 is better than using ld4 just make it not available to the vectorizer? Similar to ld2 I suppose.