On Wed, 5 Jun 2024, Richard Biener wrote: > On Tue, 4 Jun 2024, Richard Sandiford wrote: > > > Richard Biener <rguent...@suse.de> writes: > > > The following emulates classical interleaving for SLP load permutes > > > that we are unlikely handling natively. This is to handle cases > > > where interleaving (or load/store-lanes) is the optimal choice for > > > vectorizing even when we are doing that within SLP. An example > > > would be > > > > > > void foo (int * __restrict a, int * b) > > > { > > > for (int i = 0; i < 16; ++i) > > > { > > > a[4*i + 0] = b[4*i + 0] * 3; > > > a[4*i + 1] = b[4*i + 1] + 3; > > > a[4*i + 2] = (b[4*i + 2] * 3 + 3); > > > a[4*i + 3] = b[4*i + 3] * 3; > > > } > > > } > > > > > > where currently the SLP store is merging four single-lane SLP > > > sub-graphs but none of the loads in it can be code-generated > > > with V4SImode vectors and a VF of four as the permutes would need > > > three vectors. > > > > Nice! > > > > > The patch introduces a lowering phase after SLP discovery but > > > before SLP pattern recognition or permute optimization that > > > analyzes all loads from the same dataref group and creates an > > > interleaving scheme starting from an unpermuted load. > > > > > > What can be handled is quite restrictive, matching only a subset > > > of the non-SLP interleaving cases (the power-of-two group size > > > ones, in addition only cases without gaps). The interleaving > > > vectorization in addition can handle size 3 and 5 - but I am not > > > sure if it's possible to do that in a VL agnostic way. It > > > should be still possible to set up the SLP graph in a way that > > > a load-lane could be matched from SLP pattern recognition. > > > > Yeah, I don't think it would be possible to decompose a 3- or > > 5-lane grouped load into a series of VLA 2-input permutes. > > But (as I think you're saying) it seems like a load-3-lanes would just > > be a load with a LANE_PERMUTATION of N, N+3, N+6, N+9, ... for lane N. > > Is that right? > > Yes, that's how it looks without this patch. I think we'd need > a load node loading N, N+1, N+2, ... and then permute nodes > with N, N+3, ... and N+1, N+4, ... and N+2, N+5 ... so we generate > one .LOAD_LANES from the load node and the permutes pick up the > correct vector defs? I'm not sure yet how classification and > code generation would work for this. > > The store side is already on trunk with the single SLP store node > getting lanes via permutes. > > It might be we want a load/store node with N inputs/outputs as the > best representation and use lane_permutation to indicate the > input (for stores) and output (for loads) "permute". > > > > As said gaps are currently not handled - for SLP we have a > > > representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes" > > > would need to be filled in some way (even if we just push NULL). > > > > > > The patch misses multi-level even/odd handling as well as CSEing > > > intermediate generated permutes. Both is quite straight-forward > > > to add, but eventually there's a better or more general strategy > > > for lowering? The main goal of the patch is to avoid falling > > > back to non-SLP for cases the interleaving code handles. > > > > Does the multi-level thing including examples like: > > > > int a[2 * 16]; > > int b[8 * 16]; > > void f() > > { > > for (int i = 0; i < 16; ++i) > > { > > a[i * 2 + 0] += b[i * 8 + 0] + b[i * 8 + 1] + b[i * 8 + 2] + b[i * 8 > > + 3]; > > a[i * 2 + 1] += b[i * 8 + 4] + b[i * 8 + 5] + b[i * 8 + 6] + b[i * 8 > > + 7]; > > } > > } > > > > ? For that we generate: > > > > _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>; > > _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>; > > _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>; > > _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>; > > _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>; > > _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>; > > _53 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 1, 4, 5 }>; > > _52 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 1, 4, 5 }>; > > _51 = VEC_PERM_EXPR <_53, _52, { 1, 3, 5, 7 }>; > > _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>; > > > > (two even level 1, one even level 2, one odd level 1), whereas > > preferring 2xeven + 2xodd would avoid the third set of first-level > > permutes: > > > > _45 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 1, 3, 5, 7 }>; > > _44 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 1, 3, 5, 7 }>; > > _43 = VEC_PERM_EXPR <_45, _44, { 1, 3, 5, 7 }>; > > _49 = VEC_PERM_EXPR <vect__4.9_63, vect__4.10_61, { 0, 2, 4, 6 }>; > > _48 = VEC_PERM_EXPR <vect__4.11_59, vect__4.12_57, { 0, 2, 4, 6 }>; > > _47 = VEC_PERM_EXPR <_49, _48, { 1, 3, 5, 7 }>; > > _51 = VEC_PERM_EXPR <_45, _44, { 0, 2, 4, 6 }>; > > _54 = VEC_PERM_EXPR <_49, _48, { 0, 2, 4, 6 }>;
Btw, I just checked GCC 14 which uses interleaving for this (SLP needs a three vector permute) and that gets us double the VF and overall 28 permutes compared to 10 permutes for the above (and 8 for your optimal variant). So it should already be an improvement at least. Richard.