https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 26 Jun 2024, ams at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #14 from Andrew Stubbs <ams at gcc dot gnu.org> --- > On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > > > --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- > > (In reply to Richard Biener from comment #12) > >> (In reply to Andrew Stubbs from comment #10) > >>> GFX10 has more limited permutation capabilities than GFX9 because it > >>> only has 32-lane vectors natively, even though we're using the 64-lane > >>> "compatibility" mode. > >>> > >>> However, in theory, the permutation capabilities on V32 and below should > >>> be the same, and some permutations on V64 are allowed, so I don't know > >>> why it doesn't use it. It's possible I broke the logic in > >>> gcn_vectorize_vec_perm_const: > >>> > >>> /* RDNA devices can only do permutations within each group of > >>> 32-lanes. > >>> Reject permutations that cross the boundary. */ > >>> if (TARGET_RDNA2_PLUS) > >>> for (unsigned int i = 0; i < nelt; i++) > >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32) > >>> return false; > >>> > >>> It looks right to me though? > >> > >> nelt == 32 so I think the last element has the wrong check applied? > >> > >> It should be > >> > >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32) > >> > >> I think. With that the vectorization happens in a similar way but the > >> failure still doesn't reproduce (without the patch, of course). > > Oops, I think you're right. > > > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting > > two vectors src0 and src1 into one 32 element dst vector (it's no longer > > required that src0 and src1 line up with the dst vector size btw, they > > might have different nelt). So the loop would reject interleaving > > the low parts of two 32 element vectors, a permute that would look like > > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" > > mean you can never mix the two vector inputs? Or does GCN not have > > a two-to-one vector permute instruction? > > GCN does not have two-to-one vector permute in hardware, so we do two > permutes and a vec_merge to get the same effect. > > GFX9 can permute all the elements within a 64 lane vector arbitrarily. > > GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but > no value may cross the boundary. AFAIK there's no way to do that via any > vector instruction (i.e. without writing to memory, or extracting values > element-wise). I see - so it cannot even swap low-32 and high-32? I'm thinking of what sub-part of permutes would be possible by extending the two-to-one vec_merge trick. OTOH we restrict GFX10/11 to 32 lane vectors so in practice this restriction should be fine.