https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #16 from Andrew Stubbs <ams at gcc dot gnu.org> ---
On 26/06/2024 14:41, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
>>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
>>> two vectors src0 and src1 into one 32 element dst vector (it's no longer
>>> required that src0 and src1 line up with the dst vector size btw, they
>>> might have different nelt).  So the loop would reject interleaving
>>> the low parts of two 32 element vectors, a permute that would look like
>>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
>>> mean you can never mix the two vector inputs?  Or does GCN not have
>>> a two-to-one vector permute instruction?
>>
>> GCN does not have two-to-one vector permute in hardware, so we do two
>> permutes and a vec_merge to get the same effect.
>>
>> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
>>
>> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but
>> no value may cross the boundary. AFAIK there's no way to do that via any
>> vector instruction (i.e. without writing to memory, or extracting values
>> element-wise).
> 
> I see - so it cannot even swap low-32 and high-32?  I'm thinking of
> what sub-part of permutes would be possible by extending the two-to-one
> vec_merge trick.

No(?)

The 64-lane compatibility mode works, under the hood, by allocating 
double the number of 32-lane registers and then executing each 
instruction twice. Mostly this is invisible, but it gets exposed for 
permutations and the like. Logically, the microarchitecture could do a 
vec_merge to DTRT, but I've not found a way to express that.

It's possible I missed something when RTFM.

> OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
> restriction should be fine.

Yes, with the "31" fixed it should work.

Andrew

Reply via email to