tcg: Add gvec size changing operations

Anton Johansson via Tue, 03 Dec 2024 12:13:00 -0800

On 03/12/24, Richard Henderson wrote:
> On 12/3/24 12:08, Anton Johansson wrote:
> > On 22/11/24, Richard Henderson wrote:
> > > On 11/20/24 19:49, Anton Johansson wrote:
> > > > Adds new functions to the gvec API for truncating, sign- or zero
> > > > extending vector elements.  Currently implemented as helper functions,
> > > > these may be mapped onto host vector instructions in the future.
> > > > 
> > > > For the time being, allows translation of more complicated vector
> > > > instructions by helper-to-tcg.
> > > > 
> > > > Signed-off-by: Anton Johansson <a...@rev.ng>
> > > > ---
> > > >    accel/tcg/tcg-runtime-gvec.c     | 41 +++++++++++++++++
> > > >    accel/tcg/tcg-runtime.h          | 22 +++++++++
> > > >    include/tcg/tcg-op-gvec-common.h | 18 ++++++++
> > > >    tcg/tcg-op-gvec.c                | 78 
> > > > ++++++++++++++++++++++++++++++++
> > > >    4 files changed, 159 insertions(+)
> > > > 
> > > > diff --git a/accel/tcg/tcg-runtime-gvec.c b/accel/tcg/tcg-runtime-gvec.c
> > > > index afca89baa1..685c991e6a 100644
> > > > --- a/accel/tcg/tcg-runtime-gvec.c
> > > > +++ b/accel/tcg/tcg-runtime-gvec.c
> > > > @@ -1569,3 +1569,44 @@ void HELPER(gvec_bitsel)(void *d, void *a, void 
> > > > *b, void *c, uint32_t desc)
> > > >        }
> > > >        clear_high(d, oprsz, desc);
> > > >    }
> > > > +
> > > > +#define DO_SZ_OP1(NAME, DSTTY, SRCTY)                                  
> > > >     \
> > > > +void HELPER(NAME)(void *d, void *a, uint32_t desc)                     
> > > >     \
> > > > +{                                                                      
> > > >     \
> > > > +    intptr_t oprsz = simd_oprsz(desc);                                 
> > > >     \
> > > > +    intptr_t elsz = oprsz/sizeof(DSTTY);                               
> > > >     \
> > > > +    intptr_t i;                                                        
> > > >     \
> > > > +                                                                       
> > > >     \
> > > > +    for (i = 0; i < elsz; ++i) {                                       
> > > >     \
> > > > +        SRCTY aa = *((SRCTY *) a + i);                                 
> > > >     \
> > > > +        *((DSTTY *) d + i) = aa;                                       
> > > >     \
> > > > +    }                                                                  
> > > >     \
> > > > +    clear_high(d, oprsz, desc);                                        
> > > >     \
> > > 
> > > This formulation is not valid.
> > > 
> > > (1) Generic forms must *always* operate strictly on columns.  This
> > > formulation is either expanding a narrow vector to a wider vector or
> > > compressing a wider vector to a narrow vector.
> > > 
> > > (2) This takes no care for byte ordering of the data between columns.  
> > > This
> > > is where sticking strictly to columns helps, in that we can assume that 
> > > data
> > > is host-endian *within the column*, but we cannot assume anything about 
> > > the
> > > element indexing of ptr + i.
> > 
> > Concerning (1) and (2), is this a limitation imposed on generic vector
> > ops. to simplify mapping to host vector instructions where
> > padding/alignment of elements might differ?  From my understanding, the
> > helper above should be fine since we can assume contiguous elements?
> 
> This is a limitation imposed on generic vector ops, because different
> target/arch/ represent their vectors in different ways.
> 
> For instance, Arm and RISC-V chunk the vector in to host-endian uint64_t,
> with the chunks indexed little-endian.  But PPC puts the entire 128-bit
> vector in host-endian bit ordering, so the uint64_t chunks are host-endian.
> 
> On a big-endian host, ptr+1 may be addressing element i-1 or i-7 instead of 
> i+1.


Ah, I see the problem now thanks for the explanation:)

> > I see, I don't think we can make this work for Hexagon vector ops., as
> > an example consider V6_vadduwsat which performs an unsigned saturated
> > add of 32-bit elements, currently we emit
> > 
> >      void emit_V6_vadduwsat(intptr_t vec2, intptr_t vec7, intptr_t vec6) {
> >          VectorMem mem = {0};
> >          intptr_t vec5 = temp_new_gvec(&mem, 256);
> >          tcg_gen_gvec_zext(MO_64, MO_32, vec5, vec7, 256, 128, 256);
> > 
> >          intptr_t vec1 = temp_new_gvec(&mem, 256);
> >          tcg_gen_gvec_zext(MO_64, MO_32, vec1, vec6, 256, 128, 256);
> > 
> >          tcg_gen_gvec_add(MO_64, vec1, vec1, vec5, 256, 256);
> > 
> >          intptr_t vec3 = temp_new_gvec(&mem, 256);
> >          tcg_gen_gvec_dup_imm(MO_64, vec3, 256, 256, 4294967295ull);
> > 
> >          tcg_gen_gvec_umin(MO_64, vec1, vec1, vec3, 256, 256);
> > 
> >          tcg_gen_gvec_trunc(MO_32, MO_64, vec2, vec1, 128, 256, 128);
> >      }
> > 
> > so we really do rely on the size-changing property of zext here, the
> > input vectors are 128-byte and we expand them to 256-byte.  We could
> > expand vector operations within the instruction to the largest vector
> > size, but would need to zext and trunc to destination and source
> > registers anyway.
> Yes, well, this is the output of llvm though, yes?

Yes

> Did you forget to describe TCG's native saturating operations to the
> compiler? tcg_gen_gvec_usadd performs exactly this operation.
> 
> And if you'd like to improve llvm, usadd(a, b) equals umin(a, ~b) + b.
> Fewer operations without having to change vector sizes.
> Similarly for unsigned saturating subtract: ussub(a, b) equals umax(a, b) - b.

In this case LLVM wasn't able to optimize it to a llvm.uadd.sat
intrinsic, in which case we would have emitted tcg_gen_gvec_usadd I
believe.  We can manually optimize the above pattern to a llvm.uadd.sat
to avoid extra size changes.

This might be fixed in future LLVM versions, but otherwise seems like a
reasonable change to push upstream.

The point is that we have a lot of Hexagon instructions where size
changes are probably unavoidable, another example is V6_vshuffh which
takes in a <16 x i16> vector and shuffles the upper <8xi16> into the upper
16-bits of a <8 x i32> vector

    void emit_V6_vshuffh(intptr_t vec3, intptr_t vec7) {
        VectorMem mem = {0};
        intptr_t vec2 = temp_new_gvec(&mem, 128);
        tcg_gen_gvec_zext(MO_32, MO_16, vec2, vec7, 128, 64, 128);

        intptr_t vec0 = temp_new_gvec(&mem, 128);
        tcg_gen_gvec_zext(MO_32, MO_16, vec0, (vec7 + 64ull), 128, 64, 128);

        intptr_t vec1 = temp_new_gvec(&mem, 128);
        tcg_gen_gvec_shli(MO_32, vec1, vec0, 16, 128, 128);
        tcg_gen_gvec_or(MO_32, vec3, vec1, vec2, 128, 128);
    }

Not to bloat the email too much with examples, you can see 3 more here

  https://pad.rev.ng/11IvAKhiRy2cPwC7MX9nXA

Maybe we rely on the target defining size-changing operations if they
are needed?

//Anton

Re: [RFC PATCH v1 03/43] accel/tcg: Add gvec size changing operations

Reply via email to