On 03/12/24, Richard Henderson wrote: > On 12/3/24 12:08, Anton Johansson wrote: > > On 22/11/24, Richard Henderson wrote: > > > On 11/20/24 19:49, Anton Johansson wrote: > > > > Adds new functions to the gvec API for truncating, sign- or zero > > > > extending vector elements. Currently implemented as helper functions, > > > > these may be mapped onto host vector instructions in the future. > > > > > > > > For the time being, allows translation of more complicated vector > > > > instructions by helper-to-tcg. > > > > > > > > Signed-off-by: Anton Johansson <a...@rev.ng> > > > > --- > > > > accel/tcg/tcg-runtime-gvec.c | 41 +++++++++++++++++ > > > > accel/tcg/tcg-runtime.h | 22 +++++++++ > > > > include/tcg/tcg-op-gvec-common.h | 18 ++++++++ > > > > tcg/tcg-op-gvec.c | 78 > > > > ++++++++++++++++++++++++++++++++ > > > > 4 files changed, 159 insertions(+) > > > > > > > > diff --git a/accel/tcg/tcg-runtime-gvec.c b/accel/tcg/tcg-runtime-gvec.c > > > > index afca89baa1..685c991e6a 100644 > > > > --- a/accel/tcg/tcg-runtime-gvec.c > > > > +++ b/accel/tcg/tcg-runtime-gvec.c > > > > @@ -1569,3 +1569,44 @@ void HELPER(gvec_bitsel)(void *d, void *a, void > > > > *b, void *c, uint32_t desc) > > > > } > > > > clear_high(d, oprsz, desc); > > > > } > > > > + > > > > +#define DO_SZ_OP1(NAME, DSTTY, SRCTY) > > > > \ > > > > +void HELPER(NAME)(void *d, void *a, uint32_t desc) > > > > \ > > > > +{ > > > > \ > > > > + intptr_t oprsz = simd_oprsz(desc); > > > > \ > > > > + intptr_t elsz = oprsz/sizeof(DSTTY); > > > > \ > > > > + intptr_t i; > > > > \ > > > > + > > > > \ > > > > + for (i = 0; i < elsz; ++i) { > > > > \ > > > > + SRCTY aa = *((SRCTY *) a + i); > > > > \ > > > > + *((DSTTY *) d + i) = aa; > > > > \ > > > > + } > > > > \ > > > > + clear_high(d, oprsz, desc); > > > > \ > > > > > > This formulation is not valid. > > > > > > (1) Generic forms must *always* operate strictly on columns. This > > > formulation is either expanding a narrow vector to a wider vector or > > > compressing a wider vector to a narrow vector. > > > > > > (2) This takes no care for byte ordering of the data between columns. > > > This > > > is where sticking strictly to columns helps, in that we can assume that > > > data > > > is host-endian *within the column*, but we cannot assume anything about > > > the > > > element indexing of ptr + i. > > > > Concerning (1) and (2), is this a limitation imposed on generic vector > > ops. to simplify mapping to host vector instructions where > > padding/alignment of elements might differ? From my understanding, the > > helper above should be fine since we can assume contiguous elements? > > This is a limitation imposed on generic vector ops, because different > target/arch/ represent their vectors in different ways. > > For instance, Arm and RISC-V chunk the vector in to host-endian uint64_t, > with the chunks indexed little-endian. But PPC puts the entire 128-bit > vector in host-endian bit ordering, so the uint64_t chunks are host-endian. > > On a big-endian host, ptr+1 may be addressing element i-1 or i-7 instead of > i+1.
Ah, I see the problem now thanks for the explanation:) > > I see, I don't think we can make this work for Hexagon vector ops., as > > an example consider V6_vadduwsat which performs an unsigned saturated > > add of 32-bit elements, currently we emit > > > > void emit_V6_vadduwsat(intptr_t vec2, intptr_t vec7, intptr_t vec6) { > > VectorMem mem = {0}; > > intptr_t vec5 = temp_new_gvec(&mem, 256); > > tcg_gen_gvec_zext(MO_64, MO_32, vec5, vec7, 256, 128, 256); > > > > intptr_t vec1 = temp_new_gvec(&mem, 256); > > tcg_gen_gvec_zext(MO_64, MO_32, vec1, vec6, 256, 128, 256); > > > > tcg_gen_gvec_add(MO_64, vec1, vec1, vec5, 256, 256); > > > > intptr_t vec3 = temp_new_gvec(&mem, 256); > > tcg_gen_gvec_dup_imm(MO_64, vec3, 256, 256, 4294967295ull); > > > > tcg_gen_gvec_umin(MO_64, vec1, vec1, vec3, 256, 256); > > > > tcg_gen_gvec_trunc(MO_32, MO_64, vec2, vec1, 128, 256, 128); > > } > > > > so we really do rely on the size-changing property of zext here, the > > input vectors are 128-byte and we expand them to 256-byte. We could > > expand vector operations within the instruction to the largest vector > > size, but would need to zext and trunc to destination and source > > registers anyway. > Yes, well, this is the output of llvm though, yes? Yes > Did you forget to describe TCG's native saturating operations to the > compiler? tcg_gen_gvec_usadd performs exactly this operation. > > And if you'd like to improve llvm, usadd(a, b) equals umin(a, ~b) + b. > Fewer operations without having to change vector sizes. > Similarly for unsigned saturating subtract: ussub(a, b) equals umax(a, b) - b. In this case LLVM wasn't able to optimize it to a llvm.uadd.sat intrinsic, in which case we would have emitted tcg_gen_gvec_usadd I believe. We can manually optimize the above pattern to a llvm.uadd.sat to avoid extra size changes. This might be fixed in future LLVM versions, but otherwise seems like a reasonable change to push upstream. The point is that we have a lot of Hexagon instructions where size changes are probably unavoidable, another example is V6_vshuffh which takes in a <16 x i16> vector and shuffles the upper <8xi16> into the upper 16-bits of a <8 x i32> vector void emit_V6_vshuffh(intptr_t vec3, intptr_t vec7) { VectorMem mem = {0}; intptr_t vec2 = temp_new_gvec(&mem, 128); tcg_gen_gvec_zext(MO_32, MO_16, vec2, vec7, 128, 64, 128); intptr_t vec0 = temp_new_gvec(&mem, 128); tcg_gen_gvec_zext(MO_32, MO_16, vec0, (vec7 + 64ull), 128, 64, 128); intptr_t vec1 = temp_new_gvec(&mem, 128); tcg_gen_gvec_shli(MO_32, vec1, vec0, 16, 128, 128); tcg_gen_gvec_or(MO_32, vec3, vec1, vec2, 128, 128); } Not to bloat the email too much with examples, you can see 3 more here https://pad.rev.ng/11IvAKhiRy2cPwC7MX9nXA Maybe we rely on the target defining size-changing operations if they are needed? //Anton