On Fri, Jul 9, 2021 at 1:54 PM Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Kyrylo Tkachov <kyrylo.tkac...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: 09 July 2021 12:40 > >> To: Jonathan Wright <jonathan.wri...@arm.com> > >> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov <kyrylo.tkac...@arm.com> > >> Subject: Re: [PATCH] aarch64: Use unions for vector tables in vqtbl[234] > >> intrinsics > >> > >> Jonathan Wright <jonathan.wri...@arm.com> writes: > >> > Hi, > >> > > >> > As subject, this patch uses a union instead of constructing a new opaque > >> > vector structure for each of the vqtbl[234] Neon intrinsics in > >> > arm_neon.h. > >> > This simplifies the header file and also improves code generation - > >> > superfluous move instructions were emitted for every register > >> > extraction/set in this additional structure. > >> > > >> > This change is safe because the C-level vector structure types e.g. > >> > uint8x16x4_t already provide a tie for sequential register allocation > >> > - which is required by the TBL instructions. > >> > > >> > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > >> > issues. > >> > > >> > Ok for master? > >> > >> Looks good, but I think we should have some tests to defend the > >> RA improvements. E.g. have things like: > >> > >> #include <arm_neon.h> > >> > >> … > >> > >> uint8x8_t > >> f2_u8 (uint8x16x2_t x, uint8x8_t y) > >> { > >> return vqtbl2_u8 (x, y); > >> } > >> > >> … > >> > >> and add a scan-assembler-not for moves. > >> > >> Union punning is UB for standard C++, but I think in practice we're > >> not going to be able to treat it as such for GCC. This would be > >> far from the only thing to rely on union punning for correctness. > > > > Could we use some reinterpret_cast or bit_cast (or the builtins the rely > > on) for C++? > > This may involve separate definitions for C and C++, which may not be worth > > it... No objections to the patch from me to be clear. > > The C++-correct way would be to do something like: > > __extension__ extern __inline uint8x8_t > __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) > vqtbl2_u8 (uint8x16x2_t __tab, uint8x8_t __idx) > { > __builtin_aarch64_simd_oi __o; > __builtin_memcpy (&__o, &__tab, sizeof (__tab)); > return (uint8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)__idx); > } > > which does seem to produce the same code for the simple case above. > I've no idea how well it would work out in real code though. > Having to take the address of something is a bit unfortunate, > but maybe we fold that away early enough for it not to matter.
We should fold it very early (during gimplification or gimple lowering) in case the target is not strict-alignment or the copied memory is aligned which I think it is here since objects and not pointers are involved. > So the options would be: > > (1) keep it as-is > (2) keep it as-is for C and add the memcpy version for C++ > (3) use the memcpy version for C and C++ > > (3) is obviously better than (2) if we don't see any performance penalty. > > Thanks, > Richard