[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

rguenther at suse dot de via Gcc-bugs Wed, 17 Jul 2024 02:08:34 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908

--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 17 Jul 2024, mkretz at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
> 
> --- Comment #10 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #9)
> > One issue with
> > 
> > V load3(const unsigned long* ptr)
> > {
> >   V ret = {};
> >   __builtin_memcpy(&ret, ptr, 3 * sizeof(unsigned long));
> > 
> > is that we cannot load a vector worth of data from ptr because that might
> > trap
> 
> Unless the target has a masked load instruction (e.g. AVX512) or ptr is known
> to be aligned to at least 16 Bytes (in which case we know there cannot be a
> page boundary at ptr + 24 Bytes). No? In this specific example, ptr is 
> pointing
> to a 32-Byte vector object.

Sure but here we have no alignment info available (at most 8 byte 
alignment from the pointer type).  I don't think introducing a .MASK_LOAD
for the purpose of eliding a memcpy is a good thing to do (locally,
just taking into account the memcpy on its own).

> The library can do this and it makes a difference:
> 
>     if (__builtin_object_size(ptr, 0) >= 4 * sizeof(T))
>       __builtin_memcpy(&ret, ptr, 4 * sizeof(T));
>     else
>       __builtin_memcpy(&ret, ptr, 3 * sizeof(T));

I see, but that's then of course after inlining.

In my former C++ times I've used template metaprogramming to implement
this as an unrolled element-by-element copy (emitting a loop would
be possible as well, of course).

[Bug tree-optimization/114908] fails to optimize avx2 in-register permute written with std::experimental::simd

Reply via email to