https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908
--- Comment #11 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 17 Jul 2024, mkretz at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114908 > > --- Comment #10 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #9) > > One issue with > > > > V load3(const unsigned long* ptr) > > { > > V ret = {}; > > __builtin_memcpy(&ret, ptr, 3 * sizeof(unsigned long)); > > > > is that we cannot load a vector worth of data from ptr because that might > > trap > > Unless the target has a masked load instruction (e.g. AVX512) or ptr is known > to be aligned to at least 16 Bytes (in which case we know there cannot be a > page boundary at ptr + 24 Bytes). No? In this specific example, ptr is > pointing > to a 32-Byte vector object. Sure but here we have no alignment info available (at most 8 byte alignment from the pointer type). I don't think introducing a .MASK_LOAD for the purpose of eliding a memcpy is a good thing to do (locally, just taking into account the memcpy on its own). > The library can do this and it makes a difference: > > if (__builtin_object_size(ptr, 0) >= 4 * sizeof(T)) > __builtin_memcpy(&ret, ptr, 4 * sizeof(T)); > else > __builtin_memcpy(&ret, ptr, 3 * sizeof(T)); I see, but that's then of course after inlining. In my former C++ times I've used template metaprogramming to implement this as an unrolled element-by-element copy (emitting a loop would be possible as well, of course).