On Fri, Jan 17, 2025 at 3:40 AM Sam Russell <sam.h.russ...@gmail.com> wrote: > > We discussed this previously, we decided since AVX1 supports unaligned > accesses we could not do an alignment check at the start of the function, but > as you've discovered, this memcpy issue creates undefined behaviour.
I don't believe the memcpy is causing the problem. I believe it is what Bruno or Paul showed: const __m128i *data = buf; > Most performant would probably be an alignment check at the start and then > manually processing the first N bytes. Another option could be to simply cast > data to unsigned char* and then we can guarantee the compiler doesn't hit > alignment issues? Change: const __m128i *data = buf; To this so the compiler cannot pick between MOVDQA and MOVDQU: const __m128i data = _mm_loadu_si128(buf); > What are people's preferences here? Jeff > On Fri, 17 Jan 2025 at 08:11, Paul Eggert <egg...@cs.ucla.edu> wrote: >> >> On 2025-01-16 21:25, Jeffrey Walton wrote: >> > On Fri, Jan 17, 2025 at 12:07 AM Bruno Haible via Gnulib discussion >> > list <bug-gnulib@gnu.org> wrote: >> >> Yes, the undefined behaviour really starts here, in line 35: >> >> >> >> const __m128i *data = buf; >> >> >> >> 'buf' was not aligned, 'const __m128i *' is 16-byte aligned. >> > >> > Disassemble the code around that line. See which asm instruction is >> > being used for the load. I suspect MOVDQA (aligned) is being used >> > instead of MOVDQU (unaligned). >> >> The compiler is entitled to do that. Bruno's right, the behavior is >> undefined once the code assigns the unaligned pointer to an __m128i * >> variable; see C23 §6.3.2.3 ¶7. Since behavior is undefined, the compiler >> can do whatever it likes. >> >> I installed the attached patch to work around the immediate issue of the >> undefined behavior. This skips the pclmul speedup if the buffer is not >> properly aligned. If that is a significant performance issue in >> Gnulib-using code, I hope Sam or somebody can come up with a >> higher-performance fix.