https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113025
--- Comment #4 from Xi Ruoyao <xry111 at gcc dot gnu.org> --- (In reply to Xi Ruoyao from comment #3) > (In reply to juki from comment #2) > > Unfortunately alignment of the cast type was not causing this issue. > > > > I changed all calls that were defined in GCC headers to use __m128i_u or > > __m128d_u types to use those types before unaligned intrinsic. > > > > For example LOAD_SI128 macro looks like the following: > > > > #define LOAD_SI128(ptr) \ > > ( ((uintptr_t)(ptr) & 15) == 0 ) ? _mm_load_si128((__m128i*)(ptr)) : > > _mm_loadu_si128((__m128i_u*)(ptr)) > > This won't work if ptr is a __m128i *. It is allowed to optimize > (uintptr_t)(__m128i *)foo % 15 to 0 because the standard says (__m128i *)foo I mean % 16, not % 15. > invokes undefined behavior when foo is a pointer not aligned to 16-byte > boundary (C23 section 6.3.2.3p6).