https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005
--- Comment #3 from Joel Holdsworth <joel at airwebreathe dot org.uk> --- Interesting. Comparing the implementation of _mm_store_si128 to vst1q_s32: emminitrin.h extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store_si128 (__m128i *__P, __m128i __B) { *__P = __B; } arm_neon.h __extension__ extern __inline void __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vst1q_s32 (int32_t * __a, int32x4_t __b) { __builtin_neon_vst1v4si ((__builtin_neon_si *) __a, __b); } So why is one implemented with a built-in, and the other with a pointer dereference? Is there a way of making the optimizer see through __builtin_neon_vst1v4si with GIMPLE? Where would the code be implemented? Where is it implemented for other architectures?