https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89057
--- Comment #4 from Allan Jensen <linux at carewolf dot com> --- While that change might have made things worse. The real problem is probably that the registers for those instructions are loaded and stored using intrinsics, so proper register allocation and combining cant be performed. For ARMv7 for instance the same code can be optimized to having no moves but just a single vswp instruction between ld3 and st4. And MSVC and clang can do that but GCC can not.