https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110780
Bug ID: 110780 Summary: aarch64 NEON redundant displaced ld3 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nate at thatsmathematics dot com Target Milestone: --- Compile the following with gcc 14.0.0 20230723 on aarch64 with -O3: #include <stdint.h> void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd) { while (pCSI2 < pCSI2LineEnd) { pBE[0] = pCSI2[0]; pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4); pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4); pCSI2 += 3; pBE += 3; } } Godbolt: https://godbolt.org/z/WshTPKzY5 In the inner loop (.L5 of the godbolt asm) we have ld3 {v25.16b - v27.16b}, [x3] add x6, x3, 1 // no intervening stores ld3 {v25.16b - v27.16b}, [x6] The second load is redundant. v25, v26 are the same as what was already in v26, v27 respectively. The value loaded into v27 is new but it is not used in the subsequent code. This might also account for some extra later complexity, because it means that the last 48 bytes of the input can't be handled by this loop (or else the second load would be out of bounds by one byte) and so must be handled specially.