https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110780

            Bug ID: 110780
           Summary: aarch64 NEON redundant displaced ld3
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nate at thatsmathematics dot com
  Target Milestone: ---

Compile the following with gcc 14.0.0 20230723 on aarch64 with -O3:

#include <stdint.h>
void CSI2toBE12(uint8_t* pCSI2, uint8_t* pBE, uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        pBE[0] = pCSI2[0];
        pBE[1] = ((pCSI2[2] & 0xf) << 4) | (pCSI2[1] >> 4);
        pBE[2] = ((pCSI2[1] & 0xf) << 4) | (pCSI2[2] >> 4);
        pCSI2 += 3;
        pBE += 3;
    }
}

Godbolt: https://godbolt.org/z/WshTPKzY5

In the inner loop (.L5 of the godbolt asm) we have

        ld3     {v25.16b - v27.16b}, [x3]
        add     x6, x3, 1
        // no intervening stores
        ld3     {v25.16b - v27.16b}, [x6]

The second load is redundant.  v25, v26 are the same as what was already in
v26, v27 respectively.  The value loaded into v27 is new but it is not used in
the subsequent code.

This might also account for some extra later complexity, because it means that
the last 48 bytes of the input can't be handled by this loop (or else the
second load would be out of bounds by one byte) and so must be handled
specially.

Reply via email to