http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46453

           Summary: MIPS backend is not using special instructions for
                    __builtin_bswap32
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: n...@chello.at
              Host: i686
            Target: mips-elf
             Build: ../configure --disable-libssp --prefix=/usr/local
                    --target=mips-elf


MIPS32 Relase 2 introduced a special instruction called wsbh that can be used
for 32- and 16-bit byteswaps. However GCC never does produce this instruction.

all code-snippets are produced with "-march=mips32r2 -O3" (which should enable
the wsbh, rotr and ins instructions and optimized code using them). the
assembly assumes big-endian endian.

*  __builtin_bswap32:
  eg. "v0 = __builtin_bswap32(a0);" should result in 

  wsbh v0, a0
  rotr v0, v0, 16


*  16bit byteswaps:
  similarly "v0 = ((a0 >> 8) | (a0 << 8));" (with a0,v0 being 16 bit uints)
should result in:

  wsbh v0, a0

as it is now, the __builtin_bswap32 will always result in a function call
(already atleast 2 instructions) and the implementation which uses 9
instruction.
a 16bit bswap results in 4 instructions. So that would be nice savings
especially in the 32bit case

*  Unaligned loads:
  More unimportantly unaligned 16bit loads could be optimized a bit aswell if
the ins instruction is available:

--Code sample (unaligned 16bit load):

#pragma pack(push,1)
    union Unaligned {
        unsigned char c[2];
        unsigned short u16;
    };
#pragma pack(pop)

unsigned short readUnaligned16(const void *ptr) {
    return ((const union Unaligned *)(ptr))->u16;
}

-- Code sample

results in this sequence:

# a0 = ptr, v0 = return value
lbu    v0,0(a0)
lbu    v1,1(a0)
sll    v0,v0,0x8
or    v0,v1,v0


better would be:

lbu    v0,0(a0)
lbu    v1,1(a0)
ins    v1,v0,8,8


Generating this sequences for unaligend 16bit loads would be a nice start. But
there could be generic optimizations with sequences of left-shift and or being
replaced with ins instructions, aslong it can be verified that registers have
enough explicitly zeroed bits so they dont "overlap".
similarly right-shift and masking could be replaced by ext instructions. eg. v0
= ((a0 >> 8) & 0xFF) equals to ext v0,a0,8,8.

Reply via email to