Hi there.  This is in follow up to my email on the 24 th of May.

The short version is: how can I track down why GCC is picking between
two alternatives for implementing a function?  In a memcpy() where
Pmode == SImode, I get a near ideal implementation.  If Pmode ==
PSImode (due to limitations of the pointer registers) I get something
much worse.

The difference happens early on.  In the .128r.expand with Pmode ==
SImode I get:
 ;; MEM[base: to] = MEM[base: p];

With PSImode I get offset addressing instead:
;; MEM[base: pto + ivtmp.25] = MEM[base: pfrom + ivtmp.25];

This flows through into the actual code.

I assume this is due to GCC assuming that PSImode works differently to
SImode and that the cast/translation cost is enough to make offset
addressing overall cheaper.

The m32c compiler is the only other using PSImode but it doesn't
generate offsetted addresses.  The same things happen with and without
a basic TARGET_ADDRESS_COSTS and TARGET_RTX_COSTS.

I guess I want a way of telling the compiler that PSImode and SImode
are equivalent.

The longer version is:
The machine I'm working on has two special registers for memory access
that are backed by caches.  Any change to these registers can cause an
expensive cache load cycle so while they're great for memory access
they're terrible for general use.

The problem is that Pmode == SImode so the register allocator will now
and again use these registers for general operations.  I've
implemented a partial integer mode PSImode suggested by Mihael
Meissner and set Pmode to PSImode. This correctly separates things but
the compiler now generates significantly worse code.

The example is a simple memcpy():

void copy(int *pfrom, int *pto, int count)
{
  while (count != 0)
    {
      *pto = *pfrom;
      pto++;
      pfrom++;
      count--;
    }
}

If I have #define Pmode SImode then I get the near-best code:
copy:
        LOADACC, R12    ;# 133  loadaccsi_insn/1
        STOREACC, R13   ;# 134  storeaccsi_insn
        LOADLONG, #0    ;# 139  loadaccsi_insn/2
        XOR, R13        ;# 140  cmpccsi_insn/3
        LOADLONG, #.L4  ;# 43   *bCCeq
        SKIP_IF
        STOREACC, PC
        LOADACC, R11    ;# 121  loadaccsi_insn/1
        STOREACC, Y     ;# 122  storeaccsi_insn
        LOADACC, R10    ;# 127  loadaccsi_insn/1
        STOREACC, X     ;# 128  storeaccsi_insn
.L3:
        LOADACC, (X)    ;# 79   loadaccsi_insn/1
        STOREACC, (Y)   ;# 86   storeaccsi_insn
        LOADLONG, #4    ;# 149  loadaccsi_insn/2
        ADD, Y  ;# 150  addsi3_acc
        ADD, X  ;# 151  addsi3_acc
        LOADLONG, #-1   ;# 103  loadaccsi_insn/2
        ADD, R12        ;# 104  addsi3_acc
        LOADACC, R12    ;# 109  loadaccsi_insn/1
        STOREACC, R10   ;# 110  storeaccsi_insn
        LOADLONG, #0    ;# 115  loadaccsi_insn/2
        XOR, R10        ;# 116  cmpccsi_insn/3
        LOADLONG, #.L3  ;# 57   *bCCne
        STOREACC, PC_IF
.L4:
        POP     ;# 147  *expanded_return
        STOREACC, PC

Note the good
        LOADACC, (X)    ;# 79   loadaccsi_insn/1
        STOREACC, (Y)   ;# 86   storeaccsi_insn
        LOADLONG, #4    ;# 149  loadaccsi_insn/2
        ADD, Y  ;# 150  addsi3_acc
        ADD, X  ;# 151  addsi3_acc

in the middle.

Instead if I have #define Pmode PSImode I get
copy:
        LOADACC, R14    ;# 186  loadaccsi_insn/1
        PUSH    ;# 187  pushsi_acc
        LOADACC, R12    ;# 163  loadaccsi_insn/1
        STOREACC, R13   ;# 164  storeaccsi_insn
        LOADLONG, #0    ;# 169  loadaccsi_insn/2
        XOR, R13        ;# 170  cmpccsi_insn/3
        LOADLONG, #.L4  ;# 43   *bCCeq
        SKIP_IF
        STOREACC, PC
        LOADLONG, #0    ;# 157  loadaccsi_insn/2
        STOREACC, R13   ;# 158  storeaccsi_insn
.L3:
        LOADACC, R13    ;# 85   loadaccsi_insn/1
        STOREACC, X     ;# 86   storeaccsi_insn
        ; No-op truncate on X = X       ;# 47   truncsipsi2/1
        LOADACC, R11    ;# 91   loadaccpsi_insn/1
        STOREACC, Y     ;# 92   storeaccpsi_insn
        LOADACC, X      ;# 97   loadaccpsi_insn/1
        ADD, Y  ;# 98   addpsi3_acc
        LOADACC, R10    ;# 103  loadaccpsi_insn/1
        STOREACC, R14   ;# 104  storeaccpsi_insn
        LOADACC, X      ;# 109  loadaccpsi_insn/1
        ADD, R14        ;# 110  addpsi3_acc
        LOADACC, R14    ;# 115  loadaccpsi_insn/1
        STOREACC, X     ;# 116  storeaccpsi_insn
        LOADACC, (X)    ;# 121  loadaccsi_insn/1
        STOREACC, (Y)   ;# 128  storeaccsi_insn
        LOADLONG, #-1   ;# 133  loadaccsi_insn/2
        ADD, R12        ;# 134  addsi3_acc
        LOADLONG, #4    ;# 139  loadaccsi_insn/2
        ADD, R13        ;# 140  addsi3_acc
        LOADACC, R12    ;# 145  loadaccsi_insn/1
        STOREACC, X     ;# 146  storeaccsi_insn
        LOADLONG, #0    ;# 151  loadaccsi_insn/2
        XOR, X  ;# 152  cmpccsi_insn/3
        LOADLONG, #.L3  ;# 59   *bCCne
        STOREACC, PC_IF
.L4:
        POP     ;# 178  popsi_insn
        STOREACC, R14
        POP     ;# 179  *expanded_return
        STOREACC, PC

This is equivalent to:
 R13 = 0
L:
 X = R13
 X = truncate(X)
 Y = R11
 Y += X
 R14 = R10
 R14 += X
 X = R14
 (Y) = (X)
 R12 -= 1
 R13 += 4
 R14 = R12
 CMP R14, 0
 BCCNE

Thank you for your time,

-- Michael

Reply via email to