Am 30.08.24 um 15:31 schrieb Richard Biener:
On Fri, Aug 30, 2024 at 3:28 PM Georg-Johann Lay <a...@gjlay.de> wrote:

Am 30.08.24 um 14:46 schrieb Richard Biener:
On Fri, Aug 30, 2024 at 2:10 PM Georg-Johann Lay <a...@gjlay.de> wrote:

There are cases, where opportunities to use POST_INC addressing
only occur very late in the compilation process.  Take for example
the following function from AVR-LibC's qsort:

void swapfunc (char *a, char *b, int n)
{
       do
       {
           char t = *a;
           *a++ = *b;
           *b++ = t;
       } while (--n > 0);
}


which  -mmcu=avrtiny -S -Os -dp  compiles to:

swapfunc:
          push r28                 ;  72  [c=4 l=1]  pushqi1/0
          push r29                 ;  73  [c=4 l=1]  pushqi1/0
/* prologue: function */
/* frame size = 0 */
/* stack size = 2 */
          mov r26,r24      ;  66  [c=4 l=1]  movqi_insn/0
          mov r27,r25      ;  67  [c=4 l=1]  movqi_insn/0
          mov r30,r22      ;  68  [c=4 l=1]  movqi_insn/0
          mov r31,r23      ;  69  [c=4 l=1]  movqi_insn/0
          mov r22,r20      ;  70  [c=4 l=1]  movqi_insn/0
          mov r23,r21      ;  71  [c=4 l=1]  movqi_insn/0
.L2:
          ld r20,X                 ;  55  [c=4 l=1]  movqi_insn/3
          ld r21,Z                 ;  57  [c=4 l=1]  movqi_insn/3
          st X,r21                 ;  58  [c=4 l=1]  movqi_insn/2
          subi r26,-1      ;  59  [c=4 l=2]  *addhi3_clobber/0
          sbci r27,-1
          st Z,r20                 ;  61  [c=4 l=1]  movqi_insn/2
          subi r30,-1      ;  62  [c=4 l=2]  *addhi3_clobber/0
          sbci r31,-1
          subi r22, 1      ;  81  [c=4 l=2]  *add.for.cczn.hi/0
          sbci r23, 0
          breq .+2         ;  82  [c=8 l=2]  branch_ZN
          brpl .L2
/* epilogue start */
          pop r29          ;  76  [c=4 l=1]  popqi
          pop r28          ;  77  [c=4 l=1]  popqi
          ret              ;  78  [c=0 l=1]  return_from_epilogue

Insn 56+57 and insns 61+62 are post-inc stores.  They are not recognized
because the code prior to cprop_hardreg is a bit of a mess that moves
addresses back and forth (including Y).  Only after cprop_hardreg the
code is simple enough so post-inc can be detected by avr-fuse-add.

Hence this patch runs a 2nd instance of that pass late after
cprop_hardreg (the 1st instance runs prior to RTL peephole).

It renames avr_split_tiny_move to avr_split_fake_addressing_move
because that function also splits some insns on non-avrtiny.

The patch removes a define_split that's no more needed because
such splits are performed by avr_split_fake_addressing_move.

Passed without new regressions.  Ok for trunk?

Johann

p.s. post-inc etc. optimizations is basically non-existent in GCC.
One reasons seems to be SSA because each SSA variable gets its own
register, which results in a jungle of required registers which
even pass auto-inc-dec cannot penetrate...

Take for example the following code, that would require 12 instructions
as indicated by the comments:

void add4 (uint8_t *aa, const __flash uint8_t *bb, uint8_t nn)
{
       // Set Z (R30) to bb (1 MOVW or 2 MOVs)
       // Set X (R26) to aa (1 MOVW or 2 MOVs)
       do
       {
           uint8_t sum = 0;
           sum += *bb++; // 1 instruction: POST_INC load
           sum += *bb++; // 2 instructions: POST_INC load + add
           sum += *bb++; // 2 instructions: POST_INC load + add
           sum += *bb++; // 2 instructions: POST_INC load + add
           *aa++ = sum;  // 1 instruction: POST_INC store
       } while (--nn);   // 2 instructions: dec + branch
}

I think the reason here is weird behavior with __flash and how IVOPTS
tries to enable auto-incdec:

    _30 = bb_2 + 1;
    _10 = MEM[(const <address-space-1> uint8_t *)_30];
    _29 = bb_2 + 2;
    _12 = MEM[(const <address-space-1> uint8_t *)_29];
    _19 = _10 + _12;
    sum_13 = _19 + _9;
    bb_14 = bb_2 + 4;
    _28 = bb_14 + 65535;
    _15 = MEM[(const <address-space-1> uint8_t *)_28];

without __flash you get

    _10 = MEM[(const uint8_t *)_27 + 1B];
    sum_11 = _9 + _10;
    _12 = MEM[(const uint8_t *)_27 + 2B];
    sum_13 = sum_11 + _12;
    _15 = MEM[(const uint8_t *)_27 + 3B];

which of ocurse is a problem for pre/post-inc addressing as well
but one that I think is solved (well, you can hope...).

Can you open a bugreport with this testcase?

https://gcc.gnu.org/PR116542

Thanks.

Without __flash, the code uses Z, Z+1, Z+2 and Z+3 as addresses,
but still uses one register for access and one register to increment
the address.

Then there are these crazy computation like:

start_address - current_address + nn == 0

as condition for loop termination instead of just dec + branch
of nn.

That's likely due to costs as estimated by IVOPTs and the backend.

Simple code would use DEC which costs 1 arithmetic + 1 reg for
nn (in addition to the current address).

The current code uses start address, start nn + ADD + SUB (in
addition to the current address).  So doubles the register
pressure and doubles number or arithmetic compared to
simple code.

There are many cases where -fno-ivopts or -fno-tree-loop-optimize
give better code, but I doubt that switching off these passes
altogether is a good idea...

Johann

Richard.

Johann


But  -mmcu=avr4 -Os -S -dp  compiles this to madness that
requires more than a dozen registers and computes each
intermediate in its own 16-bit register, coming out with
code that requires 34 instructions instead of 12.

--

AVR: Run pass avr-fuse-add a second time after pass_cprop_hardreg.

gcc/
          * config/avr/avr-protos.h (avr_split_tiny_move): Rename to
          avr_split_fake_addressing_move.
          * config/avr/avr-passes.cc: Same.
          (avr_pass_data_fuse_add) <tv_id>: Set to TV_MACH_DEP.
          (avr_pass_fuse_add) <clone>: Override.
          * config/avr/avr-passes.def (avr_pass_fuse_add): Run again
          after pass_cprop_hardreg.
          * config/avr/avr.md (split-lpmx): Remove a define_split.  Such
          splits are performed by avr_split_fake_addressing_move.

Reply via email to