Okay, so I ran a test and decided to see what would happen if I cycled
back to pass 1 if pass 2 made changes. Other than some edge cases in
some packages (a bug caused an infinite loop at this stage of the make
process), it causes no difference in code generation in the RTL. The
only notable difference is the compiler being much slower!
So I think we can write off this idea. Granted, there are still some
situations where pass 2 methods call pass 1 methods - the most notable
one is a slightly convoluted optimisation in OptPass2MOV - if the
instruction that follows is JMP, it calls OptPass2JMP on it right there
and then, rather than wait for PeepHoleOptPass2Cpu to reach the
instruction. The reason for this is that many of OptPass2JMP's
optimisations either insert a MOV that can be optimised with the
original MOV, or inserts a RET and turns the original MOV into a
deadstore. However, this optimisation cannot be moved into pass 1
without a drop in optimisation quality (notably, OptPass2JMP performs
worse with converting blocks of MOV's into CMOVcc instructions).
Following on from comments from Florian in i38555, I'll see about
factoring out the specific MOV/MOV and MOV/RET optimisations from
OptPass1MOV at some point so they can be called separately. Not only
does it minimise problems and design violations of calling pass 1
methods from pass 2, but it will also provide a speed gain in pass 2
from not having to check everything that OptPass1MOV has to offer.
Gareth aka. Kit
On 28/02/2021 04:15, J. Gareth Moreton via fpc-devel wrote:
Just as an example, when compiling the System unit on r48813, there
exists this block of disassembly:
.Lj4072:
...
leaq (%rsi,%r13),%rax
leaq -1(%rax),%r12
# Peephole Optimization: SubMov2LeaSub
subq $1,%rax
...
With my improvement over at i38555, the optimiser can remove the sub
instruction because %rax doesn't get used afterwards, hence:
.Lj4072:
...
jne .Lj4070
leaq (%rsi,%r13),%rax
# Peephole Optimization: SubMov2Lea
leaq -1(%rax),%r12
...
SubMov2Lea (and SubMov2LeaSub) is a Pass 2 optimisation because of the
potential to do deeper optimisations on the MOV instruction (which are
in Pass 1). After the optimisation is made, and with the knowledge
that %rax's value is discarded afterwards, careful observation will
reveal that the two LEA instructions can be merged:
.Lj4072:
...
jne .Lj4070
# Peephole Optimization: SubMov2Lea
leaq -1(%rsi,%r13),%r12
...
I've been working in a separate branch to improve the optimisations in
OptPass1LEA to detect this (it currently doesn't because the two
destination registers aren't identical), and this is why I call
OptPass1LEA from OptPass2SUB in the patch provided on i38555, although
as I originally described, this feels somewhat hacky and has a risk of
opening up more bugs. A safer and more thorough approach, although
slower, would be to call Pass 1 again where the register tracking is
up to date, for example (when calling from OptPass2SUB, because the
first LEA is the previous instruction, the register tracking is ahead
by one instruction upon entering OptPass1LEA).
Gareth aka. Kit
On 28/02/2021 01:51, J. Gareth Moreton via fpc-devel wrote:
Hi everyone,
I'm currently developing some new optimisations for Lea instructions
after I discovered some new potential ones after fixing i38527. That
aside though, sometimes these optimisations only become apparent
after Pass 2 has completed. I've tried to change the order of things
so the optimisation is made in Pass 1, but there's no easy
combination that ensures the best optimisations take place (i.e. I
make a change to improve one optimisation, and another one is made
worse at the same time).
I've taken to calling OptPass1XXX routines from OptPass2XXX routines
in places where this is likely to happen, and so far this produces
the best code - however, it feels hacky and problems may occur with
register tracking if OptPass1XXX is called on a different instruction
to the current one (e.g. one optimisation I've found requires calling
GetLastInstruction and then calling OptPass1LEA on the result if it's
a LEA instruction).
So to help clean up the code and provide the best output, I would
like to propose a cross-platform change to the peephole optimizer:
- Under -O3, if a change was made in Pass 2 (implied if any of the
OptPass2XXX routines return True), the peephole optimiser cycles back
to Pass 1 and tries again.
There are a few variants for this:
- After Pass 1 is called after Pass 2, it then goes to the
Post-peephole Pass regardless of if anything was changed.
- It goes through the whole process again in that after Pass 1 is
called again, Pass 2 is then called again, and if Pass 2 returns True
again, then it goes back to Pass 1 and does it as many times as
needed (or until it hits an upper limit to prevent an infinite loop
due to a compiler bug). Only once does Pass 2 return False that it
goes to the Post-peephole Pass.
- The third variant is that variant 1 is done for -O2 and variant 2
is done for -O3 (and no extra run of Pass 1 for -O1).
The obvious side-effect is that it causes the compiler to run
slightly slower, but this could potentially be mitigated by merging
the Pre-Peephole Pass with Pass 1, thus eliminating a distinct pass,
while any missed optimisations that occur due to this are picked up
in the second call to Pass 1 (it will most likely be picked up in the
first call to Pass 1 due to PeepHoleOptPass1Cpu returning True and
signalling another iteration).
What are everyone's thoughts?
Gareth aka. Kit
--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel