[fpc-devel] More peephole optimisation questions

J. Gareth Moreton via fpc-devel Tue, 19 Apr 2022 12:03:28 -0700

Hi everyone,

So this is another question on peephole optimisation for x86_64.Occasionally you get situations where you write a load of constants tothe stack - in this case it's part of an array parameter to a function call:


    movl    $23199763,32(%rsp)
    movl    $262149,36(%rsp)
    movl    $33816983,40(%rsp)
    movl    $36176315,44(%rsp)
    movl    $50660102,48(%rsp)
    movl    $65340390,52(%rsp)

x86_64 doesn't support writing a 64-bit constant directly to memory, andyou have to instead write it to a register first. With that in mind, isthe following code faster?


    movq    $1125921404878867,%eax
    movq    %eax,32(%rsp)
    movq    $155376089848611223,%eax
    movq    %eax,40(%rsp)
    movq    $280634838208545542,%eax
    movq    %eax,48(%rsp)

I know there will be a pipeline stall between the first twoinstructions, but logic tells me that parallelisation, out-of-orderexecution and register renaming will ensure that loading %eax with thenext immediate can happen at the same time as its previous value isbeing written to memory. I know there are a lot of variables, like howsmart the processor is and how many ALUs and AGUs are available, sothat's why I'm after a second opinion before I start proposing anoptimisation that's speculative at best. If necessary, I could even dothis (if the registers are available):


    movq    $1125921404878867,%eax
    movq    $155376089848611223,%ecx
    movq    $280634838208545542,%edx
    movq    %eax,32(%rsp)
    movq    %ecx,40(%rsp)
    movq    %edx,48(%rsp)

At the very least I'm pretty sure it's not worth it to concatenate asingle pair of 32-bit immediates. For example, if it was just the firsttwo:


    movl    $23199763,32(%rsp)
    movl    $262149,36(%rsp)

... it would not be worth it to transmute them into:

    movq    $1125921404878867,%eax
    movq    %eax,32(%rsp)

Since in the former case, the two can be executed in parallel and theonly barrier is memory latency (almost all modern Intel CPUs have atleast 2 AGUs), while the latter case introduces a dependency.


Gareth aka. Kit

P.S. In this case, the assembly language is generated by this parameterin aoptx86: "[A_CMP, A_TEST, A_BSR, A_BSF, A_COMISS, A_COMISD,A_UCOMISS, A_UCOMISD, A_VCOMISS, A_VCOMISD, A_VUCOMISS, A_VUCOMISD]"...this is part of the CMOV optimisations and is a load of instructionsthat are used for comparisons - if the opcode matches one of the above,the peephole optimizer will see if it's possible to position MOVinstructions before the comparison instead of between the comparison andthe conditional jump, as this works better for macro-fusion and theability to turn "mov $0,%reg" to "xor %reg,%reg", which cannot be doneif the FLAGS register is in use (XOR scrambles them), so by moving MOVbefore the comparison, this eliminates that problem.



--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

[fpc-devel] More peephole optimisation questions

Reply via email to