Hi everyone,

So this is another question on peephole optimisation for x86_64. Occasionally you get situations where you write a load of constants to the stack - in this case it's part of an array parameter to a function call:

    movl    $23199763,32(%rsp)
    movl    $262149,36(%rsp)
    movl    $33816983,40(%rsp)
    movl    $36176315,44(%rsp)
    movl    $50660102,48(%rsp)
    movl    $65340390,52(%rsp)

x86_64 doesn't support writing a 64-bit constant directly to memory, and you have to instead write it to a register first. With that in mind, is the following code faster?

    movq    $1125921404878867,%eax
    movq    %eax,32(%rsp)
    movq    $155376089848611223,%eax
    movq    %eax,40(%rsp)
    movq    $280634838208545542,%eax
    movq    %eax,48(%rsp)

I know there will be a pipeline stall between the first two instructions, but logic tells me that parallelisation, out-of-order execution and register renaming will ensure that loading %eax with the next immediate can happen at the same time as its previous value is being written to memory.  I know there are a lot of variables, like how smart the processor is and how many ALUs and AGUs are available, so that's why I'm after a second opinion before I start proposing an optimisation that's speculative at best.  If necessary, I could even do this (if the registers are available):

    movq    $1125921404878867,%eax
    movq    $155376089848611223,%ecx
    movq    $280634838208545542,%edx
    movq    %eax,32(%rsp)
    movq    %ecx,40(%rsp)
    movq    %edx,48(%rsp)

At the very least I'm pretty sure it's not worth it to concatenate a single pair of 32-bit immediates.  For example, if it was just the first two:

    movl    $23199763,32(%rsp)
    movl    $262149,36(%rsp)

... it would not be worth it to transmute them into:

    movq    $1125921404878867,%eax
    movq    %eax,32(%rsp)

Since in the former case, the two can be executed in parallel and the only barrier is memory latency (almost all modern Intel CPUs have at least 2 AGUs), while the latter case introduces a dependency.

Gareth aka. Kit

P.S. In this case, the assembly language is generated by this parameter in aoptx86: "[A_CMP, A_TEST, A_BSR, A_BSF, A_COMISS, A_COMISD, A_UCOMISS, A_UCOMISD, A_VCOMISS, A_VCOMISD, A_VUCOMISS, A_VUCOMISD]"... this is part of the CMOV optimisations and is a load of instructions that are used for comparisons - if the opcode matches one of the above, the peephole optimizer will see if it's possible to position MOV instructions before the comparison instead of between the comparison and the conditional jump, as this works better for macro-fusion and the ability to turn "mov $0,%reg" to "xor %reg,%reg", which cannot be done if the FLAGS register is in use (XOR scrambles them), so by moving MOV before the comparison, this eliminates that problem.


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to