XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

ubizjak at gmail dot com via Gcc-bugs Fri, 05 Mar 2021 00:29:08 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856


--- Comment #24 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Richard Biener from comment #22)

> That works to avoid the vpinsrq.  I guess the case of a mem operand
> behaves similar to a gpr (plus the load uop), at least I don't have any
> contrary evidence (but I didn't do any microbenchmarks either).
> 
> I'm not sure IRA/LRA will optimally handle the situation with register
> pressure causing spilling in case it needs to reload both gpr operands.
> At least for
> 
> typedef long v2di __attribute__((vector_size(16)));
> 
> v2di foo (long a, long b)
> {
>   return (v2di){a, b};
> }
> 
> with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4
> -ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9
> -ffixed-xmm10 -ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14
> -ffixed-xmm15 I get with the
> patch
> 
> foo:
> .LFB0:
>         .cfi_startproc
>         movq    %rsi, -16(%rsp)
>         movq    %rdi, %xmm0
>         pinsrq  $1, -16(%rsp), %xmm0
>         ret
> 
> while without it's
> 
>         movq    %rdi, %xmm0
>         pinsrq  $1, %rsi, %xmm0

This is expacted, my patch is based on the assumption that punpcklqdq is cheap
compared to pinsrq, and interunit moves are cheap. This way, IRA will reload GP
register to XMM register and use cheaper instruction.

> as far as I understand LRA dumps the new attribute is a hard one, even
> applying when other alternatives are worse.  In this case we choose
> alt 7.  Covering also alts 7 and 8 with the optimize-for-speed attribute
> causes reload fails - which is expected if there's no way for LRA to
> choose alt 1.  The following seems to work for the small testcase above
> but not for the important case in the benchmark (meh).
> 
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index db5be59f5b7..e393a0d823b 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15992,7 +15992,7 @@
>           (match_operand:DI 1 "register_operand"
>           "  0, 0,x ,Yv,0,Yv,0,0,v")
>           (match_operand:DI 2 "nonimmediate_operand"
> -         " rm,rm,rm,rm,x,Yv,x,m,m")))]
> +         " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))]
>    "TARGET_SSE"
>    "@
>     pinsrq\t{$1, %2, %0|%0, %2, 1}

The above means that GP will still be used, since it fits without reloading.

> I guess the idea of this insn setup was exactly to get IRA/LRA choose
> the optimal instruction sequence - otherwise exposing the reload so
> late is probably suboptimal.

THere is one more tool in the toolbox. A peephole2 pattern can be
conditionalized on availabe XMM register. So, if XMM reg is available, the
GPR->XMM move can be emitted in front of the insn. So, if there is XMM register
pressure, pinsrd will be used, but if an XMM register is availabe, it will be
reused to emit punpcklqdq.

The peephole2 pattern can also be conditionalized for targets where GPR->XMM
moves are fast.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

Reply via email to