On Wed, Jun 22, 2022 at 1:39 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch addresses PR target/105930 which is an ia32 stack frame size
> regression in high-register pressure XOR-rich cryptography functions
> reported by Linus Torvalds.  The underlying problem is once the limited
> number of registers on the x86 are exhausted, the register allocator
> has to decide which to spill, where some eviction choices lead to much
> poorer code, but these consequences are difficult to predict in advance.
>
> The patch below, which splits xordi3_doubleword and iordi3_doubleword
> after reload (instead of before), significantly reduces the amount of
> spill code and stack frame size, in what might appear to be an arbitrary
> choice.
>
> My explanation of this behaviour is that the mixing of pre-reload split
> SImode instructions and post-reload split DImode instructions is
> confusing some of the heuristics used by reload.  One might think
> that splitting early gives the register allocator more freedom to
> use available registers, but in practice the constraint that double
> word values occupy consecutive registers (when ultimately used as a
> DImode value) is the greater constraint.  Instead, I believe in this
> case, the pseudo registers used in memory addressing, appear to be
> double counted for split SImode instructions compared to unsplit
> DImode instructions.  For the reduced test case in comment #13, this
> leads to %eax being used to hold the long-lived argument pointer "v",
> blocking the use of the ax:dx pair for processing double word values.
> The important lines are at the very top of the assembly output:
>
> GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
>         sub     esp, 24
>         mov     ecx, DWORD PTR [esp+40]
>
> GCC 12 [use %eax to address memory, require a 44-byte stack frame]
>         sub     esp, 44
>         mov     eax, DWORD PTR [esp+64]

The pre-reload vs. post-reload approach is a constant source of
various regressions, since one or the other has its benefits and
drawbacks. Please note that arithmetic and shift instructions remain
on a post-reload approach, since there were various regressions w.r.t.
register allocator observed.

I think that the best approach for now would be to move *all*
double-word splitters to the post-reload approach. As you exposed
above, mixing pre-reload and post-reload approaches confuses reload
too much due to the blocking of the part of a register pair.

> Jakub's alternative proposed patch in comment #17 is to improve
> consistency by splitting more instructions (rotates) before reload,
> which shows a small improvement, but not to GCC v11 levels.

So, I guess the experiment shows that we should revert to post-reload
splitters, while keeping the improvements splitters gain from
post-reload implementation.

> I have some follow-up patches (based on other approaches I've tried),
> including splitting rotations by 32 after reload, and supporting
> TImode operations via <dwi>, notice this same pattern is mentioned in
> https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html
> but this patch below is the core minimal fix that's hopefully
> suitable for benchmarking and possibly backporting to the 12 branch.
> I believe that changes to the register allocator itself, to tweak how
> stack slots are assigned and which values can be cheaply materialized
> are out-of-scope for the release branches.

I assume that and/andnot should also be moved to a post-reload split approach.
>
>
> I'm curious what folks (especially Uros and Jakub) think?
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> with no new failures.  CSiBE also shows a (minor) code size reduction.
> It would be great if someone could benchmark this patch, or alternatively
> it can be baked on mainline to let the automatic benchmarking evaluate it,
> then revert the patch if there are any observed performance issues.
>
> Thoughts?
>
>
> 2022-06-22  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         PR target/105930
>         * config/i386/i386.md (*<any_or>di3_doubleword): Split after
>         reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
>         and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).
>         (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1
>         to provide gen_<any_or>si_1 functions.

Is there a reason to not use ix86_expand_{binary|unary}_operator? It
fits perfectly here.

Otherwise, the patch looks OK to me.

Uros.

Reply via email to