On Wed, Jun 22, 2022 at 1:39 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch addresses PR target/105930 which is an ia32 stack frame size > regression in high-register pressure XOR-rich cryptography functions > reported by Linus Torvalds. The underlying problem is once the limited > number of registers on the x86 are exhausted, the register allocator > has to decide which to spill, where some eviction choices lead to much > poorer code, but these consequences are difficult to predict in advance. > > The patch below, which splits xordi3_doubleword and iordi3_doubleword > after reload (instead of before), significantly reduces the amount of > spill code and stack frame size, in what might appear to be an arbitrary > choice. > > My explanation of this behaviour is that the mixing of pre-reload split > SImode instructions and post-reload split DImode instructions is > confusing some of the heuristics used by reload. One might think > that splitting early gives the register allocator more freedom to > use available registers, but in practice the constraint that double > word values occupy consecutive registers (when ultimately used as a > DImode value) is the greater constraint. Instead, I believe in this > case, the pseudo registers used in memory addressing, appear to be > double counted for split SImode instructions compared to unsplit > DImode instructions. For the reduced test case in comment #13, this > leads to %eax being used to hold the long-lived argument pointer "v", > blocking the use of the ax:dx pair for processing double word values. > The important lines are at the very top of the assembly output: > > GCC 11 [use %ecx to address memory, require a 24-byte stack frame] > sub esp, 24 > mov ecx, DWORD PTR [esp+40] > > GCC 12 [use %eax to address memory, require a 44-byte stack frame] > sub esp, 44 > mov eax, DWORD PTR [esp+64]
The pre-reload vs. post-reload approach is a constant source of various regressions, since one or the other has its benefits and drawbacks. Please note that arithmetic and shift instructions remain on a post-reload approach, since there were various regressions w.r.t. register allocator observed. I think that the best approach for now would be to move *all* double-word splitters to the post-reload approach. As you exposed above, mixing pre-reload and post-reload approaches confuses reload too much due to the blocking of the part of a register pair. > Jakub's alternative proposed patch in comment #17 is to improve > consistency by splitting more instructions (rotates) before reload, > which shows a small improvement, but not to GCC v11 levels. So, I guess the experiment shows that we should revert to post-reload splitters, while keeping the improvements splitters gain from post-reload implementation. > I have some follow-up patches (based on other approaches I've tried), > including splitting rotations by 32 after reload, and supporting > TImode operations via <dwi>, notice this same pattern is mentioned in > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596201.html > but this patch below is the core minimal fix that's hopefully > suitable for benchmarking and possibly backporting to the 12 branch. > I believe that changes to the register allocator itself, to tweak how > stack slots are assigned and which values can be cheaply materialized > are out-of-scope for the release branches. I assume that and/andnot should also be moved to a post-reload split approach. > > > I'm curious what folks (especially Uros and Jakub) think? > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32}, > with no new failures. CSiBE also shows a (minor) code size reduction. > It would be great if someone could benchmark this patch, or alternatively > it can be baked on mainline to let the automatic benchmarking evaluate it, > then revert the patch if there are any observed performance issues. > > Thoughts? > > > 2022-06-22 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > PR target/105930 > * config/i386/i386.md (*<any_or>di3_doubleword): Split after > reload. Use rtx_equal_p to avoid creating memory-to-memory moves, > and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0). > (define_insn <any_or><mode>_1): Renamed from *<any_or>mode_1 > to provide gen_<any_or>si_1 functions. Is there a reason to not use ix86_expand_{binary|unary}_operator? It fits perfectly here. Otherwise, the patch looks OK to me. Uros.