> On Thu, Jul 6, 2023 at 2:04 PM Roger Sayle <ro...@nextmovesoftware.com>
> wrote:
> >
> >
> > Passing 128-bit integer (TImode) parameters on x86_64 can sometimes
> > result in surprising code.  Consider the example below (from PR 43644):
> >
> > __uint128 foo(__uint128 x, unsigned long long y) {
> >   return x+y;
> > }
> >
> > which currently results in 6 consecutive movq instructions:
> >
> > foo:    movq    %rsi, %rax
> >         movq    %rdi, %rsi
> >         movq    %rdx, %rcx
> >         movq    %rax, %rdi
> >         movq    %rsi, %rax
> >         movq    %rdi, %rdx
> >         addq    %rcx, %rax
> >         adcq    $0, %rdx
> >         ret
> >
> > The underlying issue is that during RTL expansion, we generate the
> > following initial RTL for the x argument:
> >
> > (insn 4 3 5 2 (set (reg:TI 85)
> >         (subreg:TI (reg:DI 86) 0)) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 5 4 6 2 (set (subreg:DI (reg:TI 85) 8)
> >         (reg:DI 87)) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 6 5 7 2 (set (reg/v:TI 84 [ x ])
> >         (reg:TI 85)) "pr43644-2.c":5:1 -1
> >      (nil))
> >
> > which by combine/reload becomes
> >
> > (insn 25 3 22 2 (set (reg/v:TI 84 [ x ])
> >         (const_int 0 [0])) "pr43644-2.c":5:1 -1
> >      (nil))
> > (insn 22 25 23 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 0)
> >         (reg:DI 93)) "pr43644-2.c":5:1 90 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 93)
> >         (nil)))
> > (insn 23 22 28 2 (set (subreg:DI (reg/v:TI 84 [ x ]) 8)
> >         (reg:DI 94)) "pr43644-2.c":5:1 90 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 94)
> >         (nil)))
> >
> > where the heavy use of SUBREG SET_DESTs creates challenges for both
> > combine and register allocation.
> >
> > The improvement proposed here is to avoid these problematic SUBREGs by
> > adding (two) special cases to ix86_expand_move.  For insn 4, which
> > sets a TImode destination from a paradoxical SUBREG, to assign the
> > lowpart, we can use an explicit zero extension (zero_extendditi2 was
> > added in July 2022), and for insn 5, which sets the highpart of a
> > TImode register we can use the *insvti_highpart_1 instruction (that
> > was added in May 2023, after being approved for stage1 in January).
> > This allows combine to work its magic, merging these insns into a
> > *concatditi3 and from there into other optimized forms.
> 
> How about we introduce *insvti_lowpart_1, similar to *insvti_highpart_1, in 
> the
> hope that combine is smart enough to also combine these two instructions? IMO,
> faking insert to lowpart of the register with zero_extend is a bit overkill, 
> and could
> hinder some other optimization opportunities (as perhaps hinted by failing
> testcases).

The use of ZERO_EXTEND serves two purposes, both the setting of the lowpart
and of informing the RTL passes that the highpart is dead.  Notice in the 
original
RTL stream, i.e. current GCC, insn 25 is inserted by the .286r.init-regs pass, 
clearing
the entirety of the TImode register (like a clobber), and preventing TI:84 from
occupying the same registers as DI:93 and DI:94.

If the middle-end had asked the backend to generate a SET to STRICT_LOWPART
then our hands would be tied, but a paradoxical SUBREG allows us the freedom
to set the highpart bits to a defined value (we could have used sign extension 
if
that was cheap), which then simplifies data-flow and liveness analysis.  
Allowing the
highpart to contain undefined or untouched data is exactly the sort of security
side-channel leakage that the clear regs pass attempts to address.

I can investigate an *insvti_lowpart_1, but I don't think it will help with this
issue, i.e. it won't prevent init-regs from clobbering/clearing TImode 
parameters.

> > So for the test case above, we now generate only a single movq:
> >
> > foo:    movq    %rdx, %rax
> >         xorl    %edx, %edx
> >         addq    %rdi, %rax
> >         adcq    %rsi, %rdx
> >         ret
> >
> > But there is a little bad news.  This patch causes two (minor) missed
> > optimization regressions on x86_64; gcc.target/i386/pr82580.c and
> > gcc.target/i386/pr91681-1.c.  As shown in the test case above, we're
> > no longer generating adcq $0, but instead using xorl.  For the other
> > FAIL, register allocation now has more freedom and is (arbitrarily)
> > choosing a register assignment that doesn't match what the test is
> > expecting.  These issues are easier to explain and fix once this patch
> > is in the tree.
> >
> > The good news is that this approach fixes a number of long standing
> > issues, that need to checked in bugzilla, including PR target/110533
> > which was just opened/reported earlier this week.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with only the two new FAILs described above.  Ok for mainline?
> >
> > 2023-07-06  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         PR target/43644
> >         PR target/110533
> >         * config/i386/i386-expand.cc (ix86_expand_move): Convert SETs of
> >         TImode destinations from paradoxical SUBREGs (setting the lowpart)
> >         into explicit zero extensions.  Use *insvti_highpart_1 instruction
> >         to set the highpart of a TImode destination.
> >
> > gcc/testsuite/ChangeLog
> >         PR target/43644
> >         PR target/110533
> >         * gcc.target/i386/pr110533.c: New test case.
> >         * gcc.target/i386/pr43644-2.c: Likewise.
> >
> > Thanks in advance,
> > Roger
> > --
> >

Reply via email to