nvptx vs. [PATCH] Add a late-combine pass [PR106594]

Thomas Schwinge Thu, 27 Jun 2024 09:49:47 -0700

Hi!

On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandif...@arm.com> 
wrote:
> This patch adds a combine pass that runs late in the pipeline.


Great!

In context of <https://gcc.gnu.org/PR115682>
'nvptx vs. "fwprop: invoke change_is_worthwhile to judge if a replacement is 
worthwhile"',
I've beek looking a bit through recent nvptx target code generation
changes for GCC target libraries, and thought I'd also share here my
findings for the "late-combine" changes in isolation, for nvptx target.

First the unexpected thing: there are a few cases where we now see unused
registers get declared, for example (random) in
'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' (full 'diff' before vs.
after):

    [...]
     .visible .func (.param .f64 %value_out) modf (.param .f64 %in_ar0, .param 
.u64 %in_ar1)
     {
     .reg .f64 %value;
     .reg .f64 %ar0;
     ld.param.f64 %ar0,[%in_ar0];
     .reg .u64 %ar1;
     ld.param.u64 %ar1,[%in_ar1];
     .reg .u32 %r23;
     .reg .f64 %r32;
     .reg .u32 %r41;
     .reg .u32 %r42;
     .reg .f64 %r43;
     .reg .u32 %r46;
    +.reg .u64 %r48;
    +.reg .u64 %r49;
    +.reg .u64 %r50;
    +.reg .u64 %r51;
    +.reg .u64 %r52;
     .reg .f64 %r53;
     .reg .f64 %r54;
     .reg .u64 %r55;
    [...]

That is, five additional registers declared, without any use.  I suppose
that's some 'gen_reg_rtx' that needs to be "confined" (in whichever way;
to be looked into).  I suppose that's not actually a problem: the PTX JIT
should clean these out again, but it's still noise when reading
GCC-emitted PTX code.


Other than that, I've only got good things to report :-) -- a few
examples in the following, if anyone's interested, without much
commentary.  I haven't looked of how much these "visible" PTX code
generation changes translate into actual GPU SASS code improvements after
the PTX JIT has done its thing, but it's certainly easier to read/less
state to keep during human reading (due to less live registers,
primarily).

'nvptx-none/libatomic/cas_16_.o':

    [...]
    -.reg .u32 %r24;
    [...]
     setp.eq.u64 %r45,%r40,0;
    -selp.u32 %r24,1,0,%r45;
     .loc 3 113 6
     setp.ne.u64 %r48,%r58,%r60;
     @ %r48 bra $L2;
    @@ -86,7 +84,7 @@
     call libat_unlock_1,(%out_arg1);
     }
     .loc 3 122 1
    -mov.u32 %value,%r24;
    +selp.u32 %value,1,0,%r45;
     st.param.u32 [%value_out],%value;
     ret;
     }

A lot more instances of similar patterns.

'nvptx-none/libbacktrace/dwarf.o':

    [...]
    -.reg .u64 %r695;
    [...]
     setp.ne.u32 %r679,%r678,0;
    -@ ! %r679 bra $L1149;
    -add.u64 %r695,%r212,32;
    -bra $L1090;
    -$L1149:
    +@ %r679 bra $L1090;
    [...]
     $L1090:
     .loc 2 4046 7
    -mov.u64 %r28,%r695;
    +add.u64 %r28,%r212,32;
     $L1096:
    [...]

'nvptx-none/libbacktrace/dwarf.o':

    [...]
    -.reg .u64 %r41;
    [...]
    -add.u64 %r41,%frame,8;
     mov.u64 %r43,0;
    -st.u64 [%r41],%r43;
    -st.u64 [%r41+8],%r43;
    -st.u64 [%r41+16],%r43;
    +st.u64 [%frame+8],%r43;
    +st.u64 [%frame+16],%r43;
    +st.u64 [%frame+24],%r43;
    [...]

'nvptx-none/libgfortran/generated/cshift1_8.o':

    [...]
    -.reg .u64 %r293;
    [...]
    -add.u64 %r293,%r292,120;
    -st.u64 [%r293],%r409;
    +st.u64 [%r292+120],%r409;
    [...]

'nvptx-none/newlib/libc/stdio/libc_a-siscanf.o', have 'st' do 'u32'
truncation instead of explicit 'cvt':

    [...]
    -.reg .u32 %r23;
    [...]
    -cvt.u32.u64 %r23,%r32;
    -st.u32 [%frame+8],%r23;
    +st.u32 [%frame+8],%r32;
    [...]


'nvptx-none/newlib/libc/string/libc_a-memccpy.o', simplify (supposedly)
char -> int -> short into char -> short:

    [...]
    -.reg .u32 %r37;
    [...]
    -cvt.u32.u8 %r37,%r49;
    [...]
    -cvt.u16.u32 %r67,%r37;
    +cvt.u16.u8 %r67,%r49;
    [...]

'nvptx-none/libgcc/crt0.o':

    [...]
    -.reg .u64 %r30;
    [...]
     cvta.global.u64 %r29,stack$0+131072;
     st.shared.u64 [__nvptx_stacks],%r29;
    -cvta.shared.u64 %r30,__nvptx_uni;
     mov.u32 %r31,0;
    -st.u32 [%r30],%r31;
    +st.shared.u32 [__nvptx_uni],%r31;
    [...]

There are a lot more instances of getting rid of a register in favor of
using more complex instructions.

'nvptx-none/libgfortran/generated/findloc2_s4.o':

    [...]
    -.reg .u64 %r33;
    [...]
    -neg.s64 %r33,%r35;
    [...]
    -add.u64 %r39,%r39,%r33;
    +sub.u64 %r39,%r39,%r35;
    [...]

..., but in turn also a multi-use variant of the 'neg' (is this still
beneficial?), 'nvptx-none/newlib/libc/search/libc_a-bsd_qsort_r.o':

    [...]
    -.reg .u64 %r34;
    [...]
    -neg.s64 %r34,%r82;
    -.loc 2 213 9
    -add.u64 %r35,%r76,%r34;
    +sub.u64 %r35,%r76,%r82;
    [...]
    -add.u64 %r37,%r71,%r34;
    -add.u64 %r38,%r37,%r34;
    +sub.u64 %r37,%r71,%r82;
    +sub.u64 %r38,%r37,%r82;
    [...]

'nvptx-none/newlib/libm/complex/libm_a-cephes_subr.o':

    [...]
    -.reg .f64 %r35;
    [...]
    -neg.f64 %r35,%r27;
    -fma.rn.f64 %r22,%r35,0d400921fb54000000,%r32;
    +fma.rn.f64 %r22,%r27,0dc00921fb54000000,%r32;
     .loc 2 80 21
    -fma.rn.f64 %r23,%r35,0d3e210b4610000000,%r22;
    +fma.rn.f64 %r23,%r27,0dbe210b4610000000,%r22;
     .loc 2 80 4
    -fma.rn.f64 %value,%r35,0d3c6a62633145c06e,%r23;
    +fma.rn.f64 %value,%r27,0dbc6a62633145c06e,%r23;
    [...]

'nvptx-none/libgcc/_gcov_info_to_gcda.o':

    [...]
    -.reg .u32 %r186;
    [...]
    -add.u32 %r186,%r59,%r59;
    -add.u32 %r187,%r186,%r59;
    +vadd.u32.u32.u32.add %r187,%r59,%r59,%r59;
    [...]


Grüße
 Thomas


> There are two instances: one between combine and split1, and one
> after postreload.
>
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
>
> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> in the aarch64 test results, mostly due to making proper use of
> MOVPRFX in cases where we didn't previously.  I hope it would
> also help with Robin's vec_duplicate testcase, although the
> pressure heuristic might need tweaking for that case.
>
> This is just a first step..  I'm hoping that the pass could be
> used for other combine-related optimisations in future.  In particular,
> the post-RA version doesn't need to restrict itself to cases where all
> uses are substitutitable, since it doesn't have to worry about register
> pressure.  If we did that, and if we extended it to handle multi-register
> REGs, the pass might be a viable replacement for regcprop, which in
> turn might reduce the cost of having a post-RA instance of the new pass.
>
> I've run an assembly comparison with one target per CPU directory,
> and it seems to be a win for all targets except nvptx (which is hard
> to measure, being a higher-level asm).  The biggest winner seemed
> to be AVR.
>
> I'd originally hoped to enable the pass by default at -O2 and above
> on all targets.  But in the end, I don't think that's possible,
> because it interacts badly with x86's STV and partial register
> dependency passes.
>
> For example, gcc.target/i386/minmax-6.c tests whether the code
> compiles without any spilling.  The RTL created by STV contains:
>
> (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
>         (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
>         (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
>      (expr_list:REG_DEAD (reg:SI 120)
>         (nil)))
> (insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
>         (reg:SI 118)) -1
>      (nil))
>
> and it's crucial for the test that reg 108 is kept, rather than
> propagated into uses.  As things stand, 118 can be allocated
> a vector register and 108 a scalar register.  If 108 is propagated,
> there will be scalar and vector uses of 118, and so it will be
> spilled to memory.
>
> That one could be solved by running STV2 later.  But RPAD is
> a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:
>
> (insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
>         (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse}
>      (nil))
>
> into:
>
> (insn 45 26 44 6 (set (reg:V4SF 108)
>         (const_vector:V4SF [
>                 (const_double:SF 0.0 [0x0.0p+0]) repeated x4
>             ])) -1
>      (nil))
> (insn 44 45 27 6 (set (reg:V2DF 109)
>         (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI 
> ("d2")))))
>             (subreg:V2DF (reg:V4SF 108) 0)
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
>         (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
>      (nil))
>
> But both the pre-RA and post-RA passes are able to combine these
> instructions back to the original form.
>
> The patch therefore enables the pass by default only on AArch64.
> However, I did test the patch with it enabled on x86_64-linux-gnu
> as well, which was useful for debugging.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu (as posted, with no regressions, and with the
> pass enabled by default, with some gcc.target/i386 regressions).
> OK to install?
>
> Richard

nvptx vs. [PATCH] Add a late-combine pass [PR106594]

Reply via email to