Hi Thomas,

There are two things I think I can contribute to this discussion.  The first is 
that I have
a patch (from a year or two ago) for adding rtx_costs to the nvptx backend that 
I will
respin, which will provide more backend control over combine-like pass 
decisions.

The second is in response to your comment below; nvptx is currently GCC's only
target.no_register_allocation target, and I think this provides an opportunity 
to
tweak/change these semantics.  Currently "no_register_allocation" turns off a 
whole bunch of post-reload passes (including things like peephole2 and 
predication
that would be extremely useful on nvptx).  My thoughts were that it would be 
nice,
if no_register_allocation turned off just/only the ira/reload passes, where the 
gate
function would instead simply set reload_completed.

One major advantage of this approach is that it would minimize the divergence
in code paths, between nvptx and all of GCC's other backends, which would 
hopefully
help with things like "late-combine" that may implicitly (and perhaps 
unintentionally)
require/expect certain passes to be run later.

Again I've investigated a patch/change or two towards the above approach, but 
its 
slightly more complicated than I'd anticipated, due to things like late 
nvptx-specific
passes assuming they can call gen_reg_rtx, after other targets have 
no_new_pseudos.
There's nothing that's impossible, but it would take a concerted effort by 
target and
middle-end maintainers to align a shallower "no_register_allocation".

I'm curious what folks think.

Best regards,
Roger
--

> -----Original Message-----
> From: Thomas Schwinge <tschwi...@baylibre.com>
> Sent: 27 June 2024 22:20
> To: Richard Sandiford <richard.sandif...@arm.com>
> Cc: j...@ventanamicro.com; rdapp....@gmail.com; gcc-patches@gcc.gnu.org;
> Tom de Vries <tdevr...@suse.de>; Roger Sayle <ro...@nextmovesoftware.com>
> Subject: Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]
> 
> Hi!
> 
> On 2024-06-27T22:27:21+0200, I wrote:
> > On 2024-06-27T18:49:17+0200, I wrote:
> >> On 2023-10-24T19:49:10+0100, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
> >>> This patch adds a combine pass that runs late in the pipeline.
> >
> > [After sending, I realized I replied to a previous thread of this
> > work.]
> >
> >> I've beek looking a bit through recent nvptx target code generation
> >> changes for GCC target libraries, and thought I'd also share here my
> >> findings for the "late-combine" changes in isolation, for nvptx target.
> >>
> >> First the unexpected thing:
> >
> > So much for "unexpected thing" -- next level of unexpected here...
> > Appreciated if anyone feels like helping me find my way through this,
> > but I totally understand if you've got other things to do.
> 
> OK, I found something already.  (Unexpectedly quickly...)  ;-)
> 
> >> there are a few cases where we now see unused registers get declared,
> >> for example (random) in 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf'
> 
> I've now looked into the former one ('tmp-libm_a-s_modf.i.xz' is attached), to
> avoid...
> 
> > I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'.
> 
> >     ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ 
> > is used
> uninitialized [-Wuninitialized]
> >        88 |   return ret;
> >           |          ^~~
> >     ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was
> declared here
> >        48 |   int ret;
> >           |       ^~~
> >
> > Uh.  Given nothing else is going on in that function, I suppose '%r22'
> > relates to the uninitialized 'ret' -- and given undefined behavior,
> > GCC of course is fine to emit an unused 'reg' in that case...
> 
> ... the undefined behavior here.
> 
> But in fact, for both cases, the unexpected difference goes away if after
> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run as 
> part of
> 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's all not active for
> nvptx target given '!reload_completed', given nvptx is
> 'targetm.no_register_allocation'.  Maybe we need to enable a few more passes,
> or is there anything in 'pass_late_combine' to change, so that we don't run 
> into
> this?  Does it inadvertently mark registers live or something like that?
> 
> The following makes these two cases work, but evidently needs a lot more
> analysis: a lot of other passes are enabled that may be anything between
> beneficial and harmful for 'targetm.no_register_allocation'/nvptx.
> 
>     --- gcc/passes.cc
>     +++ gcc/passes.cc
>     @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
>      class pass_postreload : public rtl_opt_pass
>      {
>      public:
>        pass_postreload (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_postreload, ctxt)
>        {}
> 
>        /* opt_pass methods: */
>     -  bool gate (function *) final override { return reload_completed; }
>     +  bool gate (function *) final override { return reload_completed ||
> targetm.no_register_allocation; }
>     --- gcc/regcprop.cc
>     +++ gcc/regcprop.cc
>     @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
>      public:
>        pass_cprop_hardreg (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
>        {}
> 
>        /* opt_pass methods: */
>        bool gate (function *) final override
>          {
>     -      return (optimize > 0 && (flag_cprop_registers));
>     +      return (optimize > 0 && flag_cprop_registers &&
> !targetm.no_register_allocation);
>          }
> 
> 
> Grüße
>  Thomas
> 
> 
> > But: should we expect '-fno-late-combine-instructions' vs.
> > '-flate-combine-instructions' to behave in the same way?  (After all,
> > '%r22' remains unused also with '-flate-combine-instructions', and
> > doesn't need to be emitted.)  This could, of course, also be a nvptx
> > back end issue?
> >
> > I'm happy to supply any dump files etc.  Also, 'tmp-libc_a-lnumeric.i.xz'
> > is attached if you'd like to reproduce this with your own nvptx target
> > 'cc1':
> >
> >     $ [...]/configure --target=nvptx-none --enable-languages=c
> >     $ make -j12 all-gcc
> >     $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase
> > tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin
> > -o tmp-libc_a-lnumeric.s -fdump-rtl-all #
> > -fno-late-combine-instructions
> >
> >
> > Grüße
> >  Thomas
> 


Reply via email to