Hi!
On 2024-06-27T23:20:18+0200, I wrote:
> On 2024-06-27T22:27:21+0200, I wrote:
>> On 2024-06-27T18:49:17+0200, I wrote:
>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <[email protected]>
>>> wrote:
>>>> This patch adds a combine pass that runs late in the pipeline.
>>
>> [After sending, I realized I replied to a previous thread of this work.]
>>
>>> I've beek looking a bit through recent nvptx target code generation
>>> changes for GCC target libraries, and thought I'd also share here my
>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>
>>> First the unexpected thing:
>>
>> So much for "unexpected thing" -- next level of unexpected here...
>> Appreciated if anyone feels like helping me find my way through this, but
>> I totally understand if you've got other things to do.
>
> OK, I found something already. (Unexpectedly quickly...) ;-)
>
>>> there are a few cases where we now see unused
>>> registers get declared
> But in fact, for both cases
Now tested: 's%both%all'. :-)
> the unexpected difference goes away if after
> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run
> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
> all not active for nvptx target given '!reload_completed', given nvptx is
> 'targetm.no_register_allocation'. Maybe we need to enable a few more
> passes, or is there anything in 'pass_late_combine' to change, so that we
> don't run into this? Does it inadvertently mark registers live or
> something like that?
Basically, is 'pass_late_combine' potentionally doing things that depend
on later clean-up? (..., or shouldn't it be doing these things in the
first place?)
> The following makes these two cases work, but evidently needs a lot more
> analysis: a lot of other passes are enabled that may be anything between
> beneficial and harmful for 'targetm.no_register_allocation'/nvptx.
>
> --- gcc/passes.cc
> +++ gcc/passes.cc
> @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
> class pass_postreload : public rtl_opt_pass
> {
> public:
> pass_postreload (gcc::context *ctxt)
> : rtl_opt_pass (pass_data_postreload, ctxt)
> {}
>
> /* opt_pass methods: */
> - bool gate (function *) final override { return reload_completed; }
> + bool gate (function *) final override { return reload_completed ||
> targetm.no_register_allocation; }
> --- gcc/regcprop.cc
> +++ gcc/regcprop.cc
> @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
> public:
> pass_cprop_hardreg (gcc::context *ctxt)
> : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
> {}
>
> /* opt_pass methods: */
> bool gate (function *) final override
> {
> - return (optimize > 0 && (flag_cprop_registers));
> + return (optimize > 0 && flag_cprop_registers &&
> !targetm.no_register_allocation);
> }
Also, that quickly ICEs; more '[...] && !targetm.no_register_allocation'
are needed elsewhere, at least.
The following simpler thing, however, does work; move 'pass_fast_rtl_dce'
out of 'pass_postreload':
--- gcc/passes.cc
+++ gcc/passes.cc
@@ -677,14 +677,15 @@ class pass_postreload : public rtl_opt_pass
{
public:
pass_postreload (gcc::context *ctxt)
: rtl_opt_pass (pass_data_postreload, ctxt)
{}
/* opt_pass methods: */
+ opt_pass * clone () final override { return new pass_postreload
(m_ctxt); }
bool gate (function *) final override { return reload_completed; }
}; // class pass_postreload
--- gcc/passes.def
+++ gcc/passes.def
@@ -529,7 +529,10 @@ along with GCC; see the file COPYING3. If not see
NEXT_PASS (pass_regrename);
NEXT_PASS (pass_fold_mem_offsets);
NEXT_PASS (pass_cprop_hardreg);
- NEXT_PASS (pass_fast_rtl_dce);
+ POP_INSERT_PASSES ()
+ NEXT_PASS (pass_fast_rtl_dce);
+ NEXT_PASS (pass_postreload);
+ PUSH_INSERT_PASSES_WITHIN (pass_postreload)
NEXT_PASS (pass_reorder_blocks);
NEXT_PASS (pass_leaf_regs);
NEXT_PASS (pass_split_before_sched2);
This (only) cleans up "the mess that 'pass_late_combine' created"; no
further changes in GCC target libraries for nvptx. (For avoidance of
doubt: "mess" is a great exaggeration here.)
Grüße
Thomas
>> But: should we expect '-fno-late-combine-instructions' vs.
>> '-flate-combine-instructions' to behave in the same way? (After all,
>> '%r22' remains unused also with '-flate-combine-instructions', and
>> doesn't need to be emitted.) This could, of course, also be a nvptx back
>> end issue?
>>
>> I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz'
>> is attached if you'd like to reproduce this with your own nvptx target
>> 'cc1':
>>
>> $ [...]/configure --target=nvptx-none --enable-languages=c
>> $ make -j12 all-gcc
>> $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase
>> tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o
>> tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions
>>
>>
>> Grüße
>> Thomas