Hi Segher,
Thanks a lot for your review! Segher Boessenkool <seg...@kernel.crashing.org> writes: > Hi! > > On Tue, Jun 11, 2024 at 04:37:25PM +0800, Jiufu Guo wrote: >> Sometimes, a complicated constant is built via 3(or more) >> instructions. Generally speaking, it would not be as fast >> as loading it from the constant pool (as the discussions in >> PR63281): >> "ld" is one instruction. If consider "address/toc" adjust, >> we may count it as 2 instructions. And "pld" may need fewer >> cycles. > > Yeah, three insns to build up a constant always was about as fast/slow Yeah. > as loading a constant from memory. When you need two related constants > loading from memory is slower, so we preferred building them up. You may talking about the functionality "const_anchor" which is enabled in cse1 for two related constants. Like the test case: const_anchors.c, the first one would be loaded from memory if it is complicated, and the second would be computed based on the first one. > >> As testing(SPEC2017), it could get better/stable runtime >> if set the threshold as "> 2" (compare with "> 3"). > > Interesting! About how much speedup did you see? 0.1%? I did not see the proven speedup from SPEC2017. On some benchs(including perlbench), I saw some runs with speedup >1%. But with deeper checking, it would not directly benefit of this functionality, since it was not hit by the hot functions; and the changed constants do not contribute the runtime visibly. Saying "better/stable", because with different bases(interval weeks/months), ">2" introduce less waving; and with different options (-O2, -O3, -Ofast), ">2" seems more stable. > >> As known, because the constant is load from memory by this >> patch, so this functionality may affect the cache missing. >> Also there may be possible side effects on icach. While, >> IMHO, this patch would be still do the right thing. > > Yeah, but almost every codegen patch has *some* icache effect. Your > benchmarks show a net positive effect, that is all that matters in the > end. Yes. :) As using the small bench (like the cases mentioned in comment20 of PR63281), the result shows the prefer choise would be buiding by "2 insns" > "loading from rodata/toc" > "3 insns", and building by "4/5insn" would be the last choise. I mentioned this here, because in one recent run, for one benchmark, I saw 'caching missing' raising, and 'cache missing' was waving with size of the code (not directly related to this patch.) > >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split >> complicate constant to memory. > > (complicated) > > And don't write "update", every patch is an update :-) > > There are more conditions here, mention those as well? Thanks! This patch only split complicated constants(>2insns) to memory for -m64 under TARGET_ELF. "Split complicate constant to memory for -m64." > > But, why does base_reg_operand matter? And, why for TARGET_ELF only? > That is the only place you tested I'm sure, but make an educated guess > for the rest, why would it not be useful for other binary formats? Thanks for these very insightful questions! "base_reg_operand" is used to avoid to instruction like "r0=[OFF(r0)]" for the case like "r0=0xXXXXX" (I would have a recheck to see if this is really needed before RA.) Yes, I only tested with ELF. I add ELF checking, just because not sure how profitable it is on other targets. In theory, this functionality would be useful for other targets, at least for the 4/5 insns (witout the aspect of parallel execution in the binary). There is another heuristic code in rs6000_emit_move, it is using: ```(TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)``` This does not check ELF. Levearging this code, it would be ok to accept all targets. This heuristic exist a long time ago, I guess it would be before ppc64le (even more history), it may more for -m32. For -m64, as I tested, ">2" would be suitable. > >> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >> index e4dc629ddcc..f448df289a0 100644 >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source) >> c = sext_hwi (c, 32); >> emit_move_insn (lo, GEN_INT (c)); >> } >> + >> + else if (base_reg_operand (dest, mode) && TARGET_64BIT >> + && TARGET_ELF && num_insns_constant (source, mode) > 2) >> + { >> + rtx sym = force_const_mem (mode, source); >> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >> + { >> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest)); >> + sym = gen_const_mem (mode, toc); >> + set_mem_alias_set (sym, get_TOC_alias_set ()); >> + } >> + >> + emit_move_insn (dest, sym); >> + } >> else >> rs6000_emit_set_long_const (dest, c); >> break; > > Is there no utility function to put constants like this in memory? > Like, "output_toc" for example? I guess you mean "force_const_mem" which puts the constant value to constant pool and create "LC" label for it. With "emit_move_insn (dest, force_const_mem (mode, source));" could implement the part of this functionality. While adding code "if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))..." would save one more instructions. So, the code is drafted like this patch. > >> diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> index 542e2674b12..f33c9a83f5e 100644 >> --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c >> @@ -1,5 +1,5 @@ >> /* { dg-do compile { target has_arch_ppc64 } } */ >> -/* { dg-options "-O2" } */ >> +/* { dg-options "-O2 -fdump-rtl-final" } */ >> >> #define C1 0x2351847027482577ULL >> #define C2 0x2351847027482578ULL >> @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long >> long b) >> *a++ = C2; >> } >> >> -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */ >> +/* Checking "final" instead checking "asm" output to avoid noise. */ >> +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */ > > So, in the RTL dump it finds a named pattern adddi3, while in the asm > you find random other addi insns? Hardly worth a comment in the Yes, other "addi"s for memory address occur in asm. Thanks for your kind review. I would remove this no-go comment. > testcase itself, but heh. The point is that you check for the RTL > pattern name instead of the machine insn. The compiler pass you check > in isn't changed even! > >> --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c >> @@ -6,11 +6,18 @@ >> /* { dg-final { scan-assembler-times {\mori\M} 4 } } */ >> /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */ >> >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) >> \ >> + { >> \ >> + register long long d asm ("r0") = CST; >> \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); >> \ >> + } > > This needs an actual changelog, not just "update". Something as simple > as "New macro to force constant into a reg" is fine, but more than > "update" :-) Thanks a lot for point this out! > >> +/* The below macro helps to avoid loading constant from memory. */ >> +#define CONST_AVOID_BASE_REG(DEST, CST) >> \ >> + { >> \ >> + register long long d asm ("r0") = CST; >> \ >> + asm volatile ("std %1, %0" : : "m"(DEST), "r"(d)); >> \ >> + } > > "Forces the constant into a register", yes. OK, thanks. > >> --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c >> @@ -4,17 +4,19 @@ >> /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */ >> /* force the constant splitter run after RA: -fdisable-rtl-split1. */ >> >> +/* The below marco helps to avoid using paddi and avoid loading from >> memory. */ > > (macro) OK. > >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >> @@ -0,0 +1,11 @@ >> +/* Check loading constant from memory pool. */ >> +/* { dg-options "-O2 -mpowerpc64" } */ >> + >> +void >> +foo (unsigned long long *a) >> +{ >> + *a++ = 0x2351847027482577ULL; >> +} >> + >> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */ > > Why target lp64 only? You have -mpowerpc64 already, does that not make > us use ld here always? For -m32, we need other code to support it. And the patch for "-m32 -mpowerpc" is the second patch of this series. Here 'lp64' is safe guard for "-m64", in case there is concerns for "-m32" patch. Thanks again for your review!! BR, Jeff(Jiufu) Guo. > > > Segher