Hi Segher,

Thanks a lot for your review!

Segher Boessenkool <seg...@kernel.crashing.org> writes:

> Hi!
>
> On Tue, Jun 11, 2024 at 04:37:25PM +0800, Jiufu Guo wrote:
>> Sometimes, a complicated constant is built via 3(or more)
>> instructions.  Generally speaking, it would not be as fast
>> as loading it from the constant pool (as the discussions in
>> PR63281):
>> "ld" is one instruction.  If consider "address/toc" adjust,
>> we may count it as 2 instructions. And "pld" may need fewer
>> cycles.
>
> Yeah, three insns to build up a constant always was about as fast/slow
Yeah.
> as loading a constant from memory.  When you need two related constants
> loading from memory is slower, so we preferred building them up.
You may talking about the functionality "const_anchor" which
is enabled in cse1 for two related constants.
Like the test case: const_anchors.c, the first one would be
loaded from memory if it is complicated, and the second would
be computed based on the first one. 

>
>> As testing(SPEC2017), it could get better/stable runtime
>> if set the threshold as "> 2" (compare with "> 3").
>
> Interesting!  About how much speedup did you see?  0.1%?
I did not see the proven speedup from SPEC2017.
On some benchs(including perlbench), I saw some runs with
speedup >1%.  But with deeper checking, it would not directly
benefit of this functionality, since it was not hit by the hot
functions; and the changed constants do not contribute the
runtime visibly.

Saying "better/stable", because with different bases(interval
weeks/months), ">2" introduce less waving; and with different
options (-O2, -O3, -Ofast), ">2" seems more stable.

>
>> As known, because the constant is load from memory by this
>> patch,  so this functionality may affect the cache missing.
>> Also there may be possible side effects on icach. While,
>> IMHO, this patch would be still do the right thing.
>
> Yeah, but almost every codegen patch has *some* icache effect.  Your
> benchmarks show a net positive effect, that is all that matters in the
> end.
Yes. :)
As using the small bench (like the cases mentioned in comment20
of PR63281), the result shows the prefer choise would be
buiding by "2 insns" > "loading from rodata/toc" > "3 insns",
and building by "4/5insn" would be the last choise.

I mentioned this here, because in one recent run, for one benchmark,
I saw 'caching missing' raising, and 'cache missing' was waving
with size of the code (not directly related to this patch.)

>
>> gcc/ChangeLog:
>> 
>>      * config/rs6000/rs6000.cc (rs6000_emit_set_const): Update to split
>>      complicate constant to memory.
>
> (complicated)
>
> And don't write "update", every patch is an update :-)
>
> There are more conditions here, mention those as well?

Thanks! This patch only split complicated constants(>2insns)
to memory for -m64 under TARGET_ELF.
"Split complicate constant to memory for -m64."

>
> But, why does base_reg_operand matter?  And, why for TARGET_ELF only?
> That is the only place you tested I'm sure, but  make an educated guess
> for the rest, why would it not be useful for other binary formats?
Thanks for these very insightful questions!

"base_reg_operand" is used to avoid to instruction like "r0=[OFF(r0)]"
for the case like "r0=0xXXXXX" (I would have a recheck to see if
this is really needed before RA.)

Yes, I only tested with ELF. I add ELF checking, just because not
sure how profitable it is on other targets.
In theory, this functionality would be useful for other targets,
at least for the 4/5 insns (witout the aspect of parallel
execution in the binary).

There is another heuristic code in rs6000_emit_move, it is using:
```(TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)```
This does not check ELF. Levearging this code, it would be ok to
accept all targets.

This heuristic exist a long time ago, I guess it would be before
ppc64le (even more history), it may more for -m32. For -m64, as
I tested, ">2" would be suitable.

>
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index e4dc629ddcc..f448df289a0 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -10240,6 +10240,21 @@ rs6000_emit_set_const (rtx dest, rtx source)
>>        c = sext_hwi (c, 32);
>>        emit_move_insn (lo, GEN_INT (c));
>>      }
>> +
>> +      else if (base_reg_operand (dest, mode) && TARGET_64BIT
>> +           && TARGET_ELF && num_insns_constant (source, mode) > 2)
>> +    {
>> +      rtx sym = force_const_mem (mode, source);
>> +      if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))
>> +          && use_toc_relative_ref (XEXP (sym, 0), mode))
>> +        {
>> +          rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx (dest));
>> +          sym = gen_const_mem (mode, toc);
>> +          set_mem_alias_set (sym, get_TOC_alias_set ());
>> +        }
>> +
>> +      emit_move_insn (dest, sym);
>> +    }
>>        else
>>      rs6000_emit_set_long_const (dest, c);
>>        break;
>
> Is there no utility function to put constants like this in memory?
> Like, "output_toc" for example?
I guess you mean "force_const_mem" which puts the constant value to
constant pool and create "LC" label for it.

With "emit_move_insn (dest, force_const_mem (mode, source));" could
implement the part of this functionality.   While adding code
"if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0))..." would save one
more instructions.  So, the code is drafted like this patch.

>
>> diff --git a/gcc/testsuite/gcc.target/powerpc/const_anchors.c 
>> b/gcc/testsuite/gcc.target/powerpc/const_anchors.c
>> index 542e2674b12..f33c9a83f5e 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/const_anchors.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/const_anchors.c
>> @@ -1,5 +1,5 @@
>>  /* { dg-do compile { target has_arch_ppc64 } } */
>> -/* { dg-options "-O2" } */
>> +/* { dg-options "-O2 -fdump-rtl-final" } */
>>  
>>  #define C1 0x2351847027482577ULL
>>  #define C2 0x2351847027482578ULL
>> @@ -17,4 +17,5 @@ void __attribute__ ((noinline)) foo1 (long long *a, long 
>> long b)
>>      *a++ = C2;
>>  }
>>  
>> -/* { dg-final { scan-assembler-times {\maddi\M} 2 } } */
>> +/* Checking "final" instead checking "asm" output to avoid noise.  */
>> +/* { dg-final { scan-rtl-dump-times {\madddi3\M} 2 "final" } } */
>
> So, in the RTL dump it finds a named pattern adddi3, while in the asm
> you find random other addi insns?  Hardly worth a comment in the
Yes, other "addi"s for memory address occur in asm.
Thanks for your kind review.  I would remove this no-go comment.

> testcase itself, but heh.  The point is that you check for the RTL
> pattern name instead of the machine insn.  The compiler pass you check
> in isn't changed even!
>
>> --- a/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/parall_5insn_const.c
>> @@ -6,11 +6,18 @@
>>  /* { dg-final { scan-assembler-times {\mori\M} 4 } } */
>>  /* { dg-final { scan-assembler-times {\mrldimi\M} 2 } } */
>>  
>> +/* The below macro helps to avoid loading constant from memory.  */
>> +#define CONST_AVOID_BASE_REG(DEST, CST)                                     
>>    \
>> +  {                                                                         
>>    \
>> +    register long long d asm ("r0") = CST;                                  
>>    \
>> +    asm volatile ("std %1, %0" : : "m"(DEST), "r"(d));                      
>>    \
>> +  }
>
> This needs an actual changelog, not just "update".  Something as simple
> as "New macro to force constant into a reg" is fine, but more than
> "update" :-)
Thanks a lot for point this out!
>
>> +/* The below macro helps to avoid loading constant from memory.  */
>> +#define CONST_AVOID_BASE_REG(DEST, CST)                                     
>>    \
>> +  {                                                                         
>>    \
>> +    register long long d asm ("r0") = CST;                                  
>>    \
>> +    asm volatile ("std %1, %0" : : "m"(DEST), "r"(d));                      
>>    \
>> +  }
>
> "Forces the constant into a register", yes.
OK, thanks.
>
>> --- a/gcc/testsuite/gcc.target/powerpc/pr106550_1.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr106550_1.c
>> @@ -4,17 +4,19 @@
>>  /* { dg-options "-O2 -mdejagnu-cpu=power10 -fdisable-rtl-split1" } */
>>  /* force the constant splitter run after RA: -fdisable-rtl-split1.  */
>>  
>> +/* The below marco helps to avoid using paddi and avoid loading from 
>> memory. */
>
> (macro)
OK.
>
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c
>> @@ -0,0 +1,11 @@
>> +/* Check loading constant from memory pool.  */
>> +/* { dg-options "-O2 -mpowerpc64" } */
>> +
>> +void
>> +foo (unsigned long long *a)
>> +{
>> +  *a++ = 0x2351847027482577ULL;
>> +}
>> +
>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 { target { lp64 } } } } */
>
> Why target lp64 only?  You have -mpowerpc64 already, does that not make
> us use ld here always?
For -m32, we need other code to support it.  And the patch for
"-m32 -mpowerpc" is the second patch of this series.  Here 'lp64'
is safe guard for "-m64", in case there is concerns for "-m32" patch.

Thanks again for your review!!

BR,
Jeff(Jiufu) Guo.

>
>
> Segher

Reply via email to