Jiufu Guo <guoji...@linux.ibm.com> writes: > Hi, > > Richard Biener <richard.guent...@gmail.com> writes: > >> On Mon, Aug 15, 2022 at 7:26 AM Jiufu Guo via Gcc-patches >> <gcc-patches@gcc.gnu.org> wrote: >>> >>> Hi, >>> >>> This patch tries to put the constant into constant pool if building the >>> constant requires 3 or more instructions. >>> >>> But there is a concern: I'm wondering if this patch is really profitable. >>> >>> Because, as I tested, 1. for simple case, if instructions are not been run >>> in parallel, loading constant from memory maybe faster; but 2. if there >>> are some instructions could run in parallel, loading constant from memory >>> are not win comparing with building constant. As below examples. >>> >>> For f1.c and f3.c, 'loading' constant would be acceptable in runtime aspect; >>> for f2.c and f4.c, 'loading' constant are visibly slower. >>> >>> For real-world cases, both kinds of code sequences exist. >>> >>> So, I'm not sure if we need to push this patch. >>> >>> Run a lot of times (1000000000) below functions to check runtime. >>> f1.c: >>> long foo (long *arg, long*, long *) >>> { >>> *arg = 0x1234567800000000; >>> } >>> asm building constant: >>> lis 10,0x1234 >>> ori 10,10,0x5678 >>> sldi 10,10,32 >>> vs. asm loading >>> addis 10,2,.LC0@toc@ha >>> ld 10,.LC0@toc@l(10) >>> The runtime between 'building' and 'loading' are similar: some times the >>> 'building' is faster; sometimes 'loading' is faster. And the difference is >>> slight. >> >> I wonder if it is possible to decide this during scheduling - chose the >> variant that, when the result is needed, is cheaper? Post-RA might >> be a bit difficult (I see the load from memory needs the TOC, but then >> when the TOC is not available we could just always emit the build form), >> and pre-reload precision might be not good enough to make this worth >> the experiment? > Thanks a lot for your comments! > > Yes, Post-RA may not handle all cases. > If there is no TOC avaiable, we are not able to load the const through > TOC. As Segher point out: crtl->uses_const_pool maybe an approximation > way. > Sched2 pass could optimize some cases(e.g. for f2.c and f4.c), but for > some cases, it may not distrubuted those 'building' instructions. > > So, maybe we add a peephole after sched2. If the five-instructions > to building constant are still successive, then using 'load' to replace > (need to check TOC available). > While I'm not sure if it is worthy.
Oh, as checking the object files (from GCC bootstrap and spec), it is rare that the five-instructions are successive. It is often 1(or 2) insns are distributed, and other 4(or 3) instructions are successive. So, using peephole may not very helpful. BR, Jeff(Jiufu) > >> >> Of course the scheduler might lack on the technical side as well. > > > BR, > Jeff(Jiufu) > >> >>> >>> f2.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 0x1234567800000000; >>> *arg2 = 0x7965234700000000; >>> *arg3 = 0x4689123700000000; >>> } >>> asm building constant: >>> lis 7,0x1234 >>> lis 10,0x7965 >>> lis 9,0x4689 >>> ori 7,7,0x5678 >>> ori 10,10,0x2347 >>> ori 9,9,0x1237 >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> vs. loading >>> addis 7,2,.LC0@toc@ha >>> addis 10,2,.LC1@toc@ha >>> addis 9,2,.LC2@toc@ha >>> ld 7,.LC0@toc@l(7) >>> ld 10,.LC1@toc@l(10) >>> ld 9,.LC2@toc@l(9) >>> For this case, 'loading' is always slower than 'building' (>15%). >>> >>> f3.c >>> long foo (long *arg, long *, long *) >>> { >>> *arg = 384307168202282325; >>> } >>> lis 10,0x555 >>> ori 10,10,0x5555 >>> sldi 10,10,32 >>> oris 10,10,0x5555 >>> ori 10,10,0x5555 >>> For this case, 'building' (through 5 instructions) are slower, and 'loading' >>> is faster ~5%; >>> >>> f4.c >>> long foo (long *arg, long *arg2, long *arg3) >>> { >>> *arg = 384307168202282325; >>> *arg2 = -6148914691236517205; >>> *arg3 = 768614336404564651; >>> } >>> lis 7,0x555 >>> lis 10,0xaaaa >>> lis 9,0xaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaaa >>> ori 9,9,0xaaaa >>> sldi 7,7,32 >>> sldi 10,10,32 >>> sldi 9,9,32 >>> oris 7,7,0x5555 >>> oris 10,10,0xaaaa >>> oris 9,9,0xaaaa >>> ori 7,7,0x5555 >>> ori 10,10,0xaaab >>> ori 9,9,0xaaab >>> For this cases, since 'building' constant are parallel, 'loading' is slower: >>> ~8%. On p10, 'loading'(through 'pld') is also slower >4%. >>> >>> >>> BR, >>> Jeff(Jiufu) >>> >>> --- >>> gcc/config/rs6000/rs6000.cc | 14 ++++++++++++++ >>> gcc/testsuite/gcc.target/powerpc/pr63281.c | 11 +++++++++++ >>> 2 files changed, 25 insertions(+) >>> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr63281.c >>> >>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc >>> index 4b727d2a500..3798e11bdbc 100644 >>> --- a/gcc/config/rs6000/rs6000.cc >>> +++ b/gcc/config/rs6000/rs6000.cc >>> @@ -10098,6 +10098,20 @@ rs6000_emit_set_const (rtx dest, rtx source) >>> c = ((c & 0xffffffff) ^ 0x80000000) - 0x80000000; >>> emit_move_insn (lo, GEN_INT (c)); >>> } >>> + else if (base_reg_operand (dest, mode) >>> + && num_insns_constant (source, mode) > 2) >>> + { >>> + rtx sym = force_const_mem (mode, source); >>> + if (TARGET_TOC && SYMBOL_REF_P (XEXP (sym, 0)) >>> + && use_toc_relative_ref (XEXP (sym, 0), mode)) >>> + { >>> + rtx toc = create_TOC_reference (XEXP (sym, 0), copy_rtx >>> (dest)); >>> + sym = gen_const_mem (mode, toc); >>> + set_mem_alias_set (sym, get_TOC_alias_set ()); >>> + } >>> + >>> + emit_insn (gen_rtx_SET (dest, sym)); >>> + } >>> else >>> rs6000_emit_set_long_const (dest, c); >>> break; >>> diff --git a/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> new file mode 100644 >>> index 00000000000..469a8f64400 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/powerpc/pr63281.c >>> @@ -0,0 +1,11 @@ >>> +/* PR target/63281 */ >>> +/* { dg-do compile { target lp64 } } */ >>> +/* { dg-options "-O2 -std=c99" } */ >>> + >>> +void >>> +foo (unsigned long long *a) >>> +{ >>> + *a = 0x020805006106003; >>> +} >>> + >>> +/* { dg-final { scan-assembler-times {\mp?ld\M} 1 } } */ >>> -- >>> 2.17.1 >>>