Hi Richard, I have incorporated your comments.
> Yes, call dump_mem_ref then, instead of repeating parts of its body. Reference object is not yet created at the place we check for invariance. It is still a tree expression. I created a common function and used at all places to dump the "step", "base" and "delta" values of memory reference being analyzed. Please find the modified patch attached. GCC regression "make check -k" passes with x86_64-unknown-linux-gnu. Regards, Venkat. -----Original Message----- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Thursday, October 04, 2012 6:26 PM To: Kumar, Venkataramanan Cc: Richard Guenther; gcc-patches@gcc.gnu.org Subject: Re: [Patch] Fix PR53397 On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan <venkataramanan.ku...@amd.com> wrote: > Hi Richi, > > (Snip) >> + (!cst_and_fits_in_hwi (step)) >> + { >> + if( loop->inner != NULL) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > No need to repeat this - all references are dumped when we gather them. > (Snip) > > The dumping happens at "record_ref" which is called after these statements to > record these references. > > When the step is invariant we return from the function without recording the > references. > > so I thought of dumping the references here. > > Is there a cleaner way to dump the references at one place? Yes, call dump_mem_ref then, instead of repeating parts of its body. Richard. > Regards, > Venkat. > > > > -----Original Message----- > From: Richard Guenther [mailto:rguent...@suse.de] > Sent: Tuesday, October 02, 2012 5:42 PM > To: Kumar, Venkataramanan > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [Patch] Fix PR53397 > > On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: > >> Hi, >> >> The below patch fixes the FFT/Scimark regression caused by useless >> prefetch generation. >> >> This fix tries to make prefetch less aggressive by prefetching arrays >> in the inner loop, when the step is invariant in the entire loop nest. >> >> GCC currently tries to prefetch invariant steps when they are in the >> inner loop. But does not check if the step is variant in outer loops. >> >> In the scimark FFT case, the trip count of the inner loop varies by a >> non constant step, which is invariant in the inner loop. >> But the step variable is varying in outer loop. This makes inner loop >> trip count small (at run time varies sometimes as small as 1 >> iteration) >> >> Prefetching ahead x iteration when the inner loop trip count is >> smaller than x leads to useless prefetches. >> >> Flag used: -O3 -march=amdfam10 >> >> Before >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 550.50 >> FFT Mflops: 38.66 (N=1024) >> SOR Mflops: 617.61 (100 x 100) >> MonteCarlo: Mflops: 173.74 >> Sparse matmult Mflops: 675.63 (N=1000, nz=5000) >> LU Mflops: 1246.88 (M=100, N=100) >> >> >> After >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 639.20 >> FFT Mflops: 479.19 (N=1024) >> SOR Mflops: 617.61 (100 x 100) >> MonteCarlo: Mflops: 173.18 >> Sparse matmult Mflops: 679.13 (N=1000, nz=5000) >> LU Mflops: 1246.88 (M=100, N=100) >> >> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu >> New tests that PASS: >> >> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c >> scan-tree-dump aprefetch "Issued prefetch" >> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c >> scan-tree-dump aprefetch "loop variant step" >> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching" >> gcc.dg/pr53397-2.c (test for excess errors) >> >> >> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. >> >> Ok to commit in trunk? >> >> regards, >> Venkat >> >> gcc/ChangeLog >> +2012-10-01 Venkataramanan Kumar <venkataramanan.ku...@amd.com> >> + >> + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ >> + Perform non constant step prefetching in inner loop, only $ >> + when it is invariant in the entire loop nest. $ >> + * testsuite/gcc.dg/pr53397-1.c: New test case $ >> + Checks we are prefecthing for loop invariant steps$ >> + * testsuite/gcc.dg/pr53397-2.c: New test case$ >> + Checks we are not prefecthing for loop variant steps >> + >> >> >> Index: gcc/testsuite/gcc.dg/pr53397-1.c >> =================================================================== >> --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >> +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >> @@ -0,0 +1,28 @@ >> +/* Prefetching when the step is loop invariant. */ >> + >> +/* { dg-do compile } */ >> +/* { dg-options "-O3 -fprefetch-loop-arrays >> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >> +*/ >> + >> + >> +double data[16384]; >> +void prefetch_when_non_constant_step_is_invariant(int step, int n) { >> + int a; >> + int b; >> + for (a = 1; a < step; a++) { >> + for (b = 0; b < n; b += 2 * step) { >> + >> + int i = 2*(b + a); >> + int j = 2*(b + a + step); >> + >> + >> + data[j] = data[i]; >> + data[j+1] = data[i+1]; >> + } >> + } >> +} >> + >> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */ >> +/* { dg-final { scan-assembler "prefetcht0" } } */ > > This (and the case below) needs to be adjusted to only run on the appropriate > hardware. See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this. > >> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >> Index: gcc/testsuite/gcc.dg/pr53397-2.c >> =================================================================== >> --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >> +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >> @@ -0,0 +1,29 @@ >> +/* Not prefetching when the step is loop variant. */ >> + >> +/* { dg-do compile } */ >> +/* { dg-options "-O3 -fprefetch-loop-arrays >> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >> +*/ >> + >> + >> +double data[16384]; >> +void donot_prefetch_when_non_constant_step_is_variant(int step, int >> +n) { >> + int a; >> + int b; >> + for (a = 1; a < step; a++,step*=2) { >> + for (b = 0; b < n; b += 2 * step) { >> + >> + int i = 2*(b + a); >> + int j = 2*(b + a + step); >> + >> + >> + data[j] = data[i]; >> + data[j+1] = data[i+1]; >> + } >> + } >> +} >> + >> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */ >> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } >> +*/ >> + >> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >> + >> Index: gcc/tree-ssa-loop-prefetch.c >> =================================================================== >> --- gcc/tree-ssa-loop-prefetch.c (revision 191642) >> +++ gcc/tree-ssa-loop-prefetch.c (working copy) >> @@ -523,6 +523,7 @@ >> tree base, step; >> HOST_WIDE_INT delta; >> struct mem_ref_group *agrp; >> + loop_p ploop; >> >> if (get_base_address (ref) == NULL) >> return false; >> @@ -537,10 +538,50 @@ >> if (may_be_nonaddressable_p (base)) >> return false; >> >> - /* Limit non-constant step prefetching only to the innermost loops. >> */ >> - if (!cst_and_fits_in_hwi (step) && loop->inner != NULL) >> - return false; >> + /* Limit non-constant step prefetching only to the innermost loops and >> + only when the step is invariant in the entire loop nest. */ if >> + (!cst_and_fits_in_hwi (step)) >> + { >> + if( loop->inner != NULL) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > No need to repeat this - all references are dumped when we gather them. > >> + fprintf (dump_file, "Ignoring %p, non-constant step >> prefetching\ >> + is limited to inner most loops \n",(void *) ref); >> + } >> + return false; >> + } >> + else >> + { >> + ploop = loop; >> >> + while (loop_outer (ploop)) >> + { >> + if (!expr_invariant_in_loop_p (ploop , step)) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > Likewise. > >> + fprintf (dump_file, "Not prefetching, ignoring %p due >> to loop variant step\n",(void *) ref); >> + } >> + return false; >> + } >> + ploop = loop_outer (ploop); >> + } > > Instead of the loop checking expr_invariant_in_loop_p for each sub-loop you > can query expr_invariant_in_loop_p once, for the outermost loop. > > Please add a new function to cfgloop.h to access this outermost loop: > > /* Returns the outermost loop of the loop nest that contains LOOP. */ > > static inline struct loop * > loop_outermost (const struct loop *loop) { > unsigned n = VEC_length (loop_p, loop->superloops); > > if (n <= 1) > return loop; > > return VEC_index (loop_p, loop->superloops, 1); } > > Thanks, > Richard. > >
Index: gcc/testsuite/gcc.dg/pr53397-1.c =================================================================== --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) @@ -0,0 +1,28 @@ +/* Prefetching when the step is loop invariant. */ +/* { dg-do compile { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-require-effective-target sse2 } */ +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */ + + +double data[16384]; +void prefetch_when_non_constant_step_is_invariant(int step, int n) +{ + int a; + int b; + for (a = 1; a < step; a++) { + for (b = 0; b < n; b += 2 * step) { + + int i = 2*(b + a); + int j = 2*(b + a + step); + + + data[j] = data[i]; + data[j+1] = data[i+1]; + } + } +} + +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */ +/* { dg-final { scan-assembler "prefetcht0" } } */ + +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ Index: gcc/testsuite/gcc.dg/pr53397-2.c =================================================================== --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) @@ -0,0 +1,28 @@ +/* Not prefetching when the step is loop variant. */ +/* { dg-do compile { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-require-effective-target sse2 } */ +/* { dg-options "-O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } */ + +double data[16384]; +void donot_prefetch_when_non_constant_step_is_variant(int step, int n) +{ + int a; + int b; + for (a = 1; a < step; a++,step*=2) { + for (b = 0; b < n; b += 2 * step) { + + int i = 2*(b + a); + int j = 2*(b + a + step); + + + data[j] = data[i]; + data[j+1] = data[i+1]; + } + } +} + +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */ +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */ + +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ + Index: gcc/tree-ssa-loop-prefetch.c =================================================================== --- gcc/tree-ssa-loop-prefetch.c (revision 191642) +++ gcc/tree-ssa-loop-prefetch.c (working copy) @@ -278,29 +278,37 @@ nontemporal one. */ }; -/* Dumps information about reference REF to FILE. */ - +/* Dumps information about memory reference */ static void -dump_mem_ref (FILE *file, struct mem_ref *ref) +dump_mem_details (FILE *file, tree base, tree step, + HOST_WIDE_INT delta, bool write_p) { - fprintf (file, "Reference %p:\n", (void *) ref); - - fprintf (file, " group %p (base ", (void *) ref->group); - print_generic_expr (file, ref->group->base, TDF_SLIM); + fprintf (file, "(base "); + print_generic_expr (file, base, TDF_SLIM); fprintf (file, ", step "); - if (cst_and_fits_in_hwi (ref->group->step)) - fprintf (file, HOST_WIDE_INT_PRINT_DEC, int_cst_value (ref->group->step)); + if (cst_and_fits_in_hwi (step)) + fprintf (file, HOST_WIDE_INT_PRINT_DEC, int_cst_value (step)); else - print_generic_expr (file, ref->group->step, TDF_TREE); + print_generic_expr (file, step, TDF_TREE); fprintf (file, ")\n"); - fprintf (file, " delta "); - fprintf (file, HOST_WIDE_INT_PRINT_DEC, ref->delta); + fprintf (file, HOST_WIDE_INT_PRINT_DEC, delta); fprintf (file, "\n"); + fprintf (file, " %s\n", write_p ? "write" : "read"); + fprintf (file, "\n"); +} - fprintf (file, " %s\n", ref->write_p ? "write" : "read"); +/* Dumps information about reference REF to FILE. */ - fprintf (file, "\n"); +static void +dump_mem_ref (FILE *file, struct mem_ref *ref) +{ + fprintf (file, "Reference %p:\n", (void *) ref); + + fprintf (file, " group %p ", (void *) ref->group); + + dump_mem_details (file, ref->group->base, ref->group->step, ref->delta, + ref->write_p); } /* Finds a group with BASE and STEP in GROUPS, or creates one if it does not @@ -537,9 +545,44 @@ if (may_be_nonaddressable_p (base)) return false; - /* Limit non-constant step prefetching only to the innermost loops. */ - if (!cst_and_fits_in_hwi (step) && loop->inner != NULL) - return false; + /* Limit non-constant step prefetching only to the innermost loops and + only when the step is loop invariant in the entire loop nest. */ + if (!cst_and_fits_in_hwi (step)) + { + if (loop->inner != NULL) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Memory expression %p\n",(void *) ref ); + print_generic_expr (dump_file, ref, TDF_TREE); + fprintf (dump_file,":"); + dump_mem_details( dump_file, base, step, delta, write_p); + fprintf (dump_file, + "Ignoring %p, non-constant step prefetching is " + "limited to inner most loops \n", + (void *) ref); + } + return false; + } + else + { + if (!expr_invariant_in_loop_p (loop_outermost (loop), step)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Memory expression %p\n",(void *) ref ); + print_generic_expr (dump_file, ref, TDF_TREE); + fprintf (dump_file,":"); + dump_mem_details(dump_file, base, step, delta, write_p); + fprintf (dump_file, + "Not prefetching, ignoring %p due to " + "loop variant step\n", + (void *) ref); + } + return false; + } + } + } /* Now we know that REF = &BASE + STEP * iter + DELTA, where DELTA and STEP are integer constants. */ Index: gcc/cfgloop.h =================================================================== --- gcc/cfgloop.h (revision 191642) +++ gcc/cfgloop.h (working copy) @@ -714,4 +714,18 @@ extern void move_loop_invariants (void); extern bool finite_loop_p (struct loop *); +/* Returns the outermost loop of the loop nest that contains LOOP.*/ +static inline struct loop * +loop_outermost (struct loop *loop) +{ + + unsigned n = VEC_length (loop_p, loop->superloops); + + if (n <= 1) + return loop; + + return VEC_index (loop_p, loop->superloops, 1); +} + + #endif /* GCC_CFGLOOP_H */