On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan <venkataramanan.ku...@amd.com> wrote: > Hi Richi, > > (Snip) >> + (!cst_and_fits_in_hwi (step)) >> + { >> + if( loop->inner != NULL) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > No need to repeat this - all references are dumped when we gather them. > (Snip) > > The dumping happens at "record_ref" which is called after these statements to > record these references. > > When the step is invariant we return from the function without recording the > references. > > so I thought of dumping the references here. > > Is there a cleaner way to dump the references at one place?
Yes, call dump_mem_ref then, instead of repeating parts of its body. Richard. > Regards, > Venkat. > > > > -----Original Message----- > From: Richard Guenther [mailto:rguent...@suse.de] > Sent: Tuesday, October 02, 2012 5:42 PM > To: Kumar, Venkataramanan > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [Patch] Fix PR53397 > > On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: > >> Hi, >> >> The below patch fixes the FFT/Scimark regression caused by useless >> prefetch generation. >> >> This fix tries to make prefetch less aggressive by prefetching arrays >> in the inner loop, when the step is invariant in the entire loop nest. >> >> GCC currently tries to prefetch invariant steps when they are in the >> inner loop. But does not check if the step is variant in outer loops. >> >> In the scimark FFT case, the trip count of the inner loop varies by a >> non constant step, which is invariant in the inner loop. >> But the step variable is varying in outer loop. This makes inner loop >> trip count small (at run time varies sometimes as small as 1 >> iteration) >> >> Prefetching ahead x iteration when the inner loop trip count is >> smaller than x leads to useless prefetches. >> >> Flag used: -O3 -march=amdfam10 >> >> Before >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 550.50 >> FFT Mflops: 38.66 (N=1024) >> SOR Mflops: 617.61 (100 x 100) >> MonteCarlo: Mflops: 173.74 >> Sparse matmult Mflops: 675.63 (N=1000, nz=5000) >> LU Mflops: 1246.88 (M=100, N=100) >> >> >> After >> ** ** >> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >> ** for details. (Results can be submitted to p...@nist.gov) ** >> ** ** >> Using 2.00 seconds min time per kenel. >> Composite Score: 639.20 >> FFT Mflops: 479.19 (N=1024) >> SOR Mflops: 617.61 (100 x 100) >> MonteCarlo: Mflops: 173.18 >> Sparse matmult Mflops: 679.13 (N=1000, nz=5000) >> LU Mflops: 1246.88 (M=100, N=100) >> >> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu >> New tests that PASS: >> >> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c >> scan-tree-dump aprefetch "Issued prefetch" >> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c >> scan-tree-dump aprefetch "loop variant step" >> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching" >> gcc.dg/pr53397-2.c (test for excess errors) >> >> >> Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. >> >> Ok to commit in trunk? >> >> regards, >> Venkat >> >> gcc/ChangeLog >> +2012-10-01 Venkataramanan Kumar <venkataramanan.ku...@amd.com> >> + >> + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ >> + Perform non constant step prefetching in inner loop, only $ >> + when it is invariant in the entire loop nest. $ >> + * testsuite/gcc.dg/pr53397-1.c: New test case $ >> + Checks we are prefecthing for loop invariant steps$ >> + * testsuite/gcc.dg/pr53397-2.c: New test case$ >> + Checks we are not prefecthing for loop variant steps >> + >> >> >> Index: gcc/testsuite/gcc.dg/pr53397-1.c >> =================================================================== >> --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >> +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >> @@ -0,0 +1,28 @@ >> +/* Prefetching when the step is loop invariant. */ >> + >> +/* { dg-do compile } */ >> +/* { dg-options "-O3 -fprefetch-loop-arrays >> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >> +*/ >> + >> + >> +double data[16384]; >> +void prefetch_when_non_constant_step_is_invariant(int step, int n) { >> + int a; >> + int b; >> + for (a = 1; a < step; a++) { >> + for (b = 0; b < n; b += 2 * step) { >> + >> + int i = 2*(b + a); >> + int j = 2*(b + a + step); >> + >> + >> + data[j] = data[i]; >> + data[j+1] = data[i+1]; >> + } >> + } >> +} >> + >> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */ >> +/* { dg-final { scan-assembler "prefetcht0" } } */ > > This (and the case below) needs to be adjusted to only run on the appropriate > hardware. See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this. > >> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >> Index: gcc/testsuite/gcc.dg/pr53397-2.c >> =================================================================== >> --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >> +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >> @@ -0,0 +1,29 @@ >> +/* Not prefetching when the step is loop variant. */ >> + >> +/* { dg-do compile } */ >> +/* { dg-options "-O3 -fprefetch-loop-arrays >> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >> +*/ >> + >> + >> +double data[16384]; >> +void donot_prefetch_when_non_constant_step_is_variant(int step, int >> +n) { >> + int a; >> + int b; >> + for (a = 1; a < step; a++,step*=2) { >> + for (b = 0; b < n; b += 2 * step) { >> + >> + int i = 2*(b + a); >> + int j = 2*(b + a + step); >> + >> + >> + data[j] = data[i]; >> + data[j+1] = data[i+1]; >> + } >> + } >> +} >> + >> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */ >> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } */ >> + >> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >> + >> Index: gcc/tree-ssa-loop-prefetch.c >> =================================================================== >> --- gcc/tree-ssa-loop-prefetch.c (revision 191642) >> +++ gcc/tree-ssa-loop-prefetch.c (working copy) >> @@ -523,6 +523,7 @@ >> tree base, step; >> HOST_WIDE_INT delta; >> struct mem_ref_group *agrp; >> + loop_p ploop; >> >> if (get_base_address (ref) == NULL) >> return false; >> @@ -537,10 +538,50 @@ >> if (may_be_nonaddressable_p (base)) >> return false; >> >> - /* Limit non-constant step prefetching only to the innermost loops. >> */ >> - if (!cst_and_fits_in_hwi (step) && loop->inner != NULL) >> - return false; >> + /* Limit non-constant step prefetching only to the innermost loops and >> + only when the step is invariant in the entire loop nest. */ if >> + (!cst_and_fits_in_hwi (step)) >> + { >> + if( loop->inner != NULL) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > No need to repeat this - all references are dumped when we gather them. > >> + fprintf (dump_file, "Ignoring %p, non-constant step >> prefetching\ >> + is limited to inner most loops \n",(void *) ref); >> + } >> + return false; >> + } >> + else >> + { >> + ploop = loop; >> >> + while (loop_outer (ploop)) >> + { >> + if (!expr_invariant_in_loop_p (ploop , step)) >> + { >> + if (dump_file && (dump_flags & TDF_DETAILS)) >> + { >> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >> + fprintf (dump_file, "(base " ); >> + print_generic_expr (dump_file, base, TDF_SLIM); >> + fprintf (dump_file, ", step "); >> + print_generic_expr (dump_file, step, TDF_TREE); >> + fprintf (dump_file, ")\n"); > > Likewise. > >> + fprintf (dump_file, "Not prefetching, ignoring %p due >> to loop variant step\n",(void *) ref); >> + } >> + return false; >> + } >> + ploop = loop_outer (ploop); >> + } > > Instead of the loop checking expr_invariant_in_loop_p for each sub-loop you > can query expr_invariant_in_loop_p once, for the outermost loop. > > Please add a new function to cfgloop.h to access this outermost loop: > > /* Returns the outermost loop of the loop nest that contains LOOP. */ > > static inline struct loop * > loop_outermost (const struct loop *loop) { > unsigned n = VEC_length (loop_p, loop->superloops); > > if (n <= 1) > return loop; > > return VEC_index (loop_p, loop->superloops, 1); } > > Thanks, > Richard. > >