On Mon, Oct 8, 2012 at 12:01 PM, Kumar, Venkataramanan <venkataramanan.ku...@amd.com> wrote: > Hi Richard, > > I have incorporated your comments. > >> Yes, call dump_mem_ref then, instead of repeating parts of its body. > > Reference object is not yet created at the place we check for invariance. It > is still a tree expression. I created a common function and used at all > places to dump the "step", "base" and "delta" values of memory reference > being analyzed. > > Please find the modified patch attached. > > GCC regression "make check -k" passes with x86_64-unknown-linux-gnu.
I presume also bootstrapped. Ok. Thanks, Richard. > Regards, > Venkat. > > -----Original Message----- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Thursday, October 04, 2012 6:26 PM > To: Kumar, Venkataramanan > Cc: Richard Guenther; gcc-patches@gcc.gnu.org > Subject: Re: [Patch] Fix PR53397 > > On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan > <venkataramanan.ku...@amd.com> wrote: >> Hi Richi, >> >> (Snip) >>> + (!cst_and_fits_in_hwi (step)) >>> + { >>> + if( loop->inner != NULL) >>> + { >>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>> + { >>> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >>> + fprintf (dump_file, "(base " ); >>> + print_generic_expr (dump_file, base, TDF_SLIM); >>> + fprintf (dump_file, ", step "); >>> + print_generic_expr (dump_file, step, TDF_TREE); >>> + fprintf (dump_file, ")\n"); >> >> No need to repeat this - all references are dumped when we gather them. >> (Snip) >> >> The dumping happens at "record_ref" which is called after these statements >> to record these references. >> >> When the step is invariant we return from the function without recording >> the references. >> >> so I thought of dumping the references here. >> >> Is there a cleaner way to dump the references at one place? > > Yes, call dump_mem_ref then, instead of repeating parts of its body. > > Richard. > >> Regards, >> Venkat. >> >> >> >> -----Original Message----- >> From: Richard Guenther [mailto:rguent...@suse.de] >> Sent: Tuesday, October 02, 2012 5:42 PM >> To: Kumar, Venkataramanan >> Cc: gcc-patches@gcc.gnu.org >> Subject: Re: [Patch] Fix PR53397 >> >> On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: >> >>> Hi, >>> >>> The below patch fixes the FFT/Scimark regression caused by useless >>> prefetch generation. >>> >>> This fix tries to make prefetch less aggressive by prefetching arrays >>> in the inner loop, when the step is invariant in the entire loop nest. >>> >>> GCC currently tries to prefetch invariant steps when they are in the >>> inner loop. But does not check if the step is variant in outer loops. >>> >>> In the scimark FFT case, the trip count of the inner loop varies by a >>> non constant step, which is invariant in the inner loop. >>> But the step variable is varying in outer loop. This makes inner loop >>> trip count small (at run time varies sometimes as small as 1 >>> iteration) >>> >>> Prefetching ahead x iteration when the inner loop trip count is >>> smaller than x leads to useless prefetches. >>> >>> Flag used: -O3 -march=amdfam10 >>> >>> Before >>> ** ** >>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >>> ** for details. (Results can be submitted to p...@nist.gov) ** >>> ** ** >>> Using 2.00 seconds min time per kenel. >>> Composite Score: 550.50 >>> FFT Mflops: 38.66 (N=1024) >>> SOR Mflops: 617.61 (100 x 100) >>> MonteCarlo: Mflops: 173.74 >>> Sparse matmult Mflops: 675.63 (N=1000, nz=5000) >>> LU Mflops: 1246.88 (M=100, N=100) >>> >>> >>> After >>> ** ** >>> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** >>> ** for details. (Results can be submitted to p...@nist.gov) ** >>> ** ** >>> Using 2.00 seconds min time per kenel. >>> Composite Score: 639.20 >>> FFT Mflops: 479.19 (N=1024) >>> SOR Mflops: 617.61 (100 x 100) >>> MonteCarlo: Mflops: 173.18 >>> Sparse matmult Mflops: 679.13 (N=1000, nz=5000) >>> LU Mflops: 1246.88 (M=100, N=100) >>> >>> GCC regression "make check -k" passes with x86_64-unknown-linux-gnu >>> New tests that PASS: >>> >>> gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c >>> scan-tree-dump aprefetch "Issued prefetch" >>> gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c >>> scan-tree-dump aprefetch "loop variant step" >>> gcc.dg/pr53397-2.c scan-tree-dump aprefetch "Not prefetching" >>> gcc.dg/pr53397-2.c (test for excess errors) >>> >>> >>> Checked CPU2006 and polyhedron on latest AMD processor, no regressions >>> noted. >>> >>> Ok to commit in trunk? >>> >>> regards, >>> Venkat >>> >>> gcc/ChangeLog >>> +2012-10-01 Venkataramanan Kumar <venkataramanan.ku...@amd.com> >>> + >>> + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ >>> + Perform non constant step prefetching in inner loop, only $ >>> + when it is invariant in the entire loop nest. $ >>> + * testsuite/gcc.dg/pr53397-1.c: New test case $ >>> + Checks we are prefecthing for loop invariant steps$ >>> + * testsuite/gcc.dg/pr53397-2.c: New test case$ >>> + Checks we are not prefecthing for loop variant steps >>> + >>> >>> >>> Index: gcc/testsuite/gcc.dg/pr53397-1.c >>> =================================================================== >>> --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >>> +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) >>> @@ -0,0 +1,28 @@ >>> +/* Prefetching when the step is loop invariant. */ >>> + >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O3 -fprefetch-loop-arrays >>> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >>> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >>> +*/ >>> + >>> + >>> +double data[16384]; >>> +void prefetch_when_non_constant_step_is_invariant(int step, int n) { >>> + int a; >>> + int b; >>> + for (a = 1; a < step; a++) { >>> + for (b = 0; b < n; b += 2 * step) { >>> + >>> + int i = 2*(b + a); >>> + int j = 2*(b + a + step); >>> + >>> + >>> + data[j] = data[i]; >>> + data[j+1] = data[i+1]; >>> + } >>> + } >>> +} >>> + >>> +/* { dg-final { scan-tree-dump "Issued prefetch" "aprefetch" } } */ >>> +/* { dg-final { scan-assembler "prefetcht0" } } */ >> >> This (and the case below) needs to be adjusted to only run on the >> appropriate hardware. See for example gcc.dg/tree-ssa/prefetch-8.c for how >> to do this. >> >>> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >>> Index: gcc/testsuite/gcc.dg/pr53397-2.c >>> =================================================================== >>> --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >>> +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) >>> @@ -0,0 +1,29 @@ >>> +/* Not prefetching when the step is loop variant. */ >>> + >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O3 -fprefetch-loop-arrays >>> +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 >>> +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details" } >>> +*/ >>> + >>> + >>> +double data[16384]; >>> +void donot_prefetch_when_non_constant_step_is_variant(int step, int >>> +n) { >>> + int a; >>> + int b; >>> + for (a = 1; a < step; a++,step*=2) { >>> + for (b = 0; b < n; b += 2 * step) { >>> + >>> + int i = 2*(b + a); >>> + int j = 2*(b + a + step); >>> + >>> + >>> + data[j] = data[i]; >>> + data[j+1] = data[i+1]; >>> + } >>> + } >>> +} >>> + >>> +/* { dg-final { scan-tree-dump "Not prefetching" "aprefetch" } } */ >>> +/* { dg-final { scan-tree-dump "loop variant step" "aprefetch" } } >>> +*/ >>> + >>> +/* { dg-final { cleanup-tree-dump "aprefetch" } } */ >>> + >>> Index: gcc/tree-ssa-loop-prefetch.c >>> =================================================================== >>> --- gcc/tree-ssa-loop-prefetch.c (revision 191642) >>> +++ gcc/tree-ssa-loop-prefetch.c (working copy) >>> @@ -523,6 +523,7 @@ >>> tree base, step; >>> HOST_WIDE_INT delta; >>> struct mem_ref_group *agrp; >>> + loop_p ploop; >>> >>> if (get_base_address (ref) == NULL) >>> return false; >>> @@ -537,10 +538,50 @@ >>> if (may_be_nonaddressable_p (base)) >>> return false; >>> >>> - /* Limit non-constant step prefetching only to the innermost loops. >>> */ >>> - if (!cst_and_fits_in_hwi (step) && loop->inner != NULL) >>> - return false; >>> + /* Limit non-constant step prefetching only to the innermost loops and >>> + only when the step is invariant in the entire loop nest. */ if >>> + (!cst_and_fits_in_hwi (step)) >>> + { >>> + if( loop->inner != NULL) >>> + { >>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>> + { >>> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >>> + fprintf (dump_file, "(base " ); >>> + print_generic_expr (dump_file, base, TDF_SLIM); >>> + fprintf (dump_file, ", step "); >>> + print_generic_expr (dump_file, step, TDF_TREE); >>> + fprintf (dump_file, ")\n"); >> >> No need to repeat this - all references are dumped when we gather them. >> >>> + fprintf (dump_file, "Ignoring %p, non-constant step >>> prefetching\ >>> + is limited to inner most loops \n",(void *) ref); >>> + } >>> + return false; >>> + } >>> + else >>> + { >>> + ploop = loop; >>> >>> + while (loop_outer (ploop)) >>> + { >>> + if (!expr_invariant_in_loop_p (ploop , step)) >>> + { >>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>> + { >>> + fprintf (dump_file, "Reference %p:\n", (void *) ref); >>> + fprintf (dump_file, "(base " ); >>> + print_generic_expr (dump_file, base, TDF_SLIM); >>> + fprintf (dump_file, ", step "); >>> + print_generic_expr (dump_file, step, TDF_TREE); >>> + fprintf (dump_file, ")\n"); >> >> Likewise. >> >>> + fprintf (dump_file, "Not prefetching, ignoring %p >>> due to loop variant step\n",(void *) ref); >>> + } >>> + return false; >>> + } >>> + ploop = loop_outer (ploop); >>> + } >> >> Instead of the loop checking expr_invariant_in_loop_p for each sub-loop you >> can query expr_invariant_in_loop_p once, for the outermost loop. >> >> Please add a new function to cfgloop.h to access this outermost loop: >> >> /* Returns the outermost loop of the loop nest that contains LOOP. */ >> >> static inline struct loop * >> loop_outermost (const struct loop *loop) { >> unsigned n = VEC_length (loop_p, loop->superloops); >> >> if (n <= 1) >> return loop; >> >> return VEC_index (loop_p, loop->superloops, 1); } >> >> Thanks, >> Richard. >> >> >