On Thu, 2021-01-21 at 06:27 +0000, Feng Xue OS via Gcc-patches wrote: > This patch implements a new loop optimization according to the proposal > in RFC given at > https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html. > So do not repeat the idea in this mail. Hope your comments on it.
With the caveat that I'm not an optimization expert (but no one else seems to have replied), here are some thoughts. [...snip...] > Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop > [PR98598] BTW, did you mean to also post patches 2 and 3? > In nested loops, if scattered memory accesses inside inner loop remain > unchanged in outer loop, we can sequentialize these loads by caching > their values into a temporary memory region at the first time, and > reuse the caching data in following iterations. This way can improve > efficiency of cpu cache subsystem by reducing its unpredictable activies. I don't think you've cited any performance numbers so far. Does the optimization show a measurable gain on some benchmark(s)? e.g. is this ready to run SPEC yet, and how does it do? > To illustrate what the optimization will do, two pieces of pseudo code, > before and after transformation, are given. Suppose all loads and > "iter_count" are invariant in outer loop. > > From: > > outer-loop () > { > inner-loop (iter, iter_count) > { > Type1 v1 = LOAD (iter); > Type2 v2 = LOAD (v1); > Type3 v3 = LOAD (v2); > ... > iter = NEXT (iter); > } > } > > To: > > typedef struct cache_elem > { > bool init; > Type1 c_v1; > Type2 c_v2; > Type3 c_v3; Putting the "bool init;" at the front made me think "what about packing?" but presumably the idea is that every element is accessed in order, so it presumably benefits speed to have "init" at the top of the element, right? > } cache_elem; > > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); What if the allocation fails at runtime? Do we keep an unoptimized copy of the nested loops around as a fallback and have an unlikely branch to that copy? I notice that you're using calloc, presumably to clear all of the "init" flags (and the whole buffer). FWIW, this feels like a case where it would be nice to have a thread- local heap allocation, perhaps something like an obstack implemented in the standard library - but that's obviously scope creep for this. Could it make sense to use alloca for small allocations? (or is that scope creep?) > outer-loop () > { > size_t cache_idx = 0; > > inner-loop (iter, iter_count) > { > if (!cache_arr[cache_idx]->init) > { > v1 = LOAD (iter); > v2 = LOAD (v1); > v3 = LOAD (v2); > > cache_arr[cache_idx]->init = true; > cache_arr[cache_idx]->c_v1 = v1; > cache_arr[cache_idx]->c_v2 = v2; > cache_arr[cache_idx]->c_v3 = v3; > } > else > { > v1 = cache_arr[cache_idx]->c_v1; > v2 = cache_arr[cache_idx]->c_v2; > v3 = cache_arr[cache_idx]->c_v3; > } > ... > cache_idx++; > iter = NEXT (iter); > } > } > > free (cache_arr); I see that the pass puts a "free" on every CFG exit from the loop. What about "longjmp" and C++ exceptions? Are we guaranteed that "free" will happen for those ways of exiting the loop? (should have test cases for this, I think) [...] > 2020-12-25 Feng Xue <f...@os.amperecomputing.com> > > gcc/ > PR tree-optimization/98598 > * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > * common.opt (-ftree-loop-mgo): New option. > * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+. > * params.opt (mgo-max-dep-load-level): New parameter. > (mgo-max-cache-elem-size): Likewise. > (mgo-min-cache-array-length): Likewise. > (mgo-max-cache-array-length): Likewise. > * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter. > (mgo-max-cache-elem-size): Likewise. > (mgo-min-cache-array-length): Likewise. > (mgo-max-cache-array-length): Likewise. Nit: also need to document "-ftree-loop-mgo". > * passes.def: Add new pass_loop_mgo pass. > * timevar.def (TV_LOOP_MGO): New timevar. > * tree-pass.h (make_pass_loop_mgo): New declaration. > * tree-ssa-loop-mgo.c: New file. Nit: new C++ source files should have a .cc extension, rather than .c > gcc/testsuite/ > PR tree-optimization/98598 > * gcc.dg/tree-ssa/mgo/mgo.exp: New file. > * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header. > * gcc.dg/tree-ssa/mgo/list/list-1.c: New test. > * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise. > * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise. [...] Presumably the optimization can't happen if the underlying data gets modified during the iteration. I briefly looked through the test cases, but I didn't see any/(much?) test coverage for loops that write back to the data, or call functions that might do that. Sorry if I missed some, but clearly we need to test that we don't incorrectly optimize those cases. What happens in a multithreaded program in which another thread might be writing to the data? What are we allowed to optimize? Are there some C++ examples? Fortran? It would be great to be able to inject calloc failures when testing this, but obviously that's out of scope for such a patch. Hope this is constructive Dave