On Fri, Apr 30, 2021 at 1:20 PM Feng Xue OS via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > >> This patch implements a new loop optimization according to the proposal > >> in RFC given at > >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html. > >> So do not repeat the idea in this mail. Hope your comments on it. > > > > With the caveat that I'm not an optimization expert (but no one else > > seems to have replied), here are some thoughts. > > > > [...snip...] > > > >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop > >> [PR98598] > > > > BTW, did you mean to also post patches 2 and 3? > > > > Not yet, but they are ready. Since this is kind of special optimization that > uses > heap as temporary storage, not a common means in gcc, we do not know > basic attitude of the community towards it. So only the first patch was sent > out for initial comments, in that it implements a generic MGO framework, and > is complete and self-contained. Other 2 patches just composed some > enhancements for specific code pattern and dynamic alias check. If possible, > this proposal would be accepted principally, we will submit other 2 for > review. > > > > >> In nested loops, if scattered memory accesses inside inner loop remain > >> unchanged in outer loop, we can sequentialize these loads by caching > >> their values into a temporary memory region at the first time, and > >> reuse the caching data in following iterations. This way can improve > >> efficiency of cpu cache subsystem by reducing its unpredictable activies. > > > > I don't think you've cited any performance numbers so far. Does the > > optimization show a measurable gain on some benchmark(s)? e.g. is this > > ready to run SPEC yet, and how does it do? > > Yes, we have done that. Minor improvement about several point percentage > could gain for some real applications. And to be specific, we also get major > improvement as more than 30% for certain benchmark in SPEC2017. Hi Feng Xue, Could you help point out which bench it is? I failed to observe improvement in spec2017 on local x86 machine. I am running with O3 level.
Thanks, bin > > > > >> To illustrate what the optimization will do, two pieces of pseudo code, > >> before and after transformation, are given. Suppose all loads and > >> "iter_count" are invariant in outer loop. > >> > >> From: > >> > >> outer-loop () > >> { > >> inner-loop (iter, iter_count) > >> { > >> Type1 v1 = LOAD (iter); > >> Type2 v2 = LOAD (v1); > >> Type3 v3 = LOAD (v2); > >> ... > >> iter = NEXT (iter); > >> } > >> } > >> > >> To: > >> > >> typedef struct cache_elem > >> { > >> bool init; > >> Type1 c_v1; > >> Type2 c_v2; > >> Type3 c_v3; > > > > Putting the "bool init;" at the front made me think "what about > > packing?" but presumably the idea is that every element is accessed in > > order, so it presumably benefits speed to have "init" at the top of the > > element, right? > > Yes, layout of the struct layout could be optimized in terms of size by > some means, such as: > o. packing "init" into a padding hole after certain field > o. if certain field is a pointer type, the field can take the role of "init" > (Non-NULL implies "initialized") > Now this simple scheme is straightforward, and would be enhanced > in various aspects later. > > >> } cache_elem; > >> > >> cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > > > What if the allocation fails at runtime? Do we keep an unoptimized > > copy of the nested loops around as a fallback and have an unlikely > > branch to that copy? > > Yes, we should. But in a different way, a flag is added into original > nested loop to control runtime switch between optimized and > unoptimized execution. This definitely incurs runtime cost, but > avoid possible code size bloating. A better handling, as a TODO is > to apply dynamic-switch for large loop, and loop-clone for small one. > > > I notice that you're using calloc, presumably to clear all of the > > "init" flags (and the whole buffer). > > > > FWIW, this feels like a case where it would be nice to have a thread- > > local heap allocation, perhaps something like an obstack implemented in > > the standard library - but that's obviously scope creep for this. > > Yes, that's good, specially for many-thread application. > > > Could it make sense to use alloca for small allocations? (or is that > > scope creep?) > > We did consider using alloca as you said. But if we could not determine > up limit for a non-constant size, we have to place alloca inside a loop that > encloses the nested loop. Without a corresponding free operation, this > kind of alloca-in-loop might cause stack overflow. So it becomes another > TODO. > > >> outer-loop () > >> { > >> size_t cache_idx = 0; > >> > >> inner-loop (iter, iter_count) > >> { > >> if (!cache_arr[cache_idx]->init) > >> { > >> v1 = LOAD (iter); > >> v2 = LOAD (v1); > >> v3 = LOAD (v2); > >> > >> cache_arr[cache_idx]->init = true; > >> cache_arr[cache_idx]->c_v1 = v1; > >> cache_arr[cache_idx]->c_v2 = v2; > >> cache_arr[cache_idx]->c_v3 = v3; > >> } > >> else > >> { > >> v1 = cache_arr[cache_idx]->c_v1; > >> v2 = cache_arr[cache_idx]->c_v2; > >> v3 = cache_arr[cache_idx]->c_v3; > >> } > >> ... > >> cache_idx++; > >> iter = NEXT (iter); > >> } > >> } > >> > >> free (cache_arr); > > > > I see that the pass puts a "free" on every CFG exit from the loop. > > > > What about "longjmp" and C++ exceptions? Are we guaranteed that "free" > > will happen for those ways of exiting the loop? (should have test > > cases for this, I think) > > > > We simply disable the optimization if a loop contains an abnormal or > eh exit. > > > [...] > > > > > >> 2020-12-25 Feng Xue <f...@os.amperecomputing.com> > >> > >> gcc/ > >> PR tree-optimization/98598 > >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > >> * common.opt (-ftree-loop-mgo): New option. > >> * opts.c (default_options_table): Enable -ftree-loop-mgo at -O3+. > >> * params.opt (mgo-max-dep-load-level): New parameter. > >> (mgo-max-cache-elem-size): Likewise. > >> (mgo-min-cache-array-length): Likewise. > >> (mgo-max-cache-array-length): Likewise. > >> * doc/invoke.texi (mgo-max-dep-load-level): Document new parameter. > >> (mgo-max-cache-elem-size): Likewise. > >> (mgo-min-cache-array-length): Likewise. > >> (mgo-max-cache-array-length): Likewise. > > > > Nit: also need to document "-ftree-loop-mgo". > > OK. > > > > >> * passes.def: Add new pass_loop_mgo pass. > >> * timevar.def (TV_LOOP_MGO): New timevar. > >> * tree-pass.h (make_pass_loop_mgo): New declaration. > >> * tree-ssa-loop-mgo.c: New file. > > > > Nit: new C++ source files should have a .cc extension, rather than .c > > OK. > > > > >> gcc/testsuite/ > >> PR tree-optimization/98598 > >> * gcc.dg/tree-ssa/mgo/mgo.exp: New file. > >> * gcc.dg/tree-ssa/mgo/mgo-common.h: New test header. > >> * gcc.dg/tree-ssa/mgo/list/list-1.c: New test. > >> * gcc.dg/tree-ssa/mgo/array/simple-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/simple-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/simple-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/param-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-5.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-6.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-7.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-8.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-9.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/dep-load-10.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-5.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/indx-iter-6.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-1.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-2.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-3.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-4.c: Likewise. > >> * gcc.dg/tree-ssa/mgo/array/outer-loop-5.c: Likewise. > > > > [...] > > > > Presumably the optimization can't happen if the underlying data gets > > modified during the iteration. > > Yes. Any data to be gathered should be read-only in the loop. > > > I briefly looked through the test cases, but I didn't see any/(much?) > > test coverage for loops that write back to the data, or call functions > > that might do that. Sorry if I missed some, but clearly we need to > > test that we don't incorrectly optimize those cases. > > If we could not statically deduce whether data might be written, we > add dynamic alias-check to detect write hazard, which has been > implemented in other patch. Those test cases covering data-write > and function call are also included in it. > > > What happens in a multithreaded program in which another thread might > > be writing to the data? What are we allowed to optimize? > > For data access in multi-thread, we assume programmer know how to > avoid data race using synchronization primitive call or "volatile" specifier > to annotate related data. If non-pure call or "volatile" memory access > is found, the optimization will be suppressed. > > > Are there some C++ examples? Fortran? > > OK. Will add some. > > > It would be great to be able to inject calloc failures when testing > > this, but obviously that's out of scope for such a patch. > > OK. This is a TODO. > > > Hope this is constructive > > It definitely is, thanks for your comments. > > Feng