On Fri, May 7, 2021 at 6:29 AM Feng Xue OS <f...@os.amperecomputing.com> wrote: > > >> gcc/ > >> PR tree-optimization/98598 > >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. > >> * common.opt (-ftree-loop-mgo): New option. > > > > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a > > good > > name. -floop-mgo would be better but still I'd have no idea what this > > would do. > > > > I don't have a good suggestion here other than to expand it to > > -floop-gather-memory (?!). > > OK. Better than "mgo", this abbr. is only a term for development use. > > > The option documentation isn't informative either. > > > > From: > > > > outer-loop () > > { > > inner-loop (iter, iter_count) > > { > > Type1 v1 = LOAD (iter); > > Type2 v2 = LOAD (v1); > > Type3 v3 = LOAD (v2); > > ... > > iter = NEXT (iter); > > } > > } > > > > To: > > > > typedef struct cache_elem > > { > > bool init; > > Type1 c_v1; > > Type2 c_v2; > > Type3 c_v3; > > } cache_elem; > > > > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > > > > outer-loop () > > { > > size_t cache_idx = 0; > > > > inner-loop (iter, iter_count) > > { > > if (!cache_arr[cache_idx]->init) > > { > > v1 = LOAD (iter); > > v2 = LOAD (v1); > > v3 = LOAD (v2); > > > > cache_arr[cache_idx]->init = true; > > cache_arr[cache_idx]->c_v1 = v1; > > cache_arr[cache_idx]->c_v2 = v2; > > cache_arr[cache_idx]->c_v3 = v3; > > } > > else > > { > > v1 = cache_arr[cache_idx]->c_v1; > > v2 = cache_arr[cache_idx]->c_v2; > > v3 = cache_arr[cache_idx]->c_v3; > > } > > ... > > cache_idx++; > > iter = NEXT (iter); > > } > > } > > > > free (cache_arr); > > > > This is a _very_ special transform. What it seems to do is > > optimize the dependent loads for outer loop iteration n > 1 > > by caching the result(s). If that's possible then you should > > be able to distribute the outer loop to one doing the caching > > and one using the cache. Then this transform would be more > > like a tradidional array expansion of scalars? In some cases > > also loop interchange could remove the need for the caching. > > > > Doing MGO as the very first loop pass thus looks bad, I think > > MGO should be much later, for example after interchange. > > I also think that MGO should work in concert with loop > > distribution (which could have an improved cost model) > > rather than being a separate pass. > > > > Your analysis phase looks quite expensive, building sth > > like a on-the side representation very closely matching SSA. > > It seems to work from PHI defs to uses, which looks backwards. > > Did not catch this point very clearly. Would you please detail it more?
I don't remember exactly but you are building a lot of data structures that resemble ones readily available when you do find_dep_loads. You're looking at each and every stmt searching for operands matching up sth and then you're walking SSA uses again. And the searching uses linear vector walks (vec::contains). > > You seem to roll your own dependence analysis code :/ Please > > have a look at loop distribution. > > > > Also you build an actual structure type for reasons that escape > > me rather than simply accessing the allocated storage at > > appropriate offsets. > > > > I think simply calling 'calloc' isn't OK because you might need > > aligned storage and because calloc might not be available. > > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT > > is large enough for the data you want to place (or perform > > dynamic re-alignment yourself). We probably want some generic > > middle-end utility to obtain aligned allocated storage at some > > point. > > > > As said above I think you want to re-do this transform as > > a loop distribution transform. I think if caching works then > > the loads should be distributable and the loop distribution > > transform should be enhanced to expand the scalars to arrays. > > I checked code of loop distribution, and its trigger strategy seems > to be very conservative, now only targets simple and regular > index-based loop, and could not handle link-list traversal, which > consists of a series of discrete memory accesses, and MGO would > matter a lot. Additionally, for some complicate cases, we could > not completely decompose MGO as two separate loops for > "do caching" and "use caching" respectively. An example: > > for (i = 0; i < N; i++) > { > for (j = 0; j < i; j++) > { > Type1 v1 = LOAD_FN1 (j); > Type2 v2 = LOAD_FN2 (v1); > Type3 v3 = LOAD_FN3 (v2); > > ... > > condition = ... > } > > if (condition) > break; > } > > We should not cache all loads (Totally N) in one step since some > of them might be invalid after "condition" breaks loops. We have to > mix up "do caching" and "use caching", and let them dynamically > switched against "init" flag. Maybe, but then classical array expansion is useful and runs into the same dynamic allocation problem. I suppose loop distribution could be teached the dependent loads case - it's not unlike gathers. > But loop distribution does have some > overlap on analysis and transformation with MGO, we will try to > see if there is a way to unify them. Thanks. Richard. > Thanks, > Feng