Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Richard Biener via Gcc-patches Mon, 10 May 2021 04:57:01 -0700

On Fri, May 7, 2021 at 6:29 AM Feng Xue OS <f...@os.amperecomputing.com> wrote:
>
> >> gcc/
> >>         PR tree-optimization/98598
> >>         * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
> >>         * common.opt (-ftree-loop-mgo): New option.
> >
> > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a 
> > good
> > name.  -floop-mgo would be better but still I'd have no idea what this 
> > would do.
> >
> > I don't have a good suggestion here other than to expand it to
> > -floop-gather-memory (?!).
>
> OK. Better than "mgo", this abbr. is only a term for development use.
>
> > The option documentation isn't informative either.
> >
> > From:
> >
> >   outer-loop ()
> >     {
> >       inner-loop (iter, iter_count)
> >         {
> >           Type1 v1 = LOAD (iter);
> >           Type2 v2 = LOAD (v1);
> >           Type3 v3 = LOAD (v2);
> >           ...
> >           iter = NEXT (iter);
> >         }
> >     }
> >
> > To:
> >
> >   typedef struct cache_elem
> >     {
> >       bool   init;
> >       Type1  c_v1;
> >       Type2  c_v2;
> >       Type3  c_v3;
> >     } cache_elem;
> >
> >   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> >
> >   outer-loop ()
> >     {
> >       size_t cache_idx = 0;
> >
> >       inner-loop (iter, iter_count)
> >         {
> >           if (!cache_arr[cache_idx]->init)
> >             {
> >               v1 = LOAD (iter);
> >               v2 = LOAD (v1);
> >               v3 = LOAD (v2);
> >
> >               cache_arr[cache_idx]->init = true;
> >               cache_arr[cache_idx]->c_v1 = v1;
> >               cache_arr[cache_idx]->c_v2 = v2;
> >               cache_arr[cache_idx]->c_v3 = v3;
> >             }
> >           else
> >             {
> >               v1 = cache_arr[cache_idx]->c_v1;
> >               v2 = cache_arr[cache_idx]->c_v2;
> >               v3 = cache_arr[cache_idx]->c_v3;
> >             }
> >           ...
> >           cache_idx++;
> >           iter = NEXT (iter);
> >         }
> >     }
> >
> >   free (cache_arr);
> >
> > This is a _very_ special transform.  What it seems to do is
> > optimize the dependent loads for outer loop iteration n > 1
> > by caching the result(s).  If that's possible then you should
> > be able to distribute the outer loop to one doing the caching
> > and one using the cache.  Then this transform would be more
> > like a tradidional array expansion of scalars?  In some cases
> > also loop interchange could remove the need for the caching.
> >
> > Doing MGO as the very first loop pass thus looks bad, I think
> > MGO should be much later, for example after interchange.
> > I also think that MGO should work in concert with loop
> > distribution (which could have an improved cost model)
> > rather than being a separate pass.
> >
> > Your analysis phase looks quite expensive, building sth
> > like a on-the side representation very closely matching SSA.
> > It seems to work from PHI defs to uses, which looks backwards.
>
> Did not catch this point very clearly. Would you please detail it more?


I don't remember exactly but you are building a lot of data structures
that resemble ones readily available when you do find_dep_loads.
You're looking at each and every stmt searching for operands
matching up sth and then you're walking SSA uses again.
And the searching uses linear vector walks (vec::contains).

> > You seem to roll your own dependence analysis code :/  Please
> > have a look at loop distribution.
> >
> > Also you build an actual structure type for reasons that escape
> > me rather than simply accessing the allocated storage at
> > appropriate offsets.
> >
> > I think simply calling 'calloc' isn't OK because you might need
> > aligned storage and because calloc might not be available.
> > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> > is large enough for the data you want to place (or perform
> > dynamic re-alignment yourself).  We probably want some generic
> > middle-end utility to obtain aligned allocated storage at some
> > point.
> >
> > As said above I think you want to re-do this transform as
> > a loop distribution transform.  I think if caching works then
> > the loads should be distributable and the loop distribution
> > transform should be enhanced to expand the scalars to arrays.
>
> I checked code of loop distribution, and its trigger strategy seems
> to be very conservative, now only targets simple and regular
> index-based loop, and could not handle link-list traversal, which
> consists of a series of discrete memory accesses, and MGO would
> matter a lot. Additionally, for some complicate cases,  we could
> not completely decompose MGO as two separate loops for
> "do caching" and "use caching" respectively. An example:
>
> for (i = 0; i < N; i++)
>   {
>     for (j = 0; j < i; j++)
>        {
>            Type1 v1 = LOAD_FN1 (j);
>            Type2 v2 = LOAD_FN2 (v1);
>            Type3 v3 = LOAD_FN3 (v2);
>
>            ...
>
>            condition = ...
>        }
>
>     if (condition)
>       break;
>   }
>
> We should not cache all loads (Totally N) in one step since some
> of them might be invalid after "condition" breaks loops. We have to
> mix up "do caching" and "use caching", and let them dynamically
> switched against "init" flag.

Maybe, but then classical array expansion is useful and runs into
the same dynamic allocation problem.  I suppose loop distribution
could be teached the dependent loads case - it's not unlike
gathers.

>  But loop distribution does have some
> overlap on analysis and transformation with MGO, we will try to
> see if there is a way to unify them.

Thanks.
Richard.

> Thanks,
> Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

Reply via email to