Re: [RFC] A memory gathering optimization for loop

Feng Xue OS via Gcc Thu, 06 May 2021 20:13:03 -0700

>> To simplify explanation of this memory gathering optimization, pseudo
>> code on original nested loops is given as:
>>
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  
>> */
>>       ...
>>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>>
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = LOAD_FN2 (v1);
>>           Type3 v3 = LOAD_FN3 (v2);
>>           ...
>>           iter = NEXT_FN (iter);
>>       }
>>
>>      ...
>>   }
>>
>> "LOAD_FNx()" is a conceptual function that translates its argument to a
>> result, and is composed of a sequence of operations in which there is
>> only one memory load. It is capable of representing simple memory
>> dereference like "iter->field", or more complicated expression like
>> "array[iter * 2 + cst].field".
>>
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
>> iterator is advanced. It is able to describe simple integer iterator,
>> list iterator, or even pure-function-based iterator on advanced stuff
>> like hashset/tree.
>>
>> Then the code will be transformed to:
>>
>>   typedef struct cache_elem {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
>>   } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>>
>>   outer-loop () {
>>       ...
>>       size_t cache_idx = 0;
>>
>>       inner-loop (iter, iter_count) {
>>
>>           if (!cache_arr[cache_idx]->init) {
>>               v1 = LOAD_FN1 (iter);
>>               v2 = LOAD_FN2 (v1);
>>               v3 = LOAD_FN3 (v2);
>>
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>           }
>>           else {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>           }
>>           ...
>>           cache_idx++;
>>           iter = NEXT_FN (iter);
>>       }
>>       ...
>>   }
>>
>>   free (cache_arr);
> 
> Why do you need the init field?  Isn't the array initialized
> sequentially?


This is meant to support more complicate pattern, in which
actual iteration count of inner_loop depends on outer_loop,
and "iter_count" only represents an up-bound of the count.

for (i = 0; i < N; i++)
  {
    for (j = 0; j < i; j++) 
       {
           Type1 v1 = LOAD_FN1 (j);
           Type2 v2 = LOAD_FN2 (v1);
           Type3 v3 = LOAD_FN3 (v2);

           ...

           condition = ...
       }

    if (condition)
      break;
  }

In above case, "iter_count" is N, and is not exact count of inner_loop.
Before going through all loads, loop execution might end up with
certain condition. We should not cache all of them in one step since
some loads might be invalid if condition is not met. So we need a
"init" field to track whether a caching element got properly filled,
and allows runtime switch between "do caching" and "use caching".

> You need to keep the original loop and use it if calloc fails.  The
> transformation is not valid in general because calloc is not an
> async-signal-safe function, and GCC shouldn't introduce such calls if
> they are not present in the original source code.  For
> -fnon-call-exceptions, you need to introduce unwinding code that calls
> free.

This is an issue that we missed. The only way left is to use alloca(), but may
be not suitable for large data caching.

> Can you change this optimization so that it can use a fixed-size buffer?
> This would avoid all issues around calling calloc.  If iter_count can be
> very large, allocating that much extra memory might not be beneficial
> anyway.

And need additional code to synchronize access to this buffer.

Thanks,
Feng

Re: [RFC] A memory gathering optimization for loop

Reply via email to