RE: [RFC] A memory gathering optimization for loop

Richard Biener via Gcc Sun, 17 Jan 2021 03:32:27 -0800

On January 17, 2021 2:23:55 AM GMT+01:00, JiangNing OS via Gcc 
<gcc@gcc.gnu.org> wrote:
>> -----Original Message-----
>> From: Feng Xue OS <f...@os.amperecomputing.com>
>> Sent: Thursday, January 14, 2021 12:28 PM
>> To: gcc@gcc.gnu.org
>> Cc: JiangNing OS <jiangn...@os.amperecomputing.com>; Hao Liu OS
>> <h...@os.amperecomputing.com>
>> Subject: [RFC] A memory gathering optimization for loop
>> 
>> 1. Background
>> 
>> In a loop, it is optimal if only one memory stream is activated, that
>is, all
>> memory operations sequentially access one data region. But that is
>always
>> not the case, such as traversing link list and manipulating discrete
>arrays. In
>> this scenario, the loop would contain multiple scattered memory
>accesses,
>> which could trigger inefficient multi-way hardware prefetching, thus
>making
>> cpu cache hit rate drop. The tracker
>> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
>> description on one related problem that we are meant to target.
>> 
>> 2. Memory gathering optimization (MGO)
>> 
>> For nested loops, if scattered memory accesses inside inner loop
>remain
>> unchanged in each iteration of outer loop, we can dynamically gather
>result
>> data into a sequential memory region when the first access in the
>inner loop
>> takes place. This way the hardware prefetching can be reduced, and
>cpu
>> cache hit rate can be improved. We name this optimization MGO (memory
>> gathering optimization). An example is given as below, which is a
>little
>> different from pr98598, and could not be optimized by loop
>distribution.
>> 
>>   struct A { int **p1; };
>> 
>>   int foo (int n, struct A *array)
>>   {
>>       int sum = 0;
>> 
>>       for (int i = 0; i < n; i++) {
>>           for (int j = 0; j < i; j++) {  /* iteration count is i */
>>               int **p1 = array[j].p1;
>>               int  *p2 = *p1;
>> 
>>               sum += *p2;
>>           }
>>       }
>> 
>>       return sum;
>>   }
>>


The number of memory streams in this loop is limited (practically 1), how would 
the transform be affected when issuing prefetches for array at a well placed 
spot? 

That is, what's the hazard we avoid (besides the obvious data dependence
Shortening in case the load from array is in cache?) 

Richard. 

>> Although this example seems to be pretty artificial, the kind of data
>reuse in
>> nested loops is often seen in real applications, especially written
>by Fortran,
>> and C++ STL.
>> 
>> To simplify explanation of this memory gathering optimization, pseudo
>code
>> on original nested loops is given as:
>> 
>>   outer-loop ( ) { /* data in inner loop are also invariant in outer
>loop.  */
>>       ...
>>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
>> 
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = LOAD_FN2 (v1);
>>           Type3 v3 = LOAD_FN3 (v2);
>>           ...
>>           iter = NEXT_FN (iter);
>>       }
>> 
>>      ...
>>   }
>> 
>> "LOAD_FNx()" is a conceptual function that translates its argument to
>a result,
>> and is composed of a sequence of operations in which there is only
>one
>> memory load. It is capable of representing simple memory dereference
>like
>> "iter->field", or more complicated expression like "array[iter * 2 +
>cst].field".
>> 
>> Likewise, "NEXT_FN()" is also a conceptual function which defines how
>an
>> iterator is advanced. It is able to describe simple integer iterator,
>list iterator,
>> or even pure-function-based iterator on advanced stuff like
>hashset/tree.
>> 
>> Then the code will be transformed to:
>> 
>>   typedef struct cache_elem {
>>       bool   init;
>>       Type1  c_v1;
>>       Type2  c_v2;
>>       Type3  c_v3;
>>   } cache_elem;
>> 
>>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
>> 
>>   outer-loop () {
>>       ...
>>       size_t cache_idx = 0;
>> 
>>       inner-loop (iter, iter_count) {
>> 
>>           if (!cache_arr[cache_idx]->init) {
>>               v1 = LOAD_FN1 (iter);
>>               v2 = LOAD_FN2 (v1);
>>               v3 = LOAD_FN3 (v2);
>> 
>>               cache_arr[cache_idx]->init = true;
>>               cache_arr[cache_idx]->c_v1 = v1;
>>               cache_arr[cache_idx]->c_v2 = v2;
>>               cache_arr[cache_idx]->c_v3 = v3;
>>           }
>>           else {
>>               v1 = cache_arr[cache_idx]->c_v1;
>>               v2 = cache_arr[cache_idx]->c_v2;
>>               v3 = cache_arr[cache_idx]->c_v3;
>>           }
>>           ...
>>           cache_idx++;
>>           iter = NEXT_FN (iter);
>>       }
>>       ...
>>   }
>> 
>>   free (cache_arr);
>
>The example in PR#98598 preloads the data before the outer loop, while
>the
>transformation in this RFC lazily caches the data in the inner loop,
>which is to
>avoid introducing unnecessary data RW hazards.
>
>> 
>> In the transformation, cache space belongs to compiler-generated
>temporary
>> storage, but it is from user heap. We find this trick is very
>uncommon in gcc.
>> We'd like to make sure whether there are any special considerations
>or
>> restrictions on this. Anyway, using alloca() to get space from stack
>is an
>> alternative, but we should be cautious to avoid incurring stack
>overflow by
>> non-user logic.
>> 
>> "iter_count" stands for an upper bound for inner-loop iteration
>count, which
>> must be an outer-loop invariant expression, in that it determines
>data count
>> in cache space, and the space is allocated before outer-loop.
>> 
>> In the example below, we can simply get such bound for inner-loop 1
>from
>> its niter_desc, while for inner-loop 2, it is not that easy, but it
>is possible to
>> infer that the loop's iteration count never exceeds "n" via
>comparison
>> predicate information.
>> 
>>   foo (int n, int m)
>>   {
>>       for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
>>           ...
>>           for (int j = 0; j < m; j++) {   /* inner-loop 1 */
>>               ...
>>           }
>>           ...
>>       }
>> 
>>       for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
>>           ...
>>           for (int j = 0; j < i; j++) {  /* inner-loop 2 */
>>               ...
>>           }
>>           ...
>>       }
>>   }
>> 
>> 3. Cache count threshold
>> 
>> We expect cache space works as normal stack temporary, and would not
>> impact execution of program so much, especially when heap is heavily
>used
>> in user code, and at least should not exhaust memory and make the
>program
>> broken. For this, a maximum cache count threshold needs to be
>introduced.
>> 
>> Since memory accesses are dynamically gathered at runtime, it has
>cost,
>> which lies in cache data filling, status check and space allocation.
>> Runtime overhead of MGO could outweigh performance benefit if cache
>data
>> count is too small, so we also need a minimum cache count threshold.
>> 
>> With a new runtime check against two cache data count thresholds, MGO
>> transformation will be:
>> 
>>   trans_ok = iter_count >= min_cache_data_count &&
>>              iter_count <= max_cache_data_count;
>> 
>>   if (trans_ok) {
>>       cache_arr = calloc (iter_count, sizeof(cache_elem));
>>   }
>> 
>>   outer-loop () {
>>       ...
>>       if (trans_ok)  {
>>           /* transformed inner-loop */
>>       }
>>       else {
>>           /* original inner-loop */
>>       }
>>   }
>> 
>>   if (trans_ok) {
>>       free (cache_arr);
>>   }
>> 
>> 4. Runtime alias check
>> 
>> In baseline MGO, memory disambiguation for candidate load is based on
>> static alias analysis, which always provides very conservative
>result.
>> A more aggressive option is to use runtime alias check, which could
>be easily
>> integrated into data-caching code. In the following pseudo code, a
>store
>> outside inner-loop is aliased with the candidate load, and it
>prevents the load
>> and its dependent loads from being optimized if we only apply the
>> optimization via static alias analysis.
>> 
>>   Type1 v0;
>> 
>>   outer-loop () {
>> 
>>       v0->field = value;  /* Aliased with v1->field */
>> 
>>       inner-loop (iter) {
>> 
>>           Type1 v1 = LOAD_FN1 (iter);
>>           Type2 v2 = v1->field;
>>           ...
>>       }
>>      ...
>>   }
>> 
>> 
>> When runtime alias check is enabled, MGO will generate code to
>> incrementally compute address range for candidate load, and check
>address
>> overlap for may-alias stores and the range at runtime. Once conflict
>is
>> detected, execution will switch to original inner-loop. With runtime
>alias
>> check added, transformation will be:
>> 
>>   Type1 v0;
>>   std::pair<void *, void *> v1_addr_range;
>> 
>>   outer-loop () {
>> 
>>       v0->field = value;
>>       ...
>>       /* Check whether v0 is covered by range.  */
>>       trans_ok &= !IN_RANGE (&v1_addr_range, v0);
>> 
>>       if (trans_ok) {
>> 
>>           inner-loop (iter) {
>> 
>>               if (!cache_arr[cache_idx]->init) {
>>                  Type1 v1 = LOAD_FN1 (iter);
>>                  Type2 v2 = v1->field;
>> 
>>                  /* Extend range to cover v1. */
>>                  UPDATE_RANGE (&v1_addr_range, v1);
>>                  ...
>>               }
>>               ...
>>           }
>>       }
>>       else {
>>          /* original inner-loop */
>>       }
>>       ...
>>   }
>> 
>> Currently, we are working on an implementation for this proposal. For
>sure,
>> it might be limited, so hope your comments on it. Thanks.
>> 
>> Best Regards,
>> 
>> Feng

RE: [RFC] A memory gathering optimization for loop

Reply via email to