On January 17, 2021 2:23:55 AM GMT+01:00, JiangNing OS via Gcc <gcc@gcc.gnu.org> wrote: >> -----Original Message----- >> From: Feng Xue OS <f...@os.amperecomputing.com> >> Sent: Thursday, January 14, 2021 12:28 PM >> To: gcc@gcc.gnu.org >> Cc: JiangNing OS <jiangn...@os.amperecomputing.com>; Hao Liu OS >> <h...@os.amperecomputing.com> >> Subject: [RFC] A memory gathering optimization for loop >> >> 1. Background >> >> In a loop, it is optimal if only one memory stream is activated, that >is, all >> memory operations sequentially access one data region. But that is >always >> not the case, such as traversing link list and manipulating discrete >arrays. In >> this scenario, the loop would contain multiple scattered memory >accesses, >> which could trigger inefficient multi-way hardware prefetching, thus >making >> cpu cache hit rate drop. The tracker >> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a >> description on one related problem that we are meant to target. >> >> 2. Memory gathering optimization (MGO) >> >> For nested loops, if scattered memory accesses inside inner loop >remain >> unchanged in each iteration of outer loop, we can dynamically gather >result >> data into a sequential memory region when the first access in the >inner loop >> takes place. This way the hardware prefetching can be reduced, and >cpu >> cache hit rate can be improved. We name this optimization MGO (memory >> gathering optimization). An example is given as below, which is a >little >> different from pr98598, and could not be optimized by loop >distribution. >> >> struct A { int **p1; }; >> >> int foo (int n, struct A *array) >> { >> int sum = 0; >> >> for (int i = 0; i < n; i++) { >> for (int j = 0; j < i; j++) { /* iteration count is i */ >> int **p1 = array[j].p1; >> int *p2 = *p1; >> >> sum += *p2; >> } >> } >> >> return sum; >> } >>
The number of memory streams in this loop is limited (practically 1), how would the transform be affected when issuing prefetches for array at a well placed spot? That is, what's the hazard we avoid (besides the obvious data dependence Shortening in case the load from array is in cache?) Richard. >> Although this example seems to be pretty artificial, the kind of data >reuse in >> nested loops is often seen in real applications, especially written >by Fortran, >> and C++ STL. >> >> To simplify explanation of this memory gathering optimization, pseudo >code >> on original nested loops is given as: >> >> outer-loop ( ) { /* data in inner loop are also invariant in outer >loop. */ >> ... >> inner-loop (iter, iter_count) { /* inner loop to apply MGO */ >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = LOAD_FN2 (v1); >> Type3 v3 = LOAD_FN3 (v2); >> ... >> iter = NEXT_FN (iter); >> } >> >> ... >> } >> >> "LOAD_FNx()" is a conceptual function that translates its argument to >a result, >> and is composed of a sequence of operations in which there is only >one >> memory load. It is capable of representing simple memory dereference >like >> "iter->field", or more complicated expression like "array[iter * 2 + >cst].field". >> >> Likewise, "NEXT_FN()" is also a conceptual function which defines how >an >> iterator is advanced. It is able to describe simple integer iterator, >list iterator, >> or even pure-function-based iterator on advanced stuff like >hashset/tree. >> >> Then the code will be transformed to: >> >> typedef struct cache_elem { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; >> } cache_elem; >> >> cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem)); >> >> outer-loop () { >> ... >> size_t cache_idx = 0; >> >> inner-loop (iter, iter_count) { >> >> if (!cache_arr[cache_idx]->init) { >> v1 = LOAD_FN1 (iter); >> v2 = LOAD_FN2 (v1); >> v3 = LOAD_FN3 (v2); >> >> cache_arr[cache_idx]->init = true; >> cache_arr[cache_idx]->c_v1 = v1; >> cache_arr[cache_idx]->c_v2 = v2; >> cache_arr[cache_idx]->c_v3 = v3; >> } >> else { >> v1 = cache_arr[cache_idx]->c_v1; >> v2 = cache_arr[cache_idx]->c_v2; >> v3 = cache_arr[cache_idx]->c_v3; >> } >> ... >> cache_idx++; >> iter = NEXT_FN (iter); >> } >> ... >> } >> >> free (cache_arr); > >The example in PR#98598 preloads the data before the outer loop, while >the >transformation in this RFC lazily caches the data in the inner loop, >which is to >avoid introducing unnecessary data RW hazards. > >> >> In the transformation, cache space belongs to compiler-generated >temporary >> storage, but it is from user heap. We find this trick is very >uncommon in gcc. >> We'd like to make sure whether there are any special considerations >or >> restrictions on this. Anyway, using alloca() to get space from stack >is an >> alternative, but we should be cautious to avoid incurring stack >overflow by >> non-user logic. >> >> "iter_count" stands for an upper bound for inner-loop iteration >count, which >> must be an outer-loop invariant expression, in that it determines >data count >> in cache space, and the space is allocated before outer-loop. >> >> In the example below, we can simply get such bound for inner-loop 1 >from >> its niter_desc, while for inner-loop 2, it is not that easy, but it >is possible to >> infer that the loop's iteration count never exceeds "n" via >comparison >> predicate information. >> >> foo (int n, int m) >> { >> for (int i = 0; i < n; i++ ) { /* outer-loop 1 */ >> ... >> for (int j = 0; j < m; j++) { /* inner-loop 1 */ >> ... >> } >> ... >> } >> >> for (int i = 0; i < n; i ++) { /* outer-loop 2 */ >> ... >> for (int j = 0; j < i; j++) { /* inner-loop 2 */ >> ... >> } >> ... >> } >> } >> >> 3. Cache count threshold >> >> We expect cache space works as normal stack temporary, and would not >> impact execution of program so much, especially when heap is heavily >used >> in user code, and at least should not exhaust memory and make the >program >> broken. For this, a maximum cache count threshold needs to be >introduced. >> >> Since memory accesses are dynamically gathered at runtime, it has >cost, >> which lies in cache data filling, status check and space allocation. >> Runtime overhead of MGO could outweigh performance benefit if cache >data >> count is too small, so we also need a minimum cache count threshold. >> >> With a new runtime check against two cache data count thresholds, MGO >> transformation will be: >> >> trans_ok = iter_count >= min_cache_data_count && >> iter_count <= max_cache_data_count; >> >> if (trans_ok) { >> cache_arr = calloc (iter_count, sizeof(cache_elem)); >> } >> >> outer-loop () { >> ... >> if (trans_ok) { >> /* transformed inner-loop */ >> } >> else { >> /* original inner-loop */ >> } >> } >> >> if (trans_ok) { >> free (cache_arr); >> } >> >> 4. Runtime alias check >> >> In baseline MGO, memory disambiguation for candidate load is based on >> static alias analysis, which always provides very conservative >result. >> A more aggressive option is to use runtime alias check, which could >be easily >> integrated into data-caching code. In the following pseudo code, a >store >> outside inner-loop is aliased with the candidate load, and it >prevents the load >> and its dependent loads from being optimized if we only apply the >> optimization via static alias analysis. >> >> Type1 v0; >> >> outer-loop () { >> >> v0->field = value; /* Aliased with v1->field */ >> >> inner-loop (iter) { >> >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = v1->field; >> ... >> } >> ... >> } >> >> >> When runtime alias check is enabled, MGO will generate code to >> incrementally compute address range for candidate load, and check >address >> overlap for may-alias stores and the range at runtime. Once conflict >is >> detected, execution will switch to original inner-loop. With runtime >alias >> check added, transformation will be: >> >> Type1 v0; >> std::pair<void *, void *> v1_addr_range; >> >> outer-loop () { >> >> v0->field = value; >> ... >> /* Check whether v0 is covered by range. */ >> trans_ok &= !IN_RANGE (&v1_addr_range, v0); >> >> if (trans_ok) { >> >> inner-loop (iter) { >> >> if (!cache_arr[cache_idx]->init) { >> Type1 v1 = LOAD_FN1 (iter); >> Type2 v2 = v1->field; >> >> /* Extend range to cover v1. */ >> UPDATE_RANGE (&v1_addr_range, v1); >> ... >> } >> ... >> } >> } >> else { >> /* original inner-loop */ >> } >> ... >> } >> >> Currently, we are working on an implementation for this proposal. For >sure, >> it might be limited, so hope your comments on it. Thanks. >> >> Best Regards, >> >> Feng