================ @@ -0,0 +1,380 @@ +<!--===- docs/DoConcurrentMappingToOpenMP.md + + Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. + See https://llvm.org/LICENSE.txt for license information. + SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception + +--> + +# `DO CONCURENT` mapping to OpenMP + +```{contents} +--- +local: +--- +``` + +This document seeks to describe the effort to parallelize `do concurrent` loops +by mapping them to OpenMP worksharing constructs. The goals of this document +are: +* Describing how to instruct `flang` to map `DO CONCURENT` loops to OpenMP + constructs. +* Tracking the current status of such mapping. +* Describing the limitations of the current implmenentation. +* Describing next steps. +* Tracking the current upstreaming status (from the AMD ROCm fork). + +## Usage + +In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new +compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values: +1. `host`: this maps `do concurent` loops to run in parallel on the host CPU. + This maps such loops to the equivalent of `omp parallel do`. +2. `device`: this maps `do concurent` loops to run in parallel on a target device. + This maps such loops to the equivalent of + `omp target teams distribute parallel do`. +3. `none`: this disables `do concurrent` mapping altogether. In that case, such + loops are emitted as sequential loops. + +The above compiler switch is currently available only when OpenMP is also +enabled. So you need to provide the following options to flang in order to +enable it: +``` +flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ... +``` + +## Current status + +Under the hood, `do concurrent` mapping is implemented in the +`DoConcurrentConversionPass`. This is still an experimental pass which means +that: +* It has been tested in a very limited way so far. +* It has been tested mostly on simple synthetic inputs. + +To describe current status in more detail, following is a description of how +the pass currently behaves for single-range loops and then for multi-range +loops. The following sub-sections describe the status of the downstream +implementation on the AMD's ROCm fork[^1]. We are working on upstreaming the +downstream implementation gradually and this document will be updated to reflect +such upstreaming process. Example LIT tests referenced below might also be only +be available in the ROCm fork and will upstream with the relevant parts of the +code. + +[^1]: https://github.com/ROCm/llvm-project/blob/amd-staging/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp + +### Single-range loops + +Given the following loop: +```fortran + do concurrent(i=1:n) + a(i) = i * i + end do +``` + +#### Mapping to `host` + +Mapping this loop to the `host`, generates MLIR operations of the following +structure: + +``` +%4 = fir.address_of(@_QFEa) ... +%6:2 = hlfir.declare %4 ... + +omp.parallel { + // Allocate private copy for `i`. + // TODO Use delayed privatization. + %19 = fir.alloca i32 {bindc_name = "i"} + %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ... + + omp.wsloop { + omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) { + %23 = fir.convert %arg0 : (index) -> i32 + // Use the privatized version of `i`. + fir.store %23 to %20#1 : !fir.ref<i32> + ... + + // Use "shared" SSA value of `a`. + %42 = hlfir.designate %6#0 + hlfir.assign %35 to %42 + ... + omp.yield + } + omp.terminator + } + omp.terminator +} +``` + +#### Mapping to `device` + +Mapping the same loop to the `device`, generates MLIR operations of the +following structure: + +``` +// Map `a` to the `target` region. The pass automatically detects memory blocks +// and maps them to device. Currently detection logic is still limited and a lot +// of work is going into making it more capable. +%29 = omp.map.info ... {name = "_QFEa"} +omp.target ... map_entries(..., %29 -> %arg4 ...) { + ... + %51:2 = hlfir.declare %arg4 + ... + omp.teams { + // Allocate private copy for `i`. + // TODO Use delayed privatization. + %52 = fir.alloca i32 {bindc_name = "i"} + %53:2 = hlfir.declare %52 + ... + + omp.parallel { + omp.distribute { + omp.wsloop { + omp.loop_nest (%arg5) : index = (%54) to (%55) inclusive step (%c1_9) { + // Use the privatized version of `i`. + %56 = fir.convert %arg5 : (index) -> i32 + fir.store %56 to %53#1 + ... + // Use the mapped version of `a`. + ... = hlfir.designate %51#0 + ... + } + omp.terminator + } + omp.terminator + } + omp.terminator + } + omp.terminator + } + omp.terminator +} +``` + +### Multi-range loops + +The pass currently supports multi-range loops as well. Given the following +example: + +```fortran + do concurrent(i=1:n, j=1:m) + a(i,j) = i * j + end do +``` + +The generated `omp.loop_nest` operation look like: + +``` +omp.loop_nest (%arg0, %arg1) + : index = (%17, %19) to (%18, %20) + inclusive step (%c1_2, %c1_4) { + fir.store %arg0 to %private_i#1 : !fir.ref<i32> + fir.store %arg1 to %private_j#1 : !fir.ref<i32> + ... + omp.yield +} +``` + +It is worth noting that we have privatized versions for both iteration +variables: `i` and `j`. These are locally allocated inside the parallel/target +OpenMP region similar to what the single-range example in previous section +shows. + +#### Multi-range and perfectly-nested loops + +Currently, on the `FIR` dialect level, the following loop: +```fortran +do concurrent(i=1:n, j=1:m) + a(i,j) = i * j +end do +``` +is modelled as a nest of `fir.do_loop` ops such that the outer loop's region +contains: + 1. The operations needed to assign/update the outer loop's induction variable. + 1. The inner loop itself. + +So the MLIR structure looks similar to the following: +``` +fir.do_loop %arg0 = %11 to %12 step %c1 unordered { + ... + fir.do_loop %arg1 = %14 to %15 step %c1_1 unordered { + ... + } +} +``` +This applies to multi-range loops in general; they are represented in the IR as +a nest of `fir.do_loop` ops with the above nesting structure. + +Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range +loops and map them as "collapsed" loops in OpenMP. + +#### Further info regarding loop nest detection + +Loop-nest detection is currently limited to the scenario described in the previous +section. However, this is quite limited and can be extended in the future to cover +more cases. For example, for the following loop nest, even though, both loops are +perfectly nested; at the moment, only the outer loop is parallelized: +```fortran +do concurrent(i=1:n) + do concurrent(j=1:m) + a(i,j) = i * j + end do +end do +``` + +Similarly, for the following loop nest, even though the intervening statement `x = 41` +does not have any memory effects that would affect parallelization, this nest is +not parallelized as well (only the outer loop is). + +```fortran +do concurrent(i=1:n) + x = 41 + do concurrent(j=1:m) + a(i,j) = i * j + end do +end do +``` ---------------- ergawy wrote:
We can re-open the discussion when the relevant part is upstreamed. This part of the doc was removed until later in any case. https://github.com/llvm/llvm-project/pull/126026 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits