llvmbot wrote:



Author: Kareem Ergawy (ergawy)


Upstreams the next part of `do concurrent` to OpenMP mapping pass (from
AMD's ROCm implementation). See 
https://github.com/llvm/llvm-project/pull/126026 for more context.

This PR add loop nest detection logic. This enables us to discover
muli-range `do concurrent` loops and then map them as "collapsed" loop
nests to OpenMP.

This is a follow up for #<!-- -->126026, only the latest commit is relevant.


Patch is 39.64 KiB, truncated to 20.00 KiB below, full version: 

19 Files Affected:

- (modified) clang/include/clang/Driver/Options.td (+4) 
- (modified) clang/lib/Driver/ToolChains/Flang.cpp (+2-1) 
- (added) flang/docs/DoConcurrentConversionToOpenMP.md (+229) 
- (modified) flang/docs/index.md (+1) 
- (modified) flang/include/flang/Frontend/CodeGenOptions.def (+2) 
- (modified) flang/include/flang/Frontend/CodeGenOptions.h (+5) 
- (modified) flang/include/flang/Optimizer/OpenMP/Passes.h (+2) 
- (modified) flang/include/flang/Optimizer/OpenMP/Passes.td (+30) 
- (added) flang/include/flang/Optimizer/OpenMP/Utils.h (+26) 
- (modified) flang/include/flang/Optimizer/Passes/Pipelines.h (+15-3) 
- (modified) flang/lib/Frontend/CompilerInvocation.cpp (+28) 
- (modified) flang/lib/Frontend/FrontendActions.cpp (+27-5) 
- (modified) flang/lib/Optimizer/OpenMP/CMakeLists.txt (+1) 
- (added) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+209) 
- (modified) flang/lib/Optimizer/Passes/Pipelines.cpp (+10-2) 
- (added) flang/test/Driver/do_concurrent_to_omp_cli.f90 (+20) 
- (added) flang/test/Transforms/DoConcurrent/basic_host.f90 (+53) 
- (added) flang/test/Transforms/DoConcurrent/loop_nest_test.f90 (+89) 
- (modified) flang/tools/bbc/bbc.cpp (+19-1) 

diff --git a/clang/include/clang/Driver/Options.td 
index 5ad187926e710..0cd3dfd3fb29d 100644
--- a/clang/include/clang/Driver/Options.td
+++ b/clang/include/clang/Driver/Options.td
@@ -6927,6 +6927,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", 
 def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, 
   HelpText<"Emit hermetic module files (no nested USE association)">;
+def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
+  HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
+      Values<"none, host, device">;
 } // let Visibility = [FC1Option, FlangOption]
 def J : JoinedOrSeparate<["-"], "J">,
diff --git a/clang/lib/Driver/ToolChains/Flang.cpp 
index 9ad795edd724d..cb0b00a2fd699 100644
--- a/clang/lib/Driver/ToolChains/Flang.cpp
+++ b/clang/lib/Driver/ToolChains/Flang.cpp
@@ -153,7 +153,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
-                  {options::OPT_flang_experimental_hlfir,
+                  {options::OPT_fdo_concurrent_to_openmp_EQ,
+                   options::OPT_flang_experimental_hlfir,
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md 
new file mode 100644
index 0000000000000..de2525dd8b57d
--- /dev/null
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -0,0 +1,229 @@
+<!--===- docs/DoConcurrentMappingToOpenMP.md
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM 
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+# `DO CONCURRENT` mapping to OpenMP
+This document seeks to describe the effort to parallelize `do concurrent` loops
+by mapping them to OpenMP worksharing constructs. The goals of this document
+* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
+  constructs.
+* Tracking the current status of such mapping.
+* Describing the limitations of the current implementation.
+* Describing next steps.
+* Tracking the current upstreaming status (from the AMD ROCm fork).
+## Usage
+In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
+compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
+1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
+   This maps such loops to the equivalent of `omp parallel do`.
+2. `device`: this maps `do concurrent` loops to run in parallel on a target 
+   This maps such loops to the equivalent of
+   `omp target teams distribute parallel do`.
+3. `none`: this disables `do concurrent` mapping altogether. In that case, such
+   loops are emitted as sequential loops.
+The `-fdo-concurrent-to-openmp` compiler switch is currently available only 
+OpenMP is also enabled. So you need to provide the following options to flang 
+order to enable it:
+flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
+For mapping to device, the target device architecture must be specified as 
+See `-fopenmp-targets` and `--offload-arch` for more info.
+## Current status
+Under the hood, `do concurrent` mapping is implemented in the
+`DoConcurrentConversionPass`. This is still an experimental pass which means
+* It has been tested in a very limited way so far.
+* It has been tested mostly on simple synthetic inputs.
+### Loop nest detection
+On the `FIR` dialect level, the following loop:
+  do concurrent(i=1:n, j=1:m, k=1:o)
+    a(i,j,k) = i + j + k
+  end do
+is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
+contains **only** the following:
+  1. The operations needed to assign/update the outer loop's induction 
+  1. The inner loop itself.
+So the MLIR structure for the above example looks similar to the following:
+  fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
+    %i_idx_2 = fir.convert %i_idx : (index) -> i32
+    fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
+    fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
+      %j_idx_2 = fir.convert %j_idx : (index) -> i32
+      fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
+      fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
+        %k_idx_2 = fir.convert %k_idx : (index) -> i32
+        fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
+        ... loop nest body goes here ...
+      }
+    }
+  }
+This applies to multi-range loops in general; they are represented in the IR as
+a nest of `fir.do_loop` ops with the above nesting structure.
+Therefore, the pass detects such "perfectly" nested loop ops to identify 
+loops and map them as "collapsed" loops in OpenMP.
+#### Further info regarding loop nest detection
+Loop nest detection is currently limited to the scenario described in the 
+section. However, this is quite limited and can be extended in the future to 
+more cases. For example, for the following loop nest, even though, both loops 
+perfectly nested; at the moment, only the outer loop is parallelized:
+do concurrent(i=1:n)
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+Similarly, for the following loop nest, even though the intervening statement 
`x = 41`
+does not have any memory effects that would affect parallelization, this nest 
+not parallelized as well (only the outer loop is).
+do concurrent(i=1:n)
+  x = 41
+  do concurrent(j=1:m)
+    a(i,j) = i * j
+  end do
+end do
+The above also has the consequence that the `j` variable will **not** be
+privatized in the OpenMP parallel/target region. In other words, it will be
+treated as if it was a `shared` variable. For more details about privatization,
+see the "Data environment" section below.
+See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
+of what is and is not detected as a perfect loop nest.
+More details about current status will be added along with relevant parts of 
+implementation in later upstreaming patches.
+## Next steps
+This section describes some of the open questions/issues that are not tackled 
+even in the downstream implementation.
+### Delayed privatization
+So far, we emit the privatization logic for IVs inline in the parallel/target
+region. This is enough for our purposes right now since we don't
+localize/privatize any sophisticated types of variables yet. Once we have need
+for more advanced localization through `do concurrent`'s locality specifiers
+(see below), delayed privatization will enable us to have a much cleaner IR.
+Once delayed privatization's implementation upstream is supported for the
+required constructs by the pass, we will move to it rather than inlined/early
+### Locality specifiers for `do concurrent`
+Locality specifiers will enable the user to control the data environment of the
+loop nest in a more fine-grained way. Implementing these specifiers on the
+`FIR` dialect level is needed in order to support this in the
+Such specifiers will also unlock a potential solution to the
+non-perfectly-nested loops' IVs issue described above. In particular, for a
+non-perfectly nested loop, one middle-ground proposal/solution would be to:
+* Emit the loop's IV as shared/mapped just like we do currently.
+* Emit a warning that the IV of the loop is emitted as shared/mapped.
+* Given support for `LOCAL`, we can recommend the user to explicitly
+  localize/privatize the loop's IV if they choose to.
+#### Sharing TableGen clause records from the OpenMP dialect
+At the moment, the FIR dialect does not have a way to model locality specifiers
+on the IR level. Instead, something similar to early/eager privatization in 
+is done for the locality specifiers in `fir.do_loop` ops. Having locality 
+modelled in a way similar to delayed privatization (i.e. the `omp.private` op) 
+reductions (i.e. the `omp.declare_reduction` op) can make mapping `do 
+to OpenMP (and other parallel programming models) much easier.
+Therefore, one way to approach this problem is to extract the TableGen records
+for relevant OpenMP clauses in a shared dialect for "data environment 
+and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
+as well.
+#### Supporting reductions
+Similar to locality specifiers, mapping reductions from `do concurrent` to 
+is also still an open TODO. We can potentially extend the MLIR infrastructure
+proposed in the previous section to share reduction records among the 
+relevant dialects as well.
+### More advanced detection of loop nests
+As pointed out earlier, any intervening code between the headers of 2 nested
+`do concurrent` loops prevents us from detecting this as a loop nest. In some
+cases this is overly conservative. Therefore, a more flexible detection logic
+of loop nests needs to be implemented.
+### Data-dependence analysis
+Right now, we map loop nests without analysing whether such mapping is safe to
+do or not. We probably need to at least warn the user of unsafe loop nests due
+to loop-carried dependencies.
+### Non-rectangular loop nests
+So far, we did not need to use the pass for non-rectangular loop nests. For
+do concurrent(i=1:n)
+  do concurrent(j=i:n)
+    ...
+  end do
+end do
+We defer this to the (hopefully) near future when we get the conversion in a
+good share for the samples/projects at hand.
+### Generalizing the pass to other parallel programming models
+Once we have a stable and capable `do concurrent` to OpenMP mapping, we can 
+this in a more generalized direction and allow the pass to target other models;
+e.g. OpenACC. This goal should be kept in mind from the get-go even while only
+targeting OpenMP.
+## Upstreaming status
+- [x] Command line options for `flang` and `bbc`.
+- [x] Conversion pass skeleton (no transormations happen yet).
+- [x] Status description and tracking document (this document).
+- [x] Loop nest detection to identify multi-range loops.
+- [ ] Basic host/CPU mapping support.
+- [ ] Basic device/GPU mapping support.
+- [ ] More advanced host and device support (expaned to multiple items as 
diff --git a/flang/docs/index.md b/flang/docs/index.md
index c35f634746e68..913e53d4cfed9 100644
--- a/flang/docs/index.md
+++ b/flang/docs/index.md
@@ -50,6 +50,7 @@ on how to get in touch with us and to learn more about the 
current status.
+   DoConcurrentConversionToOpenMP
diff --git a/flang/include/flang/Frontend/CodeGenOptions.def 
index deb8d1aede518..13cda965600b5 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.def
+++ b/flang/include/flang/Frontend/CodeGenOptions.def
@@ -41,5 +41,7 @@ ENUM_CODEGENOPT(DebugInfo,  
llvm::codegenoptions::DebugInfoKind, 4,  llvm::codeg
 ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, 
llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
 ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, 
llvm::FramePointerKind::None) ///< Enable the usage of frame pointers
+ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, 
DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP
diff --git a/flang/include/flang/Frontend/CodeGenOptions.h 
index f19943335737b..23d99e1f0897a 100644
--- a/flang/include/flang/Frontend/CodeGenOptions.h
+++ b/flang/include/flang/Frontend/CodeGenOptions.h
@@ -15,6 +15,7 @@
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "llvm/Frontend/Debug/Options.h"
 #include "llvm/Frontend/Driver/CodeGenOptions.h"
 #include "llvm/Support/CodeGen.h"
@@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
   /// (-mlarge-data-threshold).
   uint64_t LargeDataThreshold;
+  /// Optionally map `do concurrent` loops to OpenMP. This is only valid of
+  /// OpenMP is enabled.
+  using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;
   // Define accessors/mutators for code generation options of enumeration type.
 #define CODEGENOPT(Name, Bits, Default)
 #define ENUM_CODEGENOPT(Name, Type, Bits, Default)                             
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.h 
index feb395f1a12db..c67bddbcd2704 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.h
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.h
@@ -13,6 +13,7 @@
+#include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
 #include "mlir/IR/BuiltinOps.h"
 #include "mlir/Pass/Pass.h"
@@ -30,6 +31,7 @@ namespace flangomp {
 /// divided into units of work.
 bool shouldUseWorkshareLowering(mlir::Operation *op);
+std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
 } // namespace flangomp
diff --git a/flang/include/flang/Optimizer/OpenMP/Passes.td 
index 3add0c560f88d..fcc7a4ca31fef 100644
--- a/flang/include/flang/Optimizer/OpenMP/Passes.td
+++ b/flang/include/flang/Optimizer/OpenMP/Passes.td
@@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
+def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", 
"mlir::func::FuncOp"> {
+  let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
+  let description = [{ This is an experimental pass to map `DO CONCURRENT` 
+     to their correspnding equivalent OpenMP worksharing constructs.
+     For now the following is supported:
+       - Mapping simple loops to `parallel do`.
+     Still TODO:
+       - More extensive testing.
+  }];
+  let dependentDialects = ["mlir::omp::OpenMPDialect"];
+  let options = [
+    Option<"mapTo", "map-to",
+           "flangomp::DoConcurrentMappingKind",
+           /*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
+           "Try to map `do concurrent` loops to OpenMP [none|host|device]",
+           [{::llvm::cl::values(
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
+                          "none", "Do not lower `do concurrent` to OpenMP"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
+                          "host", "Lower to run in parallel on the CPU"),
+               clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
+                          "device", "Lower to run in parallel on the GPU")
+           )}]>,
+  ];
 // Needs to be scheduled on Module as we create functions in it
 def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
diff --git a/flang/include/flang/Optimizer/OpenMP/Utils.h 
new file mode 100644
index 0000000000000..636c768b016b7
--- /dev/null
+++ b/flang/include/flang/Optimizer/OpenMP/Utils.h
@@ -0,0 +1,26 @@
+//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ 
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM 
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
+namespace flangomp {
+enum class DoConcurrentMappingKind {
+  DCMK_None,  ///< Do not lower `do concurrent` to OpenMP.
+  DCMK_Host,  ///< Lower to run in parallel on the CPU.
+  DCMK_Device ///< Lower to run in parallel on the GPU.
+} // namespace flangomp
diff --git a/flang/include/flang/Optimizer/Passes/Pipelines.h 
index ef5d44ded706c..a3f59ee8dd013 100644
--- a/flang/include/flang/Optimizer/Passes/Pipelines.h
+++ b/flang/include/flang/Optimizer/Passes/Pipelines.h
@@ -128,6 +128,17 @@ void createHLFIRToFIRPassPipeline(
     mlir::PassManager &pm, bool enableOpenMP,
     llvm::OptimizationLevel optLevel = defaultOptLevel);
+struct OpenMPFIRPassPipelineOpts {
+  /// Whether code is being generated for a target device rather than the host
+  /// device
+  bool isTargetDevice;
+  /// Controls how to map `do concurrent` loops; to device, host, or none at
+  /// all.
+  Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
+      doConcurrentMappingKind;
 /// Create a pass pipeline for handling certain OpenMP transformations needed
 /// prior to FIR lowering.
@@ -135,9 +146,10 @@ void createHLFIRToFIRPassPipeline(
 /// that the FIR is correct with respect to OpenMP operations/attributes.
 /// \param pm - MLIR pass manager that will hold the pipeline definition.
-/// \param isTargetDevice - Whether code is being generated for a target device
-/// rather than the host device.
-void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
+/// \param opts - options to control OpenMP code-gen; see struct docs for more
+/// details.
+void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
+                                 OpenMPFIRPassPipelineOpts opts);
 void createDebugPasses(mlir::PassManager &pm,
diff --git a/flang/lib/Frontend/CompilerInvocation.cpp 
index f3d9432c62d3b..809e423f5aae9 100644
--- a/flang/lib/Frontend/CompilerInvocation.cpp
+++ b/flang/lib/Frontend/CompilerInvocation.cpp
@@ -157,6 +157,32 @@ static bool 
parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
   return true;
+static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
+                                     llvm::opt::ArgList &args,
+                                     clang::DiagnosticsEngine &diags) {
+  llvm::opt::Arg *arg =
+      args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
+  if (!arg)
+    return;
+  using DoConcurrentMappingKind =
+      Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
+  std::optional<DoConcurrentMappingKind> val =
+      llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
+          arg->getValue())
+          .Case("none", DoConcurrentMappingKind::DCMK_None)
+          .Case("host", DoConcurrentMappingKind::DCMK_Host)
+          .Case("device", DoConcurrentMappingKind::DCMK_Device)
+          .Default(std::nullopt);
+  if (!val.has_value()) {
+    diags.Report(clang::diag::err_drv_invalid_value)
+        << arg->getAsString(args) << arg->getValue();
+  }
+  opts.setDoConcurrentMapping(val.value());
 static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
                               llvm::opt::ArgList &args,
                               clang::DiagnosticsEngine &diags) {
@@ -426,6 +452,8 @@ static void 
parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
                    clang::driver::options::OPT_funderscoring, false)) {
     opts.Underscoring = 0;
+  parseDoConcurrentMapping(opts, args, diags);
 /// Parses all target input arguments and populates the target
diff --git a/flang/lib/Frontend/FrontendActions.cpp 
index 763c810ace0eb..ccc8c7d96135f 100644
--- a/flang/lib/Frontend/FrontendActions.cpp
+++ b/flang/lib/Frontend/FrontendActions.cp...



cfe-commits mailing list

Reply via email to