================
@@ -1614,6 +1650,50 @@ OpenMPIRBuilder::InsertPointOrErrorTy 
OpenMPIRBuilder::createParallel(
                              IfCondition, NumThreads, PrivTID, PrivTIDAddr,
                              ThreadID, ToBeDeletedVec);
     };
+
+    std::optional<omp::OMPTgtExecModeFlags> ExecMode =
+        getTargetKernelExecMode(*OuterFn);
+
+    // If OuterFn is not a Generic kernel, skip custom allocation. This causes
+    // the CodeExtractor to follow its default behavior. Otherwise, we need to
+    // use device shared memory to allocate argument structures.
+    if (ExecMode && *ExecMode & OMP_TGT_EXEC_MODE_GENERIC) {
+      OI.CustomArgAllocatorCB = [this,
+                                 EntryBB](BasicBlock *, BasicBlock::iterator,
+                                          Type *ArgTy, const Twine &Name) {
+        // Instead of using the insertion point provided by the CodeExtractor,
+        // here we need to use the block that eventually calls the outlined
+        // function for the `parallel` construct.
+        //
+        // The reason is that the explicit deallocation call will be inserted
+        // within the outlined function, whereas the alloca insertion point
+        // might actually be located somewhere else in the caller. This becomes
+        // a problem when e.g. `parallel` is inside of a `distribute` 
construct,
+        // because the deallocation would be executed multiple times and the
+        // allocation just once (outside of the loop).
+        //
+        // TODO: Ideally, we'd want to do the allocation and deallocation
+        // outside of the `parallel` outlined function, hence using here the
+        // insertion point provided by the CodeExtractor. We can't do this at
+        // the moment because there is currently no way of passing an eligible
+        // insertion point for the explicit deallocation to the CodeExtractor,
+        // as that block is created (at least when nested inside of
+        // `distribute`) sometime after createParallel() completed, so it can't
+        // be stored in the OutlineInfo structure here.
----------------
skatrak wrote:

Did you have any place in mind as to where a pointer to that block could be 
stored? The problem I see is that we'd have to link it to the exit block that's 
paired to the alloca block, which may have been created by the translation of 
any operation somewhere N levels up the call stack.

I spent about a week trying to get a proper deallocation block to be paired to 
the `OI.OuterAllocaBB`, and here's a summary of what I found:
- the insertion point used to figure out what the allocation block should be is 
passed to the `createParallel` call;
- this is obtained in `convertOmpParallel`, during MLIR to LLVM IR translation, 
by looking at the nearest `OpenMPAllocaStackFrame` or by using the function's 
entry block if no eligible stack frames are found;
- `OpenMPAllocaStackFrame`s are created right before translating the body of 
some operations, which is e.g. the case of `omp.distribute`, and they will 
point to a block that has been expressly created for child ops to perform 
allocations;
- other entry/exit blocks might be introduced in between, since not all 
operations push new `OpenMPAllocaStackFrame`s, so by the time we get to 
`OpenMPIRBuilder::createParallel` we have no way of knowing what block is 
supposed to be the exit for the given outer alloca block; and
- a pointer to the eligible deallocation block associated to the alloca block 
is only currently available at the point where both are created (see e.g. 
`OpenMPIRBuilder::createDistribute`).

So, after this previous investigation and looking at it again with fresh eyes, 
I think that one way we could make this work would be to:
1. Extend `OpenMPAllocaStackFrame` to hold alloc and dealloc insertion points.
2. Add to `OpenMPIRBuilder::BodyGenCallbackTy` (and potentially to other 
similar callback types) an additional insert point argument for deallocations.
3. Update body callbacks creating an `OpenMPAllocaStackFrame` in MLIR to LLVM 
IR to store the new deallocation insertion point argument into the stack frame.
4. Modify OMPIRBuilder translation functions taking the aforementioned 
callback(s) to pass their exit blocks as deallocation points when generating 
their body.
5. Add a new `OuterDeallocBB` to `OpenMPIRBuilder::OutlineInfo` and pass it as 
the deallocation block to the `CodeExtractor` constructor in 
`OpenMPIRBuilder::finalize`.
6. Update `OpenMPIRBuilder::createParallel` to take the new `deallocIP` 
returned by `findAllocaInsertPoint` and store it as the `OI.OuterDeallocBB`.

I think that should work for Flang and let us use the `CodeExtractor`-provided 
insert point for both allocations and deallocations, while also doing this in 
the right spot (i.e. not inside of a loop whenever possible). However, there's 
also Clang uses of the OMPIRBuilder to contend with, and a quick look tells me 
we won't have easy access to an exit block as we do for Flang, though I'm not 
too familiar with it. As it stands, Clang won't run this one code path where we 
actually use that deallocation insertion point, so passing `null` around and 
having a sane default behavior will work, but eventually we'd have to deal with 
that issue.

Do you have an opinion on this? I can work on those changes as another patch in 
the stack to hopefully make things a bit better or I could make the changes 
here, though I think such fundamental changes are better done as an independent 
PR. Or maybe there's a better way I haven't thought about, let me know.

https://github.com/llvm/llvm-project/pull/150925
_______________________________________________
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

Reply via email to