================
@@ -219,6 +220,54 @@ bool DivergenceLoweringHelper::lowerTemporalDivergence() {
   return false;
 }
 
+bool DivergenceLoweringHelper::lowerTemporalDivergenceI1() {
+  MachineRegisterInfo::VRegAttrs BoolS1 = {ST->getBoolRC(), LLT::scalar(1)};
+  initializeLaneMaskRegisterAttributes(BoolS1);
+
+  for (auto [Inst, UseInst, Cycle] : MUI->getTemporalDivergenceList()) {
+    Register Reg = Inst->getOperand(0).getReg();
+    if (MRI->getType(Reg) != LLT::scalar(1))
+      continue;
+
+    Register MergedMask = MRI->createVirtualRegister(BoolS1);
+    Register PrevIterMask = MRI->createVirtualRegister(BoolS1);
+
+    MachineBasicBlock *CycleHeaderMBB = Cycle->getHeader();
+    SmallVector<MachineBasicBlock *, 1> ExitingBlocks;
+    Cycle->getExitingBlocks(ExitingBlocks);
+    assert(ExitingBlocks.size() == 1);
+    MachineBasicBlock *CycleExitingMBB = ExitingBlocks[0];
+
+    B.setInsertPt(*CycleHeaderMBB, CycleHeaderMBB->begin());
+    auto CrossIterPHI = B.buildInstr(AMDGPU::PHI).addDef(PrevIterMask);
+
+    // We only care about cycle iterration path - merge Reg with previous
+    // iteration. For other incomings use implicit def.
+    // Predecessors should be CyclePredecessor and CycleExitingMBB.
+    // In older versions of irreducible control flow lowering there could be
+    // cases with more predecessors. To keep this lowering as generic as
+    // possible also handle those cases.
+    for (auto MBB : CycleHeaderMBB->predecessors()) {
+      if (MBB == CycleExitingMBB) {
+        CrossIterPHI.addReg(MergedMask);
+      } else {
+        B.setInsertPt(*MBB, MBB->getFirstTerminator());
+        auto ImplDef = B.buildInstr(AMDGPU::IMPLICIT_DEF, {BoolS1}, {});
+        CrossIterPHI.addReg(ImplDef.getReg(0));
+      }
+      CrossIterPHI.addMBB(MBB);
+    }
----------------
nhaehnle wrote:

This still feels vaguely off to me.

First of all, can you please add a `Cycle->isReducible()` assert here just in 
case, since you're relying on there only being one header? (Actually, that 
assert should probably be in `GenericCycle::getHeader`.)

Second, I'm worried about that case distinction between different kinds of 
predecessors. It really doesn't feel properly generic to me. Latches and 
exiting blocks aren't necessarily the same thing. (The current structurizer 
tends to make it that way, but even with the structurizer we probably don't 
have a guarantee, let alone with a future different control flow lowering.)

Let's think through some cases:

* If the predecessor is outside the cycle, it should be an `IMPLICIT_DEF`
* If it is inside the cycle, i.e. a latch, there are two cases:
  * If the predecessor is dominated by `Inst`, then clearly use `MergedMask`
  * Otherwise, it's actually not clear: could it be possible that control flow 
leaves the dominance region of `Inst` and goes around the loop again?

On that last point, you could perhaps make a case that because the loop is 
temporally divergent, having the CFG in wave CFG form requires a certain 
structure. But I'm not convinced. In fact, consider:
```mermaid
flowchart TD
    Header --> B
    Header --> Inst
    Inst --> B
    B --> Header
    Inst --> C
    C --> Header
    C --> Exit
```
Let's say the branches of Header and C are uniform; only the branch of the 
basic block containing Inst is divergent.

Then we have temporal divergence in Inst, and need to handle the `i1`. But the 
CFG is already *reconvergent*: the divergent branch is post-dominated by one of 
its immediate successors. This means it is possible to insert EXEC masking code 
for this CFG correctly without structuring it. Therefore, the code here should 
really handle this case correctly.

I think this calls for using MachineSSAUpdater for a robust solution:

* Initialize the updater with IMPLICIT_DEF in every predecessor of the header 
that is outside the cycle.
* Insert the MergedMask into the updater for the `Inst->getParent()`
* Then build the lane mask merging, querying the updater for the previous mask 
(i.e., let the updater create that register) 
* The MergedMask can still be used by UseInst due to dominance

Except now there's an additional issue with caching the MergedMask that doesn't 
exist for the non-i1 case: Different `UseInst`s could appear with respect to 
different `Cycle`s. The non-i1 case doesn't actually care about the cycle 
structure, but the code here does.

In particular, with my suggestions, it will use the cycle to determine how to 
initialize the updater.

We can still get away with a single `MergedMask` per `Inst`:

* The lazy way to do it would be to not initialize the updater with 
IMPLICIT_DEFs and just let it figure them out by itself. This would generate 
correct code.
  * However, this has a compile-time cost and could end up being too 
pessimistic in terms of the generated code.
  * The Inst and all its uses could be contained in a cycle. This naive way 
would lead to a phi node at the header of this cycle, and therefore needlessly 
extending the live range of the value to the entire outer cycle. In this 
example, the values we create only need to live from InnerHeader to UseInst, 
but the naive use of the updater will extend them from OuterHeader all the way 
through OuterLatch:
```mermaid
flowchart TD
    Start --> OuterHeader --> InnerHeader --> Inst --> InnerHeader
    Inst --> UseInst --> OuterLatch --> OuterHeader
    OuterLatch --> Exit
```
* The smarter way is to initialize the updater in the way I described, using 
the *largest* relevant cycle.

So instead of merely caching the `MergedMask`, we should collect all `UseInst`s 
and determine the largest relevant cycle (as returned by the uniformity 
analysis).

https://github.com/llvm/llvm-project/pull/124299
_______________________________________________
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits

Reply via email to