[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

Melanie Blower via Phabricator via cfe-commits Wed, 31 Mar 2021 11:27:11 -0700

mibintc created this revision.
mibintc added reviewers: andrew.w.kaylor, pengfei, kbsmith1.
Herald added subscribers: dexonsmith, jfb, hiraditya.
mibintc requested review of this revision.
Herald added a subscriber: jdoerfert.
Herald added a project: LLVM.


This is a proposal to add a new llvm intrinsic, llvm.arith.fence.  The purpose 
is to provide fine control, at the expression level, over floating point 
optimization when -ffast-math (-ffp-model=fast) is enabled.  We are also 
proposing a new clang builtin that provides access to this intrinsic, as well 
as a new clang command line option `-fprotect-parens` that will be implemented 
using this intrinsic.

This patch is authored by @pengfei

Rationale
---------

Some expression transformations that are mathematically correct, such as 
reassociation and distribution, may be incorrect when dealing with finite 
precision floating point.  For example, these two expressions,

  (a + b) + c
  a + (b + c)

are equivalent mathematically in integer arithmetic, but not in floating point. 
 In some floating point (FP) models, the compiler is allowed to make these 
value-unsafe transformations for performance reasons, even when the programmer 
uses parentheses explicitly.  But the compiler must always honor the 
parentheses implied by llvm.arith.fence, regardless of the FP model settings.

Under `–ffp-model=fast`, llvm.arith.fence provides a way to partially enforce 
ordering in an FP expression.

| Original expression           | Transformed expression | Permitted? |
| ----------------------------- | ---------------------- | ---------- |
| (a + b) + c                   | a + (b + c)            | Yes!       |
| llvm.arith.fence((a + b) + c) | a + (b + c)            | No!        |
|



NOTE: The llvm.arith.fence serves no purpose in value-safe FP modes like 
`–ffp-model=precise`:  FP expressions are already strictly ordered.

The new llvm intrinsic also enables the implementation of the option 
`-fprotect-parens` which is available in gfortran as well as the Intel C++ and 
Fortran compilers: icc and ifort.

Proposed llvm IR changes
------------------------

Requirements for llvm.arith.fence:

- There is one operand. The input to the intrinsic is an llvm::Value and must 
be scalar floating point or vector floating point.
- The return type is the same as the operand type.
- The return value is equivalent to the operand.

Optimizing llvm.arith.fence
---------------------------

- Constant folding may substitute the constant value of the llvm.arith.fence 
operand for the value of fence itself in the case where the operand is constant.
- CSE Detection: No special changes needed: if E1 and E2 are CSE, then 
llvm.arith.fence(E1) and llvm.arith.fence(E2) are CSE.
- FMA transformation should be enabled, at least in the -ffp-model=fast case.
  - The expression “llvm.arith.fence(a * b) + c” means that “a * b” must happen 
before “+ c” and FMA guarantees that, but to prevent later optimizations from 
unpacking the FMA the correct transformation needs to be:

  llvm.arith.fence(a * b) + c  →  llvm.arith.fence(FMA(a, b, c)) 



- In the ffp-model=fast case, FMA formation doesn’t happen until Isel, so we 
just need to add the llvm.arith.fence cases to ISel pattern matching.
- There are some choices around the FMA optimization. For this example:

  %t1 = fmul double %x, %y
  %t2 = call double @llvm.arith.fence.f64(double %t1)
  %t3 = fadd contract double %t2, %z

  1. FMA is allowed across an arith.fence if and only if the FMF `contract` 
flag is set for the llvm.arith.fence operand. //We are recommending this 
choice.//
  2. FMA is not allowed across a fence
  3. The FMF `contract` flag should be set on the llvm.arith.fence intrinsic 
call if contraction should be enabled
- Fast Math Optimization:
  - The result of a llvm.arith.fence can participate in fast math 
optimizations.  For example:

  // This transformation is legal:
  w + llvm.arith.fence(x + y) + z   →   w + z + llvm.arith.fence(x + y)

- The operand of a llvm.arith.fence can participate in fast math optimizations. 
 For example:

  // This transformation is legal:
  llvm.arith.fence((x+y)+z) --> llvm.arith.fence(x+(y+z))



NOTE: We want fast-math optimization within the fence, but not across the fence.



- MIR Optimization:
  - The use of a pseudo-operation in the MIR serves the same purpose as the 
intrinsic in the IR,  since all the optimizations are based on patterns 
matching from known DAGs/MIs.
  - Backend simply respects the llvm.arith.fence intrinsic, builds 
llvm.arith.fence node during DAG/ISel and emits pseudo arithmetic_fence MI 
after it.
  - The pseudo arithmetic_fence MI turns into a comment when emitting assembly.

Other llvm changes needed -- utility functions
----------------------------------------------

The ValueTracking utilities will need to be taught to handle the new intrinsic. 
For example, there are utility functions like `isKnownNeverNaN()` and 
`CannotBeOrderedLessThanZero()` that will need to “look through” the intrinsic.

A simple example
----------------

  // llvm IR, llvm.arith.fence over addition.
   %5 = load double, double* %B, align 8
   %add1 = fadd double %4, %5
   %6 = call double @llvm.arith.fence.f64(double %add1)
   %7 = load double, double* %C, align 8
   %mul = fmul double %6, %7
   store double %mul, double* %A, align 8



Example, llvm.arith.fence over memory operand
---------------------------------------------

Consider this similar example, which illustrates how ‘x’ can be optimized while 
‘z’ is fenced.  Notice ‘q’ is simplified to ‘b’ (q = a + b - a -> q = b), but 
‘z’ isn’t simplified because of the fence.

  // llvm IR
    define dso_local float @f(float %a, float %b) 
    local_unnamed_addr #0 {
    %x = fadd fast float %b, %a
    %tmp = call fast float @llvm.arith.fence.f32(float %x)
    %z = fsub fast float %tmp, %a
    %result = call fast float @llvm.maxnum.f32(float %z, float %b)
    ret float %result



Clang changes to take advantage of this intrinsic
-------------------------------------------------

- Add new clang builtin __arithmetic_fence
  - Add builtin definition
    - There is one operand. Any kind of expression, including memory operand.
    - The return type is the same as the operand type. The result of the 
intrinsic is the value of its rvalue operand.
    - The operand type can be any scalar floating point type, complex, or 
vector with float or complex element type.
    - The invocation of __arithmetic_fence is not a C/C++ constant expression, 
even if the operands are constant.

  - Add semantic checks and test cases
  - Modify clang/codegen to generate the llvm.arith.fence intrinsic
- Add support for a new command-line option `-fprotect-parens` which honors 
parentheses within a floating point expression, the default is 
`-fno-protect-parens`. For example,

  // Compile with -ffast-math
  double A,B,C;
  A = __arithmetic_fence(A+B)*C;
  
  // llvm IR
   %4 = load double, double* %A, align 8
   %5 = load double, double* %B, align 8
   %add1 = fadd double %4, %5
   %6 = call double @llvm.arith_fence.f64(double %add1)
   %7 = load double, double* %C, align 8
   %mul = fmul double %6, %7
   store double %mul, double* %A, align 8



- Motivation: the new clang builtin provides clang compatibility with the Intel 
C++ compiler builtin `__fence` which has similar semantics, and likewise 
enables implementation of the option `-fprotect-parens`.  The new builtin 
provides the clang programmer control over floating point optimizations at the 
expression level.



Pros & Cons
-----------

  1. Pros
- Increases expressiveness and precise control over floating point calculations.
- Provides a desirable compatibility feature from industrial compilers
  1. Cons
- Intrinsic bloat.
- Some of LLVM's optimizations need to understand the llvm.arith.fence 
semantics in order to retain optimization capabilities. This will require at 
least some engineering effort.
- Any target that wants to support this has to make modifications to their 
back-end.




Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D99675

Files:
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/include/llvm/CodeGen/BasicTTIImpl.h
  llvm/include/llvm/CodeGen/ISDOpcodes.h
  llvm/include/llvm/IR/Intrinsics.td
  llvm/include/llvm/Support/TargetOpcodes.def
  llvm/include/llvm/Target/Target.td
  llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
  llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

Index: llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
===================================================================
--- llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7210,6 +7210,14 @@
     }
     break;
   }
+  case Intrinsic::arithmetic_fence: {
+    SDValue Val = getValue(FPI.getArgOperand(0));
+    SDValue N(DAG.getMachineNode(TargetOpcode::ARITH_FENCE, getCurSDLoc(),
+                                 Val.getValueType(), Val),
+              0);
+    setValue(&FPI, N);
+    return;
+  }
   }
 
   // A few strict DAG nodes carry additional operands that are not
Index: llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
===================================================================
--- llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
+++ llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp
@@ -1265,6 +1265,9 @@
       case TargetOpcode::PSEUDO_PROBE:
         emitPseudoProbe(MI);
         break;
+      case TargetOpcode::ARITH_FENCE:
+        OutStreamer->emitRawComment("ARITH_FENCE");
+        break;
       default:
         emitInstruction(&MI);
         if (CanDoExtraAnalysis) {
Index: llvm/include/llvm/Target/Target.td
===================================================================
--- llvm/include/llvm/Target/Target.td
+++ llvm/include/llvm/Target/Target.td
@@ -1172,6 +1172,12 @@
   let AsmString = "PSEUDO_PROBE";
   let hasSideEffects = 1;
 }
+def ARITH_FENCE : StandardPseudoInstruction {
+  let OutOperandList = (outs unknown:$dst);
+  let InOperandList = (ins unknown:$src);
+  let AsmString = "";
+  let hasSideEffects = false;
+}
 
 def STACKMAP : StandardPseudoInstruction {
   let OutOperandList = (outs);
Index: llvm/include/llvm/Support/TargetOpcodes.def
===================================================================
--- llvm/include/llvm/Support/TargetOpcodes.def
+++ llvm/include/llvm/Support/TargetOpcodes.def
@@ -117,6 +117,9 @@
 /// Pseudo probe
 HANDLE_TARGET_OPCODE(PSEUDO_PROBE)
 
+/// Arithmetic fence.
+HANDLE_TARGET_OPCODE(ARITH_FENCE)
+
 /// A Stackmap instruction captures the location of live variables at its
 /// position in the instruction stream. It is followed by a shadow of bytes
 /// that must lie within the function and not contain another stackmap.
Index: llvm/include/llvm/IR/Intrinsics.td
===================================================================
--- llvm/include/llvm/IR/Intrinsics.td
+++ llvm/include/llvm/IR/Intrinsics.td
@@ -1311,6 +1311,9 @@
 def int_pseudoprobe : Intrinsic<[], [llvm_i64_ty, llvm_i64_ty, llvm_i32_ty, llvm_i64_ty],
                                     [IntrInaccessibleMemOnly, IntrWillReturn]>;
 
+// Arithmetic fence intrinsic.
+def int_arithmetic_fence : Intrinsic<[llvm_anyfloat_ty], [LLVMMatchType<0>], [IntrNoMem]>;
+
 // Intrinsics to support half precision floating point format
 let IntrProperties = [IntrNoMem, IntrWillReturn] in {
 def int_convert_to_fp16   : DefaultAttrsIntrinsic<[llvm_i16_ty], [llvm_anyfloat_ty]>;
Index: llvm/include/llvm/CodeGen/ISDOpcodes.h
===================================================================
--- llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -1085,6 +1085,10 @@
   /// specifier.
   PREFETCH,
 
+  /// ARITH_FENCE - This corresponds to a arithmetic fence intrinsic. Both its
+  /// operand and output are the same floating type.
+  ARITH_FENCE,
+
   /// OUTCHAIN = ATOMIC_FENCE(INCHAIN, ordering, scope)
   /// This corresponds to the fence instruction. It takes an input chain, and
   /// two integer constants: an AtomicOrdering and a SynchronizationScope.
Index: llvm/include/llvm/CodeGen/BasicTTIImpl.h
===================================================================
--- llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1515,6 +1515,7 @@
     case Intrinsic::lifetime_end:
     case Intrinsic::sideeffect:
     case Intrinsic::pseudoprobe:
+    case Intrinsic::arithmetic_fence:
       return 0;
     case Intrinsic::masked_store: {
       Type *Ty = Tys[0];
Index: llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
===================================================================
--- llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -567,6 +567,7 @@
     case Intrinsic::assume:
     case Intrinsic::sideeffect:
     case Intrinsic::pseudoprobe:
+    case Intrinsic::arithmetic_fence:
     case Intrinsic::dbg_declare:
     case Intrinsic::dbg_value:
     case Intrinsic::dbg_label:

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D99675: RFC [llvm][clang] Create new intrinsic llvm.arith.fence to control FP optimization at expression level

Reply via email to