andrew.w.kaylor added inline comments.

================
Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89
+// CHECK-EXT-NEXT:    [[EXT1:%.*]] = fpext half [[TMP1]] to float
+// CHECK-EXT-NEXT:    [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]]
+// CHECK-EXT-NEXT:    [[TMP2:%.*]] = load half, ptr [[C_ADDR]]
----------------
I apologize for taking so long to get back to this. I'm not sure how to get 
this comment to show up as a response to the questions @rjmccall asked about my 
earlier remarks here. Apologies if this ends up being redundant.

I said:

> This is not what I'd expect to see. I'd expect the operations to use the half 
> type with explicit truncation inserted where needed.


@rjmccall said:

> Are you suggesting that the frontend emit half operations normally, with some 
> intrinsic to force half precision on casts and assignments, and that a 
> backend pass would aggressively promote operations between those intrinsics? 
> I think that would be a pretty error-prone representation, both in terms of 
> guaranteeing the use of excess precision in some situations (and thus getting 
> stable behavior across compiler releases) and guaranteeing truncation in 
> others (and thus preserving correctness). The frontend would have to 
> carefully emit intrinsics in a bunch of places or else default to introducing 
> a bug.


Maybe I'm asking that -fexcess-precision=fast not be an alias of 
-fexcess-precision=standard. The idea behind 'fast' as I understand it is to 
allow the compiler to generate whatever instructions are most efficient for the 
target. If the target supports native fp16 operations, the fastest code will 
use those instructions and not use excess precision. If the target does not 
support fp16, the fastest code would be to perform the calculations at 32-bit 
precision and only truncate on casts and assignments (or, even faster, not even 
then). You can see an analogy for what I mean by looking at what gcc does with 
32-bit floats when SSE is disabled: https://godbolt.org/z/xhEGbjG4G

With -fexcess-precision=fast, gcc takes liberties to make the code run as fast 
as possible. With -fexcess-precision=standard, it truncates to the source value 
on assignments, casts, or return.

If you generate code using the half type here, the backend will legalize it and 
**should** make it as fast as possible. In fact, it looks like currently the 
X86 backend will insert calls to extend and truncate the values to maintain the 
semantics of the IR (https://godbolt.org/z/YGnj4cqvv). That's sensible, but 
it's not what the X86 backend does in the 32-bit float + no SSE case.

The main thing I'd want to say here is that the front end has no business 
trying to decide what will be fastest for the target and encoding that. At 
most, the front end should be encoding the semantics in the source and setting 
some flag in the IR to indicate the excess precision handling that's allowed. 
This feels to me like it requires a new fast-math flag, maybe 'axp' = "allow 
excess precision".

The case where the front end is encoding excess precision into the IR feels 
much more like -ffp-eval-method. When the front end encodes an fpext to float 
and does a calculation, the optimizer is forced to respect that unless it can 
prove that doing the calculation at lower precision won't change the result 
(and it rarely does that). For targets that have native FP16 support, this will 
mean that -fexcess-precision=fast results in slower code.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136176/new/

https://reviews.llvm.org/D136176

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to