pengfei added inline comments.

================
Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89
+// CHECK-EXT-NEXT:    [[EXT1:%.*]] = fpext half [[TMP1]] to float
+// CHECK-EXT-NEXT:    [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]]
+// CHECK-EXT-NEXT:    [[TMP2:%.*]] = load half, ptr [[C_ADDR]]
----------------
andrew.w.kaylor wrote:
> I apologize for taking so long to get back to this. I'm not sure how to get 
> this comment to show up as a response to the questions @rjmccall asked about 
> my earlier remarks here. Apologies if this ends up being redundant.
> 
> I said:
> 
> > This is not what I'd expect to see. I'd expect the operations to use the 
> > half type with explicit truncation inserted where needed.
> 
> 
> @rjmccall said:
> 
> > Are you suggesting that the frontend emit half operations normally, with 
> > some intrinsic to force half precision on casts and assignments, and that a 
> > backend pass would aggressively promote operations between those 
> > intrinsics? I think that would be a pretty error-prone representation, both 
> > in terms of guaranteeing the use of excess precision in some situations 
> > (and thus getting stable behavior across compiler releases) and 
> > guaranteeing truncation in others (and thus preserving correctness). The 
> > frontend would have to carefully emit intrinsics in a bunch of places or 
> > else default to introducing a bug.
> 
> 
> Maybe I'm asking that -fexcess-precision=fast not be an alias of 
> -fexcess-precision=standard. The idea behind 'fast' as I understand it is to 
> allow the compiler to generate whatever instructions are most efficient for 
> the target. If the target supports native fp16 operations, the fastest code 
> will use those instructions and not use excess precision. If the target does 
> not support fp16, the fastest code would be to perform the calculations at 
> 32-bit precision and only truncate on casts and assignments (or, even faster, 
> not even then). You can see an analogy for what I mean by looking at what gcc 
> does with 32-bit floats when SSE is disabled: https://godbolt.org/z/xhEGbjG4G
> 
> With -fexcess-precision=fast, gcc takes liberties to make the code run as 
> fast as possible. With -fexcess-precision=standard, it truncates to the 
> source value on assignments, casts, or return.
> 
> If you generate code using the half type here, the backend will legalize it 
> and **should** make it as fast as possible. In fact, it looks like currently 
> the X86 backend will insert calls to extend and truncate the values to 
> maintain the semantics of the IR (https://godbolt.org/z/YGnj4cqvv). That's 
> sensible, but it's not what the X86 backend does in the 32-bit float + no SSE 
> case.
> 
> The main thing I'd want to say here is that the front end has no business 
> trying to decide what will be fastest for the target and encoding that. At 
> most, the front end should be encoding the semantics in the source and 
> setting some flag in the IR to indicate the excess precision handling that's 
> allowed. This feels to me like it requires a new fast-math flag, maybe 'axp' 
> = "allow excess precision".
> 
> The case where the front end is encoding excess precision into the IR feels 
> much more like -ffp-eval-method. When the front end encodes an fpext to float 
> and does a calculation, the optimizer is forced to respect that unless it can 
> prove that doing the calculation at lower precision won't change the result 
> (and it rarely does that). For targets that have native FP16 support, this 
> will mean that -fexcess-precision=fast results in slower code.
> For targets that have native FP16 support, this will mean that 
> -fexcess-precision=fast results in slower code.

This won't be true. We checked `TI.hasLegalHalfType()` before doing excess 
precision. Maybe better to add a RUN for `avx512fp16` case to address the 
concern.

> At most, the front end should be encoding the semantics in the source and 
> setting some flag in the IR to indicate the excess precision handling that's 
> allowed.

I think encoding excess precision into the IR is the easy way to encode the 
information of the casts and assignments. Otherwise, we have to use intrinsics 
to tell such information to the backend. As @rjmccall explained, it's more 
error-prone.

> If you generate code using the half type here, the backend will legalize it 
> and should make it as fast as possible.

I think the current backend implementations in both GCC and LLVM have already 
tried to make it as fast as possible in the condition of not to make the code 
too complicated.
The behavior in 32-bit float + no SSE case is benefited from x87 feature that 
always does excess precision. It is somehow taking bugs as feature from the 
semantics of the IR.
That says, it is feasible to mock the same behavior of x87 special fast 
behavior for fp16. It needs extra work in the backend to globally promote all 
half types to float to just get some "bug" behavior. I don't think GCC or us 
have interest with it.

In conclusion, I think the only common thing in the 32-bit float + no SSE case 
and fp16 is codegen (either FE or BE) needs to respect the boundary of the 
casts and assignments in some condition. I think the current FE implemetation 
is the best way to solve the problem (without extra BE work).


================
Comment at: precision-12.5.patch:1
+commit 6200c92ab01e7f78128875c089d7b001cb387d4d
+Author: Zahira Ammarguellat <zahira.ammarguel...@intel.com>
----------------
Seems you submitted it by mistake :)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136176/new/

https://reviews.llvm.org/D136176

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to