andrew.w.kaylor added inline comments.

================
Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89
+// CHECK-EXT-NEXT:    [[EXT1:%.*]] = fpext half [[TMP1]] to float
+// CHECK-EXT-NEXT:    [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]]
+// CHECK-EXT-NEXT:    [[TMP2:%.*]] = load half, ptr [[C_ADDR]]
----------------
pengfei wrote:
> andrew.w.kaylor wrote:
> > I apologize for taking so long to get back to this. I'm not sure how to get 
> > this comment to show up as a response to the questions @rjmccall asked 
> > about my earlier remarks here. Apologies if this ends up being redundant.
> > 
> > I said:
> > 
> > > This is not what I'd expect to see. I'd expect the operations to use the 
> > > half type with explicit truncation inserted where needed.
> > 
> > 
> > @rjmccall said:
> > 
> > > Are you suggesting that the frontend emit half operations normally, with 
> > > some intrinsic to force half precision on casts and assignments, and that 
> > > a backend pass would aggressively promote operations between those 
> > > intrinsics? I think that would be a pretty error-prone representation, 
> > > both in terms of guaranteeing the use of excess precision in some 
> > > situations (and thus getting stable behavior across compiler releases) 
> > > and guaranteeing truncation in others (and thus preserving correctness). 
> > > The frontend would have to carefully emit intrinsics in a bunch of places 
> > > or else default to introducing a bug.
> > 
> > 
> > Maybe I'm asking that -fexcess-precision=fast not be an alias of 
> > -fexcess-precision=standard. The idea behind 'fast' as I understand it is 
> > to allow the compiler to generate whatever instructions are most efficient 
> > for the target. If the target supports native fp16 operations, the fastest 
> > code will use those instructions and not use excess precision. If the 
> > target does not support fp16, the fastest code would be to perform the 
> > calculations at 32-bit precision and only truncate on casts and assignments 
> > (or, even faster, not even then). You can see an analogy for what I mean by 
> > looking at what gcc does with 32-bit floats when SSE is disabled: 
> > https://godbolt.org/z/xhEGbjG4G
> > 
> > With -fexcess-precision=fast, gcc takes liberties to make the code run as 
> > fast as possible. With -fexcess-precision=standard, it truncates to the 
> > source value on assignments, casts, or return.
> > 
> > If you generate code using the half type here, the backend will legalize it 
> > and **should** make it as fast as possible. In fact, it looks like 
> > currently the X86 backend will insert calls to extend and truncate the 
> > values to maintain the semantics of the IR 
> > (https://godbolt.org/z/YGnj4cqvv). That's sensible, but it's not what the 
> > X86 backend does in the 32-bit float + no SSE case.
> > 
> > The main thing I'd want to say here is that the front end has no business 
> > trying to decide what will be fastest for the target and encoding that. At 
> > most, the front end should be encoding the semantics in the source and 
> > setting some flag in the IR to indicate the excess precision handling 
> > that's allowed. This feels to me like it requires a new fast-math flag, 
> > maybe 'axp' = "allow excess precision".
> > 
> > The case where the front end is encoding excess precision into the IR feels 
> > much more like -ffp-eval-method. When the front end encodes an fpext to 
> > float and does a calculation, the optimizer is forced to respect that 
> > unless it can prove that doing the calculation at lower precision won't 
> > change the result (and it rarely does that). For targets that have native 
> > FP16 support, this will mean that -fexcess-precision=fast results in slower 
> > code.
> > For targets that have native FP16 support, this will mean that 
> > -fexcess-precision=fast results in slower code.
> 
> This won't be true. We checked `TI.hasLegalHalfType()` before doing excess 
> precision. Maybe better to add a RUN for `avx512fp16` case to address the 
> concern.
> 
> > At most, the front end should be encoding the semantics in the source and 
> > setting some flag in the IR to indicate the excess precision handling 
> > that's allowed.
> 
> I think encoding excess precision into the IR is the easy way to encode the 
> information of the casts and assignments. Otherwise, we have to use 
> intrinsics to tell such information to the backend. As @rjmccall explained, 
> it's more error-prone.
> 
> > If you generate code using the half type here, the backend will legalize it 
> > and should make it as fast as possible.
> 
> I think the current backend implementations in both GCC and LLVM have already 
> tried to make it as fast as possible in the condition of not to make the code 
> too complicated.
> The behavior in 32-bit float + no SSE case is benefited from x87 feature that 
> always does excess precision. It is somehow taking bugs as feature from the 
> semantics of the IR.
> That says, it is feasible to mock the same behavior of x87 special fast 
> behavior for fp16. It needs extra work in the backend to globally promote all 
> half types to float to just get some "bug" behavior. I don't think GCC or us 
> have interest with it.
> 
> In conclusion, I think the only common thing in the 32-bit float + no SSE 
> case and fp16 is codegen (either FE or BE) needs to respect the boundary of 
> the casts and assignments in some condition. I think the current FE 
> implemetation is the best way to solve the problem (without extra BE work).
> This won't be true. We checked TI.hasLegalHalfType() before doing excess 
> precision. Maybe better to add a RUN for avx512fp16 case to address the 
> concern.

Oh, I see. I had missed that because I was focusing on the test to understand 
the expected output. If the front end won't be generating the extend and 
truncate instructions when the target has a legal half type that takes care of 
my primary objection. I'm still not entirely convinced that the front end 
should be making this decision, but it is a reasonable implementation.



> The behavior in 32-bit float + no SSE case is benefited from x87 feature that 
> always does excess precision. It is somehow taking bugs as feature from the 
> semantics of the IR.
> That says, it is feasible to mock the same behavior of x87 special fast 
> behavior for fp16. It needs extra work in the backend to globally promote all 
> half types to float to just get some "bug" behavior. I don't think GCC or us 
> have interest with it.

I'm not quite sure what you're saying here. I don't consider the fact that 
we're generating x87 instructions in the 32-bit float + no SSE case to be a 
bug. That's the behavior most people want. If you're saying that ignoring the 
IR semantics is a bug even though the result is what we intend, I would agree 
with that.

I don't understand why gcc doesn't maintain the same distinction between 
-fexcess-precision=fast and -fexcess-precision=standard for 16-bit types that 
they do for 32-bit types, but I don't suppose that's critical.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136176/new/

https://reviews.llvm.org/D136176

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to