andrew.w.kaylor added inline comments.
================ Comment at: clang/test/CodeGen/X86/fexcess-precision.c:89 +// CHECK-EXT-NEXT: [[EXT1:%.*]] = fpext half [[TMP1]] to float +// CHECK-EXT-NEXT: [[MUL:%.*]] = fmul float [[EXT]], [[EXT1]] +// CHECK-EXT-NEXT: [[TMP2:%.*]] = load half, ptr [[C_ADDR]] ---------------- pengfei wrote: > andrew.w.kaylor wrote: > > I apologize for taking so long to get back to this. I'm not sure how to get > > this comment to show up as a response to the questions @rjmccall asked > > about my earlier remarks here. Apologies if this ends up being redundant. > > > > I said: > > > > > This is not what I'd expect to see. I'd expect the operations to use the > > > half type with explicit truncation inserted where needed. > > > > > > @rjmccall said: > > > > > Are you suggesting that the frontend emit half operations normally, with > > > some intrinsic to force half precision on casts and assignments, and that > > > a backend pass would aggressively promote operations between those > > > intrinsics? I think that would be a pretty error-prone representation, > > > both in terms of guaranteeing the use of excess precision in some > > > situations (and thus getting stable behavior across compiler releases) > > > and guaranteeing truncation in others (and thus preserving correctness). > > > The frontend would have to carefully emit intrinsics in a bunch of places > > > or else default to introducing a bug. > > > > > > Maybe I'm asking that -fexcess-precision=fast not be an alias of > > -fexcess-precision=standard. The idea behind 'fast' as I understand it is > > to allow the compiler to generate whatever instructions are most efficient > > for the target. If the target supports native fp16 operations, the fastest > > code will use those instructions and not use excess precision. If the > > target does not support fp16, the fastest code would be to perform the > > calculations at 32-bit precision and only truncate on casts and assignments > > (or, even faster, not even then). You can see an analogy for what I mean by > > looking at what gcc does with 32-bit floats when SSE is disabled: > > https://godbolt.org/z/xhEGbjG4G > > > > With -fexcess-precision=fast, gcc takes liberties to make the code run as > > fast as possible. With -fexcess-precision=standard, it truncates to the > > source value on assignments, casts, or return. > > > > If you generate code using the half type here, the backend will legalize it > > and **should** make it as fast as possible. In fact, it looks like > > currently the X86 backend will insert calls to extend and truncate the > > values to maintain the semantics of the IR > > (https://godbolt.org/z/YGnj4cqvv). That's sensible, but it's not what the > > X86 backend does in the 32-bit float + no SSE case. > > > > The main thing I'd want to say here is that the front end has no business > > trying to decide what will be fastest for the target and encoding that. At > > most, the front end should be encoding the semantics in the source and > > setting some flag in the IR to indicate the excess precision handling > > that's allowed. This feels to me like it requires a new fast-math flag, > > maybe 'axp' = "allow excess precision". > > > > The case where the front end is encoding excess precision into the IR feels > > much more like -ffp-eval-method. When the front end encodes an fpext to > > float and does a calculation, the optimizer is forced to respect that > > unless it can prove that doing the calculation at lower precision won't > > change the result (and it rarely does that). For targets that have native > > FP16 support, this will mean that -fexcess-precision=fast results in slower > > code. > > For targets that have native FP16 support, this will mean that > > -fexcess-precision=fast results in slower code. > > This won't be true. We checked `TI.hasLegalHalfType()` before doing excess > precision. Maybe better to add a RUN for `avx512fp16` case to address the > concern. > > > At most, the front end should be encoding the semantics in the source and > > setting some flag in the IR to indicate the excess precision handling > > that's allowed. > > I think encoding excess precision into the IR is the easy way to encode the > information of the casts and assignments. Otherwise, we have to use > intrinsics to tell such information to the backend. As @rjmccall explained, > it's more error-prone. > > > If you generate code using the half type here, the backend will legalize it > > and should make it as fast as possible. > > I think the current backend implementations in both GCC and LLVM have already > tried to make it as fast as possible in the condition of not to make the code > too complicated. > The behavior in 32-bit float + no SSE case is benefited from x87 feature that > always does excess precision. It is somehow taking bugs as feature from the > semantics of the IR. > That says, it is feasible to mock the same behavior of x87 special fast > behavior for fp16. It needs extra work in the backend to globally promote all > half types to float to just get some "bug" behavior. I don't think GCC or us > have interest with it. > > In conclusion, I think the only common thing in the 32-bit float + no SSE > case and fp16 is codegen (either FE or BE) needs to respect the boundary of > the casts and assignments in some condition. I think the current FE > implemetation is the best way to solve the problem (without extra BE work). > This won't be true. We checked TI.hasLegalHalfType() before doing excess > precision. Maybe better to add a RUN for avx512fp16 case to address the > concern. Oh, I see. I had missed that because I was focusing on the test to understand the expected output. If the front end won't be generating the extend and truncate instructions when the target has a legal half type that takes care of my primary objection. I'm still not entirely convinced that the front end should be making this decision, but it is a reasonable implementation. > The behavior in 32-bit float + no SSE case is benefited from x87 feature that > always does excess precision. It is somehow taking bugs as feature from the > semantics of the IR. > That says, it is feasible to mock the same behavior of x87 special fast > behavior for fp16. It needs extra work in the backend to globally promote all > half types to float to just get some "bug" behavior. I don't think GCC or us > have interest with it. I'm not quite sure what you're saying here. I don't consider the fact that we're generating x87 instructions in the 32-bit float + no SSE case to be a bug. That's the behavior most people want. If you're saying that ignoring the IR semantics is a bug even though the result is what we intend, I would agree with that. I don't understand why gcc doesn't maintain the same distinction between -fexcess-precision=fast and -fexcess-precision=standard for 16-bit types that they do for 32-bit types, but I don't suppose that's critical. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D136176/new/ https://reviews.llvm.org/D136176 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits