On Tue, 2018-11-06 at 11:04 +0200, Pohjolainen, Topi wrote: > On Tue, Nov 06, 2018 at 09:40:17AM +0100, Iago Toral wrote: > > On Tue, 2018-11-06 at 08:30 +0200, Topi Pohjolainen wrote: > > > Here is a version 2 of adding support for 16-bit float > > > instructions > > > in > > > the shader compiler. Unlike the first version which did all the > > > analysis > > > at glsl level here one adds the notion of precision to NIR > > > variables > > > and > > > does the analysis and precision lowering in NIR level. > > > > > > This lives in: gitlab.freedesktop.org:tpohjola/mesa and branch > > > fp16. > > > > > > This is now mature enough to be able to use 16-bit precision for > > > all > > > instructions except a few special cases for gfxbench trex and > > > alu2. > > > (Unfortunately I'm not seeing any performance benefit. > > > > That is not too surprising. The backend optimizer has been > > implemented > > in terms of 32-bit and you are probably losing a lot of > > optimizations > > in the generated code for 16-bit paths. I have hit some of that as > > well > > while working on the backend aspects of enabling 16-bit. For > > example, > > SIMD8 executions (which is all of the geometry pipeline) will not > > benefit from copy-propagation because is_partial_write() is always > > true > > for SIMD8 16-bit instructions with its current semantics. There are > > other optimization passes that have hard-coded 32-bit conditions, > > etc I > > have addressed a small part of this and have some code available > > that I > > expect to send for review soon, but there is clearly work to be > > done in > > the backend to optimize things for 16-bit paths, which I hope to > > work > > on in the near future. > > I have added the concept of padding registers in virtual space in > order > to keep all the optimizations functioning. Comparing side-by-side the > brw_fs instructions between 32-bit and 16-bit versions (of t-rex at > least) > tells me that 16-bit version is equivalent to 32-bit. Only extra > things > are conversions from 32-bit input varyings to 16-bits and conversions > from > 16-bits to 32-bits texture coordinates. Both of which I'm working on.
That's surprising, because there're optimization paths in the backend that are clearly hardcoded for 32-bit register types. It could be that t-rex isn't affected by any of this though. Anyway, if the code for t-rex is similarly optimized (modulo input/output conversions) it is very disapointing that we're not seeing any advantage :-/ > > > > > This is not > > > that surprising as I got to the same point with the glsl-based > > > solution and was able to measure the performance already back > > > then). > > > Hence I thought it is time to share it. > > > > > > While this is still work-in-progress I didn't want to flood the > > > list > > > with the full set of patches but instead included the very last > > > where > > > I try to outline the logic and its current shortcomings. There is > > > also > > > a short list of TODO items. > > > > > > In addition to those I need to examine couple of Intel specific > > > misrenderings. I haven't gotten that deep yet but it looks I'm > > > missing > > > something with 16-bit inot and mad/mac lowered interpolation. > > > Unfortunately I get corrupted rendering only with hardware while > > > simulator is happy. > > > > Are you implementing interpolation of 16-bit fragment shader > > inputs? I > > have discussed that with Jason in the past and based on my > > experimentation, I think the hardware doesn't support this > > natively: > > the interpolator seems to produce 32-bit deltas only and assumes > > 32-bit > > inputs only as well. > > Matt has added lowering of pln() to mad/mac for gen11+. In addition, > hardware allows mixed mode operations on mad() and hence we can > interpolate > mixing with 16-bit and 32-bit operands producing 16-bit results. > Therefore > leveraring both these I can keep the incoming sources to the shader > as 32-bits > but interpolate with 16-bit precision. Looking like this in practise: > > mad(8) g11<1>HF g4.3<0,1,0>F g2<4,4,1>F g4.0<0,1,0>F { align16 > 1Q }; > mad(8) g11<1>HF g11<4,4,1>HF g3<4,4,1>F g4.1<0,1,0>F { align16 > 1Q }; > mad(8) g12<1>HF g4.7<0,1,0>F g2<4,4,1>F g4.4<0,1,0>F { align16 > 1Q }; > mad(8) g12<1>HF g12<4,4,1>HF g3<4,4,1>F g4.5<0,1,0>F { align16 > 1Q }; Ok, so you are using 32-bit inputs in the FS, that should be fine then. > Although there is something amiss with this, I'm getting partially > corrupted data with hardware while simulator is happy. Working on > it... > > > > > > Mostly I'm afraid how to test all of this properly. I haven't > > > written > > > any unit tests but that is high on my list. This is mostly > > > because > > > I've > > > been uncertain about my design choices. So far I've used shader > > > runner tests that I've written for specific cases. These are > > > useful > > > for > > > development purposes but don't bring much value for regression > > > testing. > > > > Have you tried dEQP / GLES CTS yet? I figure there should be a lot > > of > > mediump shaders there. > > I need to take a look. What I'm afraid of is checking for precision > of > results, i.e., checking that compiler doesn't lower too much, I hope > those > tests are addessing this. Yes, also, I don't think that requirements for mediump precision are very well defined in the GL ES specs either... > > > > Another note on 16-bit booleans, since I see you've been working on > > that, I don't know if you're aware that Jason has posted relevant > > patches here: > > > > https://lists.freedesktop.org/archives/mesa-dev/2018-October/207458.html > > > > This basically introduced the notion of bit-sized booleans in NIR, > > and > > it leaves it up to the backend to lower booleans to the bit-size > > they > > need before translating to a backend IR. I have been working on > > that > > lowering and have a prototype working for 16-bit booleans built on > > top > > of Jason's series and my backend work for half-float. Let me know > > if > > you are interested and I'll point you to the code. > > I'll have a look. I had to address that as well, and if you want to > take a > look: > > nir: Add lowering pass setting 16-bit boolean destinations > nir: Add lowering pass turning b2f(i2i32(x)) into b2f(x) > Revert "intel/compiler: fix 16-bit comparisons" > > I haven't yet tried to rebase on top of Jason's work. In case of GLSL > I can't > just lower in the backend. During the analysis one needs to know if > sources > are produced in 16-bit precision and mark this somehow down. I've > chosen to > sprinkle artificial i216 and i2132 operations for that and to remove > them > once validation is done. > > > > > Iago > > > > > Alejandro PiƱeiro (1): > > > intel/compiler/fs: Use half_precision data_format on 16-bit fb > > > writes > > > > > > Jose Maria Casanova Crespo (2): > > > intel/compiler/fs: Include support for RT data_format bit > > > intel/compiler/disasm: Show half-precision data_format on > > > rt_writes > > > > > > Topi Pohjolainen (58): > > > intel/compiler/fs: Set 16-bit sampler return format > > > intel/compiler/disasm: Show half-precision for sampler messages > > > intel/compiler/fs: Skip tex-inst early in conversion lowering > > > intel/compiler/fs: Support for dumping 16-bit IMM values > > > intel/compiler: Allow 16-bit math > > > intel/compiler/fs: Add helpers for 16-bit null regs > > > intel/compiler/fs: Use two SIMD8 instructions for 16-bit math > > > intel/compiler/fs: Use 16-bit null dest with 16-bit math > > > intel/compiler/fs: Use 16-bit null dest with 16-bit compare > > > intel/compiler/fs: Add 16-bit type support for nir_if > > > intel/compiler/eu: Prepare 3-src-op for 16-bit sources > > > intel/compiler/eu: Prepare 3-src-op for 16-bit dst > > > intel/compiler/eu: Allow 3-src-op with mixed precision (HF/F) > > > sources > > > intel/compiler/disasm: Print mixed precision 3-src types > > > correctly > > > intel/compiler/disasm: Print 16-bit IMM values > > > intel/compiler/fs: Support for combining 16-bit immediates > > > intel/compiler/fs: Set tex type for generator to flag fp16 > > > intel/compiler/fs: Use component_size() instead of open coded > > > intel/compiler/fs: Add register padding support > > > intel/compiler/fs: Pad 16-bit texture return payloads > > > intel/compiler/fs: Pad 16-bit output (store/fb write) payloads > > > intel/compiler/fs: Pad 16-bit nir vec* components into full reg > > > intel/compiler/fs: Pad 16-bit nir intrinsic dest into full reg > > > intel/compiler/fs: Pad 16-bit const loads into full regs > > > intel/compiler/fs: Pad 16-bit load payload lowering > > > nir: Lower also 16-bit lrp() if needed > > > intel/compiler: Lower 16-bit lrp() > > > nir: Recognize f232(f216(x)) as x > > > nir: Recognize f216(f232(x)) as x > > > nir: Store variable precision when translating from glsl > > > glsl: Set default precision for builtin variables > > > i965: Prepare uniform mapping for 16-bit values > > > i965: Support for uploading 16-bit uniforms from 32-bit store > > > intel/compiler/fs: WIP: Use 32-bit slots for 16-bit uniforms > > > intel/compiler: Tell compiler if lower precision is supported > > > nir: Add lowering pass for variables marked mediump > > > nir: Add pass for deref precision lowering > > > nir: Add pass for alu precision lowering > > > nir: Add precision conversion for load/store_deref > > > nir: Add precision conversion for sources of texturing ops > > > nir: Don't set destination size 16 for booleans > > > nir: Add precision lowering for texture samples > > > nir: Add support for non-fixed precision > > > nir: Don't try to alter precision of boolean sources > > > nir: Add support for variable sized booleans > > > nir: Add support for lowering phi precision > > > intel/compiler/fs: Prepare alu dest type for 16-bit booleans > > > nir: Add lowering pass setting 16-bit boolean destinations > > > nir: Add lowering pass turning b2f(i2i32(x)) into b2f(x) > > > nir: Adjust integer precision for alus operating with 16-bit > > > srcs > > > nir: Replace b2f(x) with b2f(i2i32(x)) for 16-bit x > > > nir: Adjust precision for discard_if > > > nir: Allow input varyings to be converted to lower precision > > > nir: Replace 16-bit src[0] for bcsel i2i32(src[0]) > > > nir: Replace 16-bit nir_if condition with i2i32(condition) > > > Revert "intel/compiler: fix 16-bit comparisons" > > > intel/compiler: Hook in precision lowering pass > > > nir: Document precision lowering pass > > > > > > src/compiler/Makefile.sources | 2 + > > > src/compiler/glsl/glsl_symbol_table.cpp | 20 + > > > src/compiler/glsl/glsl_symbol_table.h | 7 + > > > src/compiler/glsl/glsl_to_nir.cpp | 1 + > > > src/compiler/nir/meson.build | 2 + > > > src/compiler/nir/nir.h | 18 + > > > src/compiler/nir/nir_lower_bool_size.c | 120 +++ > > > src/compiler/nir/nir_lower_precision.cpp | 820 > > > ++++++++++++++++++ > > > src/compiler/nir/nir_opt_algebraic.py | 5 + > > > src/intel/blorp/blorp.c | 4 +- > > > src/intel/compiler/brw_compiler.c | 1 + > > > src/intel/compiler/brw_disasm.c | 28 +- > > > src/intel/compiler/brw_eu.h | 3 +- > > > src/intel/compiler/brw_eu_emit.c | 83 +- > > > src/intel/compiler/brw_fs.cpp | 68 +- > > > src/intel/compiler/brw_fs.h | 4 +- > > > src/intel/compiler/brw_fs_builder.h | 37 +- > > > .../compiler/brw_fs_combine_constants.cpp | 84 +- > > > .../compiler/brw_fs_copy_propagation.cpp | 7 +- > > > src/intel/compiler/brw_fs_generator.cpp | 13 +- > > > .../compiler/brw_fs_lower_conversions.cpp | 42 + > > > src/intel/compiler/brw_fs_nir.cpp | 197 +++-- > > > src/intel/compiler/brw_fs_surface_builder.cpp | 3 +- > > > src/intel/compiler/brw_fs_visitor.cpp | 6 + > > > src/intel/compiler/brw_inst.h | 5 + > > > src/intel/compiler/brw_ir_fs.h | 16 + > > > src/intel/compiler/brw_nir.c | 22 +- > > > src/intel/compiler/brw_nir.h | 4 +- > > > src/intel/compiler/brw_reg_type.c | 2 + > > > src/intel/compiler/brw_shader.h | 7 + > > > src/intel/vulkan/anv_pipeline.c | 2 +- > > > .../drivers/dri/i965/brw_nir_uniforms.cpp | 8 +- > > > src/mesa/drivers/dri/i965/brw_program.c | 10 +- > > > src/mesa/drivers/dri/i965/brw_program.h | 6 +- > > > src/mesa/drivers/dri/i965/brw_tcs.c | 2 +- > > > .../drivers/dri/i965/gen6_constant_state.c | 14 +- > > > 36 files changed, 1548 insertions(+), 125 deletions(-) > > > create mode 100644 src/compiler/nir/nir_lower_bool_size.c > > > create mode 100644 src/compiler/nir/nir_lower_precision.cpp > > > > > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev