Changes in v3: - stylistic changes - simplify createMulMethod2() - update shader-db statistics - use util_bitcount64 and util_next_power_of_two64 instead of reimplementing them Changes in v2: - rebase - bring back constant folding for multiplication by power-of-twos for nv50 - remove TODO in nv50_ir_target_gm107.cpp - document XMAD's flags - change how XMAD's per-operand flags are represented - move util/bitscan.h stuff into a new patch - stylistic changes
This series improve the performance of integer multiplication by removing much usage of the very slow IMAD and IMUL on Maxwell+ and improving multiplication by immediates on Fermi+. The first and second patch add support for the XMAD instruction in codegen The third patch replaces most IMADs and IMULs with a sequence of XMADs on Maxwell+. This is far faster but increases the total instructions in the shader-db by 0.90%, gpr count by 0.10% and local memory by 0.46%. The next patch significantly lowers this number. It replaces many multiplications by immediates with instructions that should be as fast or faster than the XMAD approach. They are also typically smaller and less register heavy, so they decrease the total instruction count by -0.65% and bring the gpr count and local memory back to normal. This series gives about a ~50% speedup in fragment-heavy scenaries with Dolphin 5.0 on my GTX 1060. All timings were made with interesting looking fifos from Dolphin's bugtracker: Wind Waker: 18 FPS -> 26 FPS at 3x internal resolution Wind Waker: 8 FPS -> 11 FPS at 5x internal resolution Paper Mario?: 26 FPS -> 42 FPS at 5x internal resolution SpongeBob Movie: 19 FPS -> 30 FPS at 5x internal resolution Unigine Heaven and Unigine Valley seems to run the same at low quality with no anti-aliasing and no tessellation. SuperTuxKart and 0 A.D. also show no change. It's possible these patches may break something. Piglit shows no functionality regressions though they should probably be tested for improvements or breakage with actual applications. These patches can also be found on my github: https://github.com/pendingchaos/mesa/tree/nv-xmad-v3 The final changes in shader-db are as follows: total instructions in shared programs : 5787704 -> 5801926 (0.25%) total gprs used in shared programs : 669878 -> 669853 (-0.00%) total shared used in shared programs : 548832 -> 548832 (0.00%) total local used in shared programs : 21068 -> 21068 (0.00%) local shared gpr inst bytes helped 0 0 280 717 717 hurt 0 0 298 2171 2171 Rhys Perry (4): nv50/ir: add preliminary support for OP_XMAD gm107/ir: add support for OP_XMAD on GM107+ nv50/ir: optimize imul/imad to xmads nv50/ir: further optimize multiplication by immediates src/gallium/drivers/nouveau/codegen/nv50_ir.h | 26 +++ .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 65 +++++++ .../drivers/nouveau/codegen/nv50_ir_peephole.cpp | 200 +++++++++++++++++++-- .../drivers/nouveau/codegen/nv50_ir_print.cpp | 19 ++ .../drivers/nouveau/codegen/nv50_ir_target.cpp | 7 +- .../nouveau/codegen/nv50_ir_target_gm107.cpp | 6 +- .../nouveau/codegen/nv50_ir_target_nv50.cpp | 1 + .../nouveau/codegen/nv50_ir_target_nvc0.cpp | 16 ++ 8 files changed, 320 insertions(+), 20 deletions(-) -- 2.14.4 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev