Hello, This series makes use of the scheduling control code in order to improve the instruction pipelining on Maxwell GPUs.
Starting with the Kepler architecture, where a control instruction has to be inserted every 7 instructions, Maxwell added additional control codes and the control instruction now has to be every 3 instructions. Maxwell control codes are really powerful and well documented [1]. By the way, I would like to thank Scott Gray who did an awesome reverse engineering work, although I had to figure out the missing parts myself. On Maxwell, control codes are mainly used for setting the number of stall counts and for producing/consumming dependency barriers in order to avoid hazards. I'm not going to explain in details how do they work because the documentation is quite good and because I added explanations here and there in the source code. But the main thing to understand is that the previous control code used by default (ie. st 0x0) means "wait for all dependencies and stall the pipeline for 15 cycles which is the maximum". Which is quite bad... Now, let's have a look at the (impressive) performance improvements. :-) I measured on a GeForce GTX 750 Ti (GM107) reclocked to the highest perf level, with and without the control codes (NV50_PROG_SCHED=0/1). app: number of FPS without -> number of FPS with (+gain%) FurMark: 13 -> 42 (+223%) Pixmark Piano: 2 -> 7 (+250%) Pixmark Volposion: 6 -> 20 (+233%) Julia F32: 61 -> 219 (+259%) LightMarks: 352 -> 685 (+94%) Heaven (low): 51 -> 102 (+100%) Heaven (ultra): 14 -> 27 (+93%) Valley (low): 30 -> 68 (+126%) Valley (ultra): 18 -> 39 (+100%) Talos (low): 32 -> 50 (+56%) Talos (ultra): 7 -> 14 (+100%) Shadow of Mordor (lowest): 13 -> 20 (+53%) That's it! I think it's enough to understand the power of Maxwell control codes. We may get additional numbers from Phoronix (wink, wink, Michael). As I said in the main patch, the control codes can be disabled with 'export NV50_PROG_SCHED=0'. Now, let's have a look how nouveau performs compared to NVIDIA's blob. FurMark: 42 -> 59 (+40%) Pixmark Piano: 7 -> 13 (+85%) Pixmark Volposion: 20 -> 42 (+110%) Julia F32: 219 -> 351 (+60%) LightMarks: 685 -> 1192 (+74%) Heaven (low): 102 -> 144 (+41%) Heaven (ultra): 27 -> 46 (+70%) Valley (low): 68 -> 94 (+38%) Valley (ultra): 39 -> 60 (+53%) Talos (low): 50 -> 128 (+156%) Talos (ultra): 14 -> 30 (+114%) Shadow of Mordor (lowest): 20 -> 77 (+285%) Nouveau is still far away from the blob, but now I think Maxwell is actually in roughly the same shape as Kepler in terms of performance and features. Speaking about this, I will enable OpenGL 4.3 on Maxwell in a separate patch, later on. The overhead at compile time added by this seris is rather small. For a full shader-db run with my private repository of shaders, it takes approximately 208s for compiling 25k shaders before the series and approximately 211s after. Less than 2% of overhead and it's comparable to a full shader-db run on Kepler. No regressions with both piglit and dEQP (tested multiple times) and all benchmarks/games I have tried render fine and seem to be quite stable. Due to a lack of time, some parts are still left to do and some others could be improved. With the following ideas implemented I'm pretty sure we can improve performance significantly. * Add support for the yield flag. This seems to be a hint to the hardware for improving how the work is balanced between the warps. I didn't figure out how and where to use it without breaking a bunch of things. Need time and patience. * Add support for dual-issue, the rules are pretty different than Kepler especially because of the dependency barriers. Note that the yield flag has to be set, otherwise the hardware won't dual-issue and in fact it will wait for all dependencies (ie. st 0x0) which is really different that what you are looking for. * Reduce stall counts. A bunch of instructions have a read latency which is the number of cycles before they can actually read the sources. This should be fairly easy to implement but will require some reverse engineering to completely understand the idea. This is my last contribution for the Nouveau driver for a while because I have been hired by Valve to work on radeonsi. Do not expect such perf improvements with radeonsi because it already performs really well, unlike Nouveau. But with time and patience we can do better. :-) This series is also available from my fdo account: https://cgit.freedesktop.org/~hakzsam/mesa/log/?h=gm107_scheduler Please, review! Thanks. [1] https://github.com/NervanaSystems/maxas/wiki/Control-Codes Samuel Pitoiset (5): nv50/ir: do not insert texture barriers on gm107 nv50/ir: improve instruction pipelining on gm107 nv50/ir: use sched control codes for gm107 builtins nvc0: use sched control codes for gm107 blitter shader nvc0: use sched control codes for gm107 MP counters code src/gallium/drivers/nouveau/codegen/lib/gm107.asm | 40 +- .../drivers/nouveau/codegen/lib/gm107.asm.h | 40 +- .../drivers/nouveau/codegen/nv50_ir_emit_gm107.cpp | 771 ++++++++++++++++++++- .../nouveau/codegen/nv50_ir_lowering_nvc0.cpp | 3 +- .../nouveau/codegen/nv50_ir_target_gm107.cpp | 253 +++++++ .../drivers/nouveau/codegen/nv50_ir_target_gm107.h | 7 + .../drivers/nouveau/nvc0/nvc0_query_hw_sm.c | 88 +-- src/gallium/drivers/nouveau/nvc0/nvc0_surface.c | 20 +- 8 files changed, 1127 insertions(+), 95 deletions(-) -- 2.11.0 _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev