Alejandro Piñeiro <apinhe...@igalia.com> writes: > On 24/08/17 21:07, Connor Abbott wrote: >> >> Hi Alejandro, > > Hi Connor, > >> >> This seems really suspicious. If the live ranges are really >> independent, then the register allocator should be able to assign the >> two virtual registers to the same physical register if it needs to. > > Yes, it is true, the register allocator should be able to assign two > virtual registers to the same physical register. But that is done at the > end (or really near the end), so late for the problem this optimization > is trying to fix. We are also reducing the amount of instructions used. > > Probably not really clear on the commit message. When I say "reduce the > pressure of the register allocator" I mean having a code that the > register allocator would be able to handle without using too much time. > The problem this optimization tries to solve is that for some 16 bit CTS > tests (some with matrices and geometry shaders), the amount of virtual > registers used and instructions was really big. For the record, > initially, some tests needed 24 min just to compile. Right now, thanks > to other optimizations, the slower test without this optimization needs > 1min 30 seconds. Adding some hacky timestamps, the time used at > fs_visitor::allocate_registers (brw_fs.cpp:6096) is: > > * While trying to schedule using the three available pre mode > heuristics: 7 seconds > * Allocation with spilling: 63 seconds > * Final schedule using SCHEDULE_POST: 19 seconds > > With this optimization, the total time goes down to 14 seconds (10 + 0 + > 3 on the previous bullet point list). > > One could argue that 1min 30 seconds is okish. But taking into account > that it goes down to 14 seconds, even with some caveats (see below), I > still think that it is worth to use the optimization. > > And a final comment. For that same test, this is the final stats (using > INTEL_DEBUG): > > * With the optimization: SIMD8 shader: 4610 instructions. 0 loops. > 130320 cycles. 15:9 spills:fills. > * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops. > 174816 cycles. 751:1851 spills:fills. > >> This change forces the two to be the same, which constrains the >> register allocator unecessarily and should make it worse, so I'm >> confused as to why this would help at all. > > I didn't check that issue specifically, but I recently found that this > optimization affects copy propagation/dead code eliminate. So there are > still some room for improvement. But in any case, take into account that > this custom optimization is only used if there is a 32 to 16 bit > conversion, so only affects shaders with this specific feature. > >> >> IIRC there were some issues where we unnecessarily made the sources >> and destination of an instruction interefere with each other, but if >> that's what's causing this, then we should fix that underlying issue. >> >> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the >> generator, in which case the second half of the source is read after >> the first half of the destination is written, and we falsely thought >> that the HW did that too, so we had some code to add a fake >> interference between them, but a while ago Curro moved the expansion >> to happen before register allocation. I don't have the code in front >> of me, but I think we still have this useless code lying around, and I >> would guess this is the source of the problem.) > > Taking into account what I explained before, I don't think that the > problem is the interference or this code you mention (although perhaps > Im wrong). >
I agree with Connor's feed-back on this change, this really smells like a hack working around register allocator brokenness. If the register allocator is failing to assign two variables with disjoint live ranges to the same register it has a bug. If you forcefully merge the live ranges of source and destination it might turn out that that wasn't the optimal decision to take after all register pressure-wise (because it's frequently harder to find room in the GRF for a variable with 2x the live range than for two independent variables), so you will be pessimizing register usage in some cases -- The register allocator should know better than you. >> hhhh >> Connor >> >> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" <apinhe...@igalia.com >> <mailto:apinhe...@igalia.com>> wrote: >> >> When dealing with HF/U/UW, it is usual having a register with a >> F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD >> value. In those cases it would be possible to reuse the register where >> the F value is initially stored instead of having two. Take also into >> account that when operating with HF/U/UW, you would need to use the >> full register (so stride 2). Packs/unpacks would be only useful when >> loading/storing several HF/W/UW. >> >> Note that no instruction is removed. The main benefict is reducing the >> amoung of registers used, so the pressure on the register allocator is >> decreased with big shaders. >> >> Possibly this could be integrated into an existing optimization, at it >> is even already done by the register allocator, but was far easier to >> write and cleaner to read as a separate optimization. >> >> We found this issue when dealing with some Vulkan CTS tests that >> needed several minutes to compile. Most of the time was spent on the >> register allocator. >> >> Right now the optimization only handles 32 to 16 bit conversion. It >> could be possible to do the equivalent for 16 to 32 bit too, but in >> practice, we didn't need it. >> --- >> src/intel/compiler/brw_fs.cpp | 77 >> +++++++++++++++++++++++++++++++++++++++++++ >> src/intel/compiler/brw_fs.h | 1 + >> 2 files changed, 78 insertions(+) >> >> diff --git a/src/intel/compiler/brw_fs.cpp >> b/src/intel/compiler/brw_fs.cpp >> index b6013a5ce85..1342150b44e 100644 >> --- a/src/intel/compiler/brw_fs.cpp >> +++ b/src/intel/compiler/brw_fs.cpp >> @@ -39,6 +39,7 @@ >> #include "compiler/glsl_types.h" >> #include "compiler/nir/nir_builder.h" >> #include "program/prog_parameter.h" >> +#include "brw_fs_live_variables.h" >> >> using namespace brw; >> >> @@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes() >> return progress; >> } >> >> +/** >> + * When dealing with HF/W/UW, it is usual having a register with >> a F/D/UD, and >> + * then convert it to HF/W/UW, and not use again the F/D/UD >> value. In those >> + * cases it would be possible to reuse the register where the F >> value is >> + * initially stored instead of having two. Take also into account >> that when >> + * operating with HF/W/UW, you would need to use the full >> register (so stride >> + * 2). Packs/unpacks would be only useful when loading/storing >> several >> + * HF/W/UWs. >> + * >> + * So something like this: >> + * mov(8) vgrf14<2>:HF, vgrf39:F >> + * >> + * Became: >> + * mov(8) vgrf39<2>:HF, vgrf39:F >> + * >> + * Note that no instruction is removed. The main benefict is >> reducing the >> + * amoung of registers used, so the pressure on the register >> allocator is >> + * decreased with big shaders. >> + */ >> +bool >> +fs_visitor::reuse_16bit_conversions_vgrf() >> +{ >> + bool progress = false; >> + int ip = -1; >> + >> + calculate_live_intervals(); >> + >> + foreach_block_and_inst_safe (block, fs_inst, inst, cfg) { >> + ip++; >> + >> + if (inst->dst.file != VGRF || inst->src[0].file != VGRF) >> + continue; >> + >> + if (inst->opcode != BRW_OPCODE_MOV) >> + continue; >> + >> + if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 || >> + type_sz(inst->src[0].type) != 4 || inst->src[0].stride >> != 1) { >> + continue; >> + } >> + >> + int src_reg = inst->src[0].nr; >> + int src_offset = inst->src[0].offset; >> + unsigned src_var = live_intervals->var_from_vgrf[src_reg]; >> + int src_end = live_intervals->end[src_var]; >> + int dst_reg = inst->dst.nr <http://dst.nr>; >> + >> + if (src_end > ip) >> + continue; >> + >> + foreach_block_and_inst(block, fs_inst, scan_inst, cfg) { >> + if (scan_inst->dst.file == VGRF && >> + scan_inst->dst.nr <http://dst.nr> == dst_reg) { >> + scan_inst->dst.nr <http://dst.nr> = src_reg; >> + scan_inst->dst.offset = src_offset; >> + progress = true; >> + } >> + >> + for (int i = 0; i < scan_inst->sources; i++) { >> + if (scan_inst->src[i].file == VGRF && >> + scan_inst->src[i].nr == dst_reg) { >> + scan_inst->src[i].nr = src_reg; >> + scan_inst->src[i].offset = src_offset; >> + progress = true; >> + } >> + } >> + } >> + } >> + >> + if (progress) >> + invalidate_live_intervals(); >> + >> + return progress; >> +} >> + >> >> static void >> clear_deps_for_inst_src(fs_inst *inst, bool *deps, int first_grf, >> int grf_len) >> @@ -5829,6 +5905,7 @@ fs_visitor::optimize() >> >> OPT(opt_drop_redundant_mov_to_flags); >> OPT(remove_extra_rounding_modes); >> + OPT(reuse_16bit_conversions_vgrf); >> >> do { >> progress = false; >> diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h >> index b9476e69edb..2685f748b87 100644 >> --- a/src/intel/compiler/brw_fs.h >> +++ b/src/intel/compiler/brw_fs.h >> @@ -151,6 +151,7 @@ public: >> bool dead_code_eliminate(); >> bool remove_duplicate_mrf_writes(); >> bool remove_extra_rounding_modes(); >> + bool reuse_16bit_conversions_vgrf(); >> >> bool opt_sampler_eot(); >> bool virtual_grf_interferes(int a, int b); >> -- >> 2.11.0 >> >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org <mailto:mesa-dev@lists.freedesktop.org> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> <https://lists.freedesktop.org/mailman/listinfo/mesa-dev> >> >> >> > > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev