On 06/06/2018 04:47 PM, Rob Clark wrote: > On Wed, Jun 6, 2018 at 6:51 PM, Ian Romanick <i...@freedesktop.org> wrote: >> On 06/06/2018 07:43 AM, Rob Clark wrote: >>> Run this pass late (after opt loop) to move load_const instructions back >>> into the basic blocks which use the result, in cases where a load_const >>> is only consumed in a single block. >> >> If the load_const is used in more than one block, you could use it to >> the block that dominates all of the block that contain a use. I suspect >> in many cases that will just be the first block, but it's probably worth >> trying. > > hmm, good point.. I was mostly going for the super-obvious wins with > now downside cases (ie. low hanging fruit), but I guess the block that > dominates all the uses isn't going to be worse than the first block.. > > I can play around w/ the common denominator case.. although in at > least some cases I think it might be better to split the load_const. > Idk, the low hanging fruit approach cuts register usage by 1/3rd in > some of these edge case shaders, which by itself seems like a big > enough win, but I guess nir_shader_compiler_options config params is > an option for more aggressive approaches that may or may not be a win > for different drivers.
True. We could land this and incrementally improve it later. If we want to do that, I'd suggest - Put all of our ideas for improving the pass, including the one already documented in the code, to the file header comment. - Remove the suggestions for future work from the "if consumed in more than one block, ignore." comment. >>> This helps reduce register usage in cases where the backend driver >>> cannot lower the load_const to a uniform. >> >> The trade off is that now you might not be able to hide the latency of >> the load. I tried this on i965, and the results (below) support this >> idea a bit. If you have a structured backend IR, this seems like a >> thing better done there... like, if you can't register allocate without >> spilling, try moving the load. But that's really just a scheduling >> heuristic. Hmm... > > iirc, i965 had some clever load_const lower passes so all the threads > in a warp could share the same load_const (iirc? I might not have > that description quite right).. so I wasn't expecting this to be > useful to i965 (but if it is, then, bonus!). For me I can *almost* > always lower load_const to uniform, so in the common cases it hasn't > been a problem. The exceptions are cases where an alu instruction > consume >1 const (but those get optimized away) or a const plus > uniform, or things like ssbo/image store. (I noticed the register > usage blowing up thing looking at some deqp tests like > dEQP-GLES31.functional.ssbo.layout.2_level_array.shared.vec4 ;-)) There are some optimizations for loading constants in the vec4 backend. There is some code to lower arrays of constants to uniforms. Uniforms are already shared within the SIMD group. I don't know of anything other than that, but that's doesn't mean such a thing doesn't exist. Also... at what point do you call this pass? Here's what I did for i965... basically called it as late as I thought was possible: diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c index dfeea73b06a..2209d6142b3 100644 --- a/src/intel/compiler/brw_nir.c +++ b/src/intel/compiler/brw_nir.c @@ -795,6 +795,8 @@ brw_postprocess_nir(nir_shader *nir, const struct brw_compiler *compiler, if (devinfo->gen <= 5) brw_nir_analyze_boolean_resolves(nir); + nir_move_load_const(nir); + nir_sweep(nir); if (unlikely(debug_enabled)) { Now that the full shader-db run is done, it seems to have some odd effects on vec4 shaders, so I'll probably tweak this a bit. > If latency of the load is an issue, perhaps a configuration option > about the minimum size of destination use_block, with the rough theory > that if the block is bigger than some threshold the instruction > scheduling could probably hide the latency? I thought about this a bit more... I think the driver's scheduler should fix this case. If it can't hide the latency of a load immediate, the scheduler should move it to the earlier block. That might even help hide the latency of a compare. Something like cmp.g.f0(16) null g2<8,8,1>F -g9.4<0,1,0>F (+f0) if(16) JIP: 40 UIP: 40 END B0 ->B1 ->B2 START B1 <-B0 (40 cycles) mov(16) g77<1>F 1F should become cmp.g.f0(16) null g2<8,8,1>F -g9.4<0,1,0>F mov(16) g77<1>F 1F (+f0) if(16) JIP: 40 UIP: 40 END B0 ->B1 ->B2 START B1 <-B0 (40 cycles) > BR, > -R > > >> total instructions in shared programs: 14398410 -> 14397918 (<.01%) >> instructions in affected programs: 134856 -> 134364 (-0.36%) >> helped: 53 >> HURT: 10 >> helped stats (abs) min: 1 max: 30 x̄: 9.47 x̃: 7 >> helped stats (rel) min: 0.16% max: 2.74% x̄: 0.64% x̃: 0.40% >> HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 >> HURT stats (rel) min: 0.03% max: 0.21% x̄: 0.05% x̃: 0.03% >> 95% mean confidence interval for instructions value: -9.61 -6.01 >> 95% mean confidence interval for instructions %-change: -0.68% -0.38% >> Instructions are helped. >> >> total cycles in shared programs: 532949823 -> 532851352 (-0.02%) >> cycles in affected programs: 260693023 -> 260594552 (-0.04%) >> helped: 122 >> HURT: 88 >> helped stats (abs) min: 1 max: 24933 x̄: 1080.30 x̃: 20 >> helped stats (rel) min: <.01% max: 34.94% x̄: 2.20% x̃: 0.32% >> HURT stats (abs) min: 1 max: 1884 x̄: 378.69 x̃: 29 >> HURT stats (rel) min: <.01% max: 10.23% x̄: 0.64% x̃: 0.09% >> 95% mean confidence interval for cycles value: -943.71 5.89 >> 95% mean confidence interval for cycles %-change: -1.71% -0.32% >> Inconclusive result (value mean confidence interval includes 0). >> >> total spills in shared programs: 8044 -> 7975 (-0.86%) >> spills in affected programs: 2391 -> 2322 (-2.89%) >> helped: 42 >> HURT: 0 >> >> total fills in shared programs: 10950 -> 10884 (-0.60%) >> fills in affected programs: 2999 -> 2933 (-2.20%) >> helped: 42 >> HURT: 0 >> >> LOST: 4 >> GAINED: 7 >> >>> Signed-off-by: Rob Clark <robdcl...@gmail.com> >>> --- >>> src/compiler/Makefile.sources | 1 + >>> src/compiler/nir/meson.build | 1 + >>> src/compiler/nir/nir.h | 1 + >>> src/compiler/nir/nir_move_load_const.c | 131 +++++++++++++++++++++++++ >>> 4 files changed, 134 insertions(+) >>> create mode 100644 src/compiler/nir/nir_move_load_const.c >>> >>> diff --git a/src/compiler/Makefile.sources b/src/compiler/Makefile.sources >>> index 3daa2c51334..cb8a2875a84 100644 >>> --- a/src/compiler/Makefile.sources >>> +++ b/src/compiler/Makefile.sources >>> @@ -253,6 +253,7 @@ NIR_FILES = \ >>> nir/nir_lower_wpos_center.c \ >>> nir/nir_lower_wpos_ytransform.c \ >>> nir/nir_metadata.c \ >>> + nir/nir_move_load_const.c \ >>> nir/nir_move_vec_src_uses_to_dest.c \ >>> nir/nir_normalize_cubemap_coords.c \ >>> nir/nir_opt_conditional_discard.c \ >>> diff --git a/src/compiler/nir/meson.build b/src/compiler/nir/meson.build >>> index 3fec363691d..373a47fd155 100644 >>> --- a/src/compiler/nir/meson.build >>> +++ b/src/compiler/nir/meson.build >>> @@ -144,6 +144,7 @@ files_libnir = files( >>> 'nir_lower_wpos_ytransform.c', >>> 'nir_lower_bit_size.c', >>> 'nir_metadata.c', >>> + 'nir_move_load_const.c', >>> 'nir_move_vec_src_uses_to_dest.c', >>> 'nir_normalize_cubemap_coords.c', >>> 'nir_opt_conditional_discard.c', >>> diff --git a/src/compiler/nir/nir.h b/src/compiler/nir/nir.h >>> index 5a1f79515ad..073ab4e82ea 100644 >>> --- a/src/compiler/nir/nir.h >>> +++ b/src/compiler/nir/nir.h >>> @@ -2612,6 +2612,7 @@ bool nir_remove_dead_variables(nir_shader *shader, >>> nir_variable_mode modes); >>> bool nir_lower_constant_initializers(nir_shader *shader, >>> nir_variable_mode modes); >>> >>> +bool nir_move_load_const(nir_shader *shader); >>> bool nir_move_vec_src_uses_to_dest(nir_shader *shader); >>> bool nir_lower_vec_to_movs(nir_shader *shader); >>> void nir_lower_alpha_test(nir_shader *shader, enum compare_func func, >>> diff --git a/src/compiler/nir/nir_move_load_const.c >>> b/src/compiler/nir/nir_move_load_const.c >>> new file mode 100644 >>> index 00000000000..c0750035fbb >>> --- /dev/null >>> +++ b/src/compiler/nir/nir_move_load_const.c >>> @@ -0,0 +1,131 @@ >>> +/* >>> + * Copyright © 2018 Red Hat >>> + * >>> + * Permission is hereby granted, free of charge, to any person obtaining a >>> + * copy of this software and associated documentation files (the >>> "Software"), >>> + * to deal in the Software without restriction, including without >>> limitation >>> + * the rights to use, copy, modify, merge, publish, distribute, sublicense, >>> + * and/or sell copies of the Software, and to permit persons to whom the >>> + * Software is furnished to do so, subject to the following conditions: >>> + * >>> + * The above copyright notice and this permission notice (including the >>> next >>> + * paragraph) shall be included in all copies or substantial portions of >>> the >>> + * Software. >>> + * >>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS >>> OR >>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, >>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL >>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR >>> OTHER >>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING >>> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER >>> DEALINGS >>> + * IN THE SOFTWARE. >>> + * >>> + * Authors: >>> + * Rob Clark (robdcl...@gmail.com> >>> + * >>> + */ >>> + >>> +#include "nir.h" >>> + >>> + >>> +/* >>> + * A simple pass that moves load_const's into consuming block if >>> + * they are only consumed in a single block, to try to counter- >>> + * act nir's tendency to move all load_const to the top of the >>> + * first block. >>> + */ >>> + >>> +/* iterate a ssa def's use's and if all uses are within a single >>> + * block, return it, otherwise return null. >>> + */ >>> +static nir_block * >>> +get_single_block(nir_ssa_def *def) >>> +{ >>> + nir_block *b = NULL; >>> + >>> + /* hmm, probably ignore if-uses: */ >>> + if (!list_empty(&def->if_uses)) >>> + return NULL; >>> + >>> + nir_foreach_use(use, def) { >>> + nir_instr *instr = use->parent_instr; >>> + >>> + /* >>> + * Kind of an ugly special-case, but phi instructions >>> + * need to appear first in the block, so by definition >>> + * we can't move a load_immed into a block where it is >>> + * consumed by a phi instruction. We could conceivably >>> + * move it into a dominator block. >>> + */ >>> + if (instr->type == nir_instr_type_phi) >>> + return NULL; >>> + >>> + nir_block *use_block = instr->block; >>> + >>> + if (!b) { >>> + b = use_block; >>> + } else if (b != use_block) { >>> + return NULL; >>> + } >>> + } >>> + >>> + return b; >>> +} >>> + >>> +bool >>> +nir_move_load_const(nir_shader *shader) >>> +{ >>> + bool progress = false; >>> + >>> + nir_foreach_function(function, shader) { >>> + if (!function->impl) >>> + continue; >>> + >>> + nir_foreach_block(block, function->impl) { >>> + nir_foreach_instr_safe(instr, block) { >>> + if (instr->type != nir_instr_type_load_const) >>> + continue; >>> + >>> + nir_load_const_instr *load = >>> + nir_instr_as_load_const(instr); >>> + nir_block *use_block = >>> + get_single_block(&load->def); >>> + >>> + /* if consumed in more than one block, ignore. Possibly >>> + * there should be a threshold, ie. if used in a small # >>> + * of blocks in a large shader, it could be better to >>> + * split the load_const. For now ignore that and focus >>> + * on the clear wins: >>> + */ >>> + if (!use_block) >>> + continue; >>> + >>> + if (use_block == load->instr.block) >>> + continue; >>> + >>> + exec_node_remove(&load->instr.node); >>> + >>> + /* insert before first non-phi instruction: */ >>> + nir_foreach_instr(instr2, use_block) { >>> + if (instr2->type == nir_instr_type_phi) >>> + continue; >>> + >>> + exec_node_insert_node_before(&instr2->node, >>> + &load->instr.node); >>> + >>> + break; >>> + } >>> + >>> + load->instr.block = use_block; >>> + >>> + progress = true; >>> + } >>> + } >>> + >>> + nir_metadata_preserve(function->impl, >>> + nir_metadata_block_index | >>> nir_metadata_dominance); >>> + } >>> + >>> + >>> + return progress; >>> +} _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev