On Thu, Jun 22, 2017 at 12:20 PM, Francisco Jerez <curroje...@riseup.net> wrote: > Unnecessary GRF bank conflicts increase the issue time of ternary > instructions (the overwhelmingly most common of which is MAD) by > roughly 50%, leading to reduced ALU throughput. This pass attempts to > minimize the number of bank conflicts by rearranging the layout of the > GRF space post-register allocation. It's in general not possible to > eliminate all of them without introducing extra copies, which are > typically more expensive than the bank conflict itself. > > In a shader-db run on SKL this helps roughly 46k shaders: > > total conflicts in shared programs: 1008981 -> 600461 (-40.49%) > conflicts in affected programs: 816222 -> 407702 (-50.05%) > helped: 46234 > HURT: 72 > > The running time of shader-db itself on SKL seems to be increased by > roughly 2.52%±1.13% with n=20 due to the additional work done by the > compiler back-end. > > On earlier generations the pass is somewhat less effective in relative > terms because the hardware incurs a bank conflict anytime the last two > sources of the instruction are duplicate (e.g. while trying to square > a value using MAD), which is impossible to avoid without introducing > copies. E.g. for a shader-db run on SNB: > > total conflicts in shared programs: 944636 -> 623185 (-34.03%) > conflicts in affected programs: 853258 -> 531807 (-37.67%) > helped: 31052 > HURT: 19 > > And on BDW: > > total conflicts in shared programs: 1418393 -> 987539 (-30.38%) > conflicts in affected programs: 1179787 -> 748933 (-36.52%) > helped: 47592 > HURT: 70 > > On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% > ±0.33% with n=16. > > NOTE: This patch intentionally disregards some i965 coding conventions > for the sake of reviewability. This is addressed by the next > squash patch which introduces an amount of (for the most part > boring) boilerplate that might distract reviewers from the > non-trivial algorithmic details of the pass. > --- > src/intel/Makefile.sources | 1 + > src/intel/compiler/brw_fs.cpp | 2 + > src/intel/compiler/brw_fs.h | 1 + > src/intel/compiler/brw_fs_bank_conflicts.cpp | 791 > +++++++++++++++++++++++++++ > 4 files changed, 795 insertions(+) > create mode 100644 src/intel/compiler/brw_fs_bank_conflicts.cpp > > diff --git a/src/intel/Makefile.sources b/src/intel/Makefile.sources > index a877ff2..1b9799a 100644 > --- a/src/intel/Makefile.sources > +++ b/src/intel/Makefile.sources > @@ -44,6 +44,7 @@ COMPILER_FILES = \ > compiler/brw_eu_util.c \ > compiler/brw_eu_validate.c \ > compiler/brw_fs_builder.h \ > + compiler/brw_fs_bank_conflicts.cpp \
Use tabs in Makefiles > compiler/brw_fs_cmod_propagation.cpp \ > compiler/brw_fs_combine_constants.cpp \ > compiler/brw_fs_copy_propagation.cpp \ > diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp > index 43b6e34..0a85c0c 100644 > --- a/src/intel/compiler/brw_fs.cpp > +++ b/src/intel/compiler/brw_fs.cpp > @@ -5858,6 +5858,8 @@ fs_visitor::allocate_registers(bool allow_spilling) > if (failed) > return; > > + opt_bank_conflicts(); > + > schedule_instructions(SCHEDULE_POST); > > if (last_scratch > 0) { > diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h > index 6c8c027..b1fc7b3 100644 > --- a/src/intel/compiler/brw_fs.h > +++ b/src/intel/compiler/brw_fs.h > @@ -141,6 +141,7 @@ public: > exec_list *acp); > bool opt_drop_redundant_mov_to_flags(); > bool opt_register_renaming(); > + bool opt_bank_conflicts(); > bool register_coalesce(); > bool compute_to_mrf(); > bool eliminate_find_live_channel(); > diff --git a/src/intel/compiler/brw_fs_bank_conflicts.cpp > b/src/intel/compiler/brw_fs_bank_conflicts.cpp > new file mode 100644 > index 0000000..0225c70 > --- /dev/null > +++ b/src/intel/compiler/brw_fs_bank_conflicts.cpp > @@ -0,0 +1,791 @@ > +/* > + * Copyright © 2017 Intel Corporation > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > DEALINGS > + * IN THE SOFTWARE. > + */ > + > +/** @file brw_fs_bank_conflicts.cpp > + * > + * This file contains a GRF bank conflict mitigation pass. The pass is > + * intended to be run after register allocation and works by rearranging the > + * layout of the GRF space (hopefully without altering the semantics of the > + * program) in a way that minimizes the number of GRF bank conflicts incurred I'd remove the parenthetical. The pass better not change the semantics of the program! > + * by ternary instructions. > + * > + * Unfortunately there is close to no information about bank conflicts in the > + * hardware spec, but experimentally on Gen7-Gen9 ternary instructions seem > to > + * incur an average bank conflict penalty of one cycle per SIMD8 op whenever > + * the second and third source are stored in the same GRF bank (\sa bank_of() > + * for the exact bank layout) which cannot be fetched during the same cycle > by > + * the EU, unless the EU logic manages to optimize out the read cycle of a > + * duplicate source register (\sa is_conflict_optimized_out()). > + * > + * The asymptotic run-time of the algorithm is dominated by the > + * shader_conflict_weight_matrix() computation below, which is O(n) on the > + * number of instructions in the program, however for small and medium-sized > + * programs the run-time is likely to be dominated by > + * optimize_reg_permutation() which is O(m^3) on the number of GRF atoms of > + * the program (\sa partitioning), which is bounded (since the program uses a > + * bounded number of registers post-regalloc) and of the order of 100. For > + * that reason optimize_reg_permutation() is vectorized in order to keep the > + * cubic term within reasonable bounds for m close to its theoretical > maximum. > + */ > + > +#include "brw_fs.h" > +#include "brw_cfg.h" > + > +#include <vector> > +#include <array> > + > +#ifdef __SSE2__ > + > +#include <emmintrin.h> > + > +/** > + * Thin layer around vector intrinsics so they can be easily replaced with > + * e.g. the fall-back scalar path, an implementation with different vector > + * width or using different SIMD architectures (AVX-512?!). > + * > + * This implementation operates on pairs of independent SSE2 integer vectors > à > + * la SIMD16 for somewhat improved throughput. SSE2 is supported by > virtually > + * all platforms that care about bank conflicts, so this path should almost SSE2 is available on all processors that can be paired with a graphics core supported by i965. We build i965 with -msse2 now (this wasn't the case when you sent the patch initially). I'm fine with leaving the #ifdef __SSE2__ in the code if the other path might help with debugging someday. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev