This series (which is ready for production and improves the cycle count of over 46k shaders) has been sitting here for nearly half a year. I'm planning to self-review it and land it (along with PATCH 3/2 I just sent to make sure we keep regressions under control) if nobody else does in the next two weeks.
Francisco Jerez <curroje...@riseup.net> writes: > Unnecessary GRF bank conflicts increase the issue time of ternary > instructions (the overwhelmingly most common of which is MAD) by > roughly 50%, leading to reduced ALU throughput. This pass attempts to > minimize the number of bank conflicts by rearranging the layout of the > GRF space post-register allocation. It's in general not possible to > eliminate all of them without introducing extra copies, which are > typically more expensive than the bank conflict itself. > > In a shader-db run on SKL this helps roughly 46k shaders: > > total conflicts in shared programs: 1008981 -> 600461 (-40.49%) > conflicts in affected programs: 816222 -> 407702 (-50.05%) > helped: 46234 > HURT: 72 > > The running time of shader-db itself on SKL seems to be increased by > roughly 2.52%±1.13% with n=20 due to the additional work done by the > compiler back-end. > > On earlier generations the pass is somewhat less effective in relative > terms because the hardware incurs a bank conflict anytime the last two > sources of the instruction are duplicate (e.g. while trying to square > a value using MAD), which is impossible to avoid without introducing > copies. E.g. for a shader-db run on SNB: > > total conflicts in shared programs: 944636 -> 623185 (-34.03%) > conflicts in affected programs: 853258 -> 531807 (-37.67%) > helped: 31052 > HURT: 19 > > And on BDW: > > total conflicts in shared programs: 1418393 -> 987539 (-30.38%) > conflicts in affected programs: 1179787 -> 748933 (-36.52%) > helped: 47592 > HURT: 70 > > On SKL GT4e this improves performance of GpuTest Volplosion by 3.64% > ±0.33% with n=16. > > NOTE: This patch intentionally disregards some i965 coding conventions > for the sake of reviewability. This is addressed by the next > squash patch which introduces an amount of (for the most part > boring) boilerplate that might distract reviewers from the > non-trivial algorithmic details of the pass. > --- > src/intel/Makefile.sources | 1 + > src/intel/compiler/brw_fs.cpp | 2 + > src/intel/compiler/brw_fs.h | 1 + > src/intel/compiler/brw_fs_bank_conflicts.cpp | 791 > +++++++++++++++++++++++++++ > 4 files changed, 795 insertions(+) > create mode 100644 src/intel/compiler/brw_fs_bank_conflicts.cpp > > diff --git a/src/intel/Makefile.sources b/src/intel/Makefile.sources > index a877ff2..1b9799a 100644 > --- a/src/intel/Makefile.sources > +++ b/src/intel/Makefile.sources > @@ -44,6 +44,7 @@ COMPILER_FILES = \ > compiler/brw_eu_util.c \ > compiler/brw_eu_validate.c \ > compiler/brw_fs_builder.h \ > + compiler/brw_fs_bank_conflicts.cpp \ > compiler/brw_fs_cmod_propagation.cpp \ > compiler/brw_fs_combine_constants.cpp \ > compiler/brw_fs_copy_propagation.cpp \ > diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp > index 43b6e34..0a85c0c 100644 > --- a/src/intel/compiler/brw_fs.cpp > +++ b/src/intel/compiler/brw_fs.cpp > @@ -5858,6 +5858,8 @@ fs_visitor::allocate_registers(bool allow_spilling) > if (failed) > return; > > + opt_bank_conflicts(); > + > schedule_instructions(SCHEDULE_POST); > > if (last_scratch > 0) { > diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h > index 6c8c027..b1fc7b3 100644 > --- a/src/intel/compiler/brw_fs.h > +++ b/src/intel/compiler/brw_fs.h > @@ -141,6 +141,7 @@ public: > exec_list *acp); > bool opt_drop_redundant_mov_to_flags(); > bool opt_register_renaming(); > + bool opt_bank_conflicts(); > bool register_coalesce(); > bool compute_to_mrf(); > bool eliminate_find_live_channel(); > diff --git a/src/intel/compiler/brw_fs_bank_conflicts.cpp > b/src/intel/compiler/brw_fs_bank_conflicts.cpp > new file mode 100644 > index 0000000..0225c70 > --- /dev/null > +++ b/src/intel/compiler/brw_fs_bank_conflicts.cpp > @@ -0,0 +1,791 @@ > +/* > + * Copyright © 2017 Intel Corporation > + * > + * Permission is hereby granted, free of charge, to any person obtaining a > + * copy of this software and associated documentation files (the "Software"), > + * to deal in the Software without restriction, including without limitation > + * the rights to use, copy, modify, merge, publish, distribute, sublicense, > + * and/or sell copies of the Software, and to permit persons to whom the > + * Software is furnished to do so, subject to the following conditions: > + * > + * The above copyright notice and this permission notice (including the next > + * paragraph) shall be included in all copies or substantial portions of the > + * Software. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER > DEALINGS > + * IN THE SOFTWARE. > + */ > + > +/** @file brw_fs_bank_conflicts.cpp > + * > + * This file contains a GRF bank conflict mitigation pass. The pass is > + * intended to be run after register allocation and works by rearranging the > + * layout of the GRF space (hopefully without altering the semantics of the > + * program) in a way that minimizes the number of GRF bank conflicts incurred > + * by ternary instructions. > + * > + * Unfortunately there is close to no information about bank conflicts in the > + * hardware spec, but experimentally on Gen7-Gen9 ternary instructions seem > to > + * incur an average bank conflict penalty of one cycle per SIMD8 op whenever > + * the second and third source are stored in the same GRF bank (\sa bank_of() > + * for the exact bank layout) which cannot be fetched during the same cycle > by > + * the EU, unless the EU logic manages to optimize out the read cycle of a > + * duplicate source register (\sa is_conflict_optimized_out()). > + * > + * The asymptotic run-time of the algorithm is dominated by the > + * shader_conflict_weight_matrix() computation below, which is O(n) on the > + * number of instructions in the program, however for small and medium-sized > + * programs the run-time is likely to be dominated by > + * optimize_reg_permutation() which is O(m^3) on the number of GRF atoms of > + * the program (\sa partitioning), which is bounded (since the program uses a > + * bounded number of registers post-regalloc) and of the order of 100. For > + * that reason optimize_reg_permutation() is vectorized in order to keep the > + * cubic term within reasonable bounds for m close to its theoretical > maximum. > + */ > + > +#include "brw_fs.h" > +#include "brw_cfg.h" > + > +#include <vector> > +#include <array> > + > +#ifdef __SSE2__ > + > +#include <emmintrin.h> > + > +/** > + * Thin layer around vector intrinsics so they can be easily replaced with > + * e.g. the fall-back scalar path, an implementation with different vector > + * width or using different SIMD architectures (AVX-512?!). > + * > + * This implementation operates on pairs of independent SSE2 integer vectors > à > + * la SIMD16 for somewhat improved throughput. SSE2 is supported by > virtually > + * all platforms that care about bank conflicts, so this path should almost > + * always be available in practice. > + */ > +namespace { > + /** > + * SIMD integer vector data type. > + */ > + typedef std::array<__m128i, 2> vector_type; > + > + /** > + * Scalar data type matching the representation of a single component of > \p > + * vector_type. > + */ > + typedef int16_t scalar_type; > + > + /** > + * Maximum integer value representable as a \p scalar_type. > + */ > + const scalar_type max_scalar = INT16_MAX; > + > + /** > + * Number of components of a \p vector_type. > + */ > + const unsigned vector_width = 2 * sizeof(vector_type::value_type) / > + sizeof(scalar_type); > + > + /** > + * Set the i-th component of vector \p v to \p x. > + */ > + void > + set(vector_type &v, unsigned i, scalar_type x) > + { > + assert(i < vector_width); > + memcpy((char *)v.data() + i * sizeof(x), &x, sizeof(x)); > + } > + > + /** > + * Get the i-th component of vector \p v. > + */ > + scalar_type > + get(const vector_type &v, unsigned i) > + { > + assert(i < vector_width); > + scalar_type x; > + memcpy(&x, (char *)v.data() + i * sizeof(x), sizeof(x)); > + return x; > + } > + > + /** > + * Add two vectors with saturation. > + */ > + vector_type > + adds(const vector_type &v, const vector_type &w) > + { > + const vector_type u = { > + _mm_adds_epi16(v[0], w[0]), > + _mm_adds_epi16(v[1], w[1]) > + }; > + return u; > + } > + > + /** > + * Subtract two vectors with saturation. > + */ > + vector_type > + subs(const vector_type &v, const vector_type &w) > + { > + const vector_type u = { > + _mm_subs_epi16(v[0], w[0]), > + _mm_subs_epi16(v[1], w[1]) > + }; > + return u; > + } > + > + /** > + * Compute the bitwise conjunction of two vectors. > + */ > + vector_type > + mask(const vector_type &v, const vector_type &w) > + { > + const vector_type u = { > + _mm_and_si128(v[0], w[0]), > + _mm_and_si128(v[1], w[1]) > + }; > + return u; > + } > + > + /** > + * Reduce the components of a vector using saturating addition. > + */ > + scalar_type > + sums(const vector_type &v) > + { > + const __m128i v8 = _mm_adds_epi16(v[0], v[1]); > + const __m128i v4 = _mm_adds_epi16(v8, _mm_shuffle_epi32(v8, 0x4e)); > + const __m128i v2 = _mm_adds_epi16(v4, _mm_shuffle_epi32(v4, 0xb1)); > + const __m128i v1 = _mm_adds_epi16(v2, _mm_shufflelo_epi16(v2, 0xb1)); > + return _mm_extract_epi16(v1, 0); > + } > +} > + > +#else > + > +/** > + * Thin layer around vector intrinsics so they can be easily replaced with > + * e.g. the fall-back scalar path, an implementation with different vector > + * width or using different SIMD architectures (AVX-512?!). > + * > + * This implementation operates on scalar values and doesn't rely on > + * any vector extensions. This is mainly intended for debugging and > + * to keep this file building on exotic platforms. > + */ > +namespace { > + /** > + * SIMD integer vector data type. > + */ > + typedef int16_t vector_type; > + > + /** > + * Scalar data type matching the representation of a single component of > \p > + * vector_type. > + */ > + typedef int16_t scalar_type; > + > + /** > + * Maximum integer value representable as a \p scalar_type. > + */ > + const scalar_type max_scalar = INT16_MAX; > + > + /** > + * Number of components of a \p vector_type. > + */ > + const unsigned vector_width = 1; > + > + /** > + * Set the i-th component of vector \p v to \p x. > + */ > + void > + set(vector_type &v, unsigned i, scalar_type x) > + { > + assert(i < vector_width); > + v = x; > + } > + > + /** > + * Get the i-th component of vector \p v. > + */ > + scalar_type > + get(const vector_type &v, unsigned i) > + { > + assert(i < vector_width); > + return v; > + } > + > + /** > + * Add two vectors with saturation. > + */ > + vector_type > + adds(vector_type v, vector_type w) > + { > + return std::max(INT16_MIN, std::min(INT16_MAX, int(v) + w)); > + } > + > + /** > + * Substract two vectors with saturation. > + */ > + vector_type > + subs(vector_type v, vector_type w) > + { > + return std::max(INT16_MIN, std::min(INT16_MAX, int(v) - w)); > + } > + > + /** > + * Compute the bitwise conjunction of two vectors. > + */ > + vector_type > + mask(vector_type v, vector_type w) > + { > + return v & w; > + } > + > + /** > + * Reduce the components of a vector using saturating addition. > + */ > + scalar_type > + sums(vector_type v) > + { > + return v; > + } > +} > + > +#endif > + > +namespace { > + /** > + * Variable-length vector type intended to represent cycle-count costs for > + * arbitrary atom-to-bank assignments. It's indexed by a pair of integers > + * (i, p), where i is an atom index and p in {0, 1} indicates the parity > of > + * the conflict (respectively, whether the cost is incurred whenever the > + * atoms are assigned the same bank b or opposite-parity banks b and b^1). > + * \sa shader_conflict_weight_matrix() > + */ > + typedef std::vector<vector_type> weight_vector_type; > + > + /** > + * Set the (i, p)-th component of weight vector \p v to \p x. > + */ > + void > + set(weight_vector_type &v, unsigned i, unsigned p, scalar_type x) > + { > + set(v[(2 * i + p) / vector_width], (2 * i + p) % vector_width, x); > + } > + > + /** > + * Get the (i, p)-th component of weight vector \p v. > + */ > + scalar_type > + get(const weight_vector_type &v, unsigned i, unsigned p) > + { > + return get(v[(2 * i + p) / vector_width], (2 * i + p) % vector_width); > + } > + > + /** > + * Swap the (i, p)-th and (j, q)-th components of weight vector \p v. > + */ > + void > + swap(weight_vector_type &v, > + unsigned i, unsigned p, > + unsigned j, unsigned q) > + { > + const scalar_type tmp = get(v, i, p); > + set(v, i, p, get(v, j, q)); > + set(v, j, q, tmp); > + } > +} > + > +namespace { > + /** > + * Object that represents the partitioning of an arbitrary register space > + * into indivisible units (referred to as atoms below) that can > potentially > + * be rearranged independently from other registers. The partitioning is > + * inferred from a number of contiguity requirements specified using > + * require_contiguous(). This allows efficient look-up of the atom index > a > + * given register address belongs to, or conversely the range of register > + * addresses that belong to a given atom. > + */ > + struct partitioning { > + /** > + * Create a (for the moment unrestricted) partitioning of a register > + * file of size \p n. The units are arbitrary. > + */ > + partitioning(unsigned n) { > + for (unsigned i = 0; i < n + num_terminator_atoms; i++) { > + offsets.push_back(i); > + atoms.push_back(i); > + } > + } > + > + /** > + * Require register range [reg, reg + n[ to be considered part of the > + * same atom. > + */ > + void > + require_contiguous(unsigned reg, unsigned n) > + { > + unsigned r = atoms[reg]; > + > + /* Renumber atoms[reg...] = { r... } and their offsets[r...] for the > + * case that the specified contiguity requirement leads to the > fusion > + * (yay) of one or more existing atoms. > + */ > + for (unsigned reg1 = reg + 1; reg1 < atoms.size(); reg1++) { > + if (offsets[atoms[reg1]] < reg + n) { > + atoms[reg1] = r; > + } else { > + if (offsets[atoms[reg1 - 1]] != offsets[atoms[reg1]]) > + r++; > + > + offsets[r] = offsets[atoms[reg1]]; > + atoms[reg1] = r; > + } > + } > + > + /* Clean up the scraps if we ended up with less atoms than we > started > + * with. > + */ > + offsets.erase(offsets.begin() + r + 1, offsets.end()); > + } > + > + /** > + * Get the atom index register address \p reg belongs to. > + */ > + unsigned > + atom_of_reg(unsigned reg) const > + { > + return atoms[reg]; > + } > + > + /** > + * Get the base register address that belongs to atom \p r. > + */ > + unsigned > + reg_of_atom(unsigned r) const > + { > + return offsets[r]; > + } > + > + /** > + * Get the size of atom \p r in register address units. > + */ > + unsigned > + size_of_atom(unsigned r) const > + { > + assert(r < num_atoms()); > + return reg_of_atom(r + 1) - reg_of_atom(r); > + } > + > + /** > + * Get the number of atoms the whole register space is partitioned > into. > + */ > + unsigned > + num_atoms() const > + { > + return offsets.size() - num_terminator_atoms; > + } > + > + private: > + /** > + * Number of trailing atoms inserted for convenience so among other > + * things we don't need to special-case the last element in > + * size_of_atom(). > + */ > + static const unsigned num_terminator_atoms = 1; > + std::vector<unsigned> offsets; > + std::vector<unsigned> atoms; > + }; > + > + /** > + * Only GRF sources (whether they have been register-allocated or not) can > + * possibly incur bank conflicts. > + */ > + bool > + is_grf(const fs_reg &r) > + { > + return r.file == VGRF || r.file == FIXED_GRF; > + } > + > + /** > + * Register offset of \p r in GRF units. Useful because the > representation > + * of GRFs post-register allocation is somewhat inconsistent and depends > on > + * whether the register already had a fixed GRF offset prior to register > + * allocation or whether it was part of a VGRF allocation. > + */ > + unsigned > + reg_of(const fs_reg &r) > + { > + assert(is_grf(r)); > + if (r.file == VGRF) > + return r.nr + r.offset / REG_SIZE; > + else > + return reg_offset(r) / REG_SIZE; > + } > + > + /** > + * Calculate the finest partitioning of the GRF space compatible with the > + * register contiguity requirements derived from all instructions part of > + * the program. > + */ > + partitioning > + shader_reg_partitioning(const fs_visitor *v) > + { > + partitioning p(BRW_MAX_GRF); > + > + foreach_block_and_inst(block, fs_inst, inst, v->cfg) { > + if (is_grf(inst->dst)) > + p.require_contiguous(reg_of(inst->dst), regs_written(inst)); > + > + for (int i = 0; i < inst->sources; i++) { > + if (is_grf(inst->src[i])) > + p.require_contiguous(reg_of(inst->src[i]), regs_read(inst, > i)); > + } > + } > + > + return p; > + } > + > + /** > + * Return the set of GRF atoms that should be left untouched at their > + * original location to avoid violating hardware or software assumptions. > + */ > + std::vector<bool> > + shader_reg_constraints(const fs_visitor *v, const partitioning &p) > + { > + std::vector<bool> constrained(p.num_atoms()); > + > + /* These are read implicitly by some send-message instructions without > + * any indication at the IR level. Assume they are unsafe to move > + * around. > + */ > + for (unsigned reg = 0; reg < 2; reg++) > + constrained[p.atom_of_reg(reg)] = true; > + > + /* Assume that anything referenced via fixed GRFs is baked into the > + * hardware's fixed-function logic and may be unsafe to move around. > + * Also take into account the source GRF restrictions of EOT > + * send-message instructions. > + */ > + foreach_block_and_inst(block, fs_inst, inst, v->cfg) { > + if (inst->dst.file == FIXED_GRF) > + constrained[p.atom_of_reg(reg_of(inst->dst))] = true; > + > + for (int i = 0; i < inst->sources; i++) { > + if (inst->src[i].file == FIXED_GRF || > + (is_grf(inst->src[i]) && inst->eot)) > + constrained[p.atom_of_reg(reg_of(inst->src[i]))] = true; > + } > + } > + > + return constrained; > + } > + > + /** > + * Return whether the hardware will be able to prevent a bank conflict by > + * optimizing out the read cycle of a source register. The formula was > + * found experimentally. > + */ > + bool > + is_conflict_optimized_out(const gen_device_info *devinfo, const fs_inst > *inst) > + { > + return devinfo->gen >= 9 && > + ((is_grf(inst->src[0]) && (reg_of(inst->src[0]) == > reg_of(inst->src[1]) || > + reg_of(inst->src[0]) == > reg_of(inst->src[2]))) || > + reg_of(inst->src[1]) == reg_of(inst->src[2])); > + } > + > + /** > + * Return a matrix that allows reasonably efficient computation of the > + * cycle-count cost of bank conflicts incurred throughout the whole > program > + * for any given atom-to-bank assignment. > + * > + * More precisely, if C_r_s_p is the result of this function, the total > + * cost of all bank conflicts involving any given atom r can be readily > + * recovered as follows: > + * > + * S(B) = Sum_s_p(d_(p^B_r)_(B_s) * C_r_s_p) > + * > + * where d_i_j is the Kronecker delta, and B_r indicates the bank > + * assignment of r. \sa delta_conflicts() for a vectorized implementation > + * of the expression above. > + * > + * FINISHME: Teach this about the Gen10+ bank conflict rules, which are > + * somewhat more relaxed than on previous generations. In the > + * meantime optimizing based on Gen9 weights is likely to be > more > + * helpful than not optimizing at all. > + */ > + std::vector<weight_vector_type> > + shader_conflict_weight_matrix(const fs_visitor *v, const partitioning &p) > + { > + std::vector<weight_vector_type> conflicts(p.num_atoms(), > + weight_vector_type(DIV_ROUND_UP(2 * p.num_atoms(), > + vector_width))); > + /* Crude approximation of the number of times the current basic block > + * will be executed at run-time. > + */ > + unsigned block_scale = 1; > + > + foreach_block_and_inst(block, fs_inst, inst, v->cfg) { > + if (inst->opcode == BRW_OPCODE_DO) { > + block_scale *= 10; > + > + } else if (inst->opcode == BRW_OPCODE_WHILE) { > + block_scale /= 10; > + > + } else if (inst->is_3src(v->devinfo) && > + is_grf(inst->src[1]) && is_grf(inst->src[2])) { > + const unsigned r = p.atom_of_reg(reg_of(inst->src[1])); > + const unsigned s = p.atom_of_reg(reg_of(inst->src[2])); > + > + /* Estimate of the cycle-count cost of incurring a bank conflict > + * for this instruction. This is only true on the average, for a > + * sequence of back-to-back ternary instructions, since the EU > + * front-end only seems to be able to issue a new instruction at > + * an even cycle. The cost of a bank conflict incurred by an > + * isolated ternary instruction may be higher. > + */ > + const unsigned exec_size = > inst->dst.component_size(inst->exec_size); > + const unsigned cycle_scale = block_scale * > DIV_ROUND_UP(exec_size, > + > REG_SIZE); > + > + /* Neglect same-atom conflicts (since they're either trivial or > + * impossible to avoid without splitting the atom), and conflicts > + * known to be optimized out by the hardware. > + */ > + if (r != s && !is_conflict_optimized_out(v->devinfo, inst)) { > + /* Calculate the parity of the sources relative to the start > of > + * their respective atoms. If their parity is the same (and > + * none of the atoms straddle the 2KB mark), the instruction > + * will incur a conflict iff both atoms are assigned the same > + * bank b. If their parity is opposite, the instruction will > + * incur a conflict iff they are assigned opposite banks (b > and > + * b^1). > + */ > + const bool p_r = 1 & (reg_of(inst->src[1]) - > p.reg_of_atom(r)); > + const bool p_s = 1 & (reg_of(inst->src[2]) - > p.reg_of_atom(s)); > + const unsigned p = p_r ^ p_s; > + > + /* Calculate the updated cost of a hypothetical conflict > + * between atoms r and s. Note that the weight matrix is > + * symmetric with respect to indices r and s by construction. > + */ > + const scalar_type w = std::min(unsigned(max_scalar), > + get(conflicts[r], s, p) + cycle_scale); > + set(conflicts[r], s, p, w); > + set(conflicts[s], r, p, w); > + } > + } > + } > + > + return conflicts; > + } > + > + /** > + * Return the set of GRF atoms that could potentially lead to bank > + * conflicts if laid out unfavorably in the GRF space according to > + * the specified \p conflicts matrix (\sa > + * shader_conflict_weight_matrix()). > + */ > + std::vector<bool> > + have_any_conflicts(const std::vector<weight_vector_type> &conflicts) > + { > + std::vector<bool> any_conflicts(conflicts.size()); > + > + for (unsigned r = 0; r < conflicts.size(); r++) { > + for (unsigned s = 0; s < conflicts[r].size(); s++) > + any_conflicts[r] = any_conflicts[r] || sums(conflicts[r][s]); > + } > + > + return any_conflicts; > + } > + > + /** > + * Calculate the difference between two S(B) cost estimates as defined > + * above (\sa shader_conflict_weight_matrix()). This represents the > + * (partial) cycle-count benefit from moving an atom r from bank p to n. > + * The respective bank assignments Bp and Bn are encoded as the \p > + * bank_mask_p and \p bank_mask_n bitmasks for efficient computation, > + * according to the formula: > + * > + * bank_mask(B)_s_p = -d_(p^B_r)_(B_s) > + * > + * Notice the similarity with the delta function in the S(B) expression > + * above, and how bank_mask(B) can be precomputed for every possible > + * selection of r since bank_mask(B) only depends on it via B_r that may > + * only assume one of four different values, so the caller can keep every > + * possible bank_mask(B) vector in memory without much hassle (\sa > + * bank_characteristics()). > + */ > + int > + delta_conflicts(const weight_vector_type &bank_mask_p, > + const weight_vector_type &bank_mask_n, > + const weight_vector_type &conflicts) > + { > + vector_type s_p = {}, s_n = {}; > + > + for (unsigned r = 0; r < conflicts.size(); r++) { > + s_p = adds(s_p, mask(bank_mask_p[r], conflicts[r])); > + s_n = adds(s_n, mask(bank_mask_n[r], conflicts[r])); > + } > + > + return sums(subs(s_p, s_n)); > + } > + > + /** > + * Return an identity permutation of GRF atoms, represented as the start > GRF > + * offset each atom is mapped into. > + */ > + std::vector<unsigned> > + identity_reg_permutation(const partitioning &p) > + { > + std::vector<unsigned> map(p.num_atoms()); > + > + for (unsigned r = 0; r < map.size(); r++) > + map[r] = p.reg_of_atom(r); > + > + return map; > + } > + > + /** > + * Return the bank index of GRF address \p reg, numbered according to the > + * table: > + * Even Odd > + * Lo 0 1 > + * Hi 2 3 > + */ > + unsigned > + bank_of(unsigned reg) > + { > + return (reg & 0x40) >> 5 | (reg & 1); > + } > + > + /** > + * Return bitmasks suitable for use as bank mask arguments for the > + * delta_conflicts() computation. Note that this is just the (negative) > + * characteristic function of each bank, if you regard it as a set > + * containing all atoms assigned to it according to the \p map array. > + */ > + std::array<weight_vector_type, 4> > + bank_characteristics(const std::vector<unsigned> &map) > + { > + std::array<weight_vector_type, 4> banks; > + > + for (unsigned b = 0; b < banks.size(); b++) { > + banks[b].resize(DIV_ROUND_UP(2 * map.size(), vector_width)); > + > + for (unsigned j = 0; j < map.size(); j++) { > + for (unsigned p = 0; p < 2; p++) > + set(banks[b], j, p, > + (b ^ p) == bank_of(map[j]) ? -1 : 0); > + } > + } > + > + return banks; > + } > + > + /** > + * Return an improved permutation of GRF atoms based on \p map attempting > + * to reduce the total cycle-count cost of bank conflicts greedily. > + * > + * Note that this doesn't attempt to merge multiple atoms into one, which > + * may allow it to do a better job in some cases -- It simply reorders > + * existing atoms in the GRF space without affecting their identity. > + */ > + std::vector<unsigned> > + optimize_reg_permutation(const partitioning &p, > + const std::vector<bool> &constrained, > + const std::vector<weight_vector_type> &conflicts, > + std::vector<unsigned> map) > + { > + const std::vector<bool> any_conflicts = have_any_conflicts(conflicts); > + std::array<weight_vector_type, 4> banks = bank_characteristics(map); > + > + for (unsigned r = 0; r < map.size(); r++) { > + const unsigned bank_r = bank_of(map[r]); > + > + if (!constrained[r]) { > + unsigned best_s = r; > + int best_benefit = 0; > + > + for (unsigned s = 0; s < map.size(); s++) { > + const unsigned bank_s = bank_of(map[s]); > + > + if (bank_r != bank_s && !constrained[s] && > + p.size_of_atom(r) == p.size_of_atom(s) && > + (any_conflicts[r] || any_conflicts[s])) { > + const int benefit = > + delta_conflicts(banks[bank_r], banks[bank_s], > conflicts[r]) + > + delta_conflicts(banks[bank_s], banks[bank_r], > conflicts[s]); > + > + if (benefit > best_benefit) { > + best_s = s; > + best_benefit = benefit; > + } > + } > + } > + > + if (best_s != r) { > + for (unsigned b = 0; b < banks.size(); b++) { > + for (unsigned p = 0; p < 2; p++) > + swap(banks[b], r, p, best_s, p); > + } > + > + std::swap(map[r], map[best_s]); > + } > + } > + } > + > + return map; > + } > + > + /** > + * Apply the GRF atom permutation given by \p map to register \p r and > + * return the result. > + */ > + fs_reg > + transform(const partitioning &p, const std::vector<unsigned> &map, > + fs_reg r) > + { > + if (r.file == VGRF) { > + const unsigned reg = reg_of(r); > + const unsigned s = p.atom_of_reg(reg); > + r.nr = map[s] + reg - p.reg_of_atom(s); > + r.offset = r.offset % REG_SIZE; > + } > + > + return r; > + } > +} > + > +bool > +fs_visitor::opt_bank_conflicts() > +{ > + assert(grf_used || !"Must be called after register allocation"); > + > + /* No ternary instructions -- No bank conflicts. */ > + if (devinfo->gen < 6) > + return false; > + > + const partitioning p = shader_reg_partitioning(this); > + const std::vector<bool> constrained = shader_reg_constraints(this, p); > + const std::vector<weight_vector_type> conflicts = > + shader_conflict_weight_matrix(this, p); > + const std::vector<unsigned> map = > + optimize_reg_permutation(p, constrained, conflicts, > + identity_reg_permutation(p)); > + > + foreach_block_and_inst(block, fs_inst, inst, cfg) { > + inst->dst = transform(p, map, inst->dst); > + > + for (int i = 0; i < inst->sources; i++) > + inst->src[i] = transform(p, map, inst->src[i]); > + } > + > + return true; > +} > -- > 2.10.2 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev