Iago Toral <ito...@igalia.com> writes: > Hi Curro, > > On Tue, 2015-10-20 at 14:18 +0300, Francisco Jerez wrote: >> Iago Toral <ito...@igalia.com> writes: >> >> > On Tue, 2015-10-20 at 13:22 +0300, Francisco Jerez wrote: >> >> Iago Toral Quiroga <ito...@igalia.com> writes: >> >> >> >> > This allows us to re-use the results of previous ssbo loads in >> >> > situations >> >> > that are safe (i.e. when there are no stores, atomic operations or >> >> > memory barriers in between). >> >> > >> >> > This is particularly useful for things like matrix multiplications, >> >> > where >> >> > for a mat4 buffer variable we cut the number of loads from 16 (4 reads >> >> > of >> >> > each column) down to 4 (1 read of each column). >> >> > >> >> > The pass can only cache ssbo loads that involve constant blocks and >> >> > offsets, but could be extended to compare sub-expressions for these >> >> > as well, similar to a CSE pass. >> >> > >> >> > The way the cache works is simple: ssbo loads with constant block/offset >> >> > are included in a cache as they are seen. Stores invalidate cache >> >> > entries. >> >> > Stores with non-constant offset invalidate all cached loads for the >> >> > block >> >> > and stores with non-constant block invalidate all cache entries. There >> >> > is >> >> > room to improve this by using the actual variable name we are accessing >> >> > to >> >> > limit the entries that should be invalidated. We also need to invalidate >> >> > cache entries when we exit the block in which they have been defined >> >> > (i.e. inside if/else blocks or loops). >> >> > >> >> > The cache optimization is built as a separate pass, instead of merging >> >> > it >> >> > inside the lower_ubo_reference pass for a number of reasons: >> >> > >> >> > 1) The way we process assignments in visitors is that the LHS is >> >> > processed before the RHS. This creates a problem for an optimization >> >> > such as this when we do things like a = a + 1, since we would see the >> >> > store before the read when the actual execution order is reversed. >> >> > This could be fixed by re-implementing the logic in the visit_enter >> >> > method for ir_assignment in lower_ubo_reference and then returning >> >> > visit_continue_with_parent. >> >> > >> >> > 2) Some writes/reads need to be split into multiple smaller >> >> > writes/reads, and we need to handle caching for each one. This happens >> >> > deep inside the code that handles the lowering and some >> >> > of the information we need to do this is not available. This could also >> >> > be fixed by passing more data into the corresponding functions or by >> >> > making this data available as class members, but the current >> >> > implementation >> >> > is already complex enough and this would only contribute to the >> >> > complexity. >> >> > >> >> > 3) We can have ssbo loads in the LHS too (i.e. a[a[0]] = ..). In these >> >> > cases >> >> > the current code in lower_uo_reference would see the store before the >> >> > read. >> >> > Probably fixable, but again would add more complexity to the lowering. >> >> > >> >> > On the other hand, a separate pass that runs after the lowering sees >> >> > all the individal loads and stores in the correct order (so we don't >> >> > need >> >> > to do any tricks) and it allows us to sepearate the lowering logic >> >> > (which >> >> > is already complex) from the caching logic. It also gives us a chance to >> >> > run it after other optimization passes have run and turned constant >> >> > expressions for block/offset into constants, enabling more opportunities >> >> > for caching. >> >> >> >> Seems like a restricted form of CSE that only handles SSBO loads, and >> >> only the ones with constant arguments. Why don't we CSE these? (and >> >> other memory access operations like image loads) >> > >> > There is not a CSE pass in GLSL IR any more so we would have to do it in >> > NIR and some drivers would lose the optimization. Doing it at GLSL IR >> > level seemed like a win from this perspective. >> > >> > Then there is the fact that we cannot just CSE these. We need to make >> > sure that we only CSE them when it is safe to do so (i.e. no >> > stores/atomics to the same offsets in between, no memory barriers, etc). >> > The current CSE pass in NIR does not support this as far as I can see. I >> > suppose that we could look into changing the pass to accommodate >> > restrictions such as this if we think is worth it. >> > >> Not really sure if the NIR CSE pass would be adequate, but at least at >> the i965 back-end level this could be handled easily in the CSE pass >> (for typed and untyped surface read opcodes in general) in roughly the >> same way that source variable interference is handled -- Just kill >> potentially overlapping entries from the AEB whenever an atomic or write >> instruction for the same surface is seen. > > I've been having a quick look at this option but I don't think this is > going to work for us. The first thing I noticed is that the CSE pass in > i965 was not currently acting on ssbo loads at all: that is, it would > add them to the AEB but immediately remove them because of this: > > /* Kill any AEB entries using registers that don't get reused any > * more -- a sure sign they'll fail operands_match(). > */ > > The problem here is that operands_match checks that all operands in the > untyped reads match, but of course it won't use the values to check for > this, it'll check that they are actually the same registers. This is not > what we want to do, we want to identify identical ssbo loads by > inspecting the block index and offset (i.e. the values inside the > registers, which we can't even check because they are GRF registers at > this point). Because of this, the chances of the CSE pass identifying > matching loads is really slim, which explains why we never saw this > producing problems even when we never disabled it for SSBO loads or > images, at least until recently (Jordan found one example where CSE > kicked in and created a problem). > That sounds fine as long as the surface index and offset are eventually assigned to the same register if they are the same value -- If either of them is a copy they should be copy or constant-propagated, otherwise copy propagation has a bug (Reminds me of some earlier discussion with Kristian. Sure it's not useful?). If the payload spans several registers (e.g. header and several coordinate components) the LOAD_PAYLOAD instruction that initializes it should be CSE'ed recursively if its sources can also be determined to be the same, otherwise CSE has a bug.
> Iago > >> > Iago >> > >> >> >> >> > --- >> >> > src/glsl/Makefile.sources | 1 + >> >> > src/glsl/ir_optimization.h | 1 + >> >> > src/glsl/opt_ssbo_load.cpp | 338 >> >> > +++++++++++++++++++++++++++++++++++++++++++++ >> >> > 3 files changed, 340 insertions(+) >> >> > create mode 100644 src/glsl/opt_ssbo_load.cpp >> >> > >> >> > diff --git a/src/glsl/Makefile.sources b/src/glsl/Makefile.sources >> >> > index ca87036..73c7514 100644 >> >> > --- a/src/glsl/Makefile.sources >> >> > +++ b/src/glsl/Makefile.sources >> >> > @@ -201,6 +201,7 @@ LIBGLSL_FILES = \ >> >> > opt_noop_swizzle.cpp \ >> >> > opt_rebalance_tree.cpp \ >> >> > opt_redundant_jumps.cpp \ >> >> > + opt_ssbo_load.cpp \ >> >> > opt_structure_splitting.cpp \ >> >> > opt_swizzle_swizzle.cpp \ >> >> > opt_tree_grafting.cpp \ >> >> > diff --git a/src/glsl/ir_optimization.h b/src/glsl/ir_optimization.h >> >> > index ce5c492..26677d7 100644 >> >> > --- a/src/glsl/ir_optimization.h >> >> > +++ b/src/glsl/ir_optimization.h >> >> > @@ -125,6 +125,7 @@ bool lower_clip_distance(gl_shader *shader); >> >> > void lower_output_reads(unsigned stage, exec_list *instructions); >> >> > bool lower_packing_builtins(exec_list *instructions, int op_mask); >> >> > void lower_ubo_reference(struct gl_shader *shader, exec_list >> >> > *instructions); >> >> > +bool opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions); >> >> > void lower_packed_varyings(void *mem_ctx, >> >> > unsigned locations_used, ir_variable_mode >> >> > mode, >> >> > unsigned gs_input_vertices, gl_shader >> >> > *shader); >> >> > diff --git a/src/glsl/opt_ssbo_load.cpp b/src/glsl/opt_ssbo_load.cpp >> >> > new file mode 100644 >> >> > index 0000000..5404907 >> >> > --- /dev/null >> >> > +++ b/src/glsl/opt_ssbo_load.cpp >> >> > @@ -0,0 +1,338 @@ >> >> > +/* >> >> > + * Copyright © 2015 Intel Corporation >> >> > + * >> >> > + * Permission is hereby granted, free of charge, to any person >> >> > obtaining a >> >> > + * copy of this software and associated documentation files (the >> >> > "Software"), >> >> > + * to deal in the Software without restriction, including without >> >> > limitation >> >> > + * the rights to use, copy, modify, merge, publish, distribute, >> >> > sublicense, >> >> > + * and/or sell copies of the Software, and to permit persons to whom >> >> > the >> >> > + * Software is furnished to do so, subject to the following conditions: >> >> > + * >> >> > + * The above copyright notice and this permission notice (including >> >> > the next >> >> > + * paragraph) shall be included in all copies or substantial portions >> >> > of the >> >> > + * Software. >> >> > + * >> >> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >> >> > EXPRESS OR >> >> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >> >> > MERCHANTABILITY, >> >> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT >> >> > SHALL >> >> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES >> >> > OR OTHER >> >> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, >> >> > ARISING >> >> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER >> >> > + * DEALINGS IN THE SOFTWARE. >> >> > + */ >> >> > + >> >> > +/** >> >> > + * \file opt_ssbo_load.cpp >> >> > + * >> >> > + * IR optimization pass to reduce the number of SSBO loads by re-using >> >> > previous >> >> > + * loads when it is safe to do so (i.e. no stores have invalidated the >> >> > load, >> >> > + * no memory barriers in between, etc) >> >> > + */ >> >> > + >> >> > +#include "ir.h" >> >> > +#include "ir_builder.h" >> >> > +#include "program/hash_table.h" >> >> > + >> >> > +using namespace ir_builder; >> >> > + >> >> > +namespace { >> >> > +struct ssbo_load_cache_remove_closure { >> >> > + struct hash_table *ht; >> >> > + const char *key_prefix; >> >> > + unsigned depth; >> >> > +}; >> >> > + >> >> > +struct ssbo_load_cache_entry { >> >> > + ir_dereference *load; >> >> > + unsigned depth; >> >> > +}; >> >> > + >> >> > +class ssbo_load_cache_visitor : public ir_hierarchical_visitor { >> >> > +public: >> >> > + ssbo_load_cache_visitor(gl_shader *shader) >> >> > + : current_depth(0), progress(false) >> >> > + { >> >> > + mem_ctx = ralloc_parent(shader->ir); >> >> > + ssbo_load_cache = >> >> > + hash_table_ctor(0, hash_table_string_hash, >> >> > hash_table_string_compare); >> >> > + } >> >> > + >> >> > + ~ssbo_load_cache_visitor() >> >> > + { >> >> > + hash_table_dtor(ssbo_load_cache); >> >> > + } >> >> > + >> >> > + virtual ir_visitor_status visit_enter(ir_call *); >> >> > + >> >> > + char *build_ssbo_load_cache_key(unsigned block_index, >> >> > + unsigned offset); >> >> > + ir_dereference *ssbo_load_cache_find(ir_rvalue *block_index, >> >> > + ir_rvalue *offset); >> >> > + void ssbo_load_cache_add(ir_dereference *deref, >> >> > + ir_rvalue *block_index, >> >> > + ir_rvalue *offset, >> >> > + unsigned depth); >> >> > + void ssbo_load_cache_remove(ir_rvalue *block_index, >> >> > + ir_rvalue *offset); >> >> > + void ssbo_load_cache_remove_by_key_prefix(const char *prefix); >> >> > + void ssbo_load_cache_remove_all(); >> >> > + void ssbo_load_cache_remove_depth(unsigned depth); >> >> > + >> >> > + ir_visitor_status visit_enter(ir_if *ir); >> >> > + ir_visitor_status visit_enter(ir_loop *ir); >> >> > + ir_visitor_status visit_leave(ir_loop *ir); >> >> > + >> >> > + bool get_progress() { return this->progress; } >> >> > + >> >> > +private: >> >> > + void *mem_ctx; >> >> > + unsigned current_depth; >> >> > + struct hash_table *ssbo_load_cache; >> >> > + bool progress; >> >> > +}; >> >> > + >> >> > +ir_visitor_status >> >> > +ssbo_load_cache_visitor::visit_enter(ir_call *ir) >> >> > +{ >> >> > + if (!ir->callee->is_intrinsic) >> >> > + return visit_continue_with_parent; >> >> > + >> >> > + if (!strcmp(ir->callee_name(), "__intrinsic_load_ssbo")) { >> >> > + exec_node *param = ir->actual_parameters.get_head(); >> >> > + ir_rvalue *block = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + param = param->get_next(); >> >> > + ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + ir_dereference *cached_load = ssbo_load_cache_find(block, >> >> > offset); >> >> > + if (cached_load) { >> >> > + ir_variable *var = ir->return_deref->variable_referenced(); >> >> > + base_ir->insert_before(assign(var, >> >> > cached_load->clone(mem_ctx, NULL))); >> >> > + ir->remove(); >> >> > + this->progress = true; >> >> > + } else { >> >> > + ssbo_load_cache_add(ir->return_deref, block, offset, >> >> > + this->current_depth); >> >> > + } >> >> > + } else if (!strcmp(ir->callee_name(), "__intrinsic_store_ssbo")) { >> >> > + exec_node *param = ir->actual_parameters.get_head(); >> >> > + ir_rvalue *block = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + param = param->get_next(); >> >> > + ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + ssbo_load_cache_remove(block, offset); >> >> > + } else if (strstr(ir->callee_name(), "__intrinsic_ssbo_atomic") == >> >> > + ir->callee_name()) { >> >> > + exec_node *param = ir->actual_parameters.get_head(); >> >> > + ir_rvalue *block = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + param = param->get_next(); >> >> > + ir_rvalue *offset = ((ir_instruction *)param)->as_rvalue(); >> >> > + >> >> > + ssbo_load_cache_remove(block, offset); >> >> > + } else if (!strcmp(ir->callee_name(), >> >> > "__intrinsic_memory_barrier")) { >> >> > + ssbo_load_cache_remove_all(); >> >> > + } >> >> > + >> >> > + return visit_continue_with_parent; >> >> > +} >> >> > + >> >> > +ir_visitor_status >> >> > +ssbo_load_cache_visitor::visit_enter(ir_if *ir) >> >> > +{ >> >> > + ir->condition->accept(this); >> >> > + >> >> > + this->current_depth++; >> >> > + >> >> > + if (!ir->then_instructions.is_empty()) { >> >> > + visit_list_elements(this, &ir->then_instructions); >> >> > + ssbo_load_cache_remove_depth(this->current_depth); >> >> > + } >> >> > + >> >> > + if (!ir->else_instructions.is_empty()) { >> >> > + visit_list_elements(this, &ir->else_instructions); >> >> > + ssbo_load_cache_remove_depth(this->current_depth); >> >> > + } >> >> > + >> >> > + this->current_depth--; >> >> > + >> >> > + return visit_continue_with_parent; >> >> > +} >> >> > + >> >> > +ir_visitor_status >> >> > +ssbo_load_cache_visitor::visit_enter(ir_loop *ir) >> >> > +{ >> >> > + this->current_depth++; >> >> > + return visit_continue; >> >> > +} >> >> > + >> >> > +ir_visitor_status >> >> > +ssbo_load_cache_visitor::visit_leave(ir_loop *ir) >> >> > +{ >> >> > + ssbo_load_cache_remove_depth(this->current_depth); >> >> > + this->current_depth--; >> >> > + return visit_continue; >> >> > +} >> >> > + >> >> > +char * >> >> > +ssbo_load_cache_visitor::build_ssbo_load_cache_key(unsigned >> >> > block_index, >> >> > + unsigned offset) >> >> > +{ >> >> > + return ralloc_asprintf(mem_ctx, "%u-%u", block_index, offset); >> >> > +} >> >> > + >> >> > +ir_dereference * >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_find(ir_rvalue *block_index, >> >> > + ir_rvalue *offset) >> >> > +{ >> >> > + ir_constant *const_block_index = block_index->as_constant(); >> >> > + if (!const_block_index) >> >> > + return NULL; >> >> > + >> >> > + ir_constant *const_offset = offset->as_constant(); >> >> > + if (!const_offset) >> >> > + return NULL; >> >> > + >> >> > + char *cache_key = >> >> > + build_ssbo_load_cache_key(const_block_index->value.u[0], >> >> > + const_offset->value.u[0]); >> >> > + >> >> > + struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry >> >> > *) >> >> > + hash_table_find(ssbo_load_cache, cache_key); >> >> > + return entry ? entry->load : NULL; >> >> > +} >> >> > + >> >> > +void >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_add(ir_dereference *deref, >> >> > + ir_rvalue *block_index, >> >> > + ir_rvalue *offset, >> >> > + unsigned depth) >> >> > +{ >> >> > + /* We only support caching SSBO loads with constant block and >> >> > offset */ >> >> > + ir_constant *const_block_index = block_index->as_constant(); >> >> > + if (!const_block_index) >> >> > + return; >> >> > + >> >> > + ir_constant *const_offset = offset->as_constant(); >> >> > + if (!const_offset) >> >> > + return; >> >> > + >> >> > + char *cache_key = >> >> > + build_ssbo_load_cache_key(const_block_index->value.u[0], >> >> > + const_offset->value.u[0]); >> >> > + >> >> > + struct ssbo_load_cache_entry *entry = (struct ssbo_load_cache_entry >> >> > *) >> >> > + ralloc_size(this->mem_ctx, sizeof(struct ssbo_load_cache_entry)); >> >> > + entry->load = deref; >> >> > + entry->depth = depth; >> >> > + >> >> > + hash_table_replace(ssbo_load_cache, entry, cache_key); >> >> > +} >> >> > + >> >> > +static void >> >> > +ssbo_load_cache_remove_callback(const void *key, void *data, void >> >> > *closure) >> >> > +{ >> >> > + struct ssbo_load_cache_remove_closure *c = >> >> > + (struct ssbo_load_cache_remove_closure *) closure; >> >> > + >> >> > + /* If we have a key_prefix, then we only want to delete entries >> >> > + * with that key prefix. Otherwise, if we have depth > 0 we want >> >> > + * to delete keys with that depth (or larger). Otheriwse (no prefix >> >> > + * and depth == 0) we want to delete all keys. >> >> > + */ >> >> > + assert((c->key_prefix && c->depth == 0) || >> >> > + (!c->key_prefix && c->depth > 0) || >> >> > + (!c->key_prefix && c->depth == 0)); >> >> > + >> >> > + if (c->key_prefix && strstr((char *)key, (char *)c->key_prefix) != >> >> > key) >> >> > + return; >> >> > + >> >> > + if (c->depth > 0) { >> >> > + struct ssbo_load_cache_entry *entry = >> >> > + (struct ssbo_load_cache_entry *) data; >> >> > + if (entry->depth < c->depth) >> >> > + return; >> >> > + } >> >> > + >> >> > + hash_table_remove(c->ht, key); >> >> > +} >> >> > + >> >> > +void >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_by_key_prefix(const >> >> > char *prefix) >> >> > +{ >> >> > + struct ssbo_load_cache_remove_closure c; >> >> > + c.ht = ssbo_load_cache; >> >> > + c.key_prefix = prefix; >> >> > + c.depth = 0; >> >> > + >> >> > + hash_table_call_foreach(ssbo_load_cache, >> >> > + ssbo_load_cache_remove_callback, &c); >> >> > +} >> >> > + >> >> > +void >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_all() >> >> > +{ >> >> > + struct ssbo_load_cache_remove_closure c; >> >> > + c.ht = ssbo_load_cache; >> >> > + c.key_prefix = NULL; >> >> > + c.depth = 0; >> >> > + >> >> > + hash_table_call_foreach(ssbo_load_cache, >> >> > + ssbo_load_cache_remove_callback, &c); >> >> > +} >> >> > + >> >> > +void >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove_depth(unsigned depth) >> >> > +{ >> >> > + struct ssbo_load_cache_remove_closure c; >> >> > + c.ht = ssbo_load_cache; >> >> > + c.key_prefix = NULL; >> >> > + c.depth = depth; >> >> > + >> >> > + hash_table_call_foreach(ssbo_load_cache, >> >> > + ssbo_load_cache_remove_callback, &c); >> >> > +} >> >> > + >> >> > +void >> >> > +ssbo_load_cache_visitor::ssbo_load_cache_remove(ir_rvalue *block_index, >> >> > + ir_rvalue *offset) >> >> > +{ >> >> > + ir_constant *const_block_index = block_index->as_constant(); >> >> > + if (!const_block_index) { >> >> > + /* If we don't know the block index, then invalidate the entire >> >> > cache. >> >> > + * We could try to do better, for example, considering the actual >> >> > + * field name we are accessing in the SSBO in the keys so we only >> >> > + * invalidate those. This requires some work though. >> >> > + */ >> >> > + ssbo_load_cache_remove_all(); >> >> > + return; >> >> > + } >> >> > + >> >> > + ir_constant *const_offset = offset->as_constant(); >> >> > + if (!const_offset) { >> >> > + /* We know the block but not the offset, so invalidate all >> >> > entries >> >> > + * for the given block >> >> > + */ >> >> > + ssbo_load_cache_remove_by_key_prefix( >> >> > + ralloc_asprintf(mem_ctx, "%u-", >> >> > const_block_index->value.u[0])); >> >> > + return; >> >> > + } >> >> > + >> >> > + /* We know block and offset, so invalidate that particular load >> >> > only */ >> >> > + char *cache_key = >> >> > + build_ssbo_load_cache_key(const_block_index->value.u[0], >> >> > + const_offset->value.u[0]); >> >> > + >> >> > + hash_table_remove(ssbo_load_cache, cache_key); >> >> > +} >> >> > + >> >> > +} /* Unnamed namespace */ >> >> > + >> >> > +bool >> >> > +opt_ssbo_loads(struct gl_shader *shader, exec_list *instructions) >> >> > +{ >> >> > + ssbo_load_cache_visitor v(shader); >> >> > + visit_list_elements(&v, instructions); >> >> > + return v.get_progress(); >> >> > +} >> >> > -- >> >> > 1.9.1 >> >> > >> >> > _______________________________________________ >> >> > mesa-dev mailing list >> >> > mesa-dev@lists.freedesktop.org >> >> > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev