Re: Some questions on hash_set/hash_map traits
On Wed, Sep 8, 2021 at 10:50 PM Erick Ochoa via Gcc wrote: > > Hello, > > I am refactoring some old code that runs as an IPA_PASS. This code is > intended to run at link-time using the LTO framework and so I am > always testing it that way. At the moment, I am trying to use > hash-maps but having some difficulties. I am adding some patches that > will help illustrate along the way. > > The first patch simply adds a link-time optimization pass that will be > used to demo the problem that I am having. One can run with -flto > -fipa-hash. During the first patch, the pass does not do any > meaningful work but it just helps to set up the stage. > > For the second patch I create a wrapper around a symtab_node. This > wrapper is intended to add some other functionality to elements that I > might want to insert in hash-sets or hash-maps. > > class symtab_wrapper > { > private: > symtab_node *_symtab_node; > public: > friend symtab_wrapper_traits; > symtab_wrapper () > : _symtab_node (NULL) > {} > symtab_wrapper (symtab_node *s) > : _symtab_node (s) > { gcc_assert (s); } > > void print (void) > { > if (!dump_file) return; > gcc_assert (_symtab_node); > fprintf (dump_file, "%s\n", _symtab_node->name ()); > } > }; > > I also have a wrapper around a hash-set to add some things that might > be of interest to me in the future: > > template default_hash_traits > > class my_set_t > { > private: > hash_set _impl; > public: > my_set_t () {} > > void insert (const KeyId &k) > { > _impl.add (k); > } > > bool contains (const KeyId &k) > { > return _impl.contains (k); > } > > void print (void) > { > if (!dump_file) return; > for (auto i = _impl.begin (), e = _impl.end (); i != e; ++i) > { > (*i).print (); > } > } > }; > > I also created a symtab_wrapper_traits because I want to store the > object in the hash-map itself, and so I need to specify how to hash > it. > > class symtab_wrapper_traits > { > public: > typedef symtab_wrapper value_type; > typedef symtab_wrapper compare_type; > static const bool empty_zero_p = true; > > static inline hashval_t hash (const value_type &v) > { > return pointer_hash ::hash (v._symtab_node); > } > > static bool equal (const value_type &l, const value_type &r) > { > return l._symtab_node == r._symtab_node; > } > > static void mark_deleted (value_type &v) > { > v._symtab_node = NULL; > } > > static void mark_empty (value_type &v) > { > v._symtab_node = NULL; > } > > static void remove (value_type &v) > { > v._symtab_node = NULL; > } > > static bool is_deleted (const value_type &v) > { > return !v._symtab_node; > } > > static bool is_empty (const value_type &v) > { > return !v._symtab_node; > } > }; > > I then create a global set with my traits and populate it with some > information and later print. It seems to be working: > > my_set_t my_set; > > static void > ipa_hash_generate_summary (void) > { > cgraph_node *cnode = NULL; > FOR_EACH_FUNCTION (cnode) > { > symtab_wrapper w (cnode); > my_set.insert (w); > } > > my_set.print (); > > return; > } > > Here, I already have some questions, but this is not the biggest issue > that I am having: I believe that the hash_set implementation manages > its own memory, but I am unclear about the semantics of "removed". > > * Is "removed" supposed to, for example, call the destructor? Since, > in this particular case, it is only a wrapper and cgraph_node is not > intended to be destroyed, I just assign the pointer to NULL. Not sure > if that's the intention. > * I don't understand the semantics of empty_zero_p, I think it is used > to zero out deleted entries? If I mark something as empty. > * I am assuming that its memory will be re-used, is this the case? > * I see that there is a typed_delete_remove function that is sometimes > used in traits, but I think that it doesn't apply to this case because > I am storing the whole object. Is that the case? > > I later did some refactoring (patch 3), which did not seem to impact > the code now, but will impact hash_map later. The refactoring is as > follows, I create an abstract "printable" class > > class printable > { > public: > virtual char* str () = 0; > void print (void); > }; > > void > printable::print (void) > { > if (!dump_file) return; > char* _str = str (); > fprintf (dump_file, "%s\n", _str); > free (_str); > } > > Make symtab_wrapper a publically derived class > > -class symtab_wrapper > +class symtab_wrapper : public printable > > and implemented the str function in symtab_wrapper: > > virtual char* str (void) >{ > if (!dump_file) return; > gcc_assert (_symtab_node); > char *retval = NULL; > asprintf (&retval, "%s", _symtab_node->name ()); > return retval; > } > > Again, this seems to be working correctly. > > What is more interesting is moving from a hash-set to a has
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 8:57 AM, Richard Biener wrote: On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the I don't know why y'all keep using the word "lame". On the contrary, these are incredibly useful explanations. Thanks. latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that crosses loop boundaries. We do not try and thread this elsewhere, so just cancel the jump threading request by clearing the AUX field now. */ if (bb->loop_father != e2->src->loop_father && (!loop_
Re: More aggressive threading causing loop-interchange-9.c regression
On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: > > > > On 9/9/21 8:57 AM, Richard Biener wrote: > > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > >> > >> Hello, > >> > >> [lame answer to self] > >> > >> On Wed, 8 Sep 2021, Michael Matz wrote: > >> > > The forward threader guards against this by simply disallowing > > threadings that involve different loops. As I see > > The thread in question (5->9->3) is all within the same outer loop, > though. BTW, the backward threader also disallows threading across > different loops (see path_crosses_loops variable). > >> ... > >>> Maybe it's possible to not disable threading over latches alltogether in > >>> the backward threader (like it's tried now), but I haven't looked at the > >>> specific situation here in depth, so take my view only as opinion from a > >>> large distance :-) > >> > >> I've now looked at the concrete situation. So yeah, the whole path is in > >> the same loop, crosses the latch, _and there's code following the latch > >> on that path_. (I.e. the latch isn't the last block in the path). In > >> particular, after loop_optimizer_init() (before any threading) we have: > >> > >> [local count: 118111600]: > >># j_19 = PHI > >>sum_11 = c[j_19]; > >>if (n_10(D) > 0) > >> goto ; [89.00%] > >>else > >> goto ; [11.00%] > >> > >>[local count: 105119324]: > >> ... > >> > >> [local count: 118111600]: > >># sum_21 = PHI > >>c[j_19] = sum_21; > >>j_13 = j_19 + 1; > >>if (n_10(D) > j_13) > >> goto ; [89.00%] > >>else > >> goto ; [11.00%] > >> > >> [local count: 105119324]: > >>goto ; [100.00%] > >> > >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >> pre-header of inner loop, but more importantly something that's not at the > >> start of the outer loop. > >> > >> Now, any thread that includes the backedge 9->3 _including_ its > >> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >> necessarily duplicates all code from that destination onto the back edge. > >> Here it's the load from c[j] into sum_11. > >> > >> The important part is the code is emitted onto the back edge, > >> conceptually; in reality it's simply included into the (new) latch block > >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >> cfg_cleanup). > >> > >> That's what we can't have for some of our structural loop optimizers: > >> there must be no code executed after the exit test (e.g. in the latch > >> block). (This requirement makes reasoning about which code is or isn't > >> executed completely for an iteration trivial; simply everything in the > >> body is always executed; e.g. loop interchange uses this to check that > >> there are no memory references after the exit test, because those would > >> then be only conditional and hence make loop interchange very awkward). > >> > >> Note that this situation can't be later rectified anymore: the duplicated > >> instructions (because they are memory refs) must remain after the exit > >> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >> memory refs on the loop-entry path and on the back edge are equivalent) > >> would that be possible, but that's something we aren't capable of. Even > >> if we were that would simply just revert the whole work that the threader > >> did, so it's better to not even do that to start with. > >> > >> I believe something like below would be appropriate, it disables threading > >> if the path contains a latch at the non-last position (due to being > >> backwards on the non-first position in the array). I.e. it disables > >> rotating the loop if there's danger of polluting the back edge. It might > >> be improved if the blocks following (preceding!) the latch are themself > >> empty because then no code is duplicated. It might also be improved if > >> the latch is already non-empty. That code should probably only be active > >> before the loop optimizers, but currently the backward threader isn't > >> differentiating between before/after loop-optims. > >> > >> I haven't tested this patch at all, except that it fixes the testcase :) > > > > Lame comment at the current end of the thread - it's not threading through > > the > > I don't know why y'all keep using the word "lame". On the contrary, > these are incredibly useful explanations. Thanks. > > > latch but threading through the loop header that's problematic, at least if > > the > > end of the threading path ends within the loop (threading through the header > > to the loop exit is fine). Because in that situation you effectively > > created an > > alternate loop entry. Threading through the latch into the loop header is > > fine but with simple latches that likely will never happen (if there are no > > simple latches then the latch can contain the loop exit test). > > > > See tree-ssa-threadupdate.c:thread_block_1 > > > >
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/8/21 8:13 PM, Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Thanks for looking at this. I think you're onto something with this approach. Perhaps in addition to the loop header threading Richard mentions. Your patch causes some regressions, but I think most are noise from FSM tests that must be adjusted. However, there are some other ones that are curious: > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated memset zero" > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping" < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, line 36) > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times graphite "1 loops carried no dependency" 1 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times optimized "loopfn.1" 4 > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times graphite "5 loops carried no dependency" 1 Interestingly your patch is fixing shrink-wrap-loop.c and Warray-bounds-87, both of which were introduced by the backward threader rewrite. At least the Warray-bounds was the threader peeling off an iteration that caused a bogus warning. The ldist-22 regression is interesting though: void foo () { int i; : goto ; [INV] : a[i_1] = 0; if (i_1 >
Re: More aggressive threading causing loop-interchange-9.c regression
On Thu, Sep 9, 2021 at 10:14 AM Aldy Hernandez wrote: > > > > On 9/8/21 8:13 PM, Michael Matz wrote: > > Hello, > > > > [lame answer to self] > > > > On Wed, 8 Sep 2021, Michael Matz wrote: > > > The forward threader guards against this by simply disallowing > threadings that involve different loops. As I see > >>> > >>> The thread in question (5->9->3) is all within the same outer loop, > >>> though. BTW, the backward threader also disallows threading across > >>> different loops (see path_crosses_loops variable). > > ... > >> Maybe it's possible to not disable threading over latches alltogether in > >> the backward threader (like it's tried now), but I haven't looked at the > >> specific situation here in depth, so take my view only as opinion from a > >> large distance :-) > > > > I've now looked at the concrete situation. So yeah, the whole path is in > > the same loop, crosses the latch, _and there's code following the latch > > on that path_. (I.e. the latch isn't the last block in the path). In > > particular, after loop_optimizer_init() (before any threading) we have: > > > > [local count: 118111600]: > ># j_19 = PHI > >sum_11 = c[j_19]; > >if (n_10(D) > 0) > > goto ; [89.00%] > >else > > goto ; [11.00%] > > > >[local count: 105119324]: > > ... > > > > [local count: 118111600]: > ># sum_21 = PHI > >c[j_19] = sum_21; > >j_13 = j_19 + 1; > >if (n_10(D) > j_13) > > goto ; [89.00%] > >else > > goto ; [11.00%] > > > > [local count: 105119324]: > >goto ; [100.00%] > > > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > > pre-header of inner loop, but more importantly something that's not at the > > start of the outer loop. > > > > Now, any thread that includes the backedge 9->3 _including_ its > > destination (i.e. where the backedge isn't the last to-be-redirected edge) > > necessarily duplicates all code from that destination onto the back edge. > > Here it's the load from c[j] into sum_11. > > > > The important part is the code is emitted onto the back edge, > > conceptually; in reality it's simply included into the (new) latch block > > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > > cfg_cleanup). > > > > That's what we can't have for some of our structural loop optimizers: > > there must be no code executed after the exit test (e.g. in the latch > > block). (This requirement makes reasoning about which code is or isn't > > executed completely for an iteration trivial; simply everything in the > > body is always executed; e.g. loop interchange uses this to check that > > there are no memory references after the exit test, because those would > > then be only conditional and hence make loop interchange very awkward). > > > > Note that this situation can't be later rectified anymore: the duplicated > > instructions (because they are memory refs) must remain after the exit > > test. Only by rerolling/unrotating the loop (i.e. noticing that the > > memory refs on the loop-entry path and on the back edge are equivalent) > > would that be possible, but that's something we aren't capable of. Even > > if we were that would simply just revert the whole work that the threader > > did, so it's better to not even do that to start with. > > > > I believe something like below would be appropriate, it disables threading > > if the path contains a latch at the non-last position (due to being > > backwards on the non-first position in the array). I.e. it disables > > rotating the loop if there's danger of polluting the back edge. It might > > be improved if the blocks following (preceding!) the latch are themself > > empty because then no code is duplicated. It might also be improved if > > the latch is already non-empty. That code should probably only be active > > before the loop optimizers, but currently the backward threader isn't > > differentiating between before/after loop-optims. > > > > I haven't tested this patch at all, except that it fixes the testcase :) > > Thanks for looking at this. > > I think you're onto something with this approach. Perhaps in addition > to the loop header threading Richard mentions. > > Your patch causes some regressions, but I think most are noise from FSM > tests that must be adjusted. However, there are some other ones that > are curious: > > > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated > memset zero" > > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" > < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue > "Performing shrink-wrapping" > < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, > line 36) > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > graphite "1 loops carried no dependency" 1 > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > optimized "loopfn.1" 4 > > FAIL: libgomp.graphite/force-parallel-8.c
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 9:45 AM, Richard Biener wrote: On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: On 9/9/21 8:57 AM, Richard Biener wrote: On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the I don't know why y'all keep using the word "lame". On the contrary, these are incredibly useful explanations. Thanks. latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that crosses loop boundaries. We do not try and thread this elsewhere, so just cancel the jump threading
Re: More aggressive threading causing loop-interchange-9.c regression
On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: > > > > On 9/9/21 9:45 AM, Richard Biener wrote: > > On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: > >> > >> > >> > >> On 9/9/21 8:57 AM, Richard Biener wrote: > >>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > > Hello, > > [lame answer to self] > > On Wed, 8 Sep 2021, Michael Matz wrote: > > >>> The forward threader guards against this by simply disallowing > >>> threadings that involve different loops. As I see > >> > >> The thread in question (5->9->3) is all within the same outer loop, > >> though. BTW, the backward threader also disallows threading across > >> different loops (see path_crosses_loops variable). > ... > > Maybe it's possible to not disable threading over latches alltogether in > > the backward threader (like it's tried now), but I haven't looked at the > > specific situation here in depth, so take my view only as opinion from a > > large distance :-) > > I've now looked at the concrete situation. So yeah, the whole path is in > the same loop, crosses the latch, _and there's code following the latch > on that path_. (I.e. the latch isn't the last block in the path). In > particular, after loop_optimizer_init() (before any threading) we have: > > [local count: 118111600]: > # j_19 = PHI > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 105119324]: > ... > > [local count: 118111600]: > # sum_21 = PHI > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 105119324]: > goto ; [100.00%] > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > pre-header of inner loop, but more importantly something that's not at > the > start of the outer loop. > > Now, any thread that includes the backedge 9->3 _including_ its > destination (i.e. where the backedge isn't the last to-be-redirected > edge) > necessarily duplicates all code from that destination onto the back edge. > Here it's the load from c[j] into sum_11. > > The important part is the code is emitted onto the back edge, > conceptually; in reality it's simply included into the (new) latch block > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > cfg_cleanup). > > That's what we can't have for some of our structural loop optimizers: > there must be no code executed after the exit test (e.g. in the latch > block). (This requirement makes reasoning about which code is or isn't > executed completely for an iteration trivial; simply everything in the > body is always executed; e.g. loop interchange uses this to check that > there are no memory references after the exit test, because those would > then be only conditional and hence make loop interchange very awkward). > > Note that this situation can't be later rectified anymore: the duplicated > instructions (because they are memory refs) must remain after the exit > test. Only by rerolling/unrotating the loop (i.e. noticing that the > memory refs on the loop-entry path and on the back edge are equivalent) > would that be possible, but that's something we aren't capable of. Even > if we were that would simply just revert the whole work that the threader > did, so it's better to not even do that to start with. > > I believe something like below would be appropriate, it disables > threading > if the path contains a latch at the non-last position (due to being > backwards on the non-first position in the array). I.e. it disables > rotating the loop if there's danger of polluting the back edge. It might > be improved if the blocks following (preceding!) the latch are themself > empty because then no code is duplicated. It might also be improved if > the latch is already non-empty. That code should probably only be active > before the loop optimizers, but currently the backward threader isn't > differentiating between before/after loop-optims. > > I haven't tested this patch at all, except that it fixes the testcase :) > >>> > >>> Lame comment at the current end of the thread - it's not threading > >>> through the > >> > >> I don't know why y'all keep using the word "lame". On the contrary, > >> these are incredibly useful explanations. Thanks. > >> > >>> latch but threading through the loop header that's problematic, at least > >>> if the > >>> end of the threading path ends within the loop (threading through the > >>> header > >>
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 10:58 AM, Richard Biener wrote: On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: On 9/9/21 9:45 AM, Richard Biener wrote: On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: On 9/9/21 8:57 AM, Richard Biener wrote: On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the I don't know why y'all keep using the word "lame". On the contrary, these are incredibly useful explanations. Thanks. latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that cr
Re: More aggressive threading causing loop-interchange-9.c regression
On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez wrote: > > > > On 9/9/21 10:58 AM, Richard Biener wrote: > > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: > >> > >> > >> > >> On 9/9/21 9:45 AM, Richard Biener wrote: > >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: > > > > On 9/9/21 8:57 AM, Richard Biener wrote: > > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > >> > >> Hello, > >> > >> [lame answer to self] > >> > >> On Wed, 8 Sep 2021, Michael Matz wrote: > >> > > The forward threader guards against this by simply disallowing > > threadings that involve different loops. As I see > > The thread in question (5->9->3) is all within the same outer loop, > though. BTW, the backward threader also disallows threading across > different loops (see path_crosses_loops variable). > >> ... > >>> Maybe it's possible to not disable threading over latches alltogether > >>> in > >>> the backward threader (like it's tried now), but I haven't looked at > >>> the > >>> specific situation here in depth, so take my view only as opinion > >>> from a > >>> large distance :-) > >> > >> I've now looked at the concrete situation. So yeah, the whole path is > >> in > >> the same loop, crosses the latch, _and there's code following the latch > >> on that path_. (I.e. the latch isn't the last block in the path). In > >> particular, after loop_optimizer_init() (before any threading) we have: > >> > >> [local count: 118111600]: > >> # j_19 = PHI > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >>goto ; [89.00%] > >> else > >>goto ; [11.00%] > >> > >> [local count: 105119324]: > >> ... > >> > >> [local count: 118111600]: > >> # sum_21 = PHI > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >>goto ; [89.00%] > >> else > >>goto ; [11.00%] > >> > >> [local count: 105119324]: > >> goto ; [100.00%] > >> > >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >> pre-header of inner loop, but more importantly something that's not at > >> the > >> start of the outer loop. > >> > >> Now, any thread that includes the backedge 9->3 _including_ its > >> destination (i.e. where the backedge isn't the last to-be-redirected > >> edge) > >> necessarily duplicates all code from that destination onto the back > >> edge. > >> Here it's the load from c[j] into sum_11. > >> > >> The important part is the code is emitted onto the back edge, > >> conceptually; in reality it's simply included into the (new) latch > >> block > >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 > >> after > >> cfg_cleanup). > >> > >> That's what we can't have for some of our structural loop optimizers: > >> there must be no code executed after the exit test (e.g. in the latch > >> block). (This requirement makes reasoning about which code is or isn't > >> executed completely for an iteration trivial; simply everything in the > >> body is always executed; e.g. loop interchange uses this to check that > >> there are no memory references after the exit test, because those would > >> then be only conditional and hence make loop interchange very awkward). > >> > >> Note that this situation can't be later rectified anymore: the > >> duplicated > >> instructions (because they are memory refs) must remain after the exit > >> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >> memory refs on the loop-entry path and on the back edge are equivalent) > >> would that be possible, but that's something we aren't capable of. > >> Even > >> if we were that would simply just revert the whole work that the > >> threader > >> did, so it's better to not even do that to start with. > >> > >> I believe something like below would be appropriate, it disables > >> threading > >> if the path contains a latch at the non-last position (due to being > >> backwards on the non-first position in the array). I.e. it disables > >> rotating the loop if there's danger of polluting the back edge. It > >> might > >> be improved if the blocks following (preceding!) the latch are themself > >> empty because then no code is duplicated. It might also be improved if > >> the latch is already non-empty. That code should probably only be > >> active > >> before the loop optimizers, but currently the backward threader isn't > >> differentiating between before/after loop-optims. > >> > >> I haven't tested this patch at all, except
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 12:15 PM, Richard Biener wrote: On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez wrote: On 9/9/21 10:58 AM, Richard Biener wrote: I ran some experiments a while back, and my current work on the enhanced solver/threader, can fold virtually everything the DOM/threader gets (even with its use of const_and_copies, avail_exprs, and evrp_range_analyzer), while getting 5% more DOM threads and 1% more overall threads. That is, I've been testing if the path solver can solve everything the DOM threader needs (the hybrid approach I mentioned). Unfortunately, replacing the forward threader right now is not feasible for a few reasons: a) The const_and_copies/avail_exprs relation framework can do floats, and that's next year's ranger work. Hmm, it sounds like relying fully on ranger is a bit premature then. For floats, most definitely. Ranger doesn't do them. They're for the next release. However, const_and_copies/avail_exprs for integers we can do just fine, that's why evrp/VRP are much easier targets. b) Even though we can seemingly fold everything DOM/threader does, in order to replace it with a backward threader instance we'd have to merge the cost/profitability code scattered throughout the forward threader, as well as the EDGE_FSM* / EDGE_COPY* business. c) DOM changes the IL as it goes. Though we could conceivably divorce do the threading after DOM is done. Yeah, it does not actually process/simplify the blocks copied by threading. In fact I think it only registers jump threading opportunities during the DOM walk and commits them only later. But it of course uses its avail / copies stack to find them - that part you cannot easily divorce. Yes, all the threaders register paths ahead of time and only do the actual CFG shuffling at the end with thread_through_all_blocks(). DOM is also yet another value-numbering framework - one could think of stripping it down from that side, keeping the threading bits only (but you'd still have the avail / copies bits). Ughh, yeah. I've seen some random missed cases because of the value-numbering it does :-/. It's very frustrating that we have various ad-hoc VN methods. That said, it has one nice property it can leverage due to its incredibly simple memory redundancy handling, in that it usually performs way less alias queries than FRE (even when you run the latter in non-iterative mode). But the same way DOM can register jump threading opportunities FRE could do as well. My plan for DOM/threading for this release is to replace its evrp analyzer with a path solver (path_range_query). We can still use the avail/copies, and everything else should just work. I'm hand waiving on a few missed cases due to IL changing, but last time I checked we got way more in return. Andrew even thought we should get most things even in the presence of changing IL, but I haven't distilled testcases for him. For VRP/threading, the same hybrid thing, but replacing the ASSERT_EXPRs, and the avail/copies with a path solver. There are no floats here. Things are looking good in my tests. Once we do floats, if we could merge the EDGE_*_THREAD and the cost/profit bits between both threaders, we should be able to replace the entire forward threader with a backward client. Hopefully I don't run out of steam by then ;-). Thanks for your insight on DOM. It's very helpful. Aldy
Re: More aggressive threading causing loop-interchange-9.c regression
Hello, On Thu, 9 Sep 2021, Richard Biener wrote: > > I believe something like below would be appropriate, it disables > > threading if the path contains a latch at the non-last position (due > > to being backwards on the non-first position in the array). I.e. it > > disables rotating the loop if there's danger of polluting the back > > edge. It might be improved if the blocks following (preceding!) the > > latch are themself empty because then no code is duplicated. It might > > also be improved if the latch is already non-empty. That code should > > probably only be active before the loop optimizers, but currently the > > backward threader isn't differentiating between before/after > > loop-optims. > > > > I haven't tested this patch at all, except that it fixes the testcase > > :) > > Lame comment at the current end of the thread - it's not threading > through the latch but threading through the loop header that's > problematic, I beg to differ, but that's probably because of the ambiguity of the word "through" (does it or does it not include the ends of the path :) ). If you thread through the loop header from the entry block (i.e. duplicate code from header to entry) all would be fine (or not, in case you created multiple entries from outside). If you thread through the latch, then through an empty header and then through another non-empty basic block within the loop, you still wouldn't be fine: you've just created code on the back edge (and hence into the new latch block). If you thread through the latch and through an empty header (and stop there), you also are fine. Also note that in this situation you do _not_ create new entries into the loop, not even intermediately. The old back edge is the one that goes away due to the threading, the old entry edge is moved comletely out of the loop, the edge header->thread-dest becomes the new entry edge, and the edge new-latch->thread-dest becomes the back edge. No additional entries. So, no, it's not the threading through the loop header that is problematic but creating a situation that fills the (new) latch with code, and that can only happen if the candidate thread contains the latch block. (Of course it's somewhat related to the header block as well, because that is simply the only destination the latch has, and hence is usually included in any thread that also include the latch; but it's not the header that indicates the situation). > See tree-ssa-threadupdate.c:thread_block_1 > > e2 = path->last ()->e; > if (!e2 || noloop_only) > { > /* If NOLOOP_ONLY is true, we only allow threading through the > header of a loop to exit edges. */ > > /* One case occurs when there was loop header buried in a jump > threading path that crosses loop boundaries. We do not try > and thread this elsewhere, so just cancel the jump threading > request by clearing the AUX field now. */ > if (bb->loop_father != e2->src->loop_father > && (!loop_exit_edge_p (e2->src->loop_father, e2) > || flow_loop_nested_p (bb->loop_father, > e2->dest->loop_father))) > { > /* Since this case is not handled by our special code > to thread through a loop header, we must explicitly > cancel the threading request here. */ > delete_jump_thread_path (path); > e->aux = NULL; > continue; > } Yeah, sure, but I'm not sure if the above check is _really_ testing it wants to test or if the effects it achieves are side effects. Like in my proposed patch: I could also test for existence of loop header in the thread and reject that; it would work as well, except that it works because any useful thread including a latch (which is the problematic one) also includes the header. I'm not sure if the above check is in the same line, or tests for some still another situation. Ciao, Michael.
Re: More aggressive threading causing loop-interchange-9.c regression
Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: > The ldist-22 regression is interesting though: > > void foo () > { > int i; > >: > goto ; [INV] > >: > a[i_1] = 0; > if (i_1 > 100) > goto ; [INV] > else > goto ; [INV] > >: > b[i_1] = i_1; > >: > i_8 = i_1 + 1; > >: > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto ; [INV] > else > goto ; [INV] Here there's no simple latch block to start with (the backedge comes directly out of the loop exit block). So my suggested improvement (testing if the latch was empty and only then reject the thread), would solve this. > Would it be crazy to suggest that we disable threading through latches > altogether, I think it wouldn't be crazy, but we can do a bit better as suggested above (only reject empty latches, and reject it only for the threaders coming before the loop optims). Ciao, Michael.
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 2:52 PM, Michael Matz wrote: Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: The ldist-22 regression is interesting though: void foo () { int i; : goto ; [INV] : a[i_1] = 0; if (i_1 > 100) goto ; [INV] else goto ; [INV] : b[i_1] = i_1; : i_8 = i_1 + 1; : # i_1 = PHI <0(2), i_8(5)> if (i_1 <= 1023) goto ; [INV] else goto ; [INV] Here there's no simple latch block to start with (the backedge comes directly out of the loop exit block). So my suggested improvement (testing if the latch was empty and only then reject the thread), would solve this. Well, there's the thing. Loop discovery is marking BB5 as the latch, so it's not getting threaded: Checking profitability of path (backwards): bb:6 (2 insns) bb:5 (latch) Would it be crazy to suggest that we disable threading through latches altogether, I think it wouldn't be crazy, but we can do a bit better as suggested above (only reject empty latches, and reject it only for the threaders coming before the loop optims). BTW, I'm not sure your check for the non-last position makes a difference: diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 449232c7715..528a753b886 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec &m_path, loop_p loop = m_path[0]->loop_father; bool path_crosses_loops = false; bool threaded_through_latch = false; + bool latch_within_path = false; bool multiway_branch_in_path = false; bool threaded_multiway_branch = false; bool contains_hot_bb = false; @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec &m_path, the last entry in the array when determining if we thread through the loop latch. */ if (loop->latch == bb) - threaded_through_latch = true; + { + threaded_through_latch = true; + if (j != 0) + latch_within_path = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " (latch)"); + } } If the last position being considered is a simple latch, it only has a simple outgoing jump. There's no need to thread that. You need a block with >= 2 succ edges to thread anything. Aldy
Re: More aggressive threading causing loop-interchange-9.c regression
Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: > > Here there's no simple latch block to start with (the backedge comes > > directly out of the loop exit block). So my suggested improvement > > (testing if the latch was empty and only then reject the thread), would > > solve this. > > Well, there's the thing. Loop discovery is marking BB5 as the latch, so > it's not getting threaded: Yes, it's a latch, but not an empty one. So the thread would make it just even more non-empty, which might not be a problem anymore. So amending my patch somewhere with a strategic && empty_block_p (latch) and only then rejecting the thread should make this testcase work again. (There's still a catch, though: if this non-empty latch, which is also the exit test block, is threaded through and is followed by actual code, then that code will be inserted onto the back edge, not into the latch block before the exit test, and so also create a (new) non-empty latch. That might or might not create problems downstreams, but not as many as transformaing an empty into a non-empty latch would do; but it could create for instance a second back edge (not in this testcase) and suchlike) > BTW, I'm not sure your check for the non-last position makes a difference: > > > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c > > index 449232c7715..528a753b886 100644 > > --- a/gcc/tree-ssa-threadbackward.c > > +++ b/gcc/tree-ssa-threadbackward.c > > - threaded_through_latch = true; > > + { > > + threaded_through_latch = true; > > + if (j != 0) > > + latch_within_path = true; > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, " (latch)"); > > + } > > } > > If the last position being considered is a simple latch, it only has a simple > outgoing jump. There's no need to thread that. You need a block with >= 2 > succ edges to thread anything. So, you are saying that any candidate thread path wouldn't have the latch in the last position if it were just an empty forwarder? I simply wasn't sure about that, so was conservative and only wanted to reject things I knew where positively bad (namely code in the path following the latch that is in danger of being moved into the latch). Ciao, Michael.
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/21 4:44 PM, Michael Matz wrote: Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: Here there's no simple latch block to start with (the backedge comes directly out of the loop exit block). So my suggested improvement (testing if the latch was empty and only then reject the thread), would solve this. Well, there's the thing. Loop discovery is marking BB5 as the latch, so it's not getting threaded: Yes, it's a latch, but not an empty one. So the thread would make it just even more non-empty, which might not be a problem anymore. So amending my patch somewhere with a strategic && empty_block_p (latch) and only then rejecting the thread should make this testcase work again. Sweet. (There's still a catch, though: if this non-empty latch, which is also the exit test block, is threaded through and is followed by actual code, then that code will be inserted onto the back edge, not into the latch block before the exit test, and so also create a (new) non-empty latch. That might or might not create problems downstreams, but not as many as transformaing an empty into a non-empty latch would do; but it could create for instance a second back edge (not in this testcase) and suchlike) BTW, I'm not sure your check for the non-last position makes a difference: diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 449232c7715..528a753b886 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c - threaded_through_latch = true; + { + threaded_through_latch = true; + if (j != 0) + latch_within_path = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " (latch)"); + } } If the last position being considered is a simple latch, it only has a simple outgoing jump. There's no need to thread that. You need a block with >= 2 succ edges to thread anything. So, you are saying that any candidate thread path wouldn't have the latch in the last position if it were just an empty forwarder? I simply wasn't sure about that, so was conservative and only wanted to reject things I knew where positively bad (namely code in the path following the latch that is in danger of being moved into the latch). Correct. In the backwards threader, we don't start the party if the last block has < 2 succs: FOR_EACH_BB_FN (bb, fun) { if (EDGE_COUNT (bb->succs) > 1) threader.maybe_thread_block (bb); } Thanks. Aldy
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/2021 2:14 AM, Aldy Hernandez wrote: On 9/8/21 8:13 PM, Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Thanks for looking at this. I think you're onto something with this approach. Perhaps in addition to the loop header threading Richard mentions. Your patch causes some regressions, but I think most are noise from FSM tests that must be adjusted. However, there are some other ones that are curious: > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated memset zero" > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping" < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, line 36) > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times graphite "1 loops carried no dependency" 1 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times optimized "loopfn.1" 4 > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times graphite "5 loops carried no dependency" 1 Interestingly your patch is fixing shrink-wrap-loop.c and Warray-bounds-87, both of which were introduced by the backward threader rewrite. At least the Warray-bounds was the threader peeling off an iteration that caused a bogus warning. The ldist-22 regression is interesting though: void foo ()
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/9/2021 2:58 AM, Richard Biener wrote: On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez wrote: On 9/9/21 9:45 AM, Richard Biener wrote: On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez wrote: On 9/9/21 8:57 AM, Richard Biener wrote: On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). ... Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the I don't know why y'all keep using the word "lame". On the contrary, these are incredibly useful explanations. Thanks. latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that crosses
gcc-9-20210909 is now available
Snapshot gcc-9-20210909 is now available on https://gcc.gnu.org/pub/gcc/snapshots/9-20210909/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 9 git branch with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-9 revision 2821499e39653d6d41136af65af10fddad7423cb You'll find: gcc-9-20210909.tar.xzComplete GCC SHA256=da07caf6c1423156feff2321f1ddbf17a9cf495d4d98575b9dbf34eef066e5ad SHA1=8872a0e0560723ce2dbf29d9b216e79fc1920f5d Diffs from 9-20210902 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-9 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.