Re: More aggressive threading causing loop-interchange-9.c regression
First of all. Thanks so much for this incredibly useful explanation. It helps a lot. I'm still trying to wrap my head around this, so please bear with me. On 9/7/21 4:45 PM, Michael Matz wrote: Hello, On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: The regression comes from the simple_reduc_1() function in tree-ssa/loop-interchange-9.c, and it involves the following path: === BB 5 Imports: n_10(D) j_19 Exports: n_10(D) j_13 j_19 j_13 : j_19(I) n_10(D) int [1, 257] j_19int [0, 256] Relational : (j_13 > j_19) [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] So, this is the outer loops exit block ... === BB 9 n_10(D) int [2, 257] j_13int [1, 256] Relational : (n_10(D) > j_19) Relational : (n_10(D) > j_13) [local count: 105119324]: goto ; [100.00%] ... this the outer loops latch block ... === BB 3 Imports: n_10(D) Exports: n_10(D) n_10(D) int [1, +INF] [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] ... and this the outer loops header, as well as inner loops pre-header. Actually, the inner loop's pre-header seems to be BB8, which is empty, and leads into the BB4 which is the header of the inner loop. The attempted thread hence involves a back-edge (of the outer loop) and a loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would destroy some properties that our loop optimizers rely on, e.g. that the loop header of a natural loop is only reached by two edges: entry edge and back edge, and that the latch blocks are empty and that there's only a single latch. (What exactly we require depends on several flags in loops_state_satisfies_p). But threading 3->8 doesn't involve a loop-entry edge. It involves a new edge into the *pre-header* of the inner loop. I am attaching the IL right after threading and before we clean up the CFG (il-after-threading.txt). After threading we have have rewritten the 5->9 edge into 5->11: [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: [local count: 105119324]: # j_7 = PHI sum_6 = c[j_19]; goto ; [100.00%] [local count: 0]:;; This becomes unreachable. goto ; [100.00%] Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9. Question: could BB12 not be considered the loop latch now and BB8 be the outer loop's entry? This would also mean, that BB3 is now outside of the outer loop, which means the threader peeled off the first iteration of the loop. Or is it a requirement that the latch be empty? With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that the loop cannot legally iterate outside the size of c[256]. So, j_13 lies within [1, 257] and n_10 is [2, +INF] at the end of the path. This allows us to thread BB3 to BB8. So, IIUC doing this threading would create a new entry to BB8: it would then be entered by its natural entry (bb3), by its natural back edge (whatever bb that is now) and the new entry from the threaded outer back edge (bb9 or bb5?). As I mentioned, BB8 looks like the pre-header to the inner loop. But yes, it now has multiple entries: the edge from BB3, and the back edge from BB12 (which seems like it could be a new latch, since it's the only back edge). The inner loop wouldn't then be recognized as natural anymore and the whole nest not as perfect, and hence loop interchange can't easily happen anymore. Most other structural loop optimizers of us would have problems with that situation as well. If I clean up the CFG and call loop_optimizer_init() again at this point, what is destroyed is the outer loop, not the inner loop. If you look at the IL at this point (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 going back up to BB4, though the conditional is now in BB6 (eeech): ... ... ... [local count: 118111600]: # sum_21 = PHI # j_18 = PHI c[j_18] = sum_21; j_13 = j_18 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: # j_7 = PHI sum_6 = c[j_7]; goto ; [100.00%] Perhaps, this looks sufficiently different that the loop optimizer can't recognize it as a loop? All the blocks lie within the same loop, and the path passes all the tests in path_profitable_p(). Is there something about the shape of this path that should make it invalid in the backward threader, or should the test be marked with -fdisable-tree-threadN (or something else entirely)? This is a functionality test checking that the very necessary interchange in this test does happen with default plus -flo
Re: More aggressive threading causing loop-interchange-9.c regression
On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez wrote: > > First of all. Thanks so much for this incredibly useful explanation. > It helps a lot. > > I'm still trying to wrap my head around this, so please bear with me. > > On 9/7/21 4:45 PM, Michael Matz wrote: > > Hello, > > > > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: > > > >> The regression comes from the simple_reduc_1() function in > >> tree-ssa/loop-interchange-9.c, and it involves the following path: > >> > >> === BB 5 > >> Imports: n_10(D) j_19 > >> Exports: n_10(D) j_13 j_19 > >> j_13 : j_19(I) > >> n_10(D) int [1, 257] > >> j_19 int [0, 256] > >> Relational : (j_13 > j_19) > >> [local count: 118111600]: > >> # sum_21 = PHI > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >>goto ; [89.00%] > >> else > >>goto ; [11.00%] > > > > So, this is the outer loops exit block ... > > > >> === BB 9 > >> n_10(D) int [2, 257] > >> j_13 int [1, 256] > >> Relational : (n_10(D) > j_19) > >> Relational : (n_10(D) > j_13) > >> [local count: 105119324]: > >> goto ; [100.00%] > > > > ... this the outer loops latch block ... > > > >> === BB 3 > >> Imports: n_10(D) > >> Exports: n_10(D) > >> n_10(D) int [1, +INF] > >> [local count: 118111600]: > >> # j_19 = PHI > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >>goto ; [89.00%] > >> else > >>goto ; [11.00%] > > > > ... and this the outer loops header, as well as inner loops pre-header. > > Actually, the inner loop's pre-header seems to be BB8, which is empty, > and leads into the BB4 which is the header of the inner loop. > > > The attempted thread hence involves a back-edge (of the outer loop) and a > > loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would > > destroy some properties that our loop optimizers rely on, e.g. that the > > loop header of a natural loop is only reached by two edges: entry edge and > > back edge, and that the latch blocks are empty and that there's only a > > single latch. (What exactly we require depends on several flags in > > loops_state_satisfies_p). > > But threading 3->8 doesn't involve a loop-entry edge. It involves a new > edge into the *pre-header* of the inner loop. > > I am attaching the IL right after threading and before we clean up the > CFG (il-after-threading.txt). > > After threading we have have rewritten the 5->9 edge into 5->11: > > [local count: 118111600]: ># sum_21 = PHI >c[j_19] = sum_21; >j_13 = j_19 + 1; >if (n_10(D) > j_13) > goto ; [89.00%] >else > goto ; [11.00%] > > [local count: 105119324]: > > [local count: 105119324]: ># j_7 = PHI >sum_6 = c[j_19]; >goto ; [100.00%] > > [local count: 0]: ;; This becomes unreachable. >goto ; [100.00%] > > Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9. > > Question: could BB12 not be considered the loop latch now and BB8 be the > outer loop's entry? This would also mean, that BB3 is now outside of > the outer loop, which means the threader peeled off the first iteration > of the loop. Or is it a requirement that the latch be empty? > > > > >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that > >> the loop cannot legally iterate outside the size of c[256]. So, j_13 > >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path. > >> This allows us to thread BB3 to BB8. > > > > So, IIUC doing this threading would create a new entry to BB8: it would > > then be entered by its natural entry (bb3), by its natural back edge > > (whatever bb that is now) and the new entry from the threaded outer back > > edge (bb9 or bb5?). > > As I mentioned, BB8 looks like the pre-header to the inner loop. But > yes, it now has multiple entries: the edge from BB3, and the back edge > from BB12 (which seems like it could be a new latch, since it's the only > back edge). It would be helpful to have the patch causing the issue to look at the IL. But as Micha said, there needs to be a perfect loop nest for interchange to work. Richard. > > > > The inner loop wouldn't then be recognized as natural anymore and the > > whole nest not as perfect, and hence loop interchange can't easily happen > > anymore. Most other structural loop optimizers of us would have problems > > with that situation as well. > > If I clean up the CFG and call loop_optimizer_init() again at this > point, what is destroyed is the outer loop, not the inner loop. If you > look at the IL at this point > (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 > going back up to BB4, though the conditional is now in BB6 (eeech): > > > ... > ... > ... > > > [local count: 118111600]: ># sum_21 = PHI ># j_18 = PHI >c[j_18] = sum_21; >j_13 = j_18 + 1; >if (n_10(D) > j_13) >
Re: More aggressive threading causing loop-interchange-9.c regression
It would be helpful to have the patch causing the issue to look at the IL. But as Micha said, there needs to be a perfect loop nest for interchange to work. Richard. Absolutely! I'm attaching the reduced testcase, as well as the patch. The problematic thread shows up in the thread2 dump: Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 insns) bb:5 Control statement insns: 2 Overall: 2 insns Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) nocopy; (3, 8) Thanks. Aldy commit 1bf3f76a5ff075396b5b9f5f88d6b18649dac2ce Author: Aldy Hernandez Date: Sun Sep 5 16:54:00 2021 +0200 Take into account global ranges when folding statements in path solver. The path solver used by the backwards threader can refine a folded range by taking into account any global range known. This patch intersects the folded range with whatever global information we have. gcc/ChangeLog: * gimple-range-path.cc (path_range_query::internal_range_of_expr): Intersect with global ranges. > FAIL: gcc.dg/tree-ssa/loop-interchange-9.c scan-tree-dump-times linterchange "Loop_pair is interchanged" 1 > FAIL: gcc.dg/tree-ssa/cunroll-15.c scan-tree-dump optimized "return 1;" diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc index a4fa3b296ff..c616b65756f 100644 --- a/gcc/gimple-range-path.cc +++ b/gcc/gimple-range-path.cc @@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt) basic_block bb = stmt ? gimple_bb (stmt) : exit_bb (); if (stmt && range_defined_in_block (r, name, bb)) { + if (TREE_CODE (name) == SSA_NAME) + r.intersect (gimple_range_global (name)); + set_cache (r, name); return true; } /* { dg-do run } */ /* { dg-options "-O2 -floop-interchange -fdump-tree-linterchange-details" } */ /* { dg-require-effective-target size20plus } */ /* { dg-skip-if "too big data segment" { visium-*-* } } */ #define M 256 int a[M][M], b[M][M], c[M], d[M]; void __attribute__((noinline)) simple_reduc_1 (int n) { for (int j = 0; j < n; j++) { int sum = c[j]; for (int i = 0; i < n; i++) sum = sum + a[i][j]*b[i][j]; c[j] = sum; } }
Re: More aggressive threading causing loop-interchange-9.c regression
On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez wrote: > > > It would be helpful to have the patch causing the issue to look at the IL. > > But as Micha said, there needs to be a perfect loop nest for interchange > > to work. > > > > Richard. > > Absolutely! I'm attaching the reduced testcase, as well as the patch. > > The problematic thread shows up in the thread2 dump: > > Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 > insns) bb:5 >Control statement insns: 2 >Overall: 2 insns >Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) > nocopy; (3, 8) Well, so as Micha said, the threading destroys the outer loop by making it have two entries - we actually _do_ recover from this but it results in a non-empty latch of the outer loop and thus the loop is no longer a perfect nest. The interchange pass has special-sauce to recover from the loop store motion applied but not to handle the kind of loop rotation the jump threading caused. The forward threader guards against this by simply disallowing threadings that involve different loops. As I see the threading done here should be 7->3 (outer loop entry) to bb 8 rather than one involving the backedge. Also note the condition is invariant in the loop and in fact subsumed by the condition outside of the loop and it should have been simplified by VRP after pass_ch but I see there's a jump threading pass inbetween pass_ch and the next VRP which is likely the problem. Why does jump threading not try to simplify a condition before attempting to thread through it? After all ranger should be around? Richard. > Thanks. > Aldy
[RISCV] RISC-V GNU Toolchain Biweekly Sync-up call (Sept 9, 2021)
Hi all, There is an agenda for tomorrow's meeting. If you have topics to discuss or share, please let me know and I can add them to the agenda. Agenda: 1. Progress of RISC-V subextensions toolchain developing 2. Open Discussion Topic: RISC-V GNU Toolchain Biweekly Sync-up Join Zoom Meeting https://zoom.com.cn/j/89393600951?pwd=ZFpWMkZ6Tm1TbUFXT1hZZjZZMHhRQT09 Meeting ID: 893 9360 0951 Passcode: 899662 BEIJING, China 11:00pThu, Sep 9 2021 12:00aFri, Sep 10 2021 PDT/PST, Pacific Daylight Time (US) 8:00aThu, Sep 9 2021 9:00aThu, Sep 9 2021 PHILADELPHIA, United States, Pennsylvania 11:00aThu, Sep 9 2021 12:00pThu, Sep 9 2021 PARIS, France 5:00pThu, Sep 9 2021 6:00pThu, Sep 9 2021 -- Best wishes, Wei Wu (吴伟)
Re: More aggressive threading causing loop-interchange-9.c regression
On 9/8/21 3:49 PM, Richard Biener wrote: On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez wrote: It would be helpful to have the patch causing the issue to look at the IL. But as Micha said, there needs to be a perfect loop nest for interchange to work. Richard. Absolutely! I'm attaching the reduced testcase, as well as the patch. The problematic thread shows up in the thread2 dump: Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 insns) bb:5 Control statement insns: 2 Overall: 2 insns Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) nocopy; (3, 8) Well, so as Micha said, the threading destroys the outer loop by making it have two entries - we actually _do_ recover from this but it results in a non-empty latch of the outer loop and thus the loop is no longer a perfect nest. The interchange pass has special-sauce to recover from the loop store motion applied but not to handle the kind of loop rotation the jump threading caused. Understood. The backedge into BB8 and the non-empty latch seem to cause problems. The forward threader guards against this by simply disallowing threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). the threading done here should be 7->3 (outer loop entry) to bb 8 rather than one involving the backedge. Also note the condition is invariant in the loop and in fact subsumed by the condition outside of the loop and it should have been simplified by VRP after pass_ch but I see there's a jump threading pass inbetween pass_ch and the next VRP which is likely the problem. A 7->3->8 thread would cross loops though, because 7 is outside the outer loop. Why does jump threading not try to simplify a condition before attempting to thread through it? After all ranger should be around? That's a very good question ;-). The current code does not use the ranger to solve unknowns. Anything further back than the first block in a path will return varying. The plan was to add this ranger solving functionality as a follow-up. I have a whole slew of pending patches adding precisely this functionality. We should be able to solve ranges outside the path with ranger, as well as relationals occurring in a path. However, even if there are alternate ways of threading this IL, something like 5->9->3 could still happen. We need a way to disallow this. I'm having a hard time determining the hammer for this. I would vote for disabling threading through latches, but it seems the backward threader is aware of this scenario and allows it anyhow (see threaded_through_latch variable). Ughh. Thanks for looking into this. Aldy
Re: More aggressive threading causing loop-interchange-9.c regression
Hello, On Wed, 8 Sep 2021, Aldy Hernandez wrote: > > The forward threader guards against this by simply disallowing > > threadings that involve different loops. As I see > > The thread in question (5->9->3) is all within the same outer loop, > though. BTW, the backward threader also disallows threading across > different loops (see path_crosses_loops variable). > > > the threading done here should be 7->3 (outer loop entry) to bb 8 > > rather than one involving the backedge. Also note the condition is > > invariant in the loop and in fact subsumed by the condition outside of > > the loop and it should have been simplified by VRP after pass_ch but I > > see there's a jump threading pass inbetween pass_ch and the next VRP > > which is likely the problem. > > A 7->3->8 thread would cross loops though, because 7 is outside the > outer loop. ... > However, even if there are alternate ways of threading this IL, > something like 5->9->3 could still happen. We need a way to disallow > this. I'm having a hard time determining the hammer for this. I would > vote for disabling threading through latches, but it seems the backward > threader is aware of this scenario and allows it anyhow (see > threaded_through_latch variable). Ughh. The backward threader seems to want to be careful with latches, but still allow it in some situations, in particular when doing so doesn't create a loop with non-simple latches (which is basically a single and empty latch block). If this improvement under discussion leads to creating a non-empty latch then those checks aren't restrictive enough (anymore). I think threading through a latch is always dubious regarding the loop structure, it's basically either loop rotation or iteration peeling, even if it doesn't cause non-simple latches. Those transformations should probably be left to a loop optimizer, or be only done when destroying loop structure is fine (i.e. late). Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) Does anything break if you brutally disable any backward threading when any of the involved blocks is a latch when current_loops is set? (I guess for that latter test to be effective you want to disable the loop_optimizer_init() for the "late" jump thread passes) Ciao, Michael.
Re: More aggressive threading causing loop-interchange-9.c regression
Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: > > > The forward threader guards against this by simply disallowing > > > threadings that involve different loops. As I see > > > > The thread in question (5->9->3) is all within the same outer loop, > > though. BTW, the backward threader also disallows threading across > > different loops (see path_crosses_loops variable). ... > Maybe it's possible to not disable threading over latches alltogether in > the backward threader (like it's tried now), but I haven't looked at the > specific situation here in depth, so take my view only as opinion from a > large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: [local count: 118111600]: # j_19 = PHI sum_11 = c[j_19]; if (n_10(D) > 0) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: ... [local count: 118111600]: # sum_21 = PHI c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto ; [89.00%] else goto ; [11.00%] [local count: 105119324]: goto ; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Ciao, Michael. diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 449232c7715..528a753b886 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec &m_path, loop_p loop = m_path[0]->loop_father; bool path_crosses_loops = false; bool threaded_through_latch = false; + bool latch_within_path = false; bool multiway_branch_in_path = false; bool threaded_multiway_branch = false; bool contains_hot_bb = false; @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec &m_path, the last entry in the array when determining if we thread through the loop latch. */ if (loop->latch == bb) - threaded_through_latch = true; + { + threaded_through_latch = true; + if (j != 0) + latch_within_path = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " (latch)"); + } } gimple *stmt = get_gimple_control_stmt (m_path[0]); @@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec &m_path, "a multiway branch.\n"); return false; } + + if (latch_within_path) +{ + if
Some questions on hash_set/hash_map traits
Hello, I am refactoring some old code that runs as an IPA_PASS. This code is intended to run at link-time using the LTO framework and so I am always testing it that way. At the moment, I am trying to use hash-maps but having some difficulties. I am adding some patches that will help illustrate along the way. The first patch simply adds a link-time optimization pass that will be used to demo the problem that I am having. One can run with -flto -fipa-hash. During the first patch, the pass does not do any meaningful work but it just helps to set up the stage. For the second patch I create a wrapper around a symtab_node. This wrapper is intended to add some other functionality to elements that I might want to insert in hash-sets or hash-maps. class symtab_wrapper { private: symtab_node *_symtab_node; public: friend symtab_wrapper_traits; symtab_wrapper () : _symtab_node (NULL) {} symtab_wrapper (symtab_node *s) : _symtab_node (s) { gcc_assert (s); } void print (void) { if (!dump_file) return; gcc_assert (_symtab_node); fprintf (dump_file, "%s\n", _symtab_node->name ()); } }; I also have a wrapper around a hash-set to add some things that might be of interest to me in the future: template > class my_set_t { private: hash_set _impl; public: my_set_t () {} void insert (const KeyId &k) { _impl.add (k); } bool contains (const KeyId &k) { return _impl.contains (k); } void print (void) { if (!dump_file) return; for (auto i = _impl.begin (), e = _impl.end (); i != e; ++i) { (*i).print (); } } }; I also created a symtab_wrapper_traits because I want to store the object in the hash-map itself, and so I need to specify how to hash it. class symtab_wrapper_traits { public: typedef symtab_wrapper value_type; typedef symtab_wrapper compare_type; static const bool empty_zero_p = true; static inline hashval_t hash (const value_type &v) { return pointer_hash ::hash (v._symtab_node); } static bool equal (const value_type &l, const value_type &r) { return l._symtab_node == r._symtab_node; } static void mark_deleted (value_type &v) { v._symtab_node = NULL; } static void mark_empty (value_type &v) { v._symtab_node = NULL; } static void remove (value_type &v) { v._symtab_node = NULL; } static bool is_deleted (const value_type &v) { return !v._symtab_node; } static bool is_empty (const value_type &v) { return !v._symtab_node; } }; I then create a global set with my traits and populate it with some information and later print. It seems to be working: my_set_t my_set; static void ipa_hash_generate_summary (void) { cgraph_node *cnode = NULL; FOR_EACH_FUNCTION (cnode) { symtab_wrapper w (cnode); my_set.insert (w); } my_set.print (); return; } Here, I already have some questions, but this is not the biggest issue that I am having: I believe that the hash_set implementation manages its own memory, but I am unclear about the semantics of "removed". * Is "removed" supposed to, for example, call the destructor? Since, in this particular case, it is only a wrapper and cgraph_node is not intended to be destroyed, I just assign the pointer to NULL. Not sure if that's the intention. * I don't understand the semantics of empty_zero_p, I think it is used to zero out deleted entries? If I mark something as empty. * I am assuming that its memory will be re-used, is this the case? * I see that there is a typed_delete_remove function that is sometimes used in traits, but I think that it doesn't apply to this case because I am storing the whole object. Is that the case? I later did some refactoring (patch 3), which did not seem to impact the code now, but will impact hash_map later. The refactoring is as follows, I create an abstract "printable" class class printable { public: virtual char* str () = 0; void print (void); }; void printable::print (void) { if (!dump_file) return; char* _str = str (); fprintf (dump_file, "%s\n", _str); free (_str); } Make symtab_wrapper a publically derived class -class symtab_wrapper +class symtab_wrapper : public printable and implemented the str function in symtab_wrapper: virtual char* str (void) { if (!dump_file) return; gcc_assert (_symtab_node); char *retval = NULL; asprintf (&retval, "%s", _symtab_node->name ()); return retval; } Again, this seems to be working correctly. What is more interesting is moving from a hash-set to a hash-map. On the fourth patch, I implemented the hash_map equivalent to the hash_set that I am describing above: template class my_map_t { private: hash_map _impl; public: my_map_t () {} void insert (const KeyId &k, const Value &v) { _impl.put (k, v); } Value *select (const KeyId &k) { return _impl.get (k); } void print (void) { if (!dump_file) return; for (auto i = _impl.begin (), e =
Re: More aggressive threading causing loop-interchange-9.c regression
On Wed, Sep 8, 2021 at 8:13 PM Michael Matz wrote: > > Hello, > > [lame answer to self] > > On Wed, 8 Sep 2021, Michael Matz wrote: > > > > > The forward threader guards against this by simply disallowing > > > > threadings that involve different loops. As I see > > > > > > The thread in question (5->9->3) is all within the same outer loop, > > > though. BTW, the backward threader also disallows threading across > > > different loops (see path_crosses_loops variable). > ... > > Maybe it's possible to not disable threading over latches alltogether in > > the backward threader (like it's tried now), but I haven't looked at the > > specific situation here in depth, so take my view only as opinion from a > > large distance :-) > > I've now looked at the concrete situation. So yeah, the whole path is in > the same loop, crosses the latch, _and there's code following the latch > on that path_. (I.e. the latch isn't the last block in the path). In > particular, after loop_optimizer_init() (before any threading) we have: > >[local count: 118111600]: > # j_19 = PHI > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto ; [89.00%] > else > goto ; [11.00%] > > [local count: 105119324]: > ... > >[local count: 118111600]: > # sum_21 = PHI > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto ; [89.00%] > else > goto ; [11.00%] > >[local count: 105119324]: > goto ; [100.00%] > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > pre-header of inner loop, but more importantly something that's not at the > start of the outer loop. > > Now, any thread that includes the backedge 9->3 _including_ its > destination (i.e. where the backedge isn't the last to-be-redirected edge) > necessarily duplicates all code from that destination onto the back edge. > Here it's the load from c[j] into sum_11. > > The important part is the code is emitted onto the back edge, > conceptually; in reality it's simply included into the (new) latch block > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > cfg_cleanup). > > That's what we can't have for some of our structural loop optimizers: > there must be no code executed after the exit test (e.g. in the latch > block). (This requirement makes reasoning about which code is or isn't > executed completely for an iteration trivial; simply everything in the > body is always executed; e.g. loop interchange uses this to check that > there are no memory references after the exit test, because those would > then be only conditional and hence make loop interchange very awkward). > > Note that this situation can't be later rectified anymore: the duplicated > instructions (because they are memory refs) must remain after the exit > test. Only by rerolling/unrotating the loop (i.e. noticing that the > memory refs on the loop-entry path and on the back edge are equivalent) > would that be possible, but that's something we aren't capable of. Even > if we were that would simply just revert the whole work that the threader > did, so it's better to not even do that to start with. > > I believe something like below would be appropriate, it disables threading > if the path contains a latch at the non-last position (due to being > backwards on the non-first position in the array). I.e. it disables > rotating the loop if there's danger of polluting the back edge. It might > be improved if the blocks following (preceding!) the latch are themself > empty because then no code is duplicated. It might also be improved if > the latch is already non-empty. That code should probably only be active > before the loop optimizers, but currently the backward threader isn't > differentiating between before/after loop-optims. > > I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that crosses loop boundaries. We do not try and thread this elsewhere, so just cancel the jump threading request by clearing the AUX field now. */ if (bb->loop_father != e2->src->loop_father && (!loop_exit_edge_p (