Re: Some questions on hash_set/hash_map traits

2021-09-09 Thread Richard Biener via Gcc
On Wed, Sep 8, 2021 at 10:50 PM Erick Ochoa via Gcc  wrote:
>
> Hello,
>
> I am refactoring some old code that runs as an IPA_PASS. This code is
> intended to run at link-time using the LTO framework and so I am
> always testing it that way. At the moment, I am trying to use
> hash-maps but having some difficulties. I am adding some patches that
> will help illustrate along the way.
>
> The first patch simply adds a link-time optimization pass that will be
> used to demo the problem that I am having. One can run with -flto
> -fipa-hash. During the first patch, the pass does not do any
> meaningful work but it just helps to set up the stage.
>
> For the second patch I create a wrapper around a symtab_node. This
> wrapper is intended to add some other functionality to elements that I
> might want to insert in hash-sets or hash-maps.
>
> class symtab_wrapper
> {
> private:
>   symtab_node *_symtab_node;
> public:
>   friend symtab_wrapper_traits;
>   symtab_wrapper ()
> : _symtab_node (NULL)
>   {}
>   symtab_wrapper (symtab_node *s)
> : _symtab_node (s)
>   { gcc_assert (s); }
>
>   void print (void)
>   {
> if (!dump_file) return;
> gcc_assert (_symtab_node);
> fprintf (dump_file, "%s\n", _symtab_node->name ());
>   }
> };
>
> I also have a wrapper around a hash-set to add some things that might
> be of interest to me in the future:
>
> template  default_hash_traits  >
> class my_set_t
> {
> private:
>   hash_set  _impl;
> public:
>   my_set_t () {}
>
>   void insert (const KeyId &k)
>   {
> _impl.add (k);
>   }
>
>   bool contains (const KeyId &k)
>   {
> return _impl.contains (k);
>   }
>
>   void print (void)
>   {
> if (!dump_file) return;
> for (auto i = _impl.begin (), e = _impl.end (); i != e; ++i)
> {
>   (*i).print ();
> }
>   }
> };
>
> I also created a symtab_wrapper_traits because I want to store the
> object in the hash-map itself, and so I need to specify how to hash
> it.
>
> class symtab_wrapper_traits
> {
> public:
>   typedef symtab_wrapper value_type;
>   typedef symtab_wrapper compare_type;
>   static const bool empty_zero_p = true;
>
>   static inline hashval_t hash (const value_type &v)
>   {
> return pointer_hash ::hash (v._symtab_node);
>   }
>
>   static bool equal (const value_type &l, const value_type &r)
>   {
> return l._symtab_node == r._symtab_node;
>   }
>
>   static void mark_deleted (value_type &v)
>   {
> v._symtab_node = NULL;
>   }
>
>   static void mark_empty (value_type &v)
>   {
> v._symtab_node = NULL;
>   }
>
>   static void remove (value_type &v)
>   {
> v._symtab_node = NULL;
>   }
>
>   static bool is_deleted (const value_type &v)
>   {
>  return !v._symtab_node;
>   }
>
>   static bool is_empty (const value_type &v)
>   {
>  return !v._symtab_node;
>   }
> };
>
> I then create a global set with my traits and populate it with some
> information and later print. It seems to be working:
>
> my_set_t  my_set;
>
>  static void
>  ipa_hash_generate_summary (void)
>  {
>   cgraph_node *cnode = NULL;
>   FOR_EACH_FUNCTION (cnode)
>   {
> symtab_wrapper w (cnode);
> my_set.insert (w);
>   }
>
>   my_set.print ();
>
>   return;
>  }
>
> Here, I already have some questions, but this is not the biggest issue
> that I am having: I believe that the hash_set implementation manages
> its own memory, but I am unclear about the semantics of "removed".
>
> * Is "removed" supposed to, for example, call the destructor? Since,
> in this particular case, it is only a wrapper and cgraph_node is not
> intended to be destroyed, I just assign the pointer to NULL. Not sure
> if that's the intention.
> * I don't understand the semantics of empty_zero_p, I think it is used
> to zero out deleted entries? If I mark something as empty.
> * I am assuming that its memory will be re-used, is this the case?
> * I see that there is a typed_delete_remove function that is sometimes
> used in traits, but I think that it doesn't apply to this case because
> I am storing the whole object. Is that the case?
>
> I later did some refactoring (patch 3), which did not seem to impact
> the code now, but will impact hash_map later. The refactoring is as
> follows, I create an abstract "printable" class
>
> class printable
> {
> public:
>   virtual char* str () = 0;
>   void print (void);
> };
>
> void
> printable::print (void)
> {
>   if (!dump_file) return;
>   char* _str = str ();
>   fprintf (dump_file, "%s\n", _str);
>   free (_str);
> }
>
> Make symtab_wrapper a publically derived class
>
> -class symtab_wrapper
> +class symtab_wrapper : public printable
>
> and implemented the str function in symtab_wrapper:
>
>   virtual char* str (void)
>{
> if (!dump_file) return;
> gcc_assert (_symtab_node);
> char *retval = NULL;
> asprintf (&retval, "%s", _symtab_node->name ());
> return retval;
>   }
>
> Again, this seems to be working correctly.
>
> What is more interesting is moving from a hash-set to a has

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

[local count: 118111600]:
   # j_19 = PHI 
   sum_11 = c[j_19];
   if (n_10(D) > 0)
 goto ; [89.00%]
   else
 goto ; [11.00%]

   [local count: 105119324]:
...

[local count: 118111600]:
   # sum_21 = PHI 
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
 goto ; [89.00%]
   else
 goto ; [11.00%]

[local count: 105119324]:
   goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary, 
these are incredibly useful explanations.  Thanks.



latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

   e2 = path->last ()->e;
   if (!e2 || noloop_only)
 {
   /* If NOLOOP_ONLY is true, we only allow threading through the
  header of a loop to exit edges.  */

   /* One case occurs when there was loop header buried in a jump
  threading path that crosses loop boundaries.  We do not try
  and thread this elsewhere, so just cancel the jump threading
  request by clearing the AUX field now.  */
   if (bb->loop_father != e2->src->loop_father
   && (!loop_

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Richard Biener via Gcc
On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:
>
>
>
> On 9/9/21 8:57 AM, Richard Biener wrote:
> > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:
> >>
> >> Hello,
> >>
> >> [lame answer to self]
> >>
> >> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>
> > The forward threader guards against this by simply disallowing
> > threadings that involve different loops.  As I see
> 
>  The thread in question (5->9->3) is all within the same outer loop,
>  though. BTW, the backward threader also disallows threading across
>  different loops (see path_crosses_loops variable).
> >> ...
> >>> Maybe it's possible to not disable threading over latches alltogether in
> >>> the backward threader (like it's tried now), but I haven't looked at the
> >>> specific situation here in depth, so take my view only as opinion from a
> >>> large distance :-)
> >>
> >> I've now looked at the concrete situation.  So yeah, the whole path is in
> >> the same loop, crosses the latch, _and there's code following the latch
> >> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >> particular, after loop_optimizer_init() (before any threading) we have:
> >>
> >> [local count: 118111600]:
> >># j_19 = PHI 
> >>sum_11 = c[j_19];
> >>if (n_10(D) > 0)
> >>  goto ; [89.00%]
> >>else
> >>  goto ; [11.00%]
> >>
> >>[local count: 105119324]:
> >> ...
> >>
> >> [local count: 118111600]:
> >># sum_21 = PHI 
> >>c[j_19] = sum_21;
> >>j_13 = j_19 + 1;
> >>if (n_10(D) > j_13)
> >>  goto ; [89.00%]
> >>else
> >>  goto ; [11.00%]
> >>
> >> [local count: 105119324]:
> >>goto ; [100.00%]
> >>
> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >> pre-header of inner loop, but more importantly something that's not at the
> >> start of the outer loop.
> >>
> >> Now, any thread that includes the backedge 9->3 _including_ its
> >> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> >> necessarily duplicates all code from that destination onto the back edge.
> >> Here it's the load from c[j] into sum_11.
> >>
> >> The important part is the code is emitted onto the back edge,
> >> conceptually; in reality it's simply included into the (new) latch block
> >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> >> cfg_cleanup).
> >>
> >> That's what we can't have for some of our structural loop optimizers:
> >> there must be no code executed after the exit test (e.g. in the latch
> >> block).  (This requirement makes reasoning about which code is or isn't
> >> executed completely for an iteration trivial; simply everything in the
> >> body is always executed; e.g. loop interchange uses this to check that
> >> there are no memory references after the exit test, because those would
> >> then be only conditional and hence make loop interchange very awkward).
> >>
> >> Note that this situation can't be later rectified anymore: the duplicated
> >> instructions (because they are memory refs) must remain after the exit
> >> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >> memory refs on the loop-entry path and on the back edge are equivalent)
> >> would that be possible, but that's something we aren't capable of.  Even
> >> if we were that would simply just revert the whole work that the threader
> >> did, so it's better to not even do that to start with.
> >>
> >> I believe something like below would be appropriate, it disables threading
> >> if the path contains a latch at the non-last position (due to being
> >> backwards on the non-first position in the array).  I.e. it disables
> >> rotating the loop if there's danger of polluting the back edge.  It might
> >> be improved if the blocks following (preceding!) the latch are themself
> >> empty because then no code is duplicated.  It might also be improved if
> >> the latch is already non-empty.  That code should probably only be active
> >> before the loop optimizers, but currently the backward threader isn't
> >> differentiating between before/after loop-optims.
> >>
> >> I haven't tested this patch at all, except that it fixes the testcase :)
> >
> > Lame comment at the current end of the thread - it's not threading through 
> > the
>
> I don't know why y'all keep using the word "lame".  On the contrary,
> these are incredibly useful explanations.  Thanks.
>
> > latch but threading through the loop header that's problematic, at least if 
> > the
> > end of the threading path ends within the loop (threading through the header
> > to the loop exit is fine).  Because in that situation you effectively 
> > created an
> > alternate loop entry.  Threading through the latch into the loop header is
> > fine but with simple latches that likely will never happen (if there are no
> > simple latches then the latch can contain the loop exit test).
> >
> > See tree-ssa-threadupdate.c:thread_block_1
> >
> > 

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/8/21 8:13 PM, Michael Matz wrote:

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

[local count: 118111600]:
   # j_19 = PHI 
   sum_11 = c[j_19];
   if (n_10(D) > 0)
 goto ; [89.00%]
   else
 goto ; [11.00%]

   [local count: 105119324]:
...

[local count: 118111600]:
   # sum_21 = PHI 
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
 goto ; [89.00%]
   else
 goto ; [11.00%]

[local count: 105119324]:
   goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Thanks for looking at this.

I think you're onto something with this approach.  Perhaps in addition 
to the loop header threading Richard mentions.


Your patch causes some regressions, but I think most are noise from FSM 
tests that must be adjusted.  However, there are some other ones that 
are curious:


> FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
memset zero"

> FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
"Performing shrink-wrapping"
< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
line 36)
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
graphite "1 loops carried no dependency" 1
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
optimized "loopfn.1" 4
> FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
graphite "5 loops carried no dependency" 1


Interestingly your patch is fixing shrink-wrap-loop.c and 
Warray-bounds-87, both of which were introduced by the backward threader 
rewrite.  At least the Warray-bounds was the threader peeling off an 
iteration that caused a bogus warning.


The ldist-22 regression is interesting though:

void foo ()
{
  int i;

   :
  goto ; [INV]

   :
  a[i_1] = 0;
  if (i_1 > 

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Richard Biener via Gcc
On Thu, Sep 9, 2021 at 10:14 AM Aldy Hernandez  wrote:
>
>
>
> On 9/8/21 8:13 PM, Michael Matz wrote:
> > Hello,
> >
> > [lame answer to self]
> >
> > On Wed, 8 Sep 2021, Michael Matz wrote:
> >
>  The forward threader guards against this by simply disallowing
>  threadings that involve different loops.  As I see
> >>>
> >>> The thread in question (5->9->3) is all within the same outer loop,
> >>> though. BTW, the backward threader also disallows threading across
> >>> different loops (see path_crosses_loops variable).
> > ...
> >> Maybe it's possible to not disable threading over latches alltogether in
> >> the backward threader (like it's tried now), but I haven't looked at the
> >> specific situation here in depth, so take my view only as opinion from a
> >> large distance :-)
> >
> > I've now looked at the concrete situation.  So yeah, the whole path is in
> > the same loop, crosses the latch, _and there's code following the latch
> > on that path_.  (I.e. the latch isn't the last block in the path).  In
> > particular, after loop_optimizer_init() (before any threading) we have:
> >
> > [local count: 118111600]:
> ># j_19 = PHI 
> >sum_11 = c[j_19];
> >if (n_10(D) > 0)
> >  goto ; [89.00%]
> >else
> >  goto ; [11.00%]
> >
> >[local count: 105119324]:
> > ...
> >
> > [local count: 118111600]:
> ># sum_21 = PHI 
> >c[j_19] = sum_21;
> >j_13 = j_19 + 1;
> >if (n_10(D) > j_13)
> >  goto ; [89.00%]
> >else
> >  goto ; [11.00%]
> >
> > [local count: 105119324]:
> >goto ; [100.00%]
> >
> > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> > pre-header of inner loop, but more importantly something that's not at the
> > start of the outer loop.
> >
> > Now, any thread that includes the backedge 9->3 _including_ its
> > destination (i.e. where the backedge isn't the last to-be-redirected edge)
> > necessarily duplicates all code from that destination onto the back edge.
> > Here it's the load from c[j] into sum_11.
> >
> > The important part is the code is emitted onto the back edge,
> > conceptually; in reality it's simply included into the (new) latch block
> > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> > cfg_cleanup).
> >
> > That's what we can't have for some of our structural loop optimizers:
> > there must be no code executed after the exit test (e.g. in the latch
> > block).  (This requirement makes reasoning about which code is or isn't
> > executed completely for an iteration trivial; simply everything in the
> > body is always executed; e.g. loop interchange uses this to check that
> > there are no memory references after the exit test, because those would
> > then be only conditional and hence make loop interchange very awkward).
> >
> > Note that this situation can't be later rectified anymore: the duplicated
> > instructions (because they are memory refs) must remain after the exit
> > test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> > memory refs on the loop-entry path and on the back edge are equivalent)
> > would that be possible, but that's something we aren't capable of.  Even
> > if we were that would simply just revert the whole work that the threader
> > did, so it's better to not even do that to start with.
> >
> > I believe something like below would be appropriate, it disables threading
> > if the path contains a latch at the non-last position (due to being
> > backwards on the non-first position in the array).  I.e. it disables
> > rotating the loop if there's danger of polluting the back edge.  It might
> > be improved if the blocks following (preceding!) the latch are themself
> > empty because then no code is duplicated.  It might also be improved if
> > the latch is already non-empty.  That code should probably only be active
> > before the loop optimizers, but currently the backward threader isn't
> > differentiating between before/after loop-optims.
> >
> > I haven't tested this patch at all, except that it fixes the testcase :)
>
> Thanks for looking at this.
>
> I think you're onto something with this approach.  Perhaps in addition
> to the loop header threading Richard mentions.
>
> Your patch causes some regressions, but I think most are noise from FSM
> tests that must be adjusted.  However, there are some other ones that
> are curious:
>
>  > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated
> memset zero"
>  > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
> < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue
> "Performing shrink-wrapping"
> < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages,
> line 36)
>  > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times
> graphite "1 loops carried no dependency" 1
>  > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times
> optimized "loopfn.1" 4
>  > FAIL: libgomp.graphite/force-parallel-8.c

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:




On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

 [local count: 118111600]:
# j_19 = PHI 
sum_11 = c[j_19];
if (n_10(D) > 0)
  goto ; [89.00%]
else
  goto ; [11.00%]

[local count: 105119324]:
...

 [local count: 118111600]:
# sum_21 = PHI 
c[j_19] = sum_21;
j_13 = j_19 + 1;
if (n_10(D) > j_13)
  goto ; [89.00%]
else
  goto ; [11.00%]

 [local count: 105119324]:
goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.


latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

e2 = path->last ()->e;
if (!e2 || noloop_only)
  {
/* If NOLOOP_ONLY is true, we only allow threading through the
   header of a loop to exit edges.  */

/* One case occurs when there was loop header buried in a jump
   threading path that crosses loop boundaries.  We do not try
   and thread this elsewhere, so just cancel the jump threading
 

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Richard Biener via Gcc
On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez  wrote:
>
>
>
> On 9/9/21 9:45 AM, Richard Biener wrote:
> > On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:
> >>
> >>
> >>
> >> On 9/9/21 8:57 AM, Richard Biener wrote:
> >>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:
> 
>  Hello,
> 
>  [lame answer to self]
> 
>  On Wed, 8 Sep 2021, Michael Matz wrote:
> 
> >>> The forward threader guards against this by simply disallowing
> >>> threadings that involve different loops.  As I see
> >>
> >> The thread in question (5->9->3) is all within the same outer loop,
> >> though. BTW, the backward threader also disallows threading across
> >> different loops (see path_crosses_loops variable).
>  ...
> > Maybe it's possible to not disable threading over latches alltogether in
> > the backward threader (like it's tried now), but I haven't looked at the
> > specific situation here in depth, so take my view only as opinion from a
> > large distance :-)
> 
>  I've now looked at the concrete situation.  So yeah, the whole path is in
>  the same loop, crosses the latch, _and there's code following the latch
>  on that path_.  (I.e. the latch isn't the last block in the path).  In
>  particular, after loop_optimizer_init() (before any threading) we have:
> 
>   [local count: 118111600]:
>  # j_19 = PHI 
>  sum_11 = c[j_19];
>  if (n_10(D) > 0)
>    goto ; [89.00%]
>  else
>    goto ; [11.00%]
> 
>  [local count: 105119324]:
>  ...
> 
>   [local count: 118111600]:
>  # sum_21 = PHI 
>  c[j_19] = sum_21;
>  j_13 = j_19 + 1;
>  if (n_10(D) > j_13)
>    goto ; [89.00%]
>  else
>    goto ; [11.00%]
> 
>   [local count: 105119324]:
>  goto ; [100.00%]
> 
>  With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>  pre-header of inner loop, but more importantly something that's not at 
>  the
>  start of the outer loop.
> 
>  Now, any thread that includes the backedge 9->3 _including_ its
>  destination (i.e. where the backedge isn't the last to-be-redirected 
>  edge)
>  necessarily duplicates all code from that destination onto the back edge.
>  Here it's the load from c[j] into sum_11.
> 
>  The important part is the code is emitted onto the back edge,
>  conceptually; in reality it's simply included into the (new) latch block
>  (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>  cfg_cleanup).
> 
>  That's what we can't have for some of our structural loop optimizers:
>  there must be no code executed after the exit test (e.g. in the latch
>  block).  (This requirement makes reasoning about which code is or isn't
>  executed completely for an iteration trivial; simply everything in the
>  body is always executed; e.g. loop interchange uses this to check that
>  there are no memory references after the exit test, because those would
>  then be only conditional and hence make loop interchange very awkward).
> 
>  Note that this situation can't be later rectified anymore: the duplicated
>  instructions (because they are memory refs) must remain after the exit
>  test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>  memory refs on the loop-entry path and on the back edge are equivalent)
>  would that be possible, but that's something we aren't capable of.  Even
>  if we were that would simply just revert the whole work that the threader
>  did, so it's better to not even do that to start with.
> 
>  I believe something like below would be appropriate, it disables 
>  threading
>  if the path contains a latch at the non-last position (due to being
>  backwards on the non-first position in the array).  I.e. it disables
>  rotating the loop if there's danger of polluting the back edge.  It might
>  be improved if the blocks following (preceding!) the latch are themself
>  empty because then no code is duplicated.  It might also be improved if
>  the latch is already non-empty.  That code should probably only be active
>  before the loop optimizers, but currently the backward threader isn't
>  differentiating between before/after loop-optims.
> 
>  I haven't tested this patch at all, except that it fixes the testcase :)
> >>>
> >>> Lame comment at the current end of the thread - it's not threading 
> >>> through the
> >>
> >> I don't know why y'all keep using the word "lame".  On the contrary,
> >> these are incredibly useful explanations.  Thanks.
> >>
> >>> latch but threading through the loop header that's problematic, at least 
> >>> if the
> >>> end of the threading path ends within the loop (threading through the 
> >>> header
> >>

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/9/21 10:58 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez  wrote:




On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:




On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:


Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

  [local count: 118111600]:
 # j_19 = PHI 
 sum_11 = c[j_19];
 if (n_10(D) > 0)
   goto ; [89.00%]
 else
   goto ; [11.00%]

 [local count: 105119324]:
...

  [local count: 118111600]:
 # sum_21 = PHI 
 c[j_19] = sum_21;
 j_13 = j_19 + 1;
 if (n_10(D) > j_13)
   goto ; [89.00%]
 else
   goto ; [11.00%]

  [local count: 105119324]:
 goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Lame comment at the current end of the thread - it's not threading through the


I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.


latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

 e2 = path->last ()->e;
 if (!e2 || noloop_only)
   {
 /* If NOLOOP_ONLY is true, we only allow threading through the
header of a loop to exit edges.  */

 /* One case occurs when there was loop header buried in a jump
threading path that cr

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Richard Biener via Gcc
On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez  wrote:
>
>
>
> On 9/9/21 10:58 AM, Richard Biener wrote:
> > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez  wrote:
> >>
> >>
> >>
> >> On 9/9/21 9:45 AM, Richard Biener wrote:
> >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:
> 
> 
> 
>  On 9/9/21 8:57 AM, Richard Biener wrote:
> > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:
> >>
> >> Hello,
> >>
> >> [lame answer to self]
> >>
> >> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>
> > The forward threader guards against this by simply disallowing
> > threadings that involve different loops.  As I see
> 
>  The thread in question (5->9->3) is all within the same outer loop,
>  though. BTW, the backward threader also disallows threading across
>  different loops (see path_crosses_loops variable).
> >> ...
> >>> Maybe it's possible to not disable threading over latches alltogether 
> >>> in
> >>> the backward threader (like it's tried now), but I haven't looked at 
> >>> the
> >>> specific situation here in depth, so take my view only as opinion 
> >>> from a
> >>> large distance :-)
> >>
> >> I've now looked at the concrete situation.  So yeah, the whole path is 
> >> in
> >> the same loop, crosses the latch, _and there's code following the latch
> >> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >> particular, after loop_optimizer_init() (before any threading) we have:
> >>
> >>   [local count: 118111600]:
> >>  # j_19 = PHI 
> >>  sum_11 = c[j_19];
> >>  if (n_10(D) > 0)
> >>goto ; [89.00%]
> >>  else
> >>goto ; [11.00%]
> >>
> >>  [local count: 105119324]:
> >> ...
> >>
> >>   [local count: 118111600]:
> >>  # sum_21 = PHI 
> >>  c[j_19] = sum_21;
> >>  j_13 = j_19 + 1;
> >>  if (n_10(D) > j_13)
> >>goto ; [89.00%]
> >>  else
> >>goto ; [11.00%]
> >>
> >>   [local count: 105119324]:
> >>  goto ; [100.00%]
> >>
> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >> pre-header of inner loop, but more importantly something that's not at 
> >> the
> >> start of the outer loop.
> >>
> >> Now, any thread that includes the backedge 9->3 _including_ its
> >> destination (i.e. where the backedge isn't the last to-be-redirected 
> >> edge)
> >> necessarily duplicates all code from that destination onto the back 
> >> edge.
> >> Here it's the load from c[j] into sum_11.
> >>
> >> The important part is the code is emitted onto the back edge,
> >> conceptually; in reality it's simply included into the (new) latch 
> >> block
> >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 
> >> after
> >> cfg_cleanup).
> >>
> >> That's what we can't have for some of our structural loop optimizers:
> >> there must be no code executed after the exit test (e.g. in the latch
> >> block).  (This requirement makes reasoning about which code is or isn't
> >> executed completely for an iteration trivial; simply everything in the
> >> body is always executed; e.g. loop interchange uses this to check that
> >> there are no memory references after the exit test, because those would
> >> then be only conditional and hence make loop interchange very awkward).
> >>
> >> Note that this situation can't be later rectified anymore: the 
> >> duplicated
> >> instructions (because they are memory refs) must remain after the exit
> >> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >> memory refs on the loop-entry path and on the back edge are equivalent)
> >> would that be possible, but that's something we aren't capable of.  
> >> Even
> >> if we were that would simply just revert the whole work that the 
> >> threader
> >> did, so it's better to not even do that to start with.
> >>
> >> I believe something like below would be appropriate, it disables 
> >> threading
> >> if the path contains a latch at the non-last position (due to being
> >> backwards on the non-first position in the array).  I.e. it disables
> >> rotating the loop if there's danger of polluting the back edge.  It 
> >> might
> >> be improved if the blocks following (preceding!) the latch are themself
> >> empty because then no code is duplicated.  It might also be improved if
> >> the latch is already non-empty.  That code should probably only be 
> >> active
> >> before the loop optimizers, but currently the backward threader isn't
> >> differentiating between before/after loop-optims.
> >>
> >> I haven't tested this patch at all, except

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc

On 9/9/21 12:15 PM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez  wrote:




On 9/9/21 10:58 AM, Richard Biener wrote:



I ran some experiments a while back, and my current work on the enhanced
solver/threader, can fold virtually everything the DOM/threader gets
(even with its use of const_and_copies, avail_exprs, and
evrp_range_analyzer), while getting 5% more DOM threads and 1% more
overall threads.  That is, I've been testing if the path solver can
solve everything the DOM threader needs (the hybrid approach I mentioned).

Unfortunately, replacing the forward threader right now is not feasible
for a few reasons:

a) The const_and_copies/avail_exprs relation framework can do floats,
and that's next year's ranger work.


Hmm, it sounds like relying fully on ranger is a bit premature then.


For floats, most definitely.  Ranger doesn't do them.  They're for the 
next release.  However, const_and_copies/avail_exprs for integers we can 
do just fine, that's why evrp/VRP are much easier targets.





b) Even though we can seemingly fold everything DOM/threader does, in
order to replace it with a backward threader instance we'd have to merge
the cost/profitability code scattered throughout the forward threader,
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce
do the threading after DOM is done.


Yeah, it does not actually process/simplify the blocks copied by threading.
In fact I think it only registers jump threading opportunities during the DOM
walk and commits them only later.  But it of course uses its avail / copies
stack to find them - that part you cannot easily divorce.


Yes, all the threaders register paths ahead of time and only do the 
actual CFG shuffling at the end with thread_through_all_blocks().




DOM is also yet another value-numbering framework - one could think
of stripping it down from that side, keeping the threading bits only
(but you'd still have the avail / copies bits).


Ughh, yeah.  I've seen some random missed cases because of the 
value-numbering it does :-/.  It's very frustrating that we have various 
ad-hoc VN methods.




That said, it has one nice property it can leverage due to its incredibly
simple memory redundancy handling, in that it usually performs way less
alias queries than FRE (even when you run the latter in non-iterative mode).

But the same way DOM can register jump threading opportunities FRE
could do as well.


My plan for DOM/threading for this release is to replace its evrp 
analyzer with a path solver (path_range_query).  We can still use the 
avail/copies, and everything else should just work.  I'm hand waiving on 
a few missed cases due to IL changing, but last time I checked we got 
way more in return.  Andrew even thought we should get most things even 
in the presence of changing IL, but I haven't distilled testcases for him.


For VRP/threading, the same hybrid thing, but replacing the 
ASSERT_EXPRs, and the avail/copies with a path solver.  There are no 
floats here.  Things are looking good in my tests.


Once we do floats, if we could merge the EDGE_*_THREAD and the 
cost/profit bits between both threaders, we should be able to replace 
the entire forward threader with a backward client.  Hopefully I don't 
run out of steam by then ;-).


Thanks for your insight on DOM.  It's very helpful.

Aldy



Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Michael Matz via Gcc
Hello,

On Thu, 9 Sep 2021, Richard Biener wrote:

> > I believe something like below would be appropriate, it disables 
> > threading if the path contains a latch at the non-last position (due 
> > to being backwards on the non-first position in the array).  I.e. it 
> > disables rotating the loop if there's danger of polluting the back 
> > edge.  It might be improved if the blocks following (preceding!) the 
> > latch are themself empty because then no code is duplicated.  It might 
> > also be improved if the latch is already non-empty.  That code should 
> > probably only be active before the loop optimizers, but currently the 
> > backward threader isn't differentiating between before/after 
> > loop-optims.
> >
> > I haven't tested this patch at all, except that it fixes the testcase 
> > :)
> 
> Lame comment at the current end of the thread - it's not threading 
> through the latch but threading through the loop header that's 
> problematic,

I beg to differ, but that's probably because of the ambiguity of the word 
"through" (does it or does it not include the ends of the path :) ).  If 
you thread through the loop header from the entry block (i.e. duplicate 
code from header to entry) all would be fine (or not, in case you created 
multiple entries from outside).  If you thread through the latch, then 
through an empty header and then through another non-empty basic block 
within the loop, you still wouldn't be fine: you've just created code on 
the back edge (and hence into the new latch block).  If you thread through 
the latch and through an empty header (and stop there), you also are fine.

Also note that in this situation you do _not_ create new entries into the 
loop, not even intermediately.  The old back edge is the one that goes 
away due to the threading, the old entry edge is moved comletely out of 
the loop, the edge header->thread-dest becomes the new entry edge, and the 
edge new-latch->thread-dest becomes the back edge.  No additional entries.

So, no, it's not the threading through the loop header that is problematic 
but creating a situation that fills the (new) latch with code, and that 
can only happen if the candidate thread contains the latch block.

(Of course it's somewhat related to the header block as well, because that 
is simply the only destination the latch has, and hence is usually 
included in any thread that also include the latch; but it's not the 
header that indicates the situation).

> See tree-ssa-threadupdate.c:thread_block_1
> 
>   e2 = path->last ()->e;
>   if (!e2 || noloop_only)
> {
>   /* If NOLOOP_ONLY is true, we only allow threading through the
>  header of a loop to exit edges.  */
> 
>   /* One case occurs when there was loop header buried in a jump
>  threading path that crosses loop boundaries.  We do not try
>  and thread this elsewhere, so just cancel the jump threading
>  request by clearing the AUX field now.  */
>   if (bb->loop_father != e2->src->loop_father
>   && (!loop_exit_edge_p (e2->src->loop_father, e2)
>   || flow_loop_nested_p (bb->loop_father,
>  e2->dest->loop_father)))
> {
>   /* Since this case is not handled by our special code
>  to thread through a loop header, we must explicitly
>  cancel the threading request here.  */
>   delete_jump_thread_path (path);
>   e->aux = NULL;
>   continue;
> }

Yeah, sure, but I'm not sure if the above check is _really_ testing it 
wants to test or if the effects it achieves are side effects.  Like in my 
proposed patch: I could also test for existence of loop header in the 
thread and reject that; it would work as well, except that it works 
because any useful thread including a latch (which is the problematic one) 
also includes the header.  I'm not sure if the above check is in the same 
line, or tests for some still another situation.


Ciao,
Michael.


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Michael Matz via Gcc
Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:

> The ldist-22 regression is interesting though:
> 
> void foo ()
> {
>   int i;
> 
>:
>   goto ; [INV]
> 
>:
>   a[i_1] = 0;
>   if (i_1 > 100)
> goto ; [INV]
>   else
> goto ; [INV]
> 
>:
>   b[i_1] = i_1;
> 
>:
>   i_8 = i_1 + 1;
> 
>:
>   # i_1 = PHI <0(2), i_8(5)>
>   if (i_1 <= 1023)
> goto ; [INV]
>   else
> goto ; [INV]

Here there's no simple latch block to start with (the backedge comes 
directly out of the loop exit block).  So my suggested improvement 
(testing if the latch was empty and only then reject the thread), would 
solve this.

> Would it be crazy to suggest that we disable threading through latches 
> altogether,

I think it wouldn't be crazy, but we can do a bit better as suggested 
above (only reject empty latches, and reject it only for the threaders 
coming before the loop optims).


Ciao,
Michael.


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/9/21 2:52 PM, Michael Matz wrote:

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:


The ldist-22 regression is interesting though:

void foo ()
{
   int i;

:
   goto ; [INV]

:
   a[i_1] = 0;
   if (i_1 > 100)
 goto ; [INV]
   else
 goto ; [INV]

:
   b[i_1] = i_1;

:
   i_8 = i_1 + 1;

:
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
 goto ; [INV]
   else
 goto ; [INV]


Here there's no simple latch block to start with (the backedge comes
directly out of the loop exit block).  So my suggested improvement
(testing if the latch was empty and only then reject the thread), would
solve this.


Well, there's the thing.  Loop discovery is marking BB5 as the latch, so 
it's not getting threaded:


Checking profitability of path (backwards):  bb:6 (2 insns) bb:5 (latch)




Would it be crazy to suggest that we disable threading through latches
altogether,


I think it wouldn't be crazy, but we can do a bit better as suggested
above (only reject empty latches, and reject it only for the threaders
coming before the loop optims).


BTW, I'm not sure your check for the non-last position makes a difference:


diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
   loop_p loop = m_path[0]->loop_father;
   bool path_crosses_loops = false;
   bool threaded_through_latch = false;
+  bool latch_within_path = false;
   bool multiway_branch_in_path = false;
   bool threaded_multiway_branch = false;
   bool contains_hot_bb = false;
@@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const 
vec &m_path,
 the last entry in the array when determining if we thread
 through the loop latch.  */
   if (loop->latch == bb)
-   threaded_through_latch = true;
+   {
+ threaded_through_latch = true;
+ if (j != 0)
+   latch_within_path = true;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, " (latch)");
+   }
 }


If the last position being considered is a simple latch, it only has a 
simple outgoing jump.  There's no need to thread that.  You need a block 
with >= 2 succ edges to thread anything.


Aldy



Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Michael Matz via Gcc
Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:

> > Here there's no simple latch block to start with (the backedge comes
> > directly out of the loop exit block).  So my suggested improvement
> > (testing if the latch was empty and only then reject the thread), would
> > solve this.
> 
> Well, there's the thing.  Loop discovery is marking BB5 as the latch, so 
> it's not getting threaded:

Yes, it's a latch, but not an empty one.  So the thread would make it just 
even more non-empty, which might not be a problem anymore.  So amending my 
patch somewhere with a strategic

  && empty_block_p (latch)

and only then rejecting the thread should make this testcase work again.

(There's still a catch, though: if this non-empty latch, which is also the 
exit test block, is threaded through and is followed by actual code, then 
that code will be inserted onto the back edge, not into the latch block 
before the exit test, and so also create a (new) non-empty latch.  That 
might or might not create problems downstreams, but not as many as 
transformaing an empty into a non-empty latch would do; but it could 
create for instance a second back edge (not in this testcase) and 
suchlike)

> BTW, I'm not sure your check for the non-last position makes a difference:
> 
> > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
> > index 449232c7715..528a753b886 100644
> > --- a/gcc/tree-ssa-threadbackward.c
> > +++ b/gcc/tree-ssa-threadbackward.c
> > -   threaded_through_latch = true;
> > +   {
> > + threaded_through_latch = true;
> > + if (j != 0)
> > +   latch_within_path = true;
> > + if (dump_file && (dump_flags & TDF_DETAILS))
> > +   fprintf (dump_file, " (latch)");
> > +   }
> >  }
> 
> If the last position being considered is a simple latch, it only has a simple
> outgoing jump.  There's no need to thread that.  You need a block with >= 2
> succ edges to thread anything.

So, you are saying that any candidate thread path wouldn't have the latch 
in the last position if it were just an empty forwarder?  I simply wasn't 
sure about that, so was conservative and only wanted to reject things I 
knew where positively bad (namely code in the path following the latch 
that is in danger of being moved into the latch).


Ciao,
Michael.


Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Aldy Hernandez via Gcc




On 9/9/21 4:44 PM, Michael Matz wrote:

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:


Here there's no simple latch block to start with (the backedge comes
directly out of the loop exit block).  So my suggested improvement
(testing if the latch was empty and only then reject the thread), would
solve this.


Well, there's the thing.  Loop discovery is marking BB5 as the latch, so
it's not getting threaded:


Yes, it's a latch, but not an empty one.  So the thread would make it just
even more non-empty, which might not be a problem anymore.  So amending my
patch somewhere with a strategic

   && empty_block_p (latch)

and only then rejecting the thread should make this testcase work again.


Sweet.



(There's still a catch, though: if this non-empty latch, which is also the
exit test block, is threaded through and is followed by actual code, then
that code will be inserted onto the back edge, not into the latch block
before the exit test, and so also create a (new) non-empty latch.  That
might or might not create problems downstreams, but not as many as
transformaing an empty into a non-empty latch would do; but it could
create for instance a second back edge (not in this testcase) and
suchlike)


BTW, I'm not sure your check for the non-last position makes a difference:


diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
-   threaded_through_latch = true;
+   {
+ threaded_through_latch = true;
+ if (j != 0)
+   latch_within_path = true;
+ if (dump_file && (dump_flags & TDF_DETAILS))
+   fprintf (dump_file, " (latch)");
+   }
  }


If the last position being considered is a simple latch, it only has a simple
outgoing jump.  There's no need to thread that.  You need a block with >= 2
succ edges to thread anything.


So, you are saying that any candidate thread path wouldn't have the latch
in the last position if it were just an empty forwarder?  I simply wasn't
sure about that, so was conservative and only wanted to reject things I
knew where positively bad (namely code in the path following the latch
that is in danger of being moved into the latch).


Correct.  In the backwards threader, we don't start the party if the 
last block has < 2 succs:


  FOR_EACH_BB_FN (bb, fun)
{
  if (EDGE_COUNT (bb->succs) > 1)
threader.maybe_thread_block (bb);
}

Thanks.
Aldy



Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Jeff Law via Gcc




On 9/9/2021 2:14 AM, Aldy Hernandez wrote:



On 9/8/21 8:13 PM, Michael Matz wrote:

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see


The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...
Maybe it's possible to not disable threading over latches 
alltogether in
the backward threader (like it's tried now), but I haven't looked at 
the
specific situation here in depth, so take my view only as opinion 
from a

large distance :-)


I've now looked at the concrete situation.  So yeah, the whole path 
is in

the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

    [local count: 118111600]:
   # j_19 = PHI 
   sum_11 = c[j_19];
   if (n_10(D) > 0)
 goto ; [89.00%]
   else
 goto ; [11.00%]

   [local count: 105119324]:
...

    [local count: 118111600]:
   # sum_21 = PHI 
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
 goto ; [89.00%]
   else
 goto ; [11.00%]

    [local count: 105119324]:
   goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not 
at the

start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected 
edge)
necessarily duplicates all code from that destination onto the back 
edge.

Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 
after

cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the 
duplicated

instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the 
threader

did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables 
threading

if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge. It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be 
active

before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)


Thanks for looking at this.

I think you're onto something with this approach.  Perhaps in addition 
to the loop header threading Richard mentions.


Your patch causes some regressions, but I think most are noise from 
FSM tests that must be adjusted.  However, there are some other ones 
that are curious:


> FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
memset zero"

> FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
"Performing shrink-wrapping"
< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
line 36)
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
graphite "1 loops carried no dependency" 1
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
optimized "loopfn.1" 4
> FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
graphite "5 loops carried no dependency" 1


Interestingly your patch is fixing shrink-wrap-loop.c and 
Warray-bounds-87, both of which were introduced by the backward 
threader rewrite.  At least the Warray-bounds was the threader peeling 
off an iteration that caused a bogus warning.


The ldist-22 regression is interesting though:

void foo ()

Re: More aggressive threading causing loop-interchange-9.c regression

2021-09-09 Thread Jeff Law via Gcc




On 9/9/2021 2:58 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez  wrote:



On 9/9/21 9:45 AM, Richard Biener wrote:

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez  wrote:



On 9/9/21 8:57 AM, Richard Biener wrote:

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz  wrote:

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:


The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see

The thread in question (5->9->3) is all within the same outer loop,
though. BTW, the backward threader also disallows threading across
different loops (see path_crosses_loops variable).

...

Maybe it's possible to not disable threading over latches alltogether in
the backward threader (like it's tried now), but I haven't looked at the
specific situation here in depth, so take my view only as opinion from a
large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in
the same loop, crosses the latch, _and there's code following the latch
on that path_.  (I.e. the latch isn't the last block in the path).  In
particular, after loop_optimizer_init() (before any threading) we have:

  [local count: 118111600]:
 # j_19 = PHI 
 sum_11 = c[j_19];
 if (n_10(D) > 0)
   goto ; [89.00%]
 else
   goto ; [11.00%]

 [local count: 105119324]:
...

  [local count: 118111600]:
 # sum_21 = PHI 
 c[j_19] = sum_21;
 j_13 = j_19 + 1;
 if (n_10(D) > j_13)
   goto ; [89.00%]
 else
   goto ; [11.00%]

  [local count: 105119324]:
 goto ; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
pre-header of inner loop, but more importantly something that's not at the
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its
destination (i.e. where the backedge isn't the last to-be-redirected edge)
necessarily duplicates all code from that destination onto the back edge.
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge,
conceptually; in reality it's simply included into the (new) latch block
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers:
there must be no code executed after the exit test (e.g. in the latch
block).  (This requirement makes reasoning about which code is or isn't
executed completely for an iteration trivial; simply everything in the
body is always executed; e.g. loop interchange uses this to check that
there are no memory references after the exit test, because those would
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated
instructions (because they are memory refs) must remain after the exit
test.  Only by rerolling/unrotating the loop (i.e. noticing that the
memory refs on the loop-entry path and on the back edge are equivalent)
would that be possible, but that's something we aren't capable of.  Even
if we were that would simply just revert the whole work that the threader
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading
if the path contains a latch at the non-last position (due to being
backwards on the non-first position in the array).  I.e. it disables
rotating the loop if there's danger of polluting the back edge.  It might
be improved if the blocks following (preceding!) the latch are themself
empty because then no code is duplicated.  It might also be improved if
the latch is already non-empty.  That code should probably only be active
before the loop optimizers, but currently the backward threader isn't
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)

Lame comment at the current end of the thread - it's not threading through the

I don't know why y'all keep using the word "lame".  On the contrary,
these are incredibly useful explanations.  Thanks.


latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

 e2 = path->last ()->e;
 if (!e2 || noloop_only)
   {
 /* If NOLOOP_ONLY is true, we only allow threading through the
header of a loop to exit edges.  */

 /* One case occurs when there was loop header buried in a jump
threading path that crosses 

gcc-9-20210909 is now available

2021-09-09 Thread GCC Administrator via Gcc
Snapshot gcc-9-20210909 is now available on
  https://gcc.gnu.org/pub/gcc/snapshots/9-20210909/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 9 git branch
with the following options: git://gcc.gnu.org/git/gcc.git branch releases/gcc-9 
revision 2821499e39653d6d41136af65af10fddad7423cb

You'll find:

 gcc-9-20210909.tar.xzComplete GCC

  SHA256=da07caf6c1423156feff2321f1ddbf17a9cf495d4d98575b9dbf34eef066e5ad
  SHA1=8872a0e0560723ce2dbf29d9b216e79fc1920f5d

Diffs from 9-20210902 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.