Re: [PATCH][2/2] Fix expansion slowness of PR60291

Richard Biener Fri, 21 Feb 2014 05:13:31 -0800

On Fri, 21 Feb 2014, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Fri, 21 Feb 2014, Richard Biener wrote:
> >
> >> On Fri, 21 Feb 2014, Richard Biener wrote:
> >> 
> >> > On Fri, 21 Feb 2014, Richard Biener wrote:
> >> > 
> >> > > 
> >> > > This fixes the slowness of RTL expansion in PR60291 which is caused
> >> > > by excessive collisions in mem-attr sharing.  The issue here is
> >> > > that sharing attempts happens across functions and we have a _lot_
> >> > > of functions in this testcase referencing the same lexically
> >> > > equivalent memory, for example MEM[(StgWord *)_5 + -64B].  That
> >> > > means those get the same hash value.  But they don't compare
> >> > > equal because an SSA name _5 from function A is of course not equal
> >> > > to one from function B.
> >> > > 
> >> > > The following fixes that by not doing mem-attr sharing across functions
> >> > > by clearing the mem-attrs hashtable in rest_of_clean_state.
> >> > > 
> >> > > Another fix may be to do what the comment in iterative_hash_expr
> >> > > says for SSA names:
> >> > > 
> >> > >     case SSA_NAME:
> >> > >       /* We can just compare by pointer.  */
> >> > >       return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > > 
> >> > > (probably blame me for changing that to hashing the SSA version).
> >> > 
> >> > It was lxo.
> >> > 
> >> > > But I'm not sure that doesn't uncover issues with various hashtables 
> >> > > and
> >> > > walking them, generating code dependent on the order.  It's IMHO just 
> >> > > not
> >> > > expected that you compare function-local expressions from different
> >> > > functions.
> >> > 
> >> > Same speedup result from
> >> > 
> >> > Index: gcc/tree.c
> >> > ===================================================================
> >> > --- gcc/tree.c  (revision 207960)
> >> > +++ gcc/tree.c  (working copy)
> >> > @@ -7428,7 +7428,7 @@ iterative_hash_expr (const_tree t, hashv
> >> >        }
> >> >      case SSA_NAME:
> >> >        /* We can just compare by pointer.  */
> >> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > +      return iterative_hash_hashval_t ((uintptr_t)t>>3, val);
> >> >      case PLACEHOLDER_EXPR:
> >> >        /* The node itself doesn't matter.  */
> >> >        return val;
> >> > 
> >> > and from
> >> > 
> >> > Index: gcc/tree.c
> >> > ===================================================================
> >> > --- gcc/tree.c  (revision 207960)
> >> > +++ gcc/tree.c  (working copy)
> >> > @@ -7428,7 +7428,9 @@ iterative_hash_expr (const_tree t, hashv
> >> >        }
> >> >      case SSA_NAME:
> >> >        /* We can just compare by pointer.  */
> >> > -      return iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val);
> >> > +      return iterative_hash_host_wide_int
> >> > +             (DECL_UID (cfun->decl),
> >> > +              iterative_hash_host_wide_int (SSA_NAME_VERSION (t), val));
> >> >      case PLACEHOLDER_EXPR:
> >> >        /* The node itself doesn't matter.  */
> >> >        return val;
> >> > 
> >> > better than hashing pointers but requring cfun != NULL in this
> >> > function isn't good either.
> >> > 
> >> > At this point I'm more comfortable with clearing the hashtable
> >> > than with changing iterative_hash_expr in any way.  It's also
> >> > along the way to get rid of the hash completely.
> >> > 
> >> > Oh, btw, the speedup is going from
> >> > 
> >> >  expand                  : 481.98 (94%) usr   1.15 (17%) sys 481.94 
> >> > (93%) 
> >> > wall  293891 kB (15%) ggc
> >> > 
> >> > to
> >> > 
> >> >  expand                  :   2.66 ( 7%) usr   0.13 ( 2%) sys   2.64 ( 
> >> > 6%) 
> >> > wall  262544 kB (13%) ggc
> >> > 
> >> > at -O0 (less dramatic slowness for -On).
> >> > 
> >> > > The other thing would be to discard mem-attr sharing alltogether,
> >> > > but that doesn't seem appropriate at this stage (but it would
> >> > > also simplify quite some code).  With only one function in RTL
> >> > > at a time that shouldn't be too bad (see several suggestions
> >> > > along that line, even with statistics).
> >> 
> >> Last statistics: http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01784.html
> >
> > With the patch below to get some statistics we see that one important
> > piece of sharing not covered by above measurements is RTX copying(?).
> >
> > On the testcase for this PR I get at -O1 and without the patch
> > to clear the hashtable after each function
> >
> > 142489 mem_attrs created (142439 for new, 50 for modification)
> > 1983225 mem_attrs shared (4044 for new, 820241 for modification, 1158940 
> > by rtx copying)
> > 0 mem_attrs dropped
> >
> > and with the patch to clear after each function
> >
> > 364411 mem_attrs created (144478 for new, 219933 for modification)
> > 1761303 mem_attrs shared (2005 for new, 600358 for modification, 1158940 
> > by rtx copying)
> > 0 mem_attrs dropped
> >
> > while for dwarf2out.c I see without the clearing
> >
> > 24399 mem_attrs created (6929 for new, 17470 for modification)
> > 102676 mem_attrs shared (10878 for new, 29265 for modification, 62533 by 
> > rtx copying)
> > 16 mem_attrs dropped
> >
> > which means that completely dropping the sharing would result
> > in creating of 6929 + 17807 + 62533(!) vs. 24399 mem-attrs.
> > That's still not a lot overhead given that mem-attrs take 40 bytes
> > (3MB vs. 950kB).  There is also always the possibility to
> > explicitely ref-count mem-attrs to handle sharing by rtx
> > copying (at least cse, fwprop, combine, ira and reload copy MEMs,
> > probably some for no good reason because MEMs are not shared),
> > thus make mem-attrs unshare-on-modify.
> 
> In a thread a few years ago you talked about the possibility of going
> further and folding the attributes into the MEM itself, so avoiding
> the indirection and separate allocation:
> 
>   http://thread.gmane.org/gmane.comp.gcc.patches/244464/focus=244538
> 
> (and earlier posts in the thread).  Would that still be OK?
> I might have a go if so.


It would work for me.  Micha just brought up the easiest incremental
change though, which is

Index: gcc/emit-rtl.c
===================================================================
--- gcc/emit-rtl.c      (revision 207960)
+++ gcc/emit-rtl.c      (working copy)
@@ -304,14 +304,12 @@ set_mem_attrs (rtx mem, mem_attrs *attrs
       return;
     }
 
-  slot = htab_find_slot (mem_attrs_htab, attrs, INSERT);
-  if (*slot == 0)
+  if (!MEM_ATTRS (mem)
+      || memcmp (MEM_ATTRS (mem), attrs, sizeof (mem_attrs)) != 0)
     {
-      *slot = ggc_alloc_mem_attrs ();
-      memcpy (*slot, attrs, sizeof (mem_attrs));
+      MEM_ATTRS (mem) = ggc_alloc_mem_attrs ();
+      memcpy (MEM_ATTRS (mem), attrs, sizeof (mem_attrs));
     }
-
-  MEM_ATTRS (mem) = (mem_attrs *) *slot;
 }
 
 /* Returns a hash code for X (which is a really a reg_attrs *).  */

which would be equivalent to folding it into the MEM but 1) not
saving the pointer, 2) retaining the important copy_rtx sharing.

Richard.

Re: [PATCH][2/2] Fix expansion slowness of PR60291

Reply via email to