On Wed, Oct 2, 2024 at 9:54 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Wed, Oct 2, 2024 at 9:13 AM Richard Biener
> <richard.guent...@gmail.com> wrote:
> >
> > On Tue, Oct 1, 2024 at 6:06 PM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > >
> > >
> > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>:
> > > >
> > > > Hi,
> > > >
> > > > the <experimental/simd> unit tests are my long-standing pain point of
> > > > excessive compiler memory usage and compile times. I've always worked 
> > > > around
> > > > the memory usage problem by splitting the test matrix into multiple
> > > > translations (with different -D flags) of the same source file. I.e. 
> > > > pay with
> > > > a huge number of compiler invocations to be able to compile at all. OOM 
> > > > kills
> > > > / thrashing isn't fun.
> > > >
> > > > Recently, the GNU Radio 4 implementation hit a similar issue of 
> > > > excessive
> > > > compiler memory usage and compile times. Worst case example I have 
> > > > tested (a
> > > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while 
> > > > compiling)):
> > > >
> > > > GCC 15: 13m03s, 30.413 GB (checking enabled)
> > > > GCC 14: 12m03s, 15.248 GB
> > > > GCC 13: 11m40s, 14.862 GB
> > > > Clang 18: 8m10s, 10.811 GB
> > > >
> > > > That's supposed to be a unit test. But it's nothing one can use for 
> > > > test-
> > > > driven development, obviously. But how do mere mortals optimize code for
> > > > better compile times? -ftime-report is interesting but not really 
> > > > helpful. -Q
> > > > has interesting information, but the output format is unusable for C++ 
> > > > and
> > > > it's really hard to post-process.
> > > >
> > > > When compiler memory usage goes through the roof it's fairly obvious 
> > > > that
> > > > compile times have to suffer. So I was wondering whether there are any 
> > > > low-
> > > > hanging fruit to pick. I've managed to come up with a small torture 
> > > > test that
> > > > shows interesting behavior. I put it at 
> > > > https://github.com/mattkretz/template-torture-test. Simply do
> > > >
> > > > git clone https://github.com/mattkretz/template-torture-test
> > > > cd template-torture-test
> > > > make STRESS=7
> > > > make TORTURE=1 STRESS=5
> > > >
> > > > These numbers can already "kill" smaller machines. Be prepared to kill 
> > > > cc1plus
> > > > before things get out of hand.
> > > >
> > > > The bit I find interesting in this test is switched with the -D GO_FAST 
> > > > macro
> > > > (the 'all' target always compiles with and without GO_FAST). With the 
> > > > macro,
> > > > template arguments to 'Operand<typename...>' are tree-like and the 
> > > > resulting
> > > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST,
> > > > template arguments to 'Operand<typename...>' are a flat list. But GGC 
> > > > usage is
> > > > at 22890M. The latter variant needs 24x longer to compile.
> > > >
> > > > Are long flat template argument/parameter lists a special problem? Why 
> > > > does it
> > > > make overload resolution *so much more* expensive?
> > > >
> > > > Beyond that torture test (should I turn it into a PR?), what can I do 
> > > > to help?
> > >
> > > Analyze where the compile time is spent and where memory is spent.  
> > > Identify unfitting data structures and algorithms causing the issue.  
> > > Replace with better ones.  That’s what I do for these kind of issues in 
> > > the middle end.
> >
> > So seeing
> >
> >  overload resolution               :  42.89 ( 67%)   1.41 ( 44%)
> > 44.31 ( 66%) 18278M ( 80%)
> >  template instantiation             :  47.25 ( 73%)   1.66 ( 51%)
> > 48.95 ( 72%) 22326M ( 97%)
> >
> > it seems obvious that you are using an excessive number of template
> > instantiations and
> > compilers are not prepared to make those "lean".  perf shows (GCC 14.2
> > release build)
> >
> > Samples: 261K of event 'cycles:Pu', Event count (approx.):
> > 315948118358
> > Overhead       Samples  Command  Shared Object
> >  Symbol
> >   26.96%         69216  cc1plus  cc1plus
> >  [.] iterative_hash
> >    7.66%         19389  cc1plus  cc1plus
> >  [.] _Z12ggc_set_markPKv
> >    5.34%         13719  cc1plus  cc1plus
> >  [.] _Z27iterative_hash_template_argP9tree_nodej
> >    5.11%         13205  cc1plus  cc1plus
> >  [.] _Z24variably_modified_type_pP9tree_nodeS0_
> >    4.60%         11901  cc1plus  cc1plus
> >  [.] _Z13cp_type_qualsPK9tree_node
> >    4.14%         10733  cc1plus  cc1plus
> >  [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib
> >
> > where the excessive use of iterative_hash_object makes it slower than
> > necessary.  I can only guess but
> > replacing
> >
> >   val = iterative_hash_object (code, val);
> >
> > with using iterative_hash_hashval_t or iterative_hash_host_wide_int
> > might help a lot.  Likewise:
> >
> >     case IDENTIFIER_NODE:
> >       return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val);
> >
> > with iterative_hash_hashval_t.  Using inchash for the whole API might
> > help as well.
>
> Fixing the above results in the following, I'll test & submit a patch.
>
> Samples: 283K of event 'cycles:Pu', Event count (approx.):
> 318742588396
> Overhead       Samples  Command  Shared Object         Symbol
>   13.92%         39577  cc1plus  cc1plus               [.]
> _Z27iterative_hash_template_argP9tree_nodej
>   10.73%         29883  cc1plus  cc1plus               [.] _Z12ggc_set_markPKv
>   10.11%         28811  cc1plus  cc1plus               [.] iterative_hash
>    5.33%         15254  cc1plus  cc1plus               [.]
> _Z13cp_type_qualsPK9tree_node

I noticed that almost half of all template hashing is from expansion of the
hash tables.  Enlarging spec_entry to store the hash value reduces this
to half, but it seems module support is somewhat inter-twisted with this,
so I have no good patch but only a prototype here.

The coerce_template_params function is indeed the biggest offender
as to make_tree_vec calls (which in turn is called by
lookup_template_class most).

Richard.

Reply via email to