On Wed, Oct 2, 2024 at 9:54 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Oct 2, 2024 at 9:13 AM Richard Biener > <richard.guent...@gmail.com> wrote: > > > > On Tue, Oct 1, 2024 at 6:06 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > > > > > > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>: > > > > > > > > Hi, > > > > > > > > the <experimental/simd> unit tests are my long-standing pain point of > > > > excessive compiler memory usage and compile times. I've always worked > > > > around > > > > the memory usage problem by splitting the test matrix into multiple > > > > translations (with different -D flags) of the same source file. I.e. > > > > pay with > > > > a huge number of compiler invocations to be able to compile at all. OOM > > > > kills > > > > / thrashing isn't fun. > > > > > > > > Recently, the GNU Radio 4 implementation hit a similar issue of > > > > excessive > > > > compiler memory usage and compile times. Worst case example I have > > > > tested (a > > > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while > > > > compiling)): > > > > > > > > GCC 15: 13m03s, 30.413 GB (checking enabled) > > > > GCC 14: 12m03s, 15.248 GB > > > > GCC 13: 11m40s, 14.862 GB > > > > Clang 18: 8m10s, 10.811 GB > > > > > > > > That's supposed to be a unit test. But it's nothing one can use for > > > > test- > > > > driven development, obviously. But how do mere mortals optimize code for > > > > better compile times? -ftime-report is interesting but not really > > > > helpful. -Q > > > > has interesting information, but the output format is unusable for C++ > > > > and > > > > it's really hard to post-process. > > > > > > > > When compiler memory usage goes through the roof it's fairly obvious > > > > that > > > > compile times have to suffer. So I was wondering whether there are any > > > > low- > > > > hanging fruit to pick. I've managed to come up with a small torture > > > > test that > > > > shows interesting behavior. I put it at > > > > https://github.com/mattkretz/template-torture-test. Simply do > > > > > > > > git clone https://github.com/mattkretz/template-torture-test > > > > cd template-torture-test > > > > make STRESS=7 > > > > make TORTURE=1 STRESS=5 > > > > > > > > These numbers can already "kill" smaller machines. Be prepared to kill > > > > cc1plus > > > > before things get out of hand. > > > > > > > > The bit I find interesting in this test is switched with the -D GO_FAST > > > > macro > > > > (the 'all' target always compiles with and without GO_FAST). With the > > > > macro, > > > > template arguments to 'Operand<typename...>' are tree-like and the > > > > resulting > > > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST, > > > > template arguments to 'Operand<typename...>' are a flat list. But GGC > > > > usage is > > > > at 22890M. The latter variant needs 24x longer to compile. > > > > > > > > Are long flat template argument/parameter lists a special problem? Why > > > > does it > > > > make overload resolution *so much more* expensive? > > > > > > > > Beyond that torture test (should I turn it into a PR?), what can I do > > > > to help? > > > > > > Analyze where the compile time is spent and where memory is spent. > > > Identify unfitting data structures and algorithms causing the issue. > > > Replace with better ones. That’s what I do for these kind of issues in > > > the middle end. > > > > So seeing > > > > overload resolution : 42.89 ( 67%) 1.41 ( 44%) > > 44.31 ( 66%) 18278M ( 80%) > > template instantiation : 47.25 ( 73%) 1.66 ( 51%) > > 48.95 ( 72%) 22326M ( 97%) > > > > it seems obvious that you are using an excessive number of template > > instantiations and > > compilers are not prepared to make those "lean". perf shows (GCC 14.2 > > release build) > > > > Samples: 261K of event 'cycles:Pu', Event count (approx.): > > 315948118358 > > Overhead Samples Command Shared Object > > Symbol > > 26.96% 69216 cc1plus cc1plus > > [.] iterative_hash > > 7.66% 19389 cc1plus cc1plus > > [.] _Z12ggc_set_markPKv > > 5.34% 13719 cc1plus cc1plus > > [.] _Z27iterative_hash_template_argP9tree_nodej > > 5.11% 13205 cc1plus cc1plus > > [.] _Z24variably_modified_type_pP9tree_nodeS0_ > > 4.60% 11901 cc1plus cc1plus > > [.] _Z13cp_type_qualsPK9tree_node > > 4.14% 10733 cc1plus cc1plus > > [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib > > > > where the excessive use of iterative_hash_object makes it slower than > > necessary. I can only guess but > > replacing > > > > val = iterative_hash_object (code, val); > > > > with using iterative_hash_hashval_t or iterative_hash_host_wide_int > > might help a lot. Likewise: > > > > case IDENTIFIER_NODE: > > return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val); > > > > with iterative_hash_hashval_t. Using inchash for the whole API might > > help as well. > > Fixing the above results in the following, I'll test & submit a patch. > > Samples: 283K of event 'cycles:Pu', Event count (approx.): > 318742588396 > Overhead Samples Command Shared Object Symbol > 13.92% 39577 cc1plus cc1plus [.] > _Z27iterative_hash_template_argP9tree_nodej > 10.73% 29883 cc1plus cc1plus [.] _Z12ggc_set_markPKv > 10.11% 28811 cc1plus cc1plus [.] iterative_hash > 5.33% 15254 cc1plus cc1plus [.] > _Z13cp_type_qualsPK9tree_node
I noticed that almost half of all template hashing is from expansion of the hash tables. Enlarging spec_entry to store the hash value reduces this to half, but it seems module support is somewhat inter-twisted with this, so I have no good patch but only a prototype here. The coerce_template_params function is indeed the biggest offender as to make_tree_vec calls (which in turn is called by lookup_template_class most). Richard.