On Tue, Oct 1, 2024 at 6:06 PM Richard Biener <richard.guent...@gmail.com> wrote: > > > > > Am 01.10.2024 um 17:11 schrieb Matthias Kretz via Gcc <gcc@gcc.gnu.org>: > > > > Hi, > > > > the <experimental/simd> unit tests are my long-standing pain point of > > excessive compiler memory usage and compile times. I've always worked around > > the memory usage problem by splitting the test matrix into multiple > > translations (with different -D flags) of the same source file. I.e. pay > > with > > a huge number of compiler invocations to be able to compile at all. OOM > > kills > > / thrashing isn't fun. > > > > Recently, the GNU Radio 4 implementation hit a similar issue of excessive > > compiler memory usage and compile times. Worst case example I have tested (a > > single TU on a Xeon @ 4.50 GHz, 64 GB RAM (no swapping while compiling)): > > > > GCC 15: 13m03s, 30.413 GB (checking enabled) > > GCC 14: 12m03s, 15.248 GB > > GCC 13: 11m40s, 14.862 GB > > Clang 18: 8m10s, 10.811 GB > > > > That's supposed to be a unit test. But it's nothing one can use for test- > > driven development, obviously. But how do mere mortals optimize code for > > better compile times? -ftime-report is interesting but not really helpful. > > -Q > > has interesting information, but the output format is unusable for C++ and > > it's really hard to post-process. > > > > When compiler memory usage goes through the roof it's fairly obvious that > > compile times have to suffer. So I was wondering whether there are any low- > > hanging fruit to pick. I've managed to come up with a small torture test > > that > > shows interesting behavior. I put it at > > https://github.com/mattkretz/template-torture-test. Simply do > > > > git clone https://github.com/mattkretz/template-torture-test > > cd template-torture-test > > make STRESS=7 > > make TORTURE=1 STRESS=5 > > > > These numbers can already "kill" smaller machines. Be prepared to kill > > cc1plus > > before things get out of hand. > > > > The bit I find interesting in this test is switched with the -D GO_FAST > > macro > > (the 'all' target always compiles with and without GO_FAST). With the macro, > > template arguments to 'Operand<typename...>' are tree-like and the resulting > > type name is *longer*. But GGC usage is only at 442M. Without GO_FAST, > > template arguments to 'Operand<typename...>' are a flat list. But GGC usage > > is > > at 22890M. The latter variant needs 24x longer to compile. > > > > Are long flat template argument/parameter lists a special problem? Why does > > it > > make overload resolution *so much more* expensive? > > > > Beyond that torture test (should I turn it into a PR?), what can I do to > > help? > > Analyze where the compile time is spent and where memory is spent. Identify > unfitting data structures and algorithms causing the issue. Replace with > better ones. That’s what I do for these kind of issues in the middle end.
So seeing overload resolution : 42.89 ( 67%) 1.41 ( 44%) 44.31 ( 66%) 18278M ( 80%) template instantiation : 47.25 ( 73%) 1.66 ( 51%) 48.95 ( 72%) 22326M ( 97%) it seems obvious that you are using an excessive number of template instantiations and compilers are not prepared to make those "lean". perf shows (GCC 14.2 release build) Samples: 261K of event 'cycles:Pu', Event count (approx.): 315948118358 Overhead Samples Command Shared Object Symbol 26.96% 69216 cc1plus cc1plus [.] iterative_hash 7.66% 19389 cc1plus cc1plus [.] _Z12ggc_set_markPKv 5.34% 13719 cc1plus cc1plus [.] _Z27iterative_hash_template_argP9tree_nodej 5.11% 13205 cc1plus cc1plus [.] _Z24variably_modified_type_pP9tree_nodeS0_ 4.60% 11901 cc1plus cc1plus [.] _Z13cp_type_qualsPK9tree_node 4.14% 10733 cc1plus cc1plus [.] _ZL5unifyP9tree_nodeS0_S0_S0_ib where the excessive use of iterative_hash_object makes it slower than necessary. I can only guess but replacing val = iterative_hash_object (code, val); with using iterative_hash_hashval_t or iterative_hash_host_wide_int might help a lot. Likewise: case IDENTIFIER_NODE: return iterative_hash_object (IDENTIFIER_HASH_VALUE (arg), val); with iterative_hash_hashval_t. Using inchash for the whole API might help as well. This won't improve memory use of course, making "leaner" template instantiations likely would help (maybe somehow allow on-demand copying of sub-structures?). Richard. > > Richard > > > Thanks, > > Matthias > > > > -- > > ────────────────────────────────────────────────────────────────────────── > > Dr. Matthias Kretz https://mattkretz.github.io > > GSI Helmholtz Center for Heavy Ion Research https://gsi.de > > std::simd > > ────────────────────────────────────────────────────────────────────────── > > <signature.asc>