On Sunday March 17 2024 03:44:00 Sergey Fedorov wrote: >> but if libc++ 5 was maybe still a bit faster overall than libstc++ the >situation is now rather reversed though differences remain small
I take that back, the differences aren't always small! I realised that the so-called "native" benchmark from the libcxx source tree could be used with the libstdc++ from (currently) port:libgcc13. It took some time to figure out how to inject the `-stdlib=macports-libstdc++` argument properly but once I got that working on Linux it transferred without further ado to Mac. These results just in. Libc++ and all benchmarking code built with `clang++-mp-12 -O3 -march=native -flto`. libstdc++ is indeed consistently faster. Usually by not much (though the differences in kernel time spent can be relatively important): ``` > build/libcxx/benchmarks/algorithms.partition_point.libcxx.out Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.13, 1.27, 1.34 <snip> 89.198 user_cpu 0.907 kernel_cpu 1:30.11 total_time 99.9%CPU {93360128M 0F 226542R 0I 0O 0k 0w 445c} > /build/libcxx/benchmarks/algorithms.partition_point.native.out Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.42, 1.35, 1.34 <snip> 75.612 user_cpu 0.911 kernel_cpu 1:16.53 total_time 99.9%CPU {102424576M 0F 229228R 0I 0O 0k 0w 504c} ``` ``` > build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 > --benchmark_filter='_262144$' Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.63, 1.75, 1.70 <snip> 220.386 user_cpu 2.961 kernel_cpu 3:43.50 total_time 99.9%CPU {79626240M 0F 542726R 0I 9O 0k 154w 3502c} > build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 > --benchmark_filter='_262144$' Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.66, 1.79, 1.67 <snip> 190.844 user_cpu 2.615 kernel_cpu 3:13.50 total_time 99.9%CPU {89800704M 0F 504812R 0I 9O 0k 149w 1942c} ``` But observe this, as far as I understand a "small" version of the above benchmark that seems to highlight a huge overhead in libc++: ``` > build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 > --benchmark_filter='_1$' Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.61, 1.37, 1.36 <snip> 1947.681 user_cpu 221.754 kernel_cpu 36:10.22 total_time 99.9%CPU {78262272M 0F 1503675R 0I 9O 0k 152w 22829c} > build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 > --benchmark_filter='_1$' Run on (4 X 2700 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 1.63, 1.42, 1.36 <snip> 1056.593 user_cpu 8.435 kernel_cpu 17:45.51 total_time 99.9%CPU {78917632M 0F 1458187R 0I 9O 0k 154w 12805c} ``` Here the library from the "bloated" GCC is twice as fast overall, and uses almost 30x less kernel CPU time! I see the same on Linux. This makes me wonder if shouldn't try building llvm+clang against macports-libstdc++ . I have already managed to do so with lld-17 (only depends on libc++ via libxml2, and turns out to be "safe to mingle"). Newer clang versions build against their own libc++ even on Linux (when building with clang) so that suggests the code has been designed to separate the possibly 2 C++ runtime versions that get linked. It would probably be impossible to use the resulting libLLVM or libclang in dependent ports but maybe the performance increase might be worth it. R