On Sunday March 17 2024 03:44:00 Sergey Fedorov wrote:

>> but if libc++ 5 was maybe still a bit faster overall than libstc++ the
>situation is now rather reversed though differences remain small

I take that back, the differences aren't always small!

I realised that the so-called "native" benchmark from the libcxx source tree 
could be used with the libstdc++ from (currently) port:libgcc13. It took some 
time to figure out how to inject the `-stdlib=macports-libstdc++` argument 
properly but once I got that working on Linux it transferred without further 
ado to Mac.

These results just in. Libc++ and all benchmarking code built with 
`clang++-mp-12 -O3 -march=native -flto`.

libstdc++ is indeed consistently faster. Usually by not much (though the 
differences in kernel time spent can be relatively important):

```
> build/libcxx/benchmarks/algorithms.partition_point.libcxx.out
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.13, 1.27, 1.34
<snip>
89.198 user_cpu 0.907 kernel_cpu 1:30.11 total_time 99.9%CPU {93360128M 0F 
226542R 0I 0O 0k 0w 445c}
> /build/libcxx/benchmarks/algorithms.partition_point.native.out
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.42, 1.35, 1.34
<snip>
75.612 user_cpu 0.911 kernel_cpu 1:16.53 total_time 99.9%CPU {102424576M 0F 
229228R 0I 0O 0k 0w 504c}
```

```
> build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 
> --benchmark_filter='_262144$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.63, 1.75, 1.70
<snip>
220.386 user_cpu 2.961 kernel_cpu 3:43.50 total_time 99.9%CPU {79626240M 0F 
542726R 0I 9O 0k 154w 3502c}
> build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 
> --benchmark_filter='_262144$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.66, 1.79, 1.67

<snip> 
190.844 user_cpu 2.615 kernel_cpu 3:13.50 total_time 99.9%CPU {89800704M 0F 
504812R 0I 9O 0k 149w 1942c}
```

But observe this, as far as I understand a "small" version of the above 
benchmark that seems to highlight a huge overhead in libc++:

```
> build/libcxx/benchmarks/algorithms.libcxx.out --benchmark_repetitions=1 
> --benchmark_filter='_1$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.61, 1.37, 1.36
<snip>
1947.681 user_cpu 221.754 kernel_cpu 36:10.22 total_time 99.9%CPU {78262272M 0F 
1503675R 0I 9O 0k 152w 22829c}
> build/libcxx/benchmarks/algorithms.native.out --benchmark_repetitions=1 
> --benchmark_filter='_1$'
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 1.63, 1.42, 1.36
<snip>
1056.593 user_cpu 8.435 kernel_cpu 17:45.51 total_time 99.9%CPU {78917632M 0F 
1458187R 0I 9O 0k 154w 12805c}
```

Here the library from the "bloated" GCC is twice as fast overall, and uses 
almost 30x less kernel CPU time!

I see the same on Linux.

This makes me wonder if shouldn't try building llvm+clang against 
macports-libstdc++ . I have already managed to do so with lld-17 (only depends 
on libc++ via libxml2, and turns out to be "safe to mingle"). Newer clang 
versions build against their own libc++ even on Linux (when building with 
clang) so that suggests the code has been designed to separate the possibly 2 
C++ runtime versions that get linked. It would probably be impossible to use 
the resulting libLLVM or libclang in dependent ports but maybe the performance 
increase might be worth it.

R

Reply via email to