https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- You raise valid points (i.e. it would be good to understand why preallocation is not beneficial, or what's causing the performance gap w.r.t malloc), but looking at cache-misses counter does not make sense here (perf is not explicit about that, but it counts misses in L3, and as you see the count is three magnitudes lower than that of cycles&instructions, so it's not the main factor in overall performance picture). As for comparison against Rust, it spreads more work over available cores: you can see that its "user time" is higher, though "wall-clock time" is same or lower. In other words, the C++ variant does not achieve good multicore scaling. The main gotcha here is m_b_r does not allocate on construction, but rather allocates 2x of the preallocation size on first call to 'allocate', and then deallocates when 'release' is called. So it repeatedly calls malloc/free in the inner benchmark loop, whereas you custom allocator allocates on construction and deallocates on destruction, avoiding repeated malloc/free calls in the loop and associated lock contention when multithreaded. (also obviously it simply does more work in 'allocate', which costs extra cycles)