https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942

Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
You raise valid points (i.e. it would be good to understand why preallocation
is not beneficial, or what's causing the performance gap w.r.t malloc), but
looking at cache-misses counter does not make sense here (perf is not explicit
about that, but it counts misses in L3, and as you see the count is three
magnitudes lower than that of cycles&instructions, so it's not the main factor
in overall performance picture).

As for comparison against Rust, it spreads more work over available cores: you
can see that its "user time" is higher, though "wall-clock time" is same or
lower. In other words, the C++ variant does not achieve good multicore scaling.

The main gotcha here is m_b_r does not allocate on construction, but rather
allocates 2x of the preallocation size on first call to 'allocate', and then
deallocates when 'release' is called. So it repeatedly calls malloc/free in the
inner benchmark loop, whereas you custom allocator allocates on construction
and deallocates on destruction, avoiding repeated malloc/free calls in the loop
and associated lock contention when multithreaded.

(also obviously it simply does more work in 'allocate', which costs extra
cycles)

Reply via email to