On Mon, Aug 28, 2017 at 3:05 PM, Emilio G. Cota <c...@braap.org> wrote: > On Sun, Aug 27, 2017 at 23:53:25 -0400, Pranith Kumar wrote: >> Using heaptrack, I found that quite a few of our temporary allocations >> are coming from allocating work items. Instead of doing this >> continously, we can cache the allocated items and reuse them instead >> of freeing them. >> >> This reduces the number of allocations by 25% (200000 -> 150000 for >> ARM64 boot+shutdown test). >> > > But what is the perf difference, if any? > > Adding a lock (or a cmpxchg) here is not a great idea. However, this is not > yet > immediately obvious because of other scalability bottlenecks. (if > you boot many arm64 cores you'll see most of the time is spent idling > on the BQL, see > https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg05207.html ) > > You're most likely better off using glib's slices, see > https://developer.gnome.org/glib/stable/glib-Memory-Slices.html > These slices use per-thread lists, so scalability should be OK.
I think we should modify our g_malloc() to internally use this. Seems like an idea worth trying out. > > I also suggest profiling with either or both of jemalloc/tcmalloc > (build with --enable-jemalloc/tcmalloc) in addition to using glibc's > allocator, and then based on perf numbers decide whether this is something > worth optimizing. > OK, I will try to get some perf numbers. -- Pranith