Hi, I just tried to use the slab allocator for a case where aset.c was bloating memory usage substantially. First: It worked wonders for memory usage, nearly eliminating overhead.
But it turned out to cause a *substantial* slowdown. With aset the allocator is barely in the profile. With slab the profile is dominated by allocator performance. slab: NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 Overhead Command Shared Object Symbol + 28.27% postgres postgres [.] SlabAlloc + 9.64% postgres bdbench.so [.] bfm_delete + 9.03% postgres bdbench.so [.] bfm_set + 8.39% postgres bdbench.so [.] bfm_lookup + 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 6.88% postgres bdbench.so [.] bfm_delete_leaf + 3.24% postgres libc-2.31.so [.] _int_malloc + 2.58% postgres bdbench.so [.] bfm_tests + 2.33% postgres postgres [.] SlabFree + 1.29% postgres libc-2.31.so [.] _int_free + 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0 aset: NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 + 16.43% postgres bdbench.so [.] bfm_lookup + 15.38% postgres bdbench.so [.] bfm_delete + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms + 12.65% postgres bdbench.so [.] bfm_set + 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0 + 10.57% postgres bdbench.so [.] bfm_delete_leaf + 4.05% postgres bdbench.so [.] bfm_tests + 2.93% postgres [kernel.vmlinux] [k] clear_page_erms + 1.59% postgres postgres [.] AllocSetAlloc + 1.15% postgres bdbench.so [.] memmove@plt + 1.06% postgres bdbench.so [.] bfm_grow_leaf_16 OS: NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec LOCATION: bfm_test_insert_bulk, radix.c:2880 That is somewhat surprising - part of the promise of a slab allocator is that it's fast... This is caused by multiple issues, I think. Some of which seems fairly easy to fix. 1) If allocations are short-lived slab.c, can end up constantly freeing/initializing blocks. Which requires fairly expensively iterating over all potential chunks in the block and initializing it. Just to then free that memory again after a small number of allocations. The extreme case of this is when there are phases of alloc/free of a single allocation. I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that only works if the problem is with the only, so it's not a great approach. Perhaps just keeping the last allocated block around would work? 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83, i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's still slow enough there to be very worrisome. I don't see a way to get around the division while keeping the freelist structure as is. But: ISTM that we only need the index because of the free-chunk list, right? Why don't we make the chunk list use actual pointers? Is it concern that that'd increase the minimum allocation size? If so, I see two ways around that: First, we could make the index just the offset from the start of the block, that's much cheaper to calculate. Second, we could store the next pointer in SlabChunk->slab/block instead (making it a union) - while on the freelist we don't need to dereference those, right? I suspect both would also make the block initialization a bit cheaper. That should also accelerate SlabBlockGetChunk(), which currently shows up as an imul, which isn't exactly fast either (and uses a lot of execution ports). 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink shows up prominently in profiles. That's ~tripling the number of cachelines touched in the happy path, with unpredictable accesses to boot. Perhaps we could reduce the precision of slab->freelist indexing to amortize that cost? I.e. not have one slab->freelist entry for each nfree, but instead have an upper limit on the number of freelists? 4) Less of a performance, and more of a usability issue: The constant block size strikes me as problematic. Most users of an allocator can sometimes be used with a small amount of data, and sometimes with a large amount. Greetings, Andres Freund [1] https://www.agner.org/optimize/instruction_tables.pdf