On Wed, Jan 29, 2025 at 11:49 PM Matthias van de Meent <boekewurm+postg...@gmail.com> wrote: > > Hi, > > Some time ago I noticed that every buffer table entry is quite large at 40 > bytes (+8): 16 bytes of HASHELEMENT header (of which the last 4 bytes are > padding), 20 bytes of BufferTag, and 4 bytes for the offset into the shared > buffers array, with generally 8 more bytes used for the bucket pointers. > (32-bit systems: 32 (+4) bytes) > > Does anyone know why we must have the buffer tag in the buffer table? > It seems to me we can follow the offset pointer into the shared BufferDesc > array whenever we find out we need to compare the tags (as opposed to just > the hash, which is stored and present in HASHELEMENT). If we decide to just > follow the pointer, we can immediately shave 16 bytes (40%) off the lookup > table's per-element size, or 24 if we pack the 4-byte shared buffer offset > into the unused bytes in HASHELEMENT, reducing the memory usage of that hash > table by ~50%: ...
So every buffer table entry is 40 bytes, and every buffer itself is 8 KB. Also, every BufferDesc is padded to 64 bytes (on 64-bit systems). So buffer table entry is < 0.5% of total buffer space. I assume this means that the benefit of your patch doesn't come from memory savings, but maybe from spatial locality? As in, saving 0.5% of memory doesn't seem like a huge win by itelf, but maybe the hash table will see fewer cache misses, since the entries are smaller, so they'll be packed closer together. Now you can fit 2.7 hash entries on a cache line, instead of 1.6. Except, to check the buffer tag itself, now you have to dereference the offset pointer into the shared BufferDesc array. And every BufferDesc is padded to 64 bytes (on 64-bit systems), a cache line. So, now, instead of having the buffer tag adjacent to the hash entry, you have a memory stall. If the buffer tags match (i.e., no collision), this is still a possible win, because you'd have to load the BufferDesc anyway (since this is the buffer you're looking for). > Does anyone have an idea on how to best benchmark this kind of patch, apart > from "running pgbench"? Other ideas on how to improve this? Specific concerns? What's the expected advantage of reducing the buffer table's entry size? Better CPU cache usage when there's no collision, but worse in the less-likely case where there is a collision? What I mean is, regarding benchmarks: what's the best case scenario for this kind of patch, and what sort of performance difference would you expect to see? Thanks, James