Re: scalability bottlenecks with (many) partitions (and more)

Matthias van de Meent Wed, 04 Sep 2024 07:26:12 -0700

On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <to...@vondra.me> wrote:
> FWIW the actual cost is somewhat higher, because we seem to need ~400B
> for every lock (not just the 150B for the LOCK struct).


We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

> At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

The 152 bytes of LOCK are mostly due to a combination of two
MAX_LOCKMODES-sized int[]s that are used to keep track of the number
of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
keep using int[]s, but that doesn't mean we can't improve this size:

ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
NoLock, which is never used or counted in these shared structures, and
the max lock mode supported by any of the supported lock methods is
AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
reducing size of the LOCK struct by 16 bytes.

If some struct- and field packing is OK, then we could further reduce
the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
type from int to int16 (we only use the first MaxLockMode (8) + 1
bits), and then storing the grant/waitMask fields (now 4 bytes total)
in the padding that's present at the end of the waitProcs struct. This
would depend on dclist not writing in its padding space, but I
couldn't find any user that did so, and most critically dclist_init
doesn't scribble in the padding with memset.

If these are both implemented, it would save 24 bytes, reducing the
struct to 128 bytes. :) [^2]

I also checked PROCLOCK: If it is worth further packing the struct, we
should probably look at whether it's worth replacing the PGPROC* typed
fields with ProcNumber -based ones, potentially in both PROCLOCK and
PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
these fields being replaced with ProcNumber would allow a reduction in
size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
smallest I could get it to without ignoring type alignments.

Further shmem savings can be achieved by reducing the "10% safety
margin" added at the end of LockShmemSize, as I'm fairly sure the
memory used in shared hashmaps doesn't exceed the estimated amount,
and if it did then we should probably fix that part, rather than
requesting that (up to) 10% overhead here.

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^1] NLOCKENTS() benefits from being a power of 2, or slightly below
one, as it's rounded up to a power of 2 when dynahash decides its
number of buckets to allocate.
[^2] Sadly this 2-cachelines alignment is lost due to dynahash's
HASHELEMENT prefix of elements. :(

Re: scalability bottlenecks with (many) partitions (and more)

Reply via email to