Re: scalability bottlenecks with (many) partitions (and more)

Ants Aasma Sun, 22 Sep 2024 01:51:18 -0700

On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <to...@vondra.me> wrote:
> I've finally pushed this, after many rounds of careful testing to ensure
> no regressions, and polishing. All changes since the version shared on
> September 13 are only cosmetic - renaming a macro to keep it consistent
> with the other ones, clarifying a couple comments etc. Nothing major.


Great work on this, I have seen multiple customers hitting fast path
capacity related LockManager contention. They will certainly be glad
to have a fix available when they eventually upgrade. Regretfully I
did not find the time to participate in this discussion during
development. But I did have some thoughts that I wanted to unload to
the list, not as a criticism, but in case it turns out follow up work
is needed.

Driving the array sizing from max_locks_per_transaction seems like a
good idea. The one major difference from the lock table is that while
the lock table is partitioned dynamically between backends, the fast
path array has a static size per backend. One case where that
distinction matters is when only a fraction of backends try to lock
large numbers of relations. This fraction will still fall back to main
lock tables, but at least the contention should be limited by virtue
of not having too many of those backends. The other case is when
max_connections is much higher than the number of backends actually
used. Then backends may be consuming well over
max_locks_per_transaction without running into lock table capacity
issues.

In both cases users will have the simple workaround of just increasing
the max_locks_per_transaction setting.  Still, I'm sure they would be
happier if things just worked without any tuning. So I tried to figure
out some scheme to get dynamic allocation of fast path locks.

The best data structure I came up with was to have a shared fast path
lock array. Still partitioned as a 16-way associative cache, but
indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
handled by one lock per way, or at least on cursory glance it
shouldn't be too difficult to convert the whole fast path acquisition
to be lock free.

Either way, it feels like structuring the array this way could result
in a large amount of false sharing of cache lines. Current static
allocation means that each process needs to touch only a small set of
cache lines only referenced by itself - quite probable to keep those
lines in CPU local L2 in exclusive mode. In a shared array a larger
number of cache lines are needed and they will be concurrently written
to by other backends - lots of invalidation messages and cache line
bouncing. I don't know how large this effect will be without doing a
prototype and running it on a large machine with high core-to-core
latencies.

It would be possible to create a hybrid approach of a small local FP
array servicing the majority of acquisitions with a larger shared
victim cache for exceptional cases. But it doesn't feel like it is
worth the complexity. At least not without seeing some example
workloads where it would help. And even then, maybe using hierarchical
locking to do less work is the better approach.

Being optimistic, perhaps the current patch was enough to resolve the issue.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com

Re: scalability bottlenecks with (many) partitions (and more)

Reply via email to