On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <to...@vondra.me> wrote: > I've finally pushed this, after many rounds of careful testing to ensure > no regressions, and polishing. All changes since the version shared on > September 13 are only cosmetic - renaming a macro to keep it consistent > with the other ones, clarifying a couple comments etc. Nothing major.
Great work on this, I have seen multiple customers hitting fast path capacity related LockManager contention. They will certainly be glad to have a fix available when they eventually upgrade. Regretfully I did not find the time to participate in this discussion during development. But I did have some thoughts that I wanted to unload to the list, not as a criticism, but in case it turns out follow up work is needed. Driving the array sizing from max_locks_per_transaction seems like a good idea. The one major difference from the lock table is that while the lock table is partitioned dynamically between backends, the fast path array has a static size per backend. One case where that distinction matters is when only a fraction of backends try to lock large numbers of relations. This fraction will still fall back to main lock tables, but at least the contention should be limited by virtue of not having too many of those backends. The other case is when max_connections is much higher than the number of backends actually used. Then backends may be consuming well over max_locks_per_transaction without running into lock table capacity issues. In both cases users will have the simple workaround of just increasing the max_locks_per_transaction setting. Still, I'm sure they would be happier if things just worked without any tuning. So I tried to figure out some scheme to get dynamic allocation of fast path locks. The best data structure I came up with was to have a shared fast path lock array. Still partitioned as a 16-way associative cache, but indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into the high byte of BackendId thanks to MAX_BACKENDS. Locking could be handled by one lock per way, or at least on cursory glance it shouldn't be too difficult to convert the whole fast path acquisition to be lock free. Either way, it feels like structuring the array this way could result in a large amount of false sharing of cache lines. Current static allocation means that each process needs to touch only a small set of cache lines only referenced by itself - quite probable to keep those lines in CPU local L2 in exclusive mode. In a shared array a larger number of cache lines are needed and they will be concurrently written to by other backends - lots of invalidation messages and cache line bouncing. I don't know how large this effect will be without doing a prototype and running it on a large machine with high core-to-core latencies. It would be possible to create a hybrid approach of a small local FP array servicing the majority of acquisitions with a larger shared victim cache for exceptional cases. But it doesn't feel like it is worth the complexity. At least not without seeing some example workloads where it would help. And even then, maybe using hierarchical locking to do less work is the better approach. Being optimistic, perhaps the current patch was enough to resolve the issue. -- Ants Aasma Senior Database Engineer www.cybertec-postgresql.com