Hi, On 2023-08-02 16:51:29 -0700, Matt Smiley wrote: > I thought it might be helpful to share some more details from one of the > case studies behind Nik's suggestion. > > Bursty contention on lock_manager lwlocks recently became a recurring cause > of query throughput drops for GitLab.com, and we got to study the behavior > via USDT and uprobe instrumentation along with more conventional > observations (see > https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This > turned up some interesting finds, and I thought sharing some of that > research might be helpful.
Hm, I'm curious whether you have a way to trigger the issue outside of your prod environment. Mainly because I'm wondering if you're potentially hitting the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so you'd not see the benefit unless you reproduced the load in 16+. I'm also wondering if it's possible that the reason for the throughput drops are possibly correlated with heavyweight contention or higher frequency access to the pg_locks view. Deadlock checking and the locks view acquire locks on all lock manager partitions... So if there's a bout of real lock contention (for longer than deadlock_timeout)... Given that most of your lock manager traffic comes from query planning - have you evaluated using prepared statements more heavily? Greetings, Andres Freund