Hi, On 2025-04-01 09:07:27 -0700, Noah Misch wrote: > On Tue, Apr 01, 2025 at 11:55:20AM -0400, Andres Freund wrote: > > WRT the locking issues, I've been wondering whether we could make > > LWLockWaitForVar() work that purpose, but I doubt it's the right approach. > > Probably better to get rid of the LWLock*Var functions and go for the > > approach > > I had in v1, namely a version of LWLockAcquire() with a callback that gets > > called between LWLockQueueSelf() and PGSemaphoreLock(), which can cause the > > lock acquisition to abort. > > What are the best thing(s) to read to understand the locking issues?
Unfortunately I think it's our discussion from a few days/weeks ago. The problem basically is that functions like LockBuffer(EXCLUSIVE) need to be able to non-racily a) wait for in-fligth IOs b) acquire the content lock If you just do it naively like this: else if (mode == BUFFER_LOCK_EXCLUSIVE) { if (pg_atomic_read_u32(&buf->state) &_IO_IN_PROGRESS) WaitIO(buf); LWLockAcquire(content_lock, LW_EXCLUSIVE); } you obviously could have another backend start new IO between the WaitIO() and the LWLockAcquire(). If that other backend then doesn't consume the completion of that IO, the current backend could end up endlessly waiting for the IO. I don't see a way to avoid with narrow changes just to LockBuffer(). We need some infrastructure that allows to avoid that issue. One approach could be to integrate more tightly with lwlock.c. If 1) anyone starting IO were to wake up all waiters for the LWLock 2) The waiting side checked that there is no IO in progress *after* LWLockQueueSelf(), but before PGSemaphoreLock() The backend doing LockBuffer() would be guaranteed to have the chance to wait for the IO, rather than the lwlock. But there might be better approaches. I'm not really convinced that using generic lwlocks for buffer locking is the best idea. There's just too many special things about buffers. E.g. we have rather massive NUMA scalability issues due to the amount of lock traffic from buffer header and content lock atomic operations, particuly on things like the uppermost levels of a btree. I've played with ideas like super-pinning and locking btree root pages, which move all the overhead to the side that wants to exclusively lock such a page - but that doesn't really make sense for lwlocks in general. Greetings, Andres Freund