Hi, On 2019-12-07 11:07:04 +0530, Amit Kapila wrote: > On Sat, Dec 7, 2019 at 5:42 AM Andres Freund <and...@anarazel.de> wrote: > > > > Tom, I seem to recall a recent thread of yours discussing a different > > approach to truncation. I wonder if there's some intersection with > > that. But unfortunately my search somehow has come up with nothing so > > far - do you remember enough to find the thread? > > > > IIUC, a similar problem is discussed in the thread [1] where Tom > proposed a few solutions which are close to what you are proposing. > > [1] - https://www.postgresql.org/message-id/1880.1281020817%40sss.pgh.pa.us
It does indeed look like basically the same problem. I was actually remembering a different thread, that was more about truncating with not quite as heavyweight locking. Having pondered this and some related problems (truncation, flushing multiple buffers at once using asynchronous IO, PrefetchBuffer() directly into buffers, cleanup lock implementation), I think we basically need the ability to set something like BM_IO_IN_PROGRESS on multiple buffers (perhaps signalling different forms of IO with different flags, but I'm not sure it's needed). I think we basically ought to replace the current IO buffer locking with condition variables (as Robert has suggested at [1]). Instead of having an array of LWLocks (BufferIOLWLockArray), we'd allocate one condition variable for each buffer. I think we have enough space directly in BufferDesc these days, due to the spinlock removal, but that seems like a detail. For truncation, we'd first iterate over all buffers once to mark them as BM_IO_IN_PROGRESS, then we would truncate the underlying file. If the truncation succeeds, we can use a local palloc'd array of IO_IN_PROGRESS buffers to actually evict them. If the truncation fails, the same array would be used to reset IO_IN_PROGRESS (basically AbortBufferIO, except not setting BM_IO_ERROR for the truncation case). This would solve the problem of truncations leading to diverging fs/buffer state, would not require a PANIC, and would allow to have concurrent buffer eviction to efficiently wait for IO to finish. This would pretty directly replace the current cleanup locks, which would just need the existing flagbit to indicate that refcount=0 should cause a condition variable wakeup. Does somebody see a fundamental hole in this approach? Obviously there's lots of different details to fill in, but most of them seem likely to only be found by actually writing the patch... Greetings, Andres Freund [1] https://postgr.es/m/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com