We got another report today [1] that seems to be due to the problem we've seen before with failed vacuum truncations leaving corrupt state on-disk [2]. Reflecting on that some more, it seems to me that we're never going to get to a solution that everybody finds acceptable without some rather significant restructuring at the buffer-access level. Since looking for a back-patchable solution has yielded no progress in eight years, what if we just accept that we will only fix this in HEAD, and think outside the box about how we could fix it if we're willing to change internal APIs as much as necessary?
After doodling for awhile with that in mind, it seems like we might be able to fix it by introducing a new buffer state "truncation pending" that we'd apply to empty-but-dirty buffers that we don't want to write out. Aside from fixing the data corruption issue, this sketch has the significant benefit that we don't need to take AccessExclusiveLock anymore to do a vacuum truncation: it seems sufficient to lock out would-be writers of the table. VACUUM's truncation logic would go like this: 1. Take ShareLock on table to lock out writers but not readers. (We might need ShareRowExclusive, not sure.) As with existing code that takes AccessExclusiveLock, this is a lock upgrade, so it could fail but that's OK, we just don't truncate. 2. Scan backwards from relation EOF to determine last page to be deleted (same as existing logic). 3. Truncate FSM and VM. (This can be done outside critical section because it's harmless if we fail later; the lost state can be reconstructed, and anyway we know all the forgotten-about pages are empty.) 4. Issue WAL truncation record, and make sure it's flushed. 5. Begin critical section, so that any hard ERROR below becomes PANIC. (We don't really expect any error, but it's not OK for the vacuum process to disappear without having undone any truncation-pending marks.) 6. Re-scan buffers from first to last page to be deleted, using a fetch mode that doesn't pull in pages not already present in buffer pool. (That way, we don't issue any reads during this phase, reducing the risk of unwanted ERROR/PANIC.) As we examine each buffer: * If not empty of live tuples, panic :-( * If clean, delete buffer. * If dirty, mark as truncation pending. Remember the first and last page numbers that got marked as trunc pending. 7. Issue ftruncate(s), working backwards if the truncation spans multiple segment files. Don't error out on syscall failure, just stop truncating and note boundary of successful truncation. 8. Re-scan buffers from first to last trunc pending page, again skipping anything not found in buffer pool. Anything found should be trunc pending, or possibly empty if a reader pulled it in concurrently. Either delete if it's above what we successfully truncated to, or drop the trunc-pending bit (reverting it to dirty state) if not. 9. If actual truncation boundary was different from plan, issue another WAL record saying "oh, we only managed to truncate to here, not there". 10. End critical section. 11. Release table ShareLock. Now, what does the "truncation pending" buffer flag do exactly? * The buffer manager is not allowed to drop such a page from the pool, nor to attempt to write it out; it's just in limbo till vacuum's critical section completes. * Readers: assume page is empty, move on. (A seqscan could actually stop, knowing that all later pages must be empty too.) Or, since the page must be empty of live tuples, readers could just process it normally, but I'd rather they didn't. * Writers: error, should not be able to see this state. * Background writer: ignore and move on (must *not* try to write dirty page, since truncate might've already happened). * Checkpointer: I think checkpointer has to wait for the flag to go away :-(. We can't mark a checkpoint complete if there are dirty trunc-pending pages. This aspect could use more thought, perhaps. WAL replay looks like this: * Normal truncate record: just truncate heap, FSM, and VM to where it says, and discard buffers above that. * "Only managed to truncate to here" record: write out empty heap pages to fill the space from original truncation target to actual. This restores the on-disk situation to be equivalent to what it was in master, assuming all the dirty pages eventually got written. It's slightly annoying to write the second truncation record inside the critical section, but I think we need to because if we don't, there's a window where we don't write the second record at all and so the post-replay situation is different from what the master had on disk. Note that if we do crash in the critical section, the post-replay situation will be that we truncate to the original target, which seems fine since nothing else could've affected the table state. (Maybe we should issue the first WAL record inside the critical section too? Not sure.) One issue with not holding AEL is that there are race conditions wherein readers might attempt to fetch pages beyond the file EOF (for example, a seqscan that started before the truncation began would attempt to read up to the old EOF, or a very slow indexscan might try to follow a pointer from a since-deleted index entry). So we would have to change things to regard that as a non-error condition. That might be fine, but it does seem like it's giving up some error detection capability. If anyone's sufficiently worried about that, we could keep the lock level at AEL; but personally I think reducing the lock level is worth enough to be willing to make that compromise. Another area that's possibly worthy of concern is whether a reader could attempt to re-set bits in the FSM or VM for to-be-deleted pages after the truncation of those files. We might need some interlock to prevent that. Or, perhaps, just re-truncate them after the main truncation? Or maybe it doesn't matter if there's bogus data in the maps. Also, I'm not entirely sure whether there's anything in our various replication logic that's dependent on vacuum truncation taking AEL. Offhand I'd expect the reduced use of AEL to be a plus, but maybe I'm missing something. Thoughts? Are there obvious holes in this plan? regards, tom lane [1] https://www.postgresql.org/message-id/28278.1544463...@sss.pgh.pa.us [2] https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC