Hi, [1] made me notice these issues. The issues here are mostly independent, but it's still worthwhile to read that thread - in particular because my proposed solution for the problem is possibly somewhat related to this issue. And, if we were to go for my more extreme proposal below, the fix for this issue would probably also fix [1].
There's several smgr operations (e.g. smgrtruncate(), smgrdounlinkall()) that first do a a DropRelFileNodesAllBuffers() and then perform the filesystem operation. Without a critical section. As far as I can tell that's not acceptable. Using smgrtruncate() as an example (because it's executed from backends / autovacuum rather than e.g. checkpointer, and because it's targeted at relations that live past the current transaction): DropRelFileNodeBuffers() throws away valid buffer contents. Therefore, we'll be in a corrupt state should smgr_truncate fail. The on-disk state will be the old file contents, but the in-memory state will not reflect that. Consider e.g. the case that autovacuum pruned away all tuples from a relation. That will result in pg_class.relfrozenxid being updated, reflecting the new horizon. If we then do a DropRelFileNodeBuffers() throwing away the modified buffer contents, but subsequently fail to truncate the underlying file, we'll potentially have a lot of pages full of tuples referencing xids that are older than relfrozenxid. Which scans will try to return, as the underlying file is the original size. As far as I can tell it, as things stand, is simply never ok to do DropRelFileNodeBuffers() - when there potentially are dirty pages - outside of a critical section. RelationTruncate() notes: /* * We WAL-log the truncation before actually truncating, which means * trouble if the truncation fails. If we then crash, the WAL replay * likely isn't going to succeed in the truncation either, and cause a * PANIC. It's tempting to put a critical section here, but that cure * would be worse than the disease. It would turn a usually harmless * failure to truncate, that might spell trouble at WAL replay, into a * certain PANIC. */ but I think this analysis is quite insufficient. As far as I can tell it doesn't consider the issue outlined above, nor does it consider what will happen if standbys/PITR will replay the WAL records, but the primary considers the relation to be of a different length. It seems to me that we either need to a) write out all dirty buffers during DropRelFileNodeBuffers(), thereby preventing the issue of old page contents "coming back to live" after a failed truncation. b) accept using a critical section, with the obvious consequence that failing to truncate would lead to a PANIC c) use a more complex protocol to invalidate buffers, ensuring there's no inconsistency between fs and shared_buffers. c) could e.g. be something like 1) mark all buffers as about-to-be-dropped 2) CacheInvalidateSmgr() 3) truncate on filesystem level 4a) if that fails, remove the about-to-be-dropped flags, in a PG_CATCH block 4b) if that succeeds, fully remove the buffers to be dropped When another backend wants to use a buffer marked as about-to-be-dropped it would need to wait for that operation to finish (this afaict could neither work the way StartBufferIO() does, nor the way LockBufferForCleanup() does). Such a scheme would have the advantage that truncation etc would behave a lot more like normal buffer modifications. There'd e.g. afterwards be an interlock between truncations and BufferSync() / checkpoints, which imo is quite attractive. On the other hand, the cost of DropRelFileNodeBuffers() needing to perform an exhaustive search of shared_buffers, would be quite painful. The single DropRelFileNodeBuffers() call is already bad enough. I guess we could amortize that slightly, by caching the buffer locations for a limited number of buffers - in many cases where the search cost are problematic there won't be too many pages for individual relations present. For me alternative a) is prohibitively expensive, and not worth discussing much. I think I can live with b), but I know that I'm much less concerned with PANICs in these types of situations than others. c) seems worth investigating, but presumably would end up being too complicated to backpatch. I think several DropRelFileNodeBuffers() callers besides vacuum are a bit less concerning. E.g. when dropping a relation we do so as part of a checkpoint, which'll trigger a PANIC IIRC. And the in-place truncation case for TRUNCATE IIRC only applies to relations created in the same subtransaction, which makes the failure scenario above largely moot IIRC. Tom, I seem to recall a recent thread of yours discussing a different approach to truncation. I wonder if there's some intersection with that. But unfortunately my search somehow has come up with nothing so far - do you remember enough to find the thread? Comments? Andres Freund [1] https://www.postgresql.org/message-id/20191206230640.2dvdjpcgn46q3ks2%40alap3.anarazel.de