On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> >> > visibilitymap_truncate is actually also wrong, in a different way. The > truncation WAL record is written only after the VM (and FSM) are truncated. > But visibilitymap_truncate() has already modified and dirtied the page. If > the VM page change is flushed to disk before the WAL record, and you crash, > you might have a torn VM page and a checksum failure. > > Right, I missed the problem with checksums. > Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in > FreeSpaceMapTruncateRel would have the same issue. If you call > MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN > to make sure the WAL record is flushed first. > > I'm not sure I fully understand this. If the page is flushed before the WAL record, we might have a truncated FSM where as the relation truncation is not replayed. But that's not the same problem, right? I mean, you might have an FSM which is not accurate, but it won't at the least return the blocks outside the range. Having said that, I agree your proposed changes are more robust. BTW any thoughts on race-condition on the primary? Comments at MarkBufferDirtyHint() seems to suggest that a race condition is possible which might leave the buffer without the DIRTY flag, but I'm not sure if that can only happen when the page is locked in shared mode. While most of the reports so far involved standbys, and the bug can also hit a standalone master during crash recovery, I wonder if a race can occur even on a live system. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services