On Mon, May 26, 2014 at 6:52 AM, Heikki Linnakangas <hlinnakan...@vmware.com> wrote: > Here's an idea I tried to explain to Andres and Simon at the pub last night, > on how to reduce the spikes in the amount of WAL written at beginning of a > checkpoint that full-page writes cause. I'm just writing this down for the > sake of the archives; I'm not planning to work on this myself. > > > When you are replaying a WAL record that lies between the Redo-pointer of a > checkpoint and the checkpoint record itself, there are two possibilities: > > a) You started WAL replay at that checkpoint's Redo-pointer. > > b) You started WAL replay at some earlier checkpoint, and are already in a > consistent state. > > In case b), you wouldn't need to replay any full-page images, normal > differential WAL records would be enough. In case a), you do, and you won't > be consistent until replaying all the WAL up to the checkpoint record. > > We can exploit those properties to spread out the spike. When you modify a > page and you're about to write a WAL record, check if the page has the > BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page > against the *previous* checkpoints redo-pointer, instead of the one's that's > currently in-progress. If no full-page image is required based on that > comparison, IOW if the page was modified and a full-page image was already > written after the earlier checkpoint, write a normal WAL record without > full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also > set a new flag on the WAL record, XLR_FPW_SKIPPED. > > When checkpointer (or any other backend that needs to evict a buffer) is > about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag > set, write a new WAL record, containing a full-page-image of the page, > before flushing the page.
How does this mechanism work during base backup? pg_stop_backup needs to flush all buffers with BM_NEEDS_FPW flag? > > Here's how this works out during replay: > > a) You start WAL replay from the latest checkpoint's Redo-pointer. > > When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't > replay that record at all. It's OK because we know that there will be a > separate record containing the full-page image of the page later in the > stream. > > b) You are continuing WAL replay that started from an earlier checkpoint, > and have already reached consistency. > > When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it > normally. It's OK, because the flag means that the page was modified after > the earlier checkpoint already, and hence we must have seen a full-page > image of it already. When you see one of the WAL records containing a > separate full-page-image, ignore it. > > This scheme make the b-case behave just as if the new checkpoint was never > started. The regular WAL records in the stream are identical to what they > would've been if the redo-pointer pointed to the earlier checkpoint. And the > additional FPW records are simply ignored. > > In the a-case, it's not be safe to replay the records marked with > XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual > torn-page hazards that comes with that. However, the separate FPW records > that come later in the stream will fix-up those pages. > > > Now, I'm sure there are issues with this scheme I haven't thought about, but > I wanted to get this written down. Note this does not reduce the overall WAL > volume - on the contrary - but it ought to reduce the spike. ISTM that this can increase WAL volume because one data change can generate both normal WAL and FPW. No? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers