I thought that the biggest reason for the pgbench RW slowdown during a 
checkpoint was the flood of dirty page writes increasing the COMMIT latency.  
It turns out that the documentation which states that FPW's start "after a 
checkpoint" really means after a CKPT starts.  And this is the really cause of 
the deep dip in performance.  Maybe only I was fooled... :-)

If we can't eliminate FPW's can we at least solve the impact of it?  Instead of 
writing the before images of pages inline into the WAL, which increases the 
COMMIT latency, write these same images to a separate physical log file.  The 
key idea is that I don't believe that COMMIT's require these buffers to be 
immediately flushed to the physical log.  We only need to flush these before 
the dirty pages are written.  This delay allows the physical before image IO's 
to be decoupled and done in an efficient manner without an impact to COMMIT's.

1. When we generate a physical image add it to an in memory buffer of before 
page images.
2. Put the physical log offset of the before image into the WAL record.  This 
is the current physical log file size plus the offset in the in-memory buffer 
of pages.
3. Set a bit in the bufhdr indicating this was done.
4. COMMIT's do not need to worry about those buffers.
5. Periodically flush the in-memory buffer and clear the bit in the BufHdr.
6. During any dirty page flushing if we see the bit set, which should be rare, 
then make sure we get our before image flushed.  This would be similar to our 
LSN based XLogFlush().
Do we need these before images for more than one CKPT?  I don't think so.  Do 
PITR's require before images since it is a continuous rollforward from a 
restore?  Just some of considerations.

Do I need to back this physical log up?  I likely(?) need to deal with 
replication.

Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW 
workload which fits in the buffer cache.  Can I get this 20% improvement with a 
separate physical log of before page images?

Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to 
impact COMMIT latency on modern SSD based storage systems.  For instance, you 
can hammer a shared data and WAL SSD filesystem with dirty page writes from the 
CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency.  However, 
this presumes that the CKPT's natural spreading of dirty page writes across the 
CKPT target doesn't push too many outstanding IO's into the storage write Q on 
the OS/device.
NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of 
dirty pages into the cache just before a CKPT might cause the Q to be flooded 
and this would then also further slow TPS during the CKPT.  But a fix to this 
is off topic from the FPW issue.

Thanks to Andres Freund for both making me aware of the Q depth impact on 
COMMIT latency and the hint that FPW might also be causing the CKPT slowdown.  
FYI, I always knew about FPW slowdown in general but I just didn't realize it 
was THE primary cause of CKPT TPS slowdown on pgbench.  NOTE: I realize that 
spinning media might exhibit different behavior.  And I didn't not say dirty 
page writing has NO impact on good SSD's.  It depends, and this is a subject 
for a later date as I have a theory as to why I something see a sawtooth 
performance for pgbench TPC-B and sometimes a square wave but I want to prove 
if first.

Reply via email to