Teaser: change made in 9.4 to simplify WAL segment compression made it easier to compress a low-activity-period WAL segment from 16 MB to about 27 kB ... but much harder to do better than that, as I was previously doing (about two orders of magnitude better).
At $work, we have a usually-low-activity PG database, so that almost always the used fraction of each 16 MB WAL segment is far smaller than 16 MB, and so it's a big win for archived-WAL storage space if an archive-command can be written that compresses those files effectively. Our database was also running on a pre-9.4 version, and I'm currently migrating to 9.5.3. As I understand it, 9.4 was where commit 9a20a9b landed, which changed what happens in the unwritten 'tail' of log segments. In my understanding, before 9.4, the 'tail' of any log segment on disk just wasn't written, and so (as segment recycling simply involves renaming a file that held some earlier segment), the remaining content was simply whatever had been there before recycling. That was never a problem for recovery (which could tell when it reached the end of real data), but was not well compressible with a generic tool like gzip. Specialized tools like pg_clearxlogtail existed, but had to know too much about the internal format, and ended up unmaintained and therefore difficult to trust. The change in 9.4 included this, from the git comment: This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. ... thus making the segments easily compressible with bog standard tools. So I can just point gzip at one of our WAL segments from a light-activity period and it goes from 16 MB down to about 27 kB. Nice, right? But why does it break my earlier approach, which was doing about two orders of magnitude better, getting low-activity WAL segments down to 200 to 300 *bytes*? (Seriously: my last solid year of archived WAL is contained in a 613 MB zip file.) That approach was based on using rsync (also bog standard) to tease apart the changed and unchanged bits of the newly-archived segment and the last-seen content of the file with the same i-number. You would expect that to work just as well when the tail is always zeros as it was working before, right? And what's breaking it now is the tiny bit of fine print that's in the code comment for AdvanceXLInsertBuffer but not in the git comment above: * ... Any new pages are initialized to zeros, with pages headers * initialized properly. That innocuous "headers initialized" means that the tail of the file is *almost* all zeros, but every 8 kB there is a tiny header, and in each tiny header, there is *one byte* that differs from its value in the pre-recycle content at the same i-node, because that one byte in each header reflects the WAL segment number. Before the 9.4 change, I see there were still headers there, and they did contain a byte matching the segment number, but in the unwritten portion of course it matched the pre-recycle segment number, and rsync easily detected the whole unchanged tail of the file. Now there is one changed byte every 8 kB, and the rsync output, instead of being 100x better than vanilla gzip, is about 3x worse. Taking a step back, isn't overwriting the whole unused tail of each 16 MB segment really just an I/O intensive way of communicating to the archive-command where the valid data ends? Could that not be done more efficiently by adding another code, say %e, in archive-command, that would be substituted by the offset of the end of the XLOG_SWITCH record? That way, however archive-command is implemented, it could simply know how much of the file to copy. Would it then be possible to go back to the old behavior (or make it selectable) of not overwriting the full 16 MB every time? Or did the 9.4 changes also change enough other logic that stuff would now break if that isn't done? -Chap -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers