On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <p...@bowt.ie> wrote: > > I haven't measured how these changes affect WAL size yet. > > Do you have any suggestions on how to automate testing of new WAL records? > > Is there any suitable place in regression tests? > > I don't know about the regression tests (I doubt that there is a > natural place for such a test), but I came up with a rough test case. > I more or less copied the approach that you took with the index build > WAL reduction patches, though I also figured out a way of subtracting > heapam WAL overhead to get a real figure. I attach the test case -- > note that you'll need to use the "land" database with this. (This test > case might need to be improved, but it's a good start.)
I used a test script similar to the "nbtree_wal_test.sql" test script I posted on September 11th today. I am concerned about the WAL overhead for cases that don't benefit from the patch (usually because they turn off deduplication altogether). The details of the index tested were different this time, though. I used an index that had the smallest possible tuple size: 16 bytes (this is the smallest possible size on 64-bit systems, but that's what almost everybody uses these days). So any index with one or two int4 columns (or one int8 column) will generally have 16 byte IndexTuples, at least when there are no NULLs in the index. In general, 16 byte wide tuples are very, very common. What I saw suggests that we will need to remove the new "postingoff" field from xl_btree_insert. (We can create a new XLog record for leaf page inserts that also need to split a posting list, without changing much else.) The way that *alignment* of WAL records affects these common 16 byte IndexTuple cases is the real problem. Adding "postingoff" to xl_btree_insert increases the WAL required for INSERT_LEAF records by two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output shows that they're 66 bytes, whereas they're only 64 bytes on the master branch. That doesn't sound that bad, but once you consider the alignment of whole records, it's really an extra 8 bytes. That is totally unacceptable. The vast majority of nbtree WAL records are bound to be INSERT_LEAF records, so as things stand we have added (almost) 12.5% space overhead to nbtree for these common cases, that don't benefit. I haven't really looked into other types of WAL record just yet. The real world overhead that we're adding to xl_btree_vacuum records is something that I will have to look into separately. I'm already pretty sure that adding two bytes to xl_btree_split is okay, though, because they're far less numerous than xl_btree_insert records, and aren't affected by alignment in the same way (they're already several hundred bytes in almost all cases). I also noticed something positive: The overhead of xl_btree_dedup WAL records seems to be very low with indexes that have hundreds of logical tuples for each distinct integer value. We don't seem to have a problem with "deduplication thrashing". -- Peter Geoghegan