Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Simon Riggs Sat, 30 Dec 2017 02:34:45 -0800

Comments in ReserveXLogInsertLocation() says
"* This is the performance critical part of XLogInsert that must be serialized
  * across backends. The rest can happen mostly in parallel. Try to keep this
  * section as short as possible, insertpos_lck can be heavily contended on a
  * busy system."


We've worked out a way of reducing contention during
ReserveXLogInsertLocation() by using atomic operations.

The mechanism requires that we remove the xl_prev field in each WAL
record. This also has the effect of reducing WAL volume by a few %.

Currently, we store the start location of the previous WAL record in
the xl_prev field of the WAL record header. Typically, redo recovery
is a forward moving process and hence we don't ever need to consult
xl_prev and read WAL backwards (there is one exception, more on that
later [1]). So in theory, we should be able to remove this field
completely without compromising any functionality or correctness.

But the presence of xl_prev field enables us to guard against torn WAL
pages, when a WAL record starts on a sector boundary. In case of a
torn page, even though the WAL page looks sane, the WAL record could
actually be a stale record retained from the older, recycled WAL file.
The system usually guards against this by comparing xl_prev field
stored in the WAL record header with the WAL location of the previous
record read. Any mismatch is treated as end-of-WAL-stream.

So we can't completely remove xl_prev field, without giving up some
functionality. But we don't really need to store the 8-byte previous
WAL pointer in order to detect torn pages. Something else which can
tell us that the WAL record does not belong to current WAL segno would
be enough as well. I propose that we replace it with a much smaller
2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
decide to call it) is the low order 16-bits of the WAL segno to which
the WAL record belongs. While reading WAL, we always match that the
"xl_walid" value stored in the WAL record matches with the current WAL
segno's lower order 16-bits and if not, then consider that as the end
of the stream.

For this to work, we must ensure that WAL files are either recycled in
such a way that the "xl_walid" of the previous (to be recycled) WAL
differs from the new WAL or we zero-out the new WAL file. Seems quite
easy to do with the existing infrastructure.

Because of padding and alignment, replacing 8-byte xl_prev with 2-byte
xl_walid effectively reduces the WAL record header by a full 8-bytes
on a 64-bit machine. Obviously, this reduces the amount of WAL
produced and transferred to the standby. On a pgbench test, we see
about 3-5% reduction in WAL traffic, though in some tests higher -
depending on workload.

There is yet another important benefit of removing the xl_prev field.
We no longer need to track PrevBytePos field in XLogCtlInsert. The
insertpos_lck spinlock is now only guarding CurrBytePos. So we can
replace that with an atomic 64-bit integer and completely remove the
spinlock. The comment at the top of ReserveXLogInsertLocation()
clearly mentions the importance of keeping the critical section as
small as possible and this patch achieves that by using atomic
variables.

Pavan ran some micro-benchmarks to measure the effectiveness of the
approach. I (Pavan) wrote a wrapper on top of
ReserveXLogInsertLocation() and exposed that as a SQL-callable
function. I then used pgbench with 1-16 clients where each client
effectively calls ReserveXLogInsertLocation() 1M times. Following are
the results from the master and the patched code, averaged across 5
runs. The tests are done on a i2.2xlarge AWS instance.

HEAD
 1 ... 24.24 tps
 2 ... 18.12 tps
 4 ... 10.95 tps
 8 ...  9.05 tps
16 ... 8.44 tps

As you would notice, the spinlock contention is immediately evident
even when running with just 2 clients and gets worse with 4 or more
clients.

Patched
 1 ... 35.08 tps
 2 ... 31.99 tps
 4 ... 30.48 tps
 8 ... 40.44 tps
16 ... 50.14 tps

The patched code on the other hand scales to higher numbers of clients
much better.

Those are microbenchmarks. You need to run a multi-CPU workload with
heavy WAL inserts to show benefits.

[1] pg_rewind is the only exception which uses xl_prev to find the
previous checkpoint record. But we can always start from the beginning
of the WAL segment and read forward until we find the checkpoint
record. The patch does just the same and passes pg_rewind's tap tests.

Patch credit: Simon Riggs and Pavan Deolasee

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pg_wal_header_reduction_v1.patch
Description: Binary data

Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

Reply via email to