Comments in ReserveXLogInsertLocation() says "* This is the performance critical part of XLogInsert that must be serialized * across backends. The rest can happen mostly in parallel. Try to keep this * section as short as possible, insertpos_lck can be heavily contended on a * busy system."
We've worked out a way of reducing contention during ReserveXLogInsertLocation() by using atomic operations. The mechanism requires that we remove the xl_prev field in each WAL record. This also has the effect of reducing WAL volume by a few %. Currently, we store the start location of the previous WAL record in the xl_prev field of the WAL record header. Typically, redo recovery is a forward moving process and hence we don't ever need to consult xl_prev and read WAL backwards (there is one exception, more on that later [1]). So in theory, we should be able to remove this field completely without compromising any functionality or correctness. But the presence of xl_prev field enables us to guard against torn WAL pages, when a WAL record starts on a sector boundary. In case of a torn page, even though the WAL page looks sane, the WAL record could actually be a stale record retained from the older, recycled WAL file. The system usually guards against this by comparing xl_prev field stored in the WAL record header with the WAL location of the previous record read. Any mismatch is treated as end-of-WAL-stream. So we can't completely remove xl_prev field, without giving up some functionality. But we don't really need to store the 8-byte previous WAL pointer in order to detect torn pages. Something else which can tell us that the WAL record does not belong to current WAL segno would be enough as well. I propose that we replace it with a much smaller 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we decide to call it) is the low order 16-bits of the WAL segno to which the WAL record belongs. While reading WAL, we always match that the "xl_walid" value stored in the WAL record matches with the current WAL segno's lower order 16-bits and if not, then consider that as the end of the stream. For this to work, we must ensure that WAL files are either recycled in such a way that the "xl_walid" of the previous (to be recycled) WAL differs from the new WAL or we zero-out the new WAL file. Seems quite easy to do with the existing infrastructure. Because of padding and alignment, replacing 8-byte xl_prev with 2-byte xl_walid effectively reduces the WAL record header by a full 8-bytes on a 64-bit machine. Obviously, this reduces the amount of WAL produced and transferred to the standby. On a pgbench test, we see about 3-5% reduction in WAL traffic, though in some tests higher - depending on workload. There is yet another important benefit of removing the xl_prev field. We no longer need to track PrevBytePos field in XLogCtlInsert. The insertpos_lck spinlock is now only guarding CurrBytePos. So we can replace that with an atomic 64-bit integer and completely remove the spinlock. The comment at the top of ReserveXLogInsertLocation() clearly mentions the importance of keeping the critical section as small as possible and this patch achieves that by using atomic variables. Pavan ran some micro-benchmarks to measure the effectiveness of the approach. I (Pavan) wrote a wrapper on top of ReserveXLogInsertLocation() and exposed that as a SQL-callable function. I then used pgbench with 1-16 clients where each client effectively calls ReserveXLogInsertLocation() 1M times. Following are the results from the master and the patched code, averaged across 5 runs. The tests are done on a i2.2xlarge AWS instance. HEAD 1 ... 24.24 tps 2 ... 18.12 tps 4 ... 10.95 tps 8 ... 9.05 tps 16 ... 8.44 tps As you would notice, the spinlock contention is immediately evident even when running with just 2 clients and gets worse with 4 or more clients. Patched 1 ... 35.08 tps 2 ... 31.99 tps 4 ... 30.48 tps 8 ... 40.44 tps 16 ... 50.14 tps The patched code on the other hand scales to higher numbers of clients much better. Those are microbenchmarks. You need to run a multi-CPU workload with heavy WAL inserts to show benefits. [1] pg_rewind is the only exception which uses xl_prev to find the previous checkpoint record. But we can always start from the beginning of the WAL segment and read forward until we find the checkpoint record. The patch does just the same and passes pg_rewind's tap tests. Patch credit: Simon Riggs and Pavan Deolasee -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pg_wal_header_reduction_v1.patch
Description: Binary data