Hi all, Attached is a patch series that implements two features to the logical replication - ability to define a memory limit for the reorderbuffer (responsible for building the decoded transactions), and ability to stream large in-progress transactions (exceeding the memory limit).
I'm submitting those two changes together, because one builds on the other, and it's beneficial to discuss them together. PART 1: adding logical_work_mem memory limit (0001) --------------------------------------------------- Currently, limiting the amount of memory consumed by logical decoding is tricky (or you might say impossible) for several reasons: * The value is hard-coded, so it's not quite possible to customize it. * The amount of decoded changes to keep in memory is restricted by number of changes. It's not very unclear how this relates to memory consumption, as the change size depends on table structure, etc. * The number is "per (sub)transaction", so a transaction with many subtransactions may easily consume significant amount of memory without actually hitting the limit. So the patch does two things. Firstly, it introduces logical_work_mem, a GUC restricting memory consumed by all transactions currently kept in the reorder buffer. Secondly, it adds a simple memory accounting by tracking the amount of memory used in total (for the whole reorder buffer, to compare against logical_work_mem) and per transaction (so that we can quickly pick transaction to spill to disk). The one wrinkle on the patch is that the memory limit can't be enforced when reading changes spilled to disk - with multiple subtransactions, we can't easily predict how many changes to pre-read for each of them. At that point we still use the existing max_changes_in_memory limit. Luckily, changes introduced in the other parts of the patch should allow addressing this deficiency. PART 2: streaming of large in-progress transactions (0002-0006) --------------------------------------------------------------- Note: This part is split into multiple smaller chunks, addressing different parts of the logical decoding infrastructure. That's mostly to allow easier reviews, though. Ultimately, it's just one patch. Processing large transactions often results in significant apply lag, for a couple of reasons. One reason is network bandwidth - while we do decode the changes incrementally (as we read the WAL), we keep them locally, either in memory, or spilled to files. Then at commit time, all the changes get sent to the downstream (and applied) at the same time. For large transactions the time to do the network transfer may be significant, causing apply lag. This patch extends the logical replication infrastructure (output plugin API, reorder buffer, pgoutput, replication protocol etc.) so allow streaming of in-progress transactions instead of spilling them to local files. The extensions to the API are pretty straightforward. Aside from adding methods to stream changes/messages and commit a streamed transaction, the API needs a function to abort a streamed (sub)transaction, and functions to demarcate a block of streamed changes. To decode a transaction, we need to know all it's subtransactions, and invalidations. Currently, those are only known at commit time (although some assignments may be known earlier), but invalidations are only ever written in the commit record. So far that was fine, because we only decode/replay transactions at commit time, when all of this is known (because it's either in commit record, or written before it). But for in-progress transactions (i.e. the subject of interest here), that is not the case. So the patch modifies WAL-logging to ensure those two bits of information are written immediately (for wal_level=logical). For assignments that was fairly simple, thanks to existing caching. For invalidations, it requires a new WAL record type and a couple of changes in inval.c. On the apply side, we simply receive the streamed changes, write them into a file (one file for toplevel transaction, which is possible thanks to the assignments being known immediately). And then at commit time the changes are replayed locally, without having to copy a large chunk of data over network. WAL overhead ------------ Of course, these changes to WAL logging are not for free - logging assignments individually (instead of multiple subtransactions at once) means higher xlog record overhead. Similarly, (sub)transactions doing a lot of DDL may result in a lot of invalidations written to WAL (again, with full xlog record overhead per invalidation). I've done a number of tests to measure the impact, and for extreme corner cases the additional amount of WAL is about 40% in both cases. By an "extreme corner case" I mean a workloads intentionally triggering many assignments/invalidations, without doing a lot of meaningful work. For assignments, imagine a single-row table (no indexes), and a transaction like this one: BEGIN; UPDATE t SET v = v + 1; SAVEPOINT s1; UPDATE t SET v = v + 1; SAVEPOINT s2; UPDATE t SET v = v + 1; SAVEPOINT s3; ... UPDATE t SET v = v + 1; SAVEPOINT s10; UPDATE t SET v = v + 1; COMMIT; For invalidations, add a CREATE TEMPORARY TABLE to each subtransaction. For more realistic workloads (large table with indexes, runs long enough to generate FPIs, etc.) the overhead drops below 5%. Which is much more acceptable, of course, although not perfect. In both cases, there was pretty much no measurable impact on performance (as measured by tps). I do not think there's a way around this requirement (having assignments and invalidations), if we want to decode in-progress transactions. But perhaps it would be possible to do some sort of caching (say, at command level), to reduce the xlog record overhead? Not sure. All ideas are welcome, of course. In the worst case, I think we can add a GUC enabling this additional logging - when disabled, streaming of in-progress transactions would not be possible. Simplifying ReorderBuffer ------------------------- One interesting consequence of having assignments is that we could get rid of the ReorderBuffer iterator, used to merge changes from subxacts. The assignments allow us to keep changes for each toplevel transaction in a single list, in LSN order, and just walk it. Abort can be performed by remembering position of the first change in each subxact, and just discarding the tail. This is what the apply worker does with the streamed changes and aborts. It would also allow us to enforce the memory limit while restoring transactions spilled to disk, because we would not have the problem with restoring changes for many subtransactions. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-Introduce-logical_work_mem-to-limit-ReorderBuffer-me.patch.gz
Description: application/gzip
0002-Issue-XLOG_XACT_ASSIGNMENT-with-wal_level-logical.patch.gz
Description: application/gzip
0003-Issue-individual-invalidations-with-wal_level-logica.patch.gz
Description: application/gzip
0004-Extend-the-output-plugin-API-with-stream-methods.patch.gz
Description: application/gzip
0005-Implement-streaming-mode-in-ReorderBuffer.patch.gz
Description: application/gzip
0006-Add-support-for-streaming-to-built-in-replication.patch.gz
Description: application/gzip