Using per-transaction memory contexts for storing decoded tuples

Masahiko Sawada Wed, 11 Sep 2024 15:33:35 -0700

Hi all,

We have several reports that logical decoding uses memory much more
than logical_decoding_work_mem[1][2][3]. For instance in one of the
reports[1], even though users set logical_decoding_work_mem to
'256MB', a walsender process was killed by OOM because of using more
than 4GB memory.

As per the discussion in these threads so far, what happened was that
there was huge memory fragmentation in rb->tup_context.
rb->tup_context uses GenerationContext with 8MB memory blocks. We
cannot free memory blocks until all memory chunks in the block are
freed. If there is a long-running transaction making changes, its
changes could be spread across many memory blocks and we end up not
being able to free memory blocks unless the long-running transaction
is evicted or completed. Since we don't account fragmentation, block
header size, nor chunk header size into per-transaction memory usage
(i.e. txn->size), rb->size could be less than
logical_decoding_work_mem but the actual allocated memory in the
context is much higher than logical_decoding_work_mem.

We can reproduce this issue with the attached patch
rb_excessive_memory_reproducer.patch (not intended to commit) that
adds a memory usage reporting and includes the test. After running the
tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
usage report in the server logs like follows:

LOG: reorderbuffer memory: logical_decoding_work_mem=268435456
rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264

Which means that the logical decoding allocated 1GB memory in spite of
logical_decoding_work_mem being 256MB.

One idea to deal with this problem is that we use
MemoryContextMemAllocated() as the reorderbuffer's memory usage
instead of txn->total_size. That is, we evict transactions until the
value returned by MemoryContextMemAllocated() gets lower than
logical_decoding_work_mem. However, it could end up evicting a lot of
(possibly all) transactions because the transaction whose decoded
tuples data are spread across memory context blocks could be evicted
after all other larger transactions are evicted (note that we evict
transactions from largest to smallest). Therefore, if we want to do
that, we would need to change the eviction algorithm to for example
LSN order instead of transaction size order. Furthermore,
reorderbuffer's changes that are counted in txn->size (and therefore
rb->size too) are stored in different memory contexts depending on the
kinds. For example, decoded tuples are stored in rb->context,
ReorderBufferChange are stored in rb->change_context, and snapshot
data are stored in builder->context. So we would need to sort out
which data is stored into which memory context and which memory
context should be accounted for in the reorderbuffer's memory usage.
Which could be a large amount of work.

Another idea is to have memory context for storing decoded tuples per
transaction. That way, we can ensure that all memory blocks of the
context are freed when the transaction is evicted or completed. I
think that this idea would be simpler and worth considering, so I
attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
decoded tuple data is created individually when the corresponding WAL
record is decoded but is released collectively when it is released
(e.g., transaction eviction), the bump memory context would fit the
best this case. One exception is that we immediately free the decoded
tuple data if the transaction is already aborted. However, in this
case, I think it is possible to skip the WAL decoding in the first
place.

One thing we need to consider is that the number of transaction
entries and memory contexts in the reorderbuffer could be very high
since it has entries for both top-level transaction entries and
sub-transactions. To deal with that, the patch changes that decoded
tuples of a subtransaction are stored into its top-level transaction's
tuple context. We always evict sub-transactions and its top-level
transaction at the same time, I think it would not be a problem.
Checking performance degradation due to using many memory contexts
would have to be done.

Even with this idea, we would still have a mismatch between the actual
amount of allocated memory and rb->size, but it would be much better
than today. And something like the first idea would be necessary to
address this mismatch, and we can discuss it in a separate thread.

Regards,

[1]
https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org
[3]
https://www.postgresql.org/message-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252%40DB9PR07MB7180.eurprd07.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

use_tup_context_per_tx_v1.patch
Description: Binary data

rb_excessive_memory_reproducer.patch
Description: Binary data

Using per-transaction memory contexts for storing decoded tuples

Reply via email to