Hi all, We have several reports that logical decoding uses memory much more than logical_decoding_work_mem[1][2][3]. For instance in one of the reports[1], even though users set logical_decoding_work_mem to '256MB', a walsender process was killed by OOM because of using more than 4GB memory.
As per the discussion in these threads so far, what happened was that there was huge memory fragmentation in rb->tup_context. rb->tup_context uses GenerationContext with 8MB memory blocks. We cannot free memory blocks until all memory chunks in the block are freed. If there is a long-running transaction making changes, its changes could be spread across many memory blocks and we end up not being able to free memory blocks unless the long-running transaction is evicted or completed. Since we don't account fragmentation, block header size, nor chunk header size into per-transaction memory usage (i.e. txn->size), rb->size could be less than logical_decoding_work_mem but the actual allocated memory in the context is much higher than logical_decoding_work_mem. We can reproduce this issue with the attached patch rb_excessive_memory_reproducer.patch (not intended to commit) that adds a memory usage reporting and includes the test. After running the tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory usage report in the server logs like follows: LOG: reorderbuffer memory: logical_decoding_work_mem=268435456 rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264 Which means that the logical decoding allocated 1GB memory in spite of logical_decoding_work_mem being 256MB. One idea to deal with this problem is that we use MemoryContextMemAllocated() as the reorderbuffer's memory usage instead of txn->total_size. That is, we evict transactions until the value returned by MemoryContextMemAllocated() gets lower than logical_decoding_work_mem. However, it could end up evicting a lot of (possibly all) transactions because the transaction whose decoded tuples data are spread across memory context blocks could be evicted after all other larger transactions are evicted (note that we evict transactions from largest to smallest). Therefore, if we want to do that, we would need to change the eviction algorithm to for example LSN order instead of transaction size order. Furthermore, reorderbuffer's changes that are counted in txn->size (and therefore rb->size too) are stored in different memory contexts depending on the kinds. For example, decoded tuples are stored in rb->context, ReorderBufferChange are stored in rb->change_context, and snapshot data are stored in builder->context. So we would need to sort out which data is stored into which memory context and which memory context should be accounted for in the reorderbuffer's memory usage. Which could be a large amount of work. Another idea is to have memory context for storing decoded tuples per transaction. That way, we can ensure that all memory blocks of the context are freed when the transaction is evicted or completed. I think that this idea would be simpler and worth considering, so I attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the decoded tuple data is created individually when the corresponding WAL record is decoded but is released collectively when it is released (e.g., transaction eviction), the bump memory context would fit the best this case. One exception is that we immediately free the decoded tuple data if the transaction is already aborted. However, in this case, I think it is possible to skip the WAL decoding in the first place. One thing we need to consider is that the number of transaction entries and memory contexts in the reorderbuffer could be very high since it has entries for both top-level transaction entries and sub-transactions. To deal with that, the patch changes that decoded tuples of a subtransaction are stored into its top-level transaction's tuple context. We always evict sub-transactions and its top-level transaction at the same time, I think it would not be a problem. Checking performance degradation due to using many memory contexts would have to be done. Even with this idea, we would still have a mismatch between the actual amount of allocated memory and rb->size, but it would be much better than today. And something like the first idea would be necessary to address this mismatch, and we can discuss it in a separate thread. Regards, [1] https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com [2] https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org [3] https://www.postgresql.org/message-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252%40DB9PR07MB7180.eurprd07.prod.outlook.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
use_tup_context_per_tx_v1.patch
Description: Binary data
rb_excessive_memory_reproducer.patch
Description: Binary data