Hi all, As the comment of ReorderBufferLargestTXN() says, it's very slow with many subtransactions:
/* * Find the largest transaction (toplevel or subxact) to evict (spill to disk). * * XXX With many subtransactions this might be quite slow, because we'll have * to walk through all of them. There are some options how we could improve * that: (a) maintain some secondary structure with transactions sorted by * amount of changes, (b) not looking for the entirely largest transaction, * but e.g. for transaction using at least some fraction of the memory limit, * and (c) evicting multiple transactions at once, e.g. to free a given portion * of the memory limit (e.g. 50%). */ This is because the reorderbuffer has transaction entries for each top-level and sub transaction, and ReorderBufferLargestTXN() walks through all transaction entries to pick the transaction to evict. I've heard the report internally that replication lag became huge when decoding transactions each consisting of 500k sub transactions. Note that ReorderBufferLargetstTXN() is used only in non-streaming mode. Here is a test script for a many subtransactions scenario. In my environment, the logical decoding took over 2min to decode one top transaction having 100k subtransctions. ----- create table test (c int); create or replace function testfn (cnt int) returns void as $$ begin for i in 1..cnt loop begin insert into test values (i); exception when division_by_zero then raise notice 'caught error'; return; end; end loop; end; $$ language plpgsql; select testfn(100000) set logical_decoding_work_mem to '4MB'; select count(*) from pg_logical_slot_peek_changes('s', null, null) ---- To deal with this problem, I initially thought of the idea (a) mentioned in the comment; use a binary heap to maintain the transactions sorted by the amount of changes or the size. But it seems not a good idea to try maintaining all transactions by its size since the size of each transaction could be changed frequently. The attached patch uses a different approach that consists of three strategies; (1) maintain the list of transactions whose size is larger than 10% of logical_decoding_work_mem, and preferentially evict a transaction from this list. If the list is empty, all transactions are small enough, (2) so we evict the oldest top-level transaction from rb->toplevel_by_lsn list. Evicting older transactions would help in freeing memory blocks in GenerationContext. Finally, if this is also empty, (3) we evict a transaction that size is > 0. Here, we need to note the fact that even if a transaction is evicted the ReorderBufferTXN entry is not removed from rb->by_txn but its size is 0. In the worst case where all (quite a few) transactions are smaller than 10% of the memory limit, we might end up checking many transactions to find non-zero size transaction entries to evict. So the patch adds a new list to maintain all transactions that have at least one change in memory. Summarizing the algorithm I've implemented in the patch, 1. pick a transaction from the list of large transactions (larger than 10% of memory limit). 2. pick a transaction from the top-level transaction list in LSN order. 3. pick a transaction from the list of transactions that have at least one change in memory. With the patch, the above test case completed within 3 seconds in my environment. As a side note, the idea (c) mentioned in the comment, evicting multiple transactions at once to free a given portion of the memory, would also help in avoiding back and forth the memory threshold. It's also worth considering. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
improve_eviction_rb_poc.patch
Description: Binary data