Improve eviction algorithm in ReorderBuffer

Masahiko Sawada Mon, 11 Dec 2023 19:32:02 -0800

Hi all,

As the comment of ReorderBufferLargestTXN() says, it's very slow with
many subtransactions:


/*
 * Find the largest transaction (toplevel or subxact) to evict (spill to disk).
 *
 * XXX With many subtransactions this might be quite slow, because we'll have
 * to walk through all of them. There are some options how we could improve
 * that: (a) maintain some secondary structure with transactions sorted by
 * amount of changes, (b) not looking for the entirely largest transaction,
 * but e.g. for transaction using at least some fraction of the memory limit,
 * and (c) evicting multiple transactions at once, e.g. to free a given portion
 * of the memory limit (e.g. 50%).
 */

This is because the reorderbuffer has transaction entries for each
top-level and sub transaction, and ReorderBufferLargestTXN() walks
through all transaction entries to pick the transaction to evict.
I've heard the report internally that replication lag became huge when
decoding transactions each consisting of 500k sub transactions. Note
that ReorderBufferLargetstTXN() is used only in non-streaming mode.

Here is a test script for a many subtransactions scenario. In my
environment, the logical decoding took over 2min to decode one top
transaction having 100k subtransctions.

-----
create table test (c int);
create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
    begin
      insert into test values (i);
    exception when division_by_zero then
      raise notice 'caught error';
      return;
    end;
  end loop;
end;
$$
language plpgsql;
select testfn(100000)
set logical_decoding_work_mem to '4MB';
select count(*) from pg_logical_slot_peek_changes('s', null, null)
----

To deal with this problem, I initially thought of the idea (a)
mentioned in the comment; use a binary heap to maintain the
transactions sorted by the amount of changes or the size. But it seems
not a good idea to try maintaining all transactions by  its size since
the size of each transaction could be changed frequently.

The attached patch uses a different approach that consists of three
strategies; (1) maintain the list of transactions whose size is larger
than 10% of logical_decoding_work_mem, and preferentially evict a
transaction from this list. If the list is empty, all transactions are
small enough, (2) so we evict the oldest top-level transaction from
rb->toplevel_by_lsn list. Evicting older transactions would help in
freeing memory blocks in GenerationContext. Finally, if this is also
empty, (3) we evict a transaction that size is > 0. Here, we need to
note the fact that even if a transaction is evicted the
ReorderBufferTXN entry is not removed from rb->by_txn but its size is
0. In the worst case where all (quite a few) transactions are smaller
than 10% of the memory limit, we might end up checking many
transactions to find non-zero size transaction entries to evict. So
the patch adds a new list to maintain all transactions that have at
least one change in memory.

Summarizing the algorithm I've implemented in the patch,

1. pick a transaction from the list of large transactions (larger than
10% of memory limit).
2. pick a transaction from the top-level transaction list in LSN order.
3. pick a transaction from the list of transactions that have at least
one change in memory.

With the patch, the above test case completed within 3 seconds in my
environment.

As a side note, the idea (c) mentioned in the comment, evicting
multiple transactions at once to free a given portion of the memory,
would also help in avoiding back and forth the memory threshold. It's
also worth considering.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

improve_eviction_rb_poc.patch
Description: Binary data

Improve eviction algorithm in ReorderBuffer

Reply via email to