Hi,

On 2020-01-11 23:20:56 -0500, Tom Lane wrote:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
> Nah, don't think I believe that: the test inserts a bunch of tuples,
> but they look like they will all be *exactly* the same size.
>
> CREATE TABLE decoding_test(x integer, y text);
> ...
>
>     FOR i IN 1..10 LOOP
>         BEGIN
>             INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
>         EXCEPTION
>             when division_by_zero then perform 'dummy';
>         END;

I think the issue this triggers higher memory usage in in older versions
is that before

commit cec2edfa7859279f36d2374770ca920c59c73dd8
Author: Amit Kapila <akap...@postgresql.org>
Date:   2019-11-16 17:49:33 +0530

    Add logical_decoding_work_mem to limit ReorderBuffer memory usage.

we enforced how many changes to keep in memory (vs on disk)

/*
 * Maximum number of changes kept in memory, per transaction. After that,
 * changes are spooled to disk.
 *
 * The current value should be sufficient to decode the entire transaction
 * without hitting disk in OLTP workloads, while starting to spool to disk in
 * other workloads reasonably fast.
 *
 * At some point in the future it probably makes sense to have a more elaborate
 * resource management here, but it's not entirely clear what that would look
 * like.
 */
static const Size max_changes_in_memory = 4096;

on a per-transaction basis. And that subtransactions are *different*
transactions for that purpose (as they can be rolled back
separately). As the test generates loads of records for different
subtransactions, they each end up having quite a few changes (including
the tuples pointed to!) in memory at the same time.

Due to the way the limit of 4096 interacts with the 5000 rows inserted
above, we only hit the out of memory error when loading. That's because
when decoding (before the commit has been seen), we spill after 4096 changes:

2020-02-07 11:18:22.399 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
585 to disk
2020-02-07 11:18:22.419 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
586 to disk
2020-02-07 11:18:22.431 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
587 to disk
2020-02-07 11:18:22.443 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
588 to disk
2020-02-07 11:18:22.454 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
589 to disk
2020-02-07 11:18:22.465 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
590 to disk
2020-02-07 11:18:22.477 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
591 to disk
2020-02-07 11:18:22.488 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
592 to disk
2020-02-07 11:18:22.499 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
593 to disk
2020-02-07 11:18:22.511 PST [1136134][3/2] DEBUG:  spill 4096 changes in XID 
594 to disk

so there's each 5000 - 4096 changes in memory, times 10. But when
actually calling the output plugin (at the commit record), we start with
loading changes back into memory from the start of each
subtransaction. That first entails spilling the tail of that transaction
to disk, and then loading the start:

2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  StartSubTransaction(1) name: 
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  StartSubTransaction(2) name: 
replay; blockState: SUB BEGIN; state: INPROGR, xid/subid/cid: 0/2/0
2020-02-07 11:18:22.515 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 585 
to disk
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
585 into memory
2020-02-07 11:18:22.524 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 586 
to disk
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
586 into memory
2020-02-07 11:18:22.534 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 587 
to disk
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
587 into memory
2020-02-07 11:18:22.544 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 588 
to disk
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG:  restored 4096 changes in XID 
588 into memory
2020-02-07 11:18:22.554 PST [1136134][3/2] DEBUG:  spill 904 changes in XID 589 
to disk
TopMemoryContext: 161440 total in 7 blocks; 80240 free (68 chunks); 81200 used
...

Because each transaction has 4096 changes in memory, we actually need
more memory here than we did during the decoding phase, where all but
the "current" subtransaction only have 5000 - 4096 changes in memory.

If we instead change the test to insert 4096*2 - 1 tuples each, we run
out of memory earlier:
2020-02-07 11:23:20.540 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
610 to disk
2020-02-07 11:23:20.565 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
611 to disk
2020-02-07 11:23:20.587 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
612 to disk
2020-02-07 11:23:20.608 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
613 to disk
2020-02-07 11:23:20.630 PST [1136134][3/12] DEBUG:  spill 4096 changes in XID 
614 to disk
TopMemoryContext: 161440 total in 7 blocks; 79264 free (82 chunks); 82176 used
...
2020-02-07 11:23:20.655 PST [1136134][3/12] ERROR:  out of memory
2020-02-07 11:23:20.655 PST [1136134][3/12] DETAIL:  Failed on request of size 
8208.
2020-02-07 11:23:20.655 PST [1136134][3/12] STATEMENT:  SELECT * FROM 
pg_logical_slot_peek_changes('regression_slot', NULL, NULL);


The reason that the per-subxact enforcement of max_changes_in_memory
isn't as noticable in 11 is that there we have the generational
context. Which means that each of the 4096*10 tuples we have in memory
doesn't allocate MaxHeapTupleSize, but instead something like ~30 bytes.


I wonder if we, in the backbranches that don't have generation context,
should just reduce the size of slab allocated tuples to be ~1024 bytes
instead of MaxHeapTupleSize. That's an almost trivial change, as we
already have to support tuples above that limit (in cases the oldtuple
in an update/delete contains toasted columns that we "inlined"). POC for
that attached.

Greetings,

Andres Freund
diff --git i/src/backend/replication/logical/reorderbuffer.c w/src/backend/replication/logical/reorderbuffer.c
index cd2a0111825..9a5b5530e75 100644
--- i/src/backend/replication/logical/reorderbuffer.c
+++ w/src/backend/replication/logical/reorderbuffer.c
@@ -421,6 +421,9 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 	pfree(change);
 }
 
+#define TUPLEBUF_MAX_SIZE 1024
+//MaxHeapTupleSize
+
 /*
  * Get a fresh ReorderBufferTupleBuf fitting at least a tuple of size
  * tuple_len (excluding header overhead).
@@ -434,22 +437,20 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)
 	alloc_len = tuple_len + SizeofHeapTupleHeader;
 
 	/*
-	 * Most tuples are below MaxHeapTupleSize, so we use a slab allocator for
-	 * those. Thus always allocate at least MaxHeapTupleSize. Note that tuples
-	 * generated for oldtuples can be bigger, as they don't have out-of-line
-	 * toast columns.
+	 * Most tuples are below TUPLEBUF_MAX_SIZE, so we use a slab allocator for
+	 * those. Thus always allocate at least TUPLEBUF_MAX_SIZE.
 	 */
-	if (alloc_len < MaxHeapTupleSize)
-		alloc_len = MaxHeapTupleSize;
+	if (alloc_len < TUPLEBUF_MAX_SIZE)
+		alloc_len = TUPLEBUF_MAX_SIZE;
 
 
 	/* if small enough, check the slab cache */
-	if (alloc_len <= MaxHeapTupleSize && rb->nr_cached_tuplebufs)
+	if (alloc_len <= TUPLEBUF_MAX_SIZE && rb->nr_cached_tuplebufs)
 	{
 		rb->nr_cached_tuplebufs--;
 		tuple = slist_container(ReorderBufferTupleBuf, node,
 								slist_pop_head_node(&rb->cached_tuplebufs));
-		Assert(tuple->alloc_tuple_size == MaxHeapTupleSize);
+		Assert(tuple->alloc_tuple_size == TUPLEBUF_MAX_SIZE);
 #ifdef USE_ASSERT_CHECKING
 		memset(&tuple->tuple, 0xa9, sizeof(HeapTupleData));
 		VALGRIND_MAKE_MEM_UNDEFINED(&tuple->tuple, sizeof(HeapTupleData));
@@ -480,7 +481,7 @@ void
 ReorderBufferReturnTupleBuf(ReorderBuffer *rb, ReorderBufferTupleBuf *tuple)
 {
 	/* check whether to put into the slab cache, oversized tuples never are */
-	if (tuple->alloc_tuple_size == MaxHeapTupleSize &&
+	if (tuple->alloc_tuple_size == TUPLEBUF_MAX_SIZE &&
 		rb->nr_cached_tuplebufs < max_cached_tuplebufs)
 	{
 		rb->nr_cached_tuplebufs++;
@@ -2588,6 +2589,9 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		restored++;
 	}
 
+	elog(DEBUG5, "restored %zu changes in XID %u into memory",
+		 restored, txn->xid);
+
 	return restored;
 }
 

Reply via email to