On Mon, Aug 28, 2023 at 12:26 PM Robert Haas <robertmh...@gmail.com> wrote:
>
> On Mon, Aug 28, 2023 at 10:00 AM Melanie Plageman
> <melanieplage...@gmail.com> wrote:
> Then there's the question of whether it's the right metric. My first
> reaction is to think that it sounds pretty good. One thing I really
> like about it is that if the table is being vacuumed frequently, then
> we freeze less aggressively, and if the table is being vacuumed
> infrequently, then we freeze more aggressively. That seems like a very
> desirable property. It also seems broadly good that this metric
> doesn't really care about reads. If there are a lot of reads on the
> system, or no reads at all, it doesn't really change the chances that
> a certain page is going to be written again soon, and since reads
> don't change the insert LSN, here again it seems to do the right
> thing. I'm a little less clear about whether it's good that it doesn't
> really depend on wall-clock time. Certainly, that's desirable from the
> point of view of not wanting to have to measure wall-clock time in
> places where we otherwise wouldn't have to, which tends to end up
> being expensive. However, if I were making all of my freezing
> decisions manually, I might be more freeze-positive on a low-velocity
> system where writes are more stretched out across time than on a
> high-velocity system where we're blasting through the LSN space at a
> higher rate. But maybe that's not a very important consideration, and
> I don't know what we'd do about it anyway.

By low-velocity, do you mean lower overall TPS? In that case, wouldn't you be
less likely to run into xid wraparound and thus need less aggressive
opportunistic freezing?

> >            Page Freezes/Page Frozen (less is better)
> >
> > |   | Master |     (1) |     (2) |     (3) |     (4) |     (5) |
> > |---+--------+---------+---------+---------+---------+---------|
> > | A |  28.50 |    3.89 |    1.08 |    1.15 |    1.10 |    1.10 |
> > | B |   1.00 |    1.06 |    1.65 |    1.03 |    1.59 |    1.00 |
> > | C |    N/A |    1.00 |    1.00 |    1.00 |    1.00 |    1.00 |
> > | D |   2.00 | 5199.15 | 5276.85 | 4830.45 | 5234.55 | 2193.55 |
> > | E |   7.90 |    3.21 |    2.73 |    2.70 |    2.69 |    2.43 |
> > | F |    N/A |    1.00 |    1.00 |    1.00 |    1.00 |    1.00 |
> > | G |    N/A |    1.00 |    1.00 |    1.00 |    1.00 |    1.00 |
> > | H |    N/A |    1.00 |    1.00 |    1.00 |    1.00 |    1.00 |
> > | I |    N/A |   42.00 |   42.00 |     N/A |   41.00 |     N/A |
>
> Hmm. I would say that the interesting rows here are A, D, and I, with
> rows C and E deserving honorable mention. In row A, master is bad.

So, this is where the caveat about absolute number of page freezes
matters. In algorithm A, master only did 57 page freezes (spread across
the various pgbench tables). At the end of the run, 2 pages were still
frozen.

> In row D, your algorithms are all bad, really bad. I don't quite
> understand how it can be that bad, actually.

So, I realize now that this test was poorly designed. I meant it to be a
worst case scenario, but I think one critical part was wrong. In this
example one client is going at full speed inserting a row and then
updating it. Then another rate-limited client is deleting old data
periodically to keep the table at a constant size. I meant to bulk load
the table with enough data that the delete job would have data to delete
from the start. With the default autovacuum settings, over the course of
45 minutes, I usually saw around 40 autovacuums of the table. Due to the
rate limiting, the first autovacuum of the table ends up freezing many
pages that are deleted soon after. Thus the total number of page freezes
is very high.

I will redo benchmarking of workload D and start the table with the
number of rows which the DELETE job seeks to maintain. My back of the
envelope math says that this will mean ratios closer to a dozen (and not
5000).

Also, I had doubled checkpoint timeout, which likely led master to
freeze so few pages (2 total freezes, neither of which were still frozen
at the end of the run). This is an example where master's overall low
number of page freezes makes it difficult to compare to the alternatives
using a ratio.

I didn't initially question the numbers because it seems like freezing
data and then deleting it right after would naturally be one of the
worst cases for opportunistic freezing, but certainly not this bad.

> Row I looks bad for algorithms 1, 2, and 4: they freeze pages because
> it looks cheap, but the work doesn't really pay off.

Yes, the work queue example looks like it is hard to handle.

> >            % Frozen at end of run
> >
> > |   | Master | (1) | (2) | (3) |  (4) | (5) |
> > |---+--------+-----+-----+-----+------+-----+
> > | A |      0 |   1 |  99 |   0 |   81 |   0 |
> > | B |     71 |  96 |  99 |   3 |   98 |   2 |
> > | C |      0 |   9 | 100 |   6 |   92 |   5 |
> > | D |      0 |   1 |   1 |   1 |    1 |   1 |
> > | E |      0 |  63 | 100 |  68 |  100 |  67 |
> > | F |      0 |   5 |  14 |   6 |   14 |   5 |
> > | G |      0 | 100 | 100 |  92 |  100 |  67 |
> > | H |      0 |  11 | 100 |   9 |   86 |   5 |
> > | I |      0 | 100 | 100 |   0 |  100 |   0 |
>
> So all of the algorithms here, but especially 1, 2, and 4, freeze a
> lot more often than master.
>
> If I understand correctly, we'd like to see small numbers for B, D,
> and I, and large numbers for the other workloads. None of the
> algorithms seem to achieve that. (3) and (5) seem like they always
> behave as well or better than master, but they produce small numbers
> for A, C, F, and H. (1), (2), and (4) regress B and I relative to
> master but do better than (3) and (5) on A, C, and the latter two also
> on E.
>
> B is such an important benchmarking workload that I'd be loathe to
> regress it, so if I had to pick on the basis of this data, my vote
> would be (3) or (5), provided whatever is happening with (D) in the
> previous metric is not as bad as it looks. What's your reason for
> preferring (4) and (5) over (2) and (3)? I'm not clear that these
> numbers give us much of an idea whether 10% or 33% or something else
> is better in general.

(1) seems bad to me because it doesn't consider whether or not freezing
will be useful -- only if it will be cheap. It froze very little of the
cold data in a workload where a small percentage of it was being
modified (especially workloads A, C, H). And it froze a lot of data in
workloads where it was being uniformly modified (workload B).

I suggested (4) and (5) because I think the "older than 33%" threshold
is better than the "older than 10%" threshold. I chose both because I am
still unclear on our values. Are we willing to freeze more aggressively
at the expense of emitting more FPIs? As long as it doesn't affect
throughput? For pretty much all of these workloads, the algorithms which
froze based on page modification recency OR FPI required emitted many
more FPIs than those which froze based only on page modification
recency.

I've attached the WIP patch that I forgot in my previous email.

I'll rerun workload D in a more reasonable way and be back with results.

- Melanie
From 51786ef91cba43bbb4985dc8ab86f2cdcbf54369 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Sat, 26 Aug 2023 20:16:00 -0400
Subject: [PATCH v1 1/3] Opportunistically freeze cold data when it is cheap

Instead of relying on whether or not pruning emitted an FPI while
vacuuming to determine whether or not to opportunistically freeze,
determine whether or not doing so will be cheap and useful. Determine if
it is cheap by checking if the buffer is already dirty and if freezing
the page will require emitting an FPI. Then, predict if the page will
stay frozen by comparing the page LSN to the insert LSN at the end of
the last vacuum of the relation. This gives us some idea of whether or
not the page has been recently modified, and, hopefully insight into
whether or not it will soon be modified again.
---
 src/backend/access/heap/vacuumlazy.c         | 26 ++++++++++++++----
 src/backend/storage/buffer/bufmgr.c          | 19 +++++++++++++
 src/backend/storage/buffer/localbuf.c        | 20 ++++++++++++++
 src/backend/utils/activity/pgstat_relation.c | 29 +++++++++++++++++++-
 src/include/pgstat.h                         |  8 +++++-
 src/include/storage/buf_internals.h          |  2 ++
 src/include/storage/bufmgr.h                 |  2 ++
 7 files changed, 98 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6a41ee635d..e4d5b402c2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -210,6 +210,7 @@ typedef struct LVRelState
 	int64		live_tuples;	/* # live tuples remaining */
 	int64		recently_dead_tuples;	/* # dead, but not yet removable */
 	int64		missed_dead_tuples; /* # removable, but not removed */
+	XLogRecPtr last_vac_lsn;
 } LVRelState;
 
 /*
@@ -364,6 +365,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
+	vacrel->last_vac_lsn = pgstat_get_last_vac_lsn(RelationGetRelid(rel),
+			rel->rd_rel->relisshared);
+
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
@@ -597,7 +601,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 rel->rd_rel->relisshared,
 						 Max(vacrel->new_live_tuples, 0),
 						 vacrel->recently_dead_tuples +
-						 vacrel->missed_dead_tuples);
+						 vacrel->missed_dead_tuples, ProcLastRecPtr);
+
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -1511,6 +1516,7 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
 	return false;
 }
 
+#define FREEZE_CUTOFF_DIVISOR 3
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1551,9 +1557,11 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
-	int64		fpi_before = pgWalUsage.wal_fpi;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
+	XLogRecPtr current_lsn;
+	XLogRecPtr lsns_since_last_vacuum;
+	XLogRecPtr freeze_lsn_cutoff;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1795,13 +1803,19 @@ retry:
 
 	/*
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
-	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
-	 * freeze when pruning generated an FPI, if doing so means that we set the
-	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also,
+	 * if freezing would allow us to set the whole page all-frozen, then freeze
+	 * if the buffer is already dirty, freezing won't emit an FPI and the page
+	 * hasn't been modified recently.
 	 */
+	current_lsn = GetXLogInsertRecPtr();
+	lsns_since_last_vacuum = current_lsn - vacrel->last_vac_lsn;
+	freeze_lsn_cutoff = current_lsn - (lsns_since_last_vacuum / FREEZE_CUTOFF_DIVISOR);
+
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		(BufferIsProbablyDirty(buf) && !XLogCheckBufferNeedsBackup(buf) &&
+		PageGetLSN(page) < freeze_lsn_cutoff)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3bd82dbfca..9110b686ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2098,6 +2098,25 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	return first_block;
 }
 
+bool
+BufferIsProbablyDirty(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	if (!BufferIsValid(buffer))
+		elog(ERROR, "bad buffer ID: %d", buffer);
+
+	if (BufferIsLocal(buffer))
+		return LocalBufferIsProbablyDirty(buffer);
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	return buf_state & BM_DIRTY;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1735ec7141..f2d4f91336 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -436,6 +436,26 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 	return first_block;
 }
 
+bool
+LocalBufferIsProbablyDirty(Buffer buffer)
+{
+	int			bufid;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(BufferIsLocal(buffer));
+
+	bufid = -buffer - 1;
+
+	Assert(LocalRefCount[bufid] > 0);
+
+	bufHdr = GetLocalBufferDescriptor(bufid);
+
+	buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+	return buf_state & BM_DIRTY;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 9876e0c1e8..cac0a4bc49 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -205,12 +205,37 @@ pgstat_drop_relation(Relation rel)
 	}
 }
 
+
+XLogRecPtr
+pgstat_get_last_vac_lsn(Oid tableoid, bool shared)
+{
+	PgStat_EntryRef *entry_ref;
+	PgStatShared_Relation *shtabentry;
+	PgStat_StatTabEntry *tabentry;
+	XLogRecPtr result;
+	Oid			dboid = (shared ? InvalidOid : MyDatabaseId);
+
+	if (!pgstat_track_counts)
+		return InvalidXLogRecPtr;
+
+	entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION,
+											dboid, tableoid, false);
+
+	shtabentry = (PgStatShared_Relation *) entry_ref->shared_stats;
+	tabentry = &shtabentry->stats;
+
+	result = tabentry->last_vac_lsn;
+	pgstat_unlock_entry(entry_ref);
+	return result;
+}
+
 /*
  * Report that the table was just vacuumed and flush IO statistics.
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
-					 PgStat_Counter livetuples, PgStat_Counter deadtuples)
+					 PgStat_Counter livetuples, PgStat_Counter deadtuples,
+					 XLogRecPtr last_vac_lsn)
 {
 	PgStat_EntryRef *entry_ref;
 	PgStatShared_Relation *shtabentry;
@@ -257,6 +282,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 		tabentry->vacuum_count++;
 	}
 
+	tabentry->last_vac_lsn = last_vac_lsn;
+
 	pgstat_unlock_entry(entry_ref);
 
 	/*
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57a2c0866a..9fe607c961 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -424,6 +425,7 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter analyze_count;
 	TimestampTz last_autoanalyze_time;	/* autovacuum initiated */
 	PgStat_Counter autoanalyze_count;
+	XLogRecPtr last_vac_lsn;
 } PgStat_StatTabEntry;
 
 typedef struct PgStat_WalStats
@@ -589,7 +591,11 @@ extern void pgstat_assoc_relation(Relation rel);
 extern void pgstat_unlink_relation(Relation rel);
 
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
-								 PgStat_Counter livetuples, PgStat_Counter deadtuples);
+								 PgStat_Counter livetuples, PgStat_Counter deadtuples,
+								 XLogRecPtr last_vac_lsn);
+
+extern XLogRecPtr pgstat_get_last_vac_lsn(Oid tableoid, bool shared);
+
 extern void pgstat_report_analyze(Relation rel,
 								  PgStat_Counter livetuples, PgStat_Counter deadtuples,
 								  bool resetcounter);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bc79a329a1..aaa7882099 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -430,6 +430,8 @@ extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  BlockNumber extend_upto,
 										  Buffer *buffers,
 										  uint32 *extended_by);
+
+extern bool LocalBufferIsProbablyDirty(Buffer buffer);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b379c76e27..7458d5fd83 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -180,6 +180,8 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
+
+extern bool BufferIsProbablyDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
 extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
-- 
2.37.2

Reply via email to