BRIN summarization vs. WAL logging

Tomas Vondra Tue, 25 Jan 2022 19:12:46 -0800

Hi,

In a thread about sequences and sync replication [1], I've explainedthat the issue we're observing is due to not waiting for WAL at commitif the transaction only did nextval(). In which case we don't flush WALin RecordTransactionCommit, we don't wait for sync replica, etc. The WALmay get lost in case of crash, etc.

As I explained in the other thread, there are various other cases wherea transaction generates WAL but does not have XID, which is sufficientfor not flushing/waiting at transaction commit. Some of those cases areprobably fine (e.g. there are comments explaining why this is fine forPRUNE record).

But other cases (discovered by running regression tests with extralogging) looked a bit suspicious - particularly those that writemultiple WAL messages, because what if we lose just some of those?

So I looked at two cases related to BRIN, mostly because those werefairly simple, and I think at least the brin_summarize_range() issomewhat broken.



1) brin_desummarize_range()

This is pretty simple, because this function generates a single WALrecord, without waiting for it to be flushed:


DESUMMARIZE pagesPerRange 1, heapBlk 0, page offset 9, blkref #0: ...

But if the cluster/VM/... crashes right after you ran the function (andit completed just fine, possibly even in an explicit transaciton), thatchange will get lost. Not really a serious data corruption/loss, and youcan simply run it again, but IMHO rather surprising.

Of course, most people are unlikely to run brin_desummarize_range() veryoften, so maybe it's acceptable? But of course - if we expect this to bevery rare operation, why skip the WAL at all?



2) brin_summarize_range()

Now, the issue I think is more serious, more likely to happen, andharder to fix. When summarizing a range, we write two WAL records:


INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2

So, what happens if we lost the second WAL record, e.g. due to a crash?To experiment with this, I wrote a trivial patch (attached) that allowscrashing on WAL message of certain type by simply setting a GUC.


Now, consider this example:

  create table t (a int);
  insert into t select i from generate_series(1,5000) s(i);
  create index on t using brin (a);
  select brin_desummarize_range('t_a_idx', 1);

  set crash_on_wal_message = 'SAMEPAGE_UPDATE';

  select brin_summarize_range('t_a_idx', 5);

  PANIC:  crashing before 'SAMEPAGE_UPDATE' WAL message
  server closed the connection unexpectedly
  ...

After recovery, this is what we have:

  select * from brin_page_items(get_Raw_page('t_a_idx', 2), 't_a_idx');

   ...  | allnulls | hasnulls | placeholder | value
   ... -+----------+----------+-------------+-------
   ...  | t        | f        | t           |
   (1 row)

So the BRIN tuple is still marked as placeholder, which is a problembecause that means we'll always consider it as matching, making thebitmap index scan less efficient. And we'll *never* fix this, becausejust summarizing the range does nothing:


   select brin_summarize_range('t_a_idx', 5);
   brin_summarize_range
  ----------------------
                      0
  (1 row)

So it's still marked as placeholder, and to fix it you have toexplicitly desummarize the range first.

The reason for this seems obvious - only the process that created theplaceholder tuple is expected to mark it as "placeholder=false", butthis is described as two WAL records. And if we lose the update, thetuple will stay marked as a placeholder forever.

Of course, this requires a crash while something is summarizing ranges.But consider the summarization is often done by autovacuum, so it's notjust about hitting this from manually-executed brin_summarize_range.

I'm not quite sure what to do about this. Just doing XLogFlush() doesnot really fix this - it makes it less likely, but the root cause is thechange is described by multiple WAL messages that are not linkedtogether in any way. We may lost the last message without noticing that,and the flush does not fix that.

I didn't look at the other cases mentioned in [1], but I would't besurprised if some had a similar issue (e.g. the GIN pending list cleanupseems like another candidate).



regards

[1]https://www.postgresql.org/message-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c%40enterprisedb.com


--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 17257919dbf..daed8ba3bc8 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -190,6 +190,8 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 
 			xlrec.offnum = oldoff;
 
+			maybe_crash_on_wal("SAMEPAGE_UPDATE");
+
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBrinSamepageUpdate);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c9516e03fae..562192dc514 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -84,6 +84,8 @@ bool		XactDeferrable;
 
 int			synchronous_commit = SYNCHRONOUS_COMMIT_ON;
 
+char	   *crash_on_wal_message;
+
 /*
  * CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
  * transaction.  Currently, it is used in logical decoding.  It's possible
@@ -6157,3 +6159,17 @@ xact_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+void
+maybe_crash_on_wal(char *msg)
+{
+	/* GUC not set or not the right message */
+	if (strcmp(crash_on_wal_message, msg) != 0)
+		return;
+
+	/* flush whatever was generated in this xact so far */
+	XLogFlush(XactLastRecEnd);
+
+	elog(PANIC, "crashing before '%s' WAL message", msg);
+}
+
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c94f09c645..37bf20e7f04 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -139,6 +139,7 @@ extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool ignore_invalid_pages;
 extern bool synchronize_seqscans;
+extern char *crash_on_wal_message;
 
 #ifdef TRACE_SYNCSCAN
 extern bool trace_syncscan;
@@ -4070,6 +4071,17 @@ static struct config_string ConfigureNamesString[] =
 		check_default_tablespace, NULL, NULL
 	},
 
+	{
+		{"crash_on_wal_message", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("crash on this WAL message"),
+			NULL,
+			GUC_IS_NAME
+		},
+		&crash_on_wal_message,
+		"",
+		NULL, NULL, NULL
+	},
+
 	{
 		{"temp_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Sets the tablespace(s) to use for temporary tables and sort files."),
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 02276d3edd5..fa13c60da32 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -487,4 +487,6 @@ extern void CancelBackup(void);
 /* in executor/nodeHash.c */
 extern size_t get_hash_memory_limit(void);
 
+extern void maybe_crash_on_wal(char *msg);
+
 #endif							/* MISCADMIN_H */

BRIN summarization vs. WAL logging

Reply via email to