On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote: > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: >> Mmm, is it really an improvement to report warnings during this >> function execution? Note also that PageIsVerified as-is won't report >> a warning if a page is found as PageIsNew() but isn't actually all >> zero, while still being reported as corrupted by the SRF. > > Yep, joining the point of above to just have no WARNINGs at all.
Now that we have d401c57, I got to consider more this one, and opted for not generating a WARNING for now. Hence, PageisVerifiedExtended() is disabled regarding that, but we still report a checksum failure in it. I have spent some time reviewing the tests, and as I felt this was bulky. In the reworked version attached, I have reduced the number of tests by half, without reducing the coverage, mainly: - Removed all the stderr and the return code tests, as we always expected the commands to succeed, and safe_psql() can do the job already. - Merged of the queries using pg_relation_check_pages into a single routine, with the expected output (set of broken pages returned in the SRF) in the arguments. - Added some prefixes to the tests, to generate unique test names. That makes debug easier. - The query on pg_stat_database is run once at the beginning, once at the end with the number of checksum failures correctly updated. - Added comments to document all the routines, and renamed some of them mostly for consistency. - Skipped system relations from the scan of pg_class, making the test more costly for nothing. - I ran some tests on Windows, just-in-case. I have also added a SearchSysCacheExists1() to double-check if the relation is missing before opening it, added a CHECK_FOR_INTERRUPTS() within the main check loop (where the error context is really helpful), indented the code, bumped the catalogs (mostly a self-reminder), etc. So, what do you think? -- Michael
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f44a09b0c2..e522477780 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
*/
/* yyyymmddN */
-#define CATALOG_VERSION_NO 202010201
+#define CATALOG_VERSION_NO 202010271
#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bbcac69d48..a66870bcc0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,13 @@
proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
proargnames => '{tablespace,name,size,modification}',
prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check pages of a relation',
+ proname => 'pg_relation_check_pages', procost => '10000', prorows => '20',
+ proisstrict => 'f', proretset => 't', provolatile => 'v', proparallel => 'r',
+ prorettype => 'record', proargtypes => 'regclass text',
+ proallargtypes => '{regclass,text,text,int8}', proargmodes => '{i,i,o,o}',
+ proargnames => '{relation,fork,path,failed_block_num}',
+ prosrc => 'pg_relation_check_pages' },
# hash partitioning constraint function
{ oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..a21cab2eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -240,6 +240,9 @@ extern void AtProcExit_LocalBuffers(void);
extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
+extern bool CheckBuffer(struct SMgrRelationData *smgr, ForkNumber forknum,
+ BlockNumber blkno);
+
/* in freelist.c */
extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 85cd147e21..c6dd084fbc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1300,6 +1300,14 @@ LANGUAGE INTERNAL
STRICT VOLATILE
AS 'pg_create_logical_replication_slot';
+CREATE OR REPLACE FUNCTION pg_relation_check_pages(
+ IN relation regclass, IN fork text DEFAULT NULL,
+ OUT path text, OUT failed_block_num bigint)
+RETURNS SETOF record
+LANGUAGE internal
+VOLATILE PARALLEL RESTRICTED
+AS 'pg_relation_check_pages';
+
CREATE OR REPLACE FUNCTION
make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
@@ -1444,6 +1452,7 @@ AS 'unicode_is_normalized';
-- can later change who can access these functions, or leave them as only
-- available to superuser / cluster owner, if they choose.
--
+REVOKE EXECUTE ON FUNCTION pg_relation_check_pages(regclass, text) FROM public;
REVOKE EXECUTE ON FUNCTION pg_start_backup(text, boolean, boolean) FROM public;
REVOKE EXECUTE ON FUNCTION pg_stop_backup() FROM public;
REVOKE EXECUTE ON FUNCTION pg_stop_backup(boolean, boolean) FROM public;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3eee86afe5..eb5c917074 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4585,3 +4585,95 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
errmsg("snapshot too old")));
}
+
+
+/*
+ * CheckBuffer
+ *
+ * Check the state of a buffer without loading it into the shared buffers. To
+ * avoid torn pages and possible false positives when reading data, a shared
+ * LWLock is taken on the target buffer pool partition mapping, and we check
+ * if the page is in shared buffers or not. An I/O lock is taken on the block
+ * to prevent any concurrent activity from happening.
+ *
+ * If the page is found as dirty in the shared buffers, it is ignored as
+ * it will be flushed to disk either before the end of the next checkpoint
+ * or during recovery in the event of an unsafe shutdown.
+ *
+ * If the page is found in the shared buffers but is not dirty, we still
+ * check the state of its data on disk, as it could be possible that the
+ * page stayed in shared buffers for a rather long time while the on-disk
+ * data got corrupted.
+ *
+ * If the page is not found in shared buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The page data is stored a private memory area local to this function while
+ * running the checks.
+ */
+bool
+CheckBuffer(SMgrRelation smgr, ForkNumber forknum, BlockNumber blkno)
+{
+ char buffer[BLCKSZ];
+ BufferTag buf_tag; /* identity of requested block */
+ uint32 buf_hash; /* hash value for buf_tag */
+ LWLock *partLock; /* buffer partition lock for the buffer */
+ BufferDesc *bufdesc;
+ int buf_id;
+
+ Assert(smgrexists(smgr, forknum));
+
+ /* create a tag so we can look after the buffer */
+ INIT_BUFFERTAG(buf_tag, smgr->smgr_rnode.node, forknum, blkno);
+
+ /* determine its hash code and partition lock ID */
+ buf_hash = BufTableHashCode(&buf_tag);
+ partLock = BufMappingPartitionLock(buf_hash);
+
+ /* see if the block is in the buffer pool or not */
+ LWLockAcquire(partLock, LW_SHARED);
+ buf_id = BufTableLookup(&buf_tag, buf_hash);
+ if (buf_id >= 0)
+ {
+ uint32 buf_state;
+
+ /*
+ * Found it. Now, retrieve its state to know what to do with it, and
+ * release the pin immediately. We do so to limit overhead as much as
+ * possible. We keep the shared lightweight lock on the target buffer
+ * mapping partition for now, so this buffer cannot be evicted, and we
+ * acquire an I/O Lock on the buffer as we may need to read its
+ * contents from disk.
+ */
+ bufdesc = GetBufferDescriptor(buf_id);
+
+ LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+ buf_state = LockBufHdr(bufdesc);
+ UnlockBufHdr(bufdesc, buf_state);
+
+ /* If the page is dirty or invalid, skip it */
+ if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+ {
+ LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+ LWLockRelease(partLock);
+ return true;
+ }
+
+ /* Read the buffer from disk, with the I/O lock still held */
+ smgrread(smgr, forknum, blkno, buffer);
+ LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+ }
+ else
+ {
+ /*
+ * Simply read the buffer. There's no risk of modification on it as
+ * we are holding the buffer pool partition mapping lock.
+ */
+ smgrread(smgr, forknum, blkno, buffer);
+ }
+
+ /* buffer lookup done, so now do its check */
+ LWLockRelease(partLock);
+
+ return PageIsVerifiedExtended(buffer, blkno, PIV_REPORT_STAT);
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..e2279af1e5 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -69,6 +69,7 @@ OBJS = \
oid.o \
oracle_compat.o \
orderedsetaggs.o \
+ pagefuncs.o \
partitionfuncs.o \
pg_locale.o \
pg_lsn.o \
diff --git a/src/backend/utils/adt/pagefuncs.c b/src/backend/utils/adt/pagefuncs.c
new file mode 100644
index 0000000000..f34d56cf1f
--- /dev/null
+++ b/src/backend/utils/adt/pagefuncs.c
@@ -0,0 +1,230 @@
+/*-------------------------------------------------------------------------
+ *
+ * pagefuncs.c
+ * Functions for page related features.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/pagefuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/builtins.h"
+#include "utils/syscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+ Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+ Relation relation, ForkNumber forknum);
+
+/*
+ * callback arguments for check_pages_error_callback()
+ */
+typedef struct CheckPagesErrorInfo
+{
+ char *path;
+ BlockNumber blkno;
+} CheckPagesErrorInfo;
+
+/*
+ * Error callback specific to check_relation_fork().
+ */
+static void
+check_pages_error_callback(void *arg)
+{
+ CheckPagesErrorInfo *errinfo = (CheckPagesErrorInfo *) arg;
+
+ errcontext("while checking page %u of path %s",
+ errinfo->blkno, errinfo->path);
+}
+
+/*
+ * pg_relation_check_pages
+ *
+ * Check the state of all the pages for one or more fork types in the given
+ * relation.
+ */
+Datum
+pg_relation_check_pages(PG_FUNCTION_ARGS)
+{
+ Oid relid;
+ ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ MemoryContext per_query_ctx;
+ MemoryContext oldcontext;
+ ForkNumber forknum;
+
+ /* Switch into long-lived context to construct returned data structures */
+ per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+ oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ tupstore = tuplestore_begin_heap(true, false, work_mem);
+ rsinfo->returnMode = SFRM_Materialize;
+ rsinfo->setResult = tupstore;
+ rsinfo->setDesc = tupdesc;
+
+ MemoryContextSwitchTo(oldcontext);
+
+ /* handle arguments */
+ if (PG_ARGISNULL(0))
+ {
+ /* Just leave if nothing is defined */
+ PG_RETURN_VOID();
+ }
+
+ /* By default all the forks of a relation are checked */
+ if (PG_ARGISNULL(1))
+ forknum = InvalidForkNumber;
+ else
+ {
+ const char *forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+
+ forknum = forkname_to_number(forkname);
+ }
+
+ relid = PG_GETARG_OID(0);
+
+ check_one_relation(tupdesc, tupstore, relid, forknum);
+ tuplestore_donestoring(tupstore);
+
+ return (Datum) 0;
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork. This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+ Oid relid, ForkNumber single_forknum)
+{
+ Relation relation;
+ ForkNumber forknum;
+
+ /* Check if relation exists. leaving if there is no such relation */
+ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+ return;
+
+ relation = relation_open(relid, AccessShareLock);
+
+ /*
+ * Sanity checks, returning no results if not support. Temporary
+ * relations and relations without storage are out of scope.
+ */
+ if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind) ||
+ relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+ {
+ relation_close(relation, AccessShareLock);
+ return;
+ }
+
+ RelationOpenSmgr(relation);
+
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+ {
+ if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+ continue;
+
+ if (smgrexists(relation->rd_smgr, forknum))
+ check_relation_fork(tupdesc, tupstore, relation, forknum);
+ }
+
+ relation_close(relation, AccessShareLock);
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check. Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+ Relation relation, ForkNumber forknum)
+{
+ BlockNumber blkno,
+ nblocks;
+ SMgrRelation smgr = relation->rd_smgr;
+ char *path;
+ CheckPagesErrorInfo errinfo;
+ ErrorContextCallback errcallback;
+
+ /* Number of output arguments in the SRF */
+#define PG_CHECK_RELATION_COLS 2
+
+ Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+
+ /*
+ * We remember the number of blocks here. Since caller must hold a lock
+ * on the relation, we know that it won't be truncated while we're
+ * iterating over the blocks. Any block added after this function started
+ * won't be checked, but this is out of scope as such pages will be
+ * flushed before the next checkpoint's completion.
+ */
+ nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+ path = relpathbackend(smgr->smgr_rnode.node,
+ smgr->smgr_rnode.backend,
+ forknum);
+
+ /*
+ * Error context to print some information about blocks and relations
+ * impacted by corruptions.
+ */
+ errinfo.path = pstrdup(path);
+ errinfo.blkno = 0;
+ errcallback.callback = check_pages_error_callback;
+ errcallback.arg = (void *) &errinfo;
+ errcallback.previous = error_context_stack;
+ error_context_stack = &errcallback;
+
+ for (blkno = 0; blkno < nblocks; blkno++)
+ {
+ Datum values[PG_CHECK_RELATION_COLS];
+ bool nulls[PG_CHECK_RELATION_COLS];
+ int i = 0;
+
+ /* Update block number for the error context */
+ errinfo.blkno = blkno;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Check the given buffer */
+ if (CheckBuffer(smgr, forknum, blkno))
+ continue;
+
+ memset(values, 0, sizeof(values));
+ memset(nulls, 0, sizeof(nulls));
+
+ values[i++] = CStringGetTextDatum(path);
+ values[i++] = UInt32GetDatum(blkno);
+
+ Assert(i == PG_CHECK_RELATION_COLS);
+
+ /* Save the corrupted blocks in the tuplestore. */
+ tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+
+ pfree(path);
+ }
+
+ /* Pop the error context stack */
+ error_context_stack = errcallback.previous;
+}
diff --git a/src/test/recovery/t/022_page_check.pl b/src/test/recovery/t/022_page_check.pl
new file mode 100644
index 0000000000..7e1f0d1fd8
--- /dev/null
+++ b/src/test/recovery/t/022_page_check.pl
@@ -0,0 +1,234 @@
+# Emulate on-disk corruptions of relation pages and find such corruptions
+# using pg_relation_check_pages().
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+# Grab a relation page worth a size of BLOCKSIZE from given $filename.
+# $blkno is the same block number as for a relation file.
+sub read_page
+{
+ my ($filename, $blkno) = @_;
+ my $block;
+
+ open(my $infile, '<', $filename) or die;
+ binmode($infile);
+
+ my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+ die($!) if !defined($success);
+
+ close($infile);
+
+ return ($block);
+}
+
+# Update an existing page of size BLOCKSIZE with new contents in given
+# $filename. blkno is the block number assigned in the relation file.
+sub write_page
+{
+ my ($filename, $block, $blkno) = @_;
+
+ open(my $outfile, '>', $filename) or die;
+ binmode($outfile);
+
+ my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+ die($!) if not defined $nb;
+ die("Write error") if ($nb != $BLOCKSIZE);
+
+ $outfile->flush();
+
+ close($outfile);
+ return;
+}
+
+# Read 2 bytes from relation page at a given offset.
+sub get_uint16_from_page
+{
+ my ($block, $offset) = @_;
+
+ return (unpack("S*", $block))[$offset];
+}
+
+# Write 2 bytes to relation page at a given offset.
+sub set_uint16_to_page
+{
+ my ($block, $data, $offset) = @_;
+
+ my $pack = pack("S", $data);
+
+ # vec with 16B or more won't preserve endianness.
+ vec($block, 2 * $offset, 8) = (unpack('C*', $pack))[0];
+ vec($block, (2 * $offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+ return $block;
+}
+
+# Sanity check on pg_stat_database looking after the number of checksum
+# failures.
+sub check_pg_stat_database
+{
+ my ($node, $test_prefix) = @_;
+
+ my $stdout = $node->safe_psql('postgres',
+ "SELECT "
+ . " sum(checksum_failures)"
+ . " FROM pg_catalog.pg_stat_database");
+ is($stdout, $TOTAL_NB_ERR,
+ "$test_prefix: pg_stat_database should have $TOTAL_NB_ERR error");
+
+ return;
+}
+
+# Run a round of page checks for any relation present in this test run.
+# $expected_broken is the psql output marking all the pages found as
+# corrupted using relname|blkno as format for each tuple returned. $nb
+# is the new number of added to the global counter matched later with
+# pg_stat_database.
+#
+# Note that this has no need to check system relations as these would have
+# no corruptions: this test does not manipulate them and should by no mean
+# break the cluster.
+sub run_page_checks
+{
+ my ($node, $num_checksum, $expected_broken, $test_prefix) = @_;
+
+ my $stdout = $node->safe_psql('postgres',
+ "SELECT relname, failed_block_num"
+ . " FROM (SELECT relname, (pg_catalog.pg_relation_check_pages(oid)).*"
+ . " FROM pg_class "
+ . " WHERE relkind in ('r','i', 'm') AND oid >= 16384) AS s");
+
+ # Check command result
+ is($stdout, $expected_broken,
+ "$test_prefix: output mismatch with pg_relation_check_pages()");
+
+ $TOTAL_NB_ERR += $num_checksum;
+ return;
+}
+
+# Perform various test that modify a specified block at the specified
+# offset, checking that the page corruption is correctly detected. The
+# original contents of the page are restored back once done.
+# $broken_pages is the set of pages that are expected to be broken
+# as of the returned result of pg_relation_check_pages(). $num_checksum
+# is the number of checksum failures expected to be added after this
+# function is done.
+sub corrupt_and_test_block
+{
+ my ($node, $filename, $blkno, $offset, $broken_pages, $num_checksum,
+ $test_prefix)
+ = @_;
+ my $fake_data = hex '0x0000';
+
+ # Stop the server cleanly to flush any pages, and to prevent any
+ # concurrent updates on what is going to be updated.
+ $node->stop;
+ my $original_block = read_page($filename, 0);
+ my $original_data = get_uint16_from_page($original_block, $offset);
+
+ isnt($original_data, $fake_data,
+ "$test_prefix: fake data at offset $offset should be different from the existing one"
+ );
+
+ my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+ isnt(
+ $original_data,
+ get_uint16_from_page($new_block, $offset),
+ "$test_prefix: The data at offset $offset should have been changed in memory"
+ );
+
+ write_page($filename, $new_block, 0);
+
+ my $written_data = get_uint16_from_page(read_page($filename, 0), $offset);
+
+ # Some offline checks to validate that the corrupted data is in place.
+ isnt($original_data, $written_data,
+ "$test_prefix: data written at offset $offset should be different from the original one"
+ );
+ is( get_uint16_from_page($new_block, $offset),
+ $written_data,
+ "$test_prefix: data written at offset $offset should be the same as the one in memory"
+ );
+ is($written_data, $fake_data,
+ "$test_prefix: The data written at offset $offset should be the one we wanted to write"
+ );
+
+ # The corruption is in place, start the server to run the checks.
+ $node->start;
+ run_page_checks($node, $num_checksum, $broken_pages, $test_prefix);
+
+ # Stop the server, put the original page back in place.
+ $node->stop;
+
+ $new_block = set_uint16_to_page($original_block, $original_data, $offset);
+ is( $original_data,
+ get_uint16_from_page($new_block, $offset),
+ "$test_prefix: data at offset $offset should have been restored in memory"
+ );
+
+ write_page($filename, $new_block, 0);
+ is( $original_data,
+ get_uint16_from_page(read_page($filename, $blkno), $offset),
+ "$test_prefix: data at offset $offset should have been restored on disk"
+ );
+
+ # There should be no errors now that the contents are back in place.
+ $node->start;
+ run_page_checks($node, 0, '', $test_prefix);
+}
+
+# Data checksums are necessary for this test.
+my $node = get_new_node('main');
+$node->init(extra => ['--data-checksums']);
+$node->start;
+
+my $stdout =
+ $node->safe_psql('postgres', "SELECT" . " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+# Basic schema to corrupt and check
+$node->safe_psql(
+ 'postgres', q|
+ CREATE TABLE public.t1(id integer);
+ INSERT INTO public.t1 SELECT generate_series(1, 100);
+ CHECKPOINT;
+|);
+
+# Get the path to the relation file that will get manipulated by the
+# follow-up tests with some on-disk corruptions.
+$stdout = $node->safe_psql('postgres',
+ "SELECT"
+ . " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+my $filename = $stdout;
+
+# Normal case without corruptions, this passes, with pg_stat_database
+# reporting no errors.
+check_pg_stat_database($node, 'start');
+
+# Test with a modified checksum. We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page. We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET, 't1|0',
+ 1, 'broken checksum');
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET, 't1|0',
+ 0, 'new page');
+
+# Check that the number of errors in pg_stat_database match what we
+# expect with the corruptions previously introduced.
+check_pg_stat_database($node, 'end');
diff --git a/src/test/regress/expected/pagefuncs.out b/src/test/regress/expected/pagefuncs.out
new file mode 100644
index 0000000000..38a72b01b3
--- /dev/null
+++ b/src/test/regress/expected/pagefuncs.out
@@ -0,0 +1,72 @@
+--
+-- Tests for functions related to relation pages
+--
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+ERROR: permission denied for function pg_relation_check_pages
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+ERROR: permission denied for function pg_relation_check_pages
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+ERROR: invalid fork name
+HINT: Valid fork names are "main", "fsm", "vm", and "init".
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_tab;
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+DROP SEQUENCE pgfunc_test_seq;
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+ pg_relation_check_pages
+-------------------------
+(0 rows)
+
+DROP VIEW pgfunc_test_view;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index ae89ed7f0b..7a46a13252 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -112,7 +112,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
# ----------
# Another group of parallel tests
# ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain pagefuncs
# event triggers cannot run concurrently with any test that runs DDL
test: event_trigger
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 525bdc804f..9a80b80f73 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -197,6 +197,7 @@ test: hash_part
test: indexing
test: partition_aggregate
test: partition_info
+test: pagefuncs
test: tuplesort
test: explain
test: event_trigger
diff --git a/src/test/regress/sql/pagefuncs.sql b/src/test/regress/sql/pagefuncs.sql
new file mode 100644
index 0000000000..12d32eeae4
--- /dev/null
+++ b/src/test/regress/sql/pagefuncs.sql
@@ -0,0 +1,41 @@
+--
+-- Tests for functions related to relation pages
+--
+
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+DROP TABLE pgfunc_test_tab;
+
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+DROP SEQUENCE pgfunc_test_seq;
+
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+DROP VIEW pgfunc_test_view;
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f7f401b534..7ef2ec9972 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26182,6 +26182,56 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
</sect2>
+ <sect2 id="functions-data-sanity">
+ <title>Data Sanity Functions</title>
+
+ <para>
+ The functions shown in <xref linkend="functions-data-sanity-table"/>
+ provide ways to check the sanity of data files in the cluster.
+ </para>
+
+ <table id="functions-data-sanity-table">
+ <title>Data Sanity Functions</title>
+ <tgroup cols="3">
+ <thead>
+ <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+ </row>
+ </thead>
+
+ <tbody>
+ <row>
+ <entry>
+ <literal><function>pg_relation_check_pages(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>NULL</literal> ])</function></literal>
+ </entry>
+ <entry><type>setof record</type></entry>
+ <entry>Check the pages of a relation.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <indexterm>
+ <primary>pg_relation_check_pages</primary>
+ </indexterm>
+ <para id="functions-check-relation-note" xreflabel="pg_relation_check_pages">
+ <function>pg_relation_check_pages</function> iterates over all blocks of a
+ given relation and verifies if they are in a state where they can safely
+ be loaded into the shared buffers. If defined,
+ <replaceable>fork</replaceable> specifies that only the pages of the given
+ fork are to be verified. Fork can be <literal>'main'</literal> for the
+ main data fork, <literal>'fsm'</literal> for the free space map,
+ <literal>'vm'</literal> for the visibility map, or
+ <literal>'init'</literal> for the initialization fork. The default of
+ <literal>NULL</literal> means that all the forks of the relation are
+ checked. The function returns a list of blocks that are considered as
+ corrupted with the path of the related file. Use of this function is
+ restricted to superusers by default but access may be granted to others
+ using <command>GRANT</command>.
+ </para>
+
+ </sect2>
+
</sect1>
<sect1 id="functions-trigger">
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ff853634bc..b6acade6c6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -332,6 +332,7 @@ CatCacheHeader
CatalogId
CatalogIndexState
ChangeVarNodes_context
+CheckPagesErrorInfo
CheckPoint
CheckPointStmt
CheckpointStatsData
signature.asc
Description: PGP signature
