Hi Masahiko-san,

Please find the updated patch with the following new changes:

1) It adds the code changes in heap_force_kill function to clear an
all-visible bit on the visibility map corresponding to the page that
is marked all-visible. Along the way it also clears PD_ALL_VISIBLE
flag on the page header.

2) It adds the code changes in heap_force_freeze function to reset the
ctid value in a tuple header if it is corrupted.

3) It adds several notes and examples in the documentation stating
when and how we need to use the functions provided by this module.

Please have a look and let me know for any other concern.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 20, 2020 at 11:43 AM Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 11:04 AM Masahiko Sawada
> <masahiko.saw...@2ndquadrant.com> wrote:
> >
> > On Wed, 19 Aug 2020 at 20:45, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 3:55 PM Masahiko Sawada
> > > <masahiko.saw...@2ndquadrant.com> wrote:
> > > >
> > > > On Wed, 19 Aug 2020 at 15:09, Ashutosh Sharma <ashu.coe...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Wed, Aug 19, 2020 at 9:27 AM Masahiko Sawada
> > > > > <masahiko.saw...@2ndquadrant.com> wrote:
> > > > > >
> > > > > > On Mon, 17 Aug 2020 at 15:05, Ashutosh Sharma 
> > > > > > <ashu.coe...@gmail.com> wrote:
> > > > > > >
> > > > > > > > pg_force_freeze() can revival a tuple that is already deleted 
> > > > > > > > but not
> > > > > > > > vacuumed yet. Therefore, the user might need to reindex indexes 
> > > > > > > > after
> > > > > > > > using that function. For instance, with the following script, 
> > > > > > > > the last
> > > > > > > > two queries: index scan and seq scan, will return different 
> > > > > > > > results.
> > > > > > > >
> > > > > > > > set enable_seqscan to off;
> > > > > > > > set enable_bitmapscan to off;
> > > > > > > > set enable_indexonlyscan to off;
> > > > > > > > create table tbl (a int primary key);
> > > > > > > > insert into tbl values (1);
> > > > > > > >
> > > > > > > > update tbl set a = a + 100 where a = 1;
> > > > > > > >
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- revive deleted tuple on heap
> > > > > > > > select heap_force_freeze('tbl', array['(0,1)'::tid]);
> > > > > > > >
> > > > > > > > -- index scan returns 2 tuples
> > > > > > > > explain analyze select * from tbl where a < 200;
> > > > > > > >
> > > > > > > > -- seq scan returns 1 tuple
> > > > > > > > set enable_seqscan to on;
> > > > > > > > explain analyze select * from tbl;
> > > > > > > >
> > > > > > >
> > > > > > > I am not sure if this is the right use-case of pg_force_freeze
> > > > > > > function. I think we should only be running pg_force_freeze 
> > > > > > > function
> > > > > > > on a tuple for which VACUUM reports "found xmin ABC from before
> > > > > > > relfrozenxid PQR" sort of error otherwise it might worsen the 
> > > > > > > things
> > > > > > > instead of making it better.
> > > > > >
> > > > > > Should this also be documented? I think that it's hard to force the
> > > > > > user to always use this module in the right situation but we need to
> > > > > > show at least when to use.
> > > > > >
> > > > >
> > > > > I've already added some examples in the documentation explaining the
> > > > > use-case of force_freeze function. If required, I will also add a note
> > > > > about it.
> > > > >
> > > > > > > > Also, if a tuple updated and moved to another partition is 
> > > > > > > > revived by
> > > > > > > > heap_force_freeze(), its ctid still has special values:
> > > > > > > > MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I 
> > > > > > > > don't
> > > > > > > > see a problem yet caused by a visible tuple having the special 
> > > > > > > > ctid
> > > > > > > > value, but it might be worth considering either to reset ctid 
> > > > > > > > value as
> > > > > > > > well or to not freezing already-deleted tuple.
> > > > > > > >
> > > > > > >
> > > > > > > For this as well, the answer remains the same as above.
> > > > > >
> > > > > > Perhaps the same is true when a tuple header is corrupted including
> > > > > > xmin and ctid for some reason and the user wants to fix it? I'm
> > > > > > concerned that a live tuple having the wrong ctid will cause SEGV or
> > > > > > PANIC error in the future.
> > > > > >
> > > > >
> > > > > If a tuple header itself is corrupted, then I think we must kill that
> > > > > tuple. If only xmin and t_ctid fields are corrupted, then probably we
> > > > > can think of resetting the ctid value of that tuple. However, it won't
> > > > > be always possible to detect the corrupted ctid value. It's quite
> > > > > possible that the corrupted ctid field has valid values for block
> > > > > number and offset number in it, but it's actually corrupted and it
> > > > > would be difficult to consider such ctid as corrupted. Hence, we can't
> > > > > do anything about such types of corruption. Probably in such cases we
> > > > > need to run VACUUM FULL on such tables so that new ctid gets generated
> > > > > for each tuple in the table.
> > > >
> > > > Understood.
> > > >
> > > > Perhaps such corruption will be able to be detected by new heapam
> > > > check functions discussed on another thread. My point was that it
> > > > might be better to attempt making the tuple header sane state as much
> > > > as possible when fixing a live tuple in order to prevent further
> > > > problems such as databases crash by malicious attackers.
> > > >
> > >
> > > Agreed. So, to handle the ctid related concern that you raised, I'm
> > > planning to do the following changes to ensure that the tuple being
> > > marked as frozen contains the correct item pointer value. Please let
> > > me know if you are okay with these changes.
> >
> > Given that a live tuple never indicates to ve moved partitions, I
> > guess the first condition in the if statement is not necessary. The
> > rest looks good to me, although other hackers might think differently.
> >
>
> Okay, thanks for confirming. I am planning to go ahead with this
> approach. Will later see what others have to say about it.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
From 4615bd2356a8102fed3f71efeb7181f8857f04e6 Mon Sep 17 00:00:00 2001
From: ashu <ashutosh.sha...@enterprisedb.com>
Date: Fri, 21 Aug 2020 18:13:35 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on a damaged heap
 table.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile                           |   1 +
 contrib/pg_surgery/Makefile                |  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 161 +++++++++++
 contrib/pg_surgery/heap_surgery.c          | 418 +++++++++++++++++++++++++++++
 contrib/pg_surgery/pg_surgery--1.0.sql     |  18 ++
 contrib/pg_surgery/pg_surgery.control      |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql      |  89 ++++++
 doc/src/sgml/contrib.sgml                  |   1 +
 doc/src/sgml/filelist.sgml                 |   1 +
 doc/src/sgml/pgsurgery.sgml                | 136 ++++++++++
 10 files changed, 853 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql
 create mode 100644 doc/src/sgml/pgsurgery.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..c8d2a16 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		pg_prewarm	\
 		pg_standby	\
 		pg_stat_statements \
+		pg_surgery	\
 		pg_trgm		\
 		pgcrypto	\
 		pgrowlocks	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 0000000..ecf2e20
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on a damaged relation"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_surgery/expected/pg_surgery.out b/contrib/pg_surgery/expected/pg_surgery.out
new file mode 100644
index 0000000..9858de2
--- /dev/null
+++ b/contrib/pg_surgery/expected/pg_surgery.out
@@ -0,0 +1,161 @@
+create extension pg_surgery;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+select * from htab where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select ctid, xmax from htab where xmin = 2;
+ ctid  | xmax 
+-------+------
+ (0,4) |    0
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from htab where ctid = '(0, 4)';
+ a 
+---
+(0 rows)
+
+rollback;
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+select * from mvw where xmin = 2;
+ a 
+---
+(0 rows)
+
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+select * from mvw where xmin = 2;
+ a 
+---
+ 3
+(1 row)
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select * from mvw where ctid = '(0, 3)';
+ a 
+---
+(0 rows)
+
+rollback;
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create index ptab_idx on ptab (a);
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create view vw as select 1;
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  only the relation using heap_tableam_handler is supported
+create sequence seq;
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+ERROR:  "seq" is not a table, materialized view, or TOAST table
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping block 0 for relation "htab" because the block number is out of range
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+insert into htab values (10);
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+NOTICE:  skipping tid (0, 2) for relation "htab" because the item number is out of range for this block
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_kill 
+-----------------
+ 
+(1 row)
+
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+NOTICE:  skipping tid (0, 1) for relation "htab" because it is marked dead
+ heap_force_freeze 
+-------------------
+ 
+(1 row)
+
+rollback;
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
new file mode 100644
index 0000000..5228990
--- /dev/null
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -0,0 +1,418 @@
+/*-------------------------------------------------------------------------
+ *
+ * heap_surgery.c
+ *	  Functions to perform surgery on the damaged heap table.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_surgery/heap_surgery.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heapam.h"
+#include "access/visibilitymap.h"
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_proc_d.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HeapTupleForceOption
+{
+	HEAP_FORCE_KILL,
+	HEAP_FORCE_FREEZE
+} HeapTupleForceOption;
+
+PG_FUNCTION_INFO_V1(heap_force_kill);
+PG_FUNCTION_INFO_V1(heap_force_freeze);
+
+static int32 tidcmp(const void *a, const void *b);
+static Datum heap_force_common(FunctionCallInfo fcinfo,
+							   HeapTupleForceOption heap_force_opt);
+static void sanity_check_tid_array(ArrayType *ta, int *ntids);
+static void sanity_check_relation(Relation rel);
+static BlockNumber tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+												OffsetNumber *next_start_ptr,
+												OffsetNumber *offnos);
+
+/*-------------------------------------------------------------------------
+ * heap_force_kill()
+ *
+ * Force kill the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_kill(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_kill(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_KILL));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_freeze()
+ *
+ * Force freeze the tuple(s) pointed to by the item pointer(s) stored in the
+ * given tid array.
+ *
+ * Usage: SELECT heap_force_freeze(regclass, tid[]);
+ *-------------------------------------------------------------------------
+ */
+Datum
+heap_force_freeze(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_DATUM(heap_force_common(fcinfo, HEAP_FORCE_FREEZE));
+}
+
+/*-------------------------------------------------------------------------
+ * heap_force_common()
+ *
+ * Common code for heap_force_kill and heap_force_freeze
+ *-------------------------------------------------------------------------
+ */
+static Datum
+heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
+{
+	Oid				relid = PG_GETARG_OID(0);
+	ArrayType	   *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	ItemPointer		tids;
+	int				ntids,
+					nblocks;
+	Relation		rel;
+	OffsetNumber   *offnos;
+	OffsetNumber	noffs,
+					curr_start_ptr,
+					next_start_ptr;
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("heap surgery functions cannot be executed during recovery.")));
+
+	/* Basic sanity checking. */
+	sanity_check_tid_array(ta, &ntids);
+
+	rel = relation_open(relid, RowExclusiveLock);
+
+	sanity_check_relation(rel);
+
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	/*
+	 * If there is more than one tid in the array, sort it so that we can
+	 * easily fetch all the tids belonging to one particular page from the
+	 * array.
+	 */
+	if (ntids > 1)
+		qsort((void*) tids, ntids, sizeof(ItemPointerData), tidcmp);
+
+	offnos = (OffsetNumber *) palloc(ntids * sizeof(OffsetNumber));
+	noffs = curr_start_ptr = next_start_ptr = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	do
+	{
+		Buffer			buf;
+		Buffer			vmbuf = InvalidBuffer;
+		Page			page;
+		BlockNumber		blkno;
+		OffsetNumber	maxoffset;
+		int				i;
+		bool			did_modify_page = false;
+		bool			did_modify_vm = false;
+
+		/*
+		 * Get the offset numbers from the tids belonging to one particular page
+		 * and process them one by one.
+		 */
+		blkno = tids_same_page_fetch_offnums(tids, ntids, &next_start_ptr,
+											 offnos);
+
+		/* Calculate the number of offsets stored in offnos array. */
+		noffs = next_start_ptr - curr_start_ptr;
+
+		/*
+		 * Update the current start pointer so that next time when
+		 * tids_same_page_fetch_offnums() is called, we can calculate the number
+		 * of offsets present in the offnos array.
+		 */
+		curr_start_ptr = next_start_ptr;
+
+		/* Check whether the block number is valid. */
+		if (blkno >= nblocks)
+		{
+			ereport(NOTICE,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("skipping block %u for relation \"%s\" because the block number is out of range",
+							blkno, RelationGetRelationName(rel))));
+			continue;
+		}
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBuffer(rel, blkno);
+		LockBufferForCleanup(buf);
+
+		page = BufferGetPage(buf);
+
+		maxoffset = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Before entering the critical section, pin the visibility map page if
+		 * it appears to be necessary.
+		 */
+		if (heap_force_opt == HEAP_FORCE_KILL && PageIsAllVisible(page))
+			visibilitymap_pin(rel, blkno, &vmbuf);
+
+		/* No ereport(ERROR) from here until all the changes are logged. */
+		START_CRIT_SECTION();
+
+		for (i = 0; i < noffs; i++)
+		{
+			OffsetNumber	offno;
+			ItemId			itemid;
+
+			if (offnos[i] == 0 || offnos[i] > maxoffset)
+			{
+				ereport(NOTICE,
+						 errmsg("skipping tid (%u, %u) for relation \"%s\" because the item number is out of range for this block",
+								blkno, offnos[i], RelationGetRelationName(rel)));
+				continue;
+			}
+
+			itemid = PageGetItemId(page, offnos[i]);
+
+			/* Follow any redirections until we find something useful. */
+			while (ItemIdIsRedirected(itemid))
+			{
+				offno = ItemIdGetRedirect(itemid);
+				itemid = PageGetItemId(page, offno);
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/* Nothing to do if the itemid is unused or already dead. */
+			if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
+			{
+				if (!ItemIdIsUsed(itemid))
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked unused",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				else
+					ereport(NOTICE,
+							(errmsg("skipping tid (%u, %u) for relation \"%s\" because it is marked dead",
+									blkno, offnos[i], RelationGetRelationName(rel))));
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			did_modify_page = true;
+
+			if (heap_force_opt == HEAP_FORCE_KILL)
+			{
+
+				ItemIdSetDead(itemid);
+
+				/*
+				 * If the page is marked all-visible, we must clear
+				 * PD_ALL_VISIBLE flag on the page header and an all-visible bit
+				 * on the visibility map corresponding to the page.
+				 */
+				if (PageIsAllVisible(page))
+				{
+					PageClearAllVisible(page);
+					visibilitymap_clear(rel, blkno, vmbuf,
+										VISIBILITYMAP_ALL_VISIBLE);
+					did_modify_vm = true;
+				}
+			}
+			else
+			{
+				HeapTupleHeader	htup;
+				ItemPointerData	ctid;
+
+				Assert(heap_force_opt == HEAP_FORCE_FREEZE);
+
+				ItemPointerSet(&ctid, blkno, offnos[i]);
+
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				/*
+				 * Make sure that this tuple holds the correct item pointer
+				 * value.
+				 */
+				if (!ItemPointerEquals(&ctid, &htup->t_ctid))
+					ItemPointerSet(&htup->t_ctid, blkno, offnos[i]);
+
+				HeapTupleHeaderSetXmin(htup, FrozenTransactionId);
+				HeapTupleHeaderSetXmax(htup, InvalidTransactionId);
+
+				/* We might have MOVED_OFF/MOVED_IN tuples in the database */
+				if (htup->t_infomask & HEAP_MOVED)
+				{
+					if (htup->t_infomask & HEAP_MOVED_OFF)
+						HeapTupleHeaderSetXvac(htup, InvalidTransactionId);
+					else
+						HeapTupleHeaderSetXvac(htup, FrozenTransactionId);
+				}
+
+				/*
+				 * Clear all the visibility-related bits of this tuple and mark
+				 * it as frozen. Also, get rid of HOT_UPDATED and KEYS_UPDATES
+				 * bits.
+				 */
+				htup->t_infomask &= ~HEAP_XACT_MASK;
+				htup->t_infomask |= (HEAP_XMIN_FROZEN | HEAP_XMAX_INVALID);
+				htup->t_infomask2 &= ~HEAP_HOT_UPDATED;
+				htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+			}
+		}
+
+		/*
+		 * If the page was modified, only then, we mark the buffer dirty or do
+		 * the WAL logging.
+		 */
+		if (did_modify_page)
+		{
+			/* Mark buffer dirty before we write WAL. */
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+				log_newpage_buffer(buf, true);
+		}
+
+		/* WAL log the VM page if it was modified. */
+		if (did_modify_vm && RelationNeedsWAL(rel))
+			log_newpage_buffer(vmbuf, false);
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buf);
+
+		if (vmbuf != InvalidBuffer)
+			ReleaseBuffer(vmbuf);
+	} while (next_start_ptr != ntids);
+
+	relation_close(rel, RowExclusiveLock);
+
+	pfree(ta);
+	pfree(offnos);
+
+	PG_RETURN_VOID();
+}
+
+/*-------------------------------------------------------------------------
+ * tidcmp()
+ *
+ * Compare two item pointers, return -1, 0, or +1.
+ *
+ * See ItemPointerCompare for details.
+ * ------------------------------------------------------------------------
+ */
+static int32
+tidcmp(const void *a, const void *b)
+{
+	ItemPointer iptr1 = ((const ItemPointer) a);
+	ItemPointer iptr2 = ((const ItemPointer) b);
+
+	return ItemPointerCompare(iptr1, iptr2);
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_tid_array()
+ *
+ * Perform sanity check on the given tid array.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_tid_array(ArrayType *ta, int *ntids)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	*ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+
+	if (*ntids == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("empty tid array")));
+}
+
+/*-------------------------------------------------------------------------
+ * sanity_check_relation()
+ *
+ * Perform sanity check on the given relation.
+ * ------------------------------------------------------------------------
+ */
+static void
+sanity_check_relation(Relation rel)
+{
+	if (rel->rd_amhandler != HEAP_TABLE_AM_HANDLER_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only the relation using heap_tableam_handler is supported")));
+
+	if (rel->rd_rel->relkind != RELKIND_RELATION &&
+		rel->rd_rel->relkind != RELKIND_MATVIEW &&
+		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("\"%s\" is not a table, materialized view, or TOAST table",
+						RelationGetRelationName(rel))));
+
+	/* Must be owner of the table or superuser. */
+	if (!pg_class_ownercheck(RelationGetRelid(rel), GetUserId()))
+		aclcheck_error(ACLCHECK_NOT_OWNER,
+					   get_relkind_objtype(rel->rd_rel->relkind),
+					   RelationGetRelationName(rel));
+}
+
+/*-------------------------------------------------------------------------
+ * tids_same_page_fetch_offnums()
+ *
+ * Find out all the tids residing in the same page as tids[next_start_ptr] and
+ * fetch the offset number stored in each of them into a caller-allocated offset
+ * number array.
+ * ------------------------------------------------------------------------
+ */
+static BlockNumber
+tids_same_page_fetch_offnums(ItemPointer tids, int ntids,
+							 OffsetNumber *next_start_ptr, OffsetNumber *offnos)
+{
+	int				i;
+	BlockNumber		prev_blkno,
+					blkno;
+	OffsetNumber	offno;
+
+	prev_blkno = blkno = InvalidBlockNumber;
+
+	for (i = *next_start_ptr; i < ntids; i++)
+	{
+		ItemPointerData tid = tids[i];
+
+		blkno = ItemPointerGetBlockNumberNoCheck(&tid);
+		offno = ItemPointerGetOffsetNumberNoCheck(&tid);
+
+		if (i == *next_start_ptr || (prev_blkno == blkno))
+			offnos[i - *next_start_ptr] = offno;
+		else
+			break;
+
+		prev_blkno = blkno;
+	}
+
+	*next_start_ptr = i;
+	return prev_blkno;
+}
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery--1.0.sql b/contrib/pg_surgery/pg_surgery--1.0.sql
new file mode 100644
index 0000000..2ae7f22
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery--1.0.sql
@@ -0,0 +1,18 @@
+/* contrib/pg_surgery/pg_surgery--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_surgery" to load this file. \quit
+
+CREATE FUNCTION heap_force_kill(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_kill'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_kill(regclass, tid[]) FROM PUBLIC;
+
+CREATE FUNCTION heap_force_freeze(reloid regclass, tids tid[])
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'heap_force_freeze'
+LANGUAGE C STRICT;
+
+REVOKE EXECUTE ON FUNCTION heap_force_freeze(regclass, tid[]) FROM PUBLIC;
\ No newline at end of file
diff --git a/contrib/pg_surgery/pg_surgery.control b/contrib/pg_surgery/pg_surgery.control
new file mode 100644
index 0000000..2bcdad1
--- /dev/null
+++ b/contrib/pg_surgery/pg_surgery.control
@@ -0,0 +1,5 @@
+# pg_surgery extension
+comment = 'extension to perform surgery on a damaged relation'
+default_version = '1.0'
+module_pathname = '$libdir/pg_surgery'
+relocatable = true
diff --git a/contrib/pg_surgery/sql/pg_surgery.sql b/contrib/pg_surgery/sql/pg_surgery.sql
new file mode 100644
index 0000000..4aef380
--- /dev/null
+++ b/contrib/pg_surgery/sql/pg_surgery.sql
@@ -0,0 +1,89 @@
+create extension pg_surgery;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- supported relations passes.
+--
+
+-- normal heap table.
+begin;
+create table htab(a int);
+insert into htab values (100), (200), (300), (400), (500);
+
+select * from htab where xmin = 2;
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select ctid, xmax from htab where xmin = 2;
+
+select heap_force_kill('htab'::regclass, ARRAY['(0, 4)']::tid[]);
+select * from htab where ctid = '(0, 4)';
+rollback;
+
+-- materialized view.
+begin;
+create materialized view mvw as select a from generate_series(1, 3) a;
+
+select * from mvw where xmin = 2;
+select heap_force_freeze('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where xmin = 2;
+
+select heap_force_kill('mvw'::regclass, ARRAY['(0, 3)']::tid[]);
+select * from mvw where ctid = '(0, 3)';
+rollback;
+
+--
+-- check that using heap_force_kill and heap_force_freeze functions with the
+-- unsupported relations fails.
+--
+
+-- partitioned tables (the parent table) doesn't contain any tuple.
+create table ptab (a int) partition by list (a);
+
+select heap_force_kill('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create index ptab_idx on ptab (a);
+
+-- indexes are not supported, should fail.
+select heap_force_kill('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('ptab_idx'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create view vw as select 1;
+
+-- views are not supported as well. so, all these should fail.
+select heap_force_kill('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('vw'::regclass, ARRAY['(0, 1)']::tid[]);
+
+create sequence seq;
+
+-- sequences are not supported as well. so, all these functions should fail.
+select heap_force_kill('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('seq'::regclass, ARRAY['(0, 1)']::tid[]);
+
+--
+-- Some negative test-cases with invalid inputs.
+--
+begin;
+create table htab(a int);
+
+-- invalid block number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+insert into htab values (10);
+
+-- invalid offset number, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 2)']::tid[]);
+
+-- dead tuple, should be skipped.
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_kill('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+select heap_force_freeze('htab'::regclass, ARRAY['(0, 1)']::tid[]);
+
+rollback;
+
+-- cleanup.
+drop table ptab;
+drop view vw;
+drop sequence seq;
+drop extension pg_surgery;
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 261a559..c82dde2 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -125,6 +125,7 @@ CREATE EXTENSION <replaceable>module_name</replaceable>;
  &pgrowlocks;
  &pgstatstatements;
  &pgstattuple;
+ &pgsurgery;
  &pgtrgm;
  &pgvisibility;
  &postgres-fdw;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 64b5da0..828396d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -139,6 +139,7 @@
 <!ENTITY pgstandby       SYSTEM "pgstandby.sgml">
 <!ENTITY pgstatstatements SYSTEM "pgstatstatements.sgml">
 <!ENTITY pgstattuple     SYSTEM "pgstattuple.sgml">
+<!ENTITY pgsurgery       SYSTEM "pgsurgery.sgml">
 <!ENTITY pgtrgm          SYSTEM "pgtrgm.sgml">
 <!ENTITY pgvisibility    SYSTEM "pgvisibility.sgml">
 <!ENTITY postgres-fdw    SYSTEM "postgres-fdw.sgml">
diff --git a/doc/src/sgml/pgsurgery.sgml b/doc/src/sgml/pgsurgery.sgml
new file mode 100644
index 0000000..8120c1b
--- /dev/null
+++ b/doc/src/sgml/pgsurgery.sgml
@@ -0,0 +1,136 @@
+<!-- doc/src/sgml/pgsurgery.sgml -->
+
+<sect1 id="pgsurgery" xreflabel="pg_surgery">
+ <title>pg_surgery</title>
+
+ <indexterm zone="pgsurgery">
+  <primary>pg_surgery</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_surgery</filename> module provides various functions to
+  perform surgery on a damaged relation. It is intended to be used only with a
+  damaged relation and must be operated only on damaged tuples of such
+  relations.
+ </para>
+
+ <para>
+  By default, only the superusers have <literal>EXECUTE</literal> privilege on
+  the functions provided by this module. Once the extension has been installed,
+  users may issue <command>GRANT</command> commands to change the privileges on
+  the functions to allow others to execute them.
+ </para>
+
+ <note>
+ <para>
+  After a surgery is performed on a damaged relation using this module, we must
+  run VACUUM with DISABLE_PAGE_SKIPPING option on that relation to bring it back
+  into a sane state (see examples in the next section).
+ </para>
+ </note>
+
+ <note>
+ <para>
+  While performing surgery on a damaged relation, we must not be doing anything
+  else on that relation in parallel. This is to ensure that when we are
+  operating on a damaged tuple there is no other transaction trying to modify
+  that tuple.
+ </para>
+ </note>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>heap_force_kill(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_kill</function> marks a <quote>used</quote> line
+      pointer as <quote>dead</quote> without examining the tuple. It is used to
+      force-kill a corrupted tuple in the table that may or may not be
+      accessible. It takes the relation oid and an array of item pointers as an
+      input arguments. Here is an example demonstrating when and how this
+      function is used:
+<programlisting>
+test=&gt; select * from t1 where ctid = '(0, 1)';
+ERROR:  could not access status of transaction 4007513275
+DETAIL:  Could not open file "pg_xact/0EED": No such file or directory.
+
+test=# select heap_force_kill('t1'::regclass, ARRAY['(0, 1)']::tid[]);
+-[ RECORD 1 ]---+-
+heap_force_kill |
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM 
+
+test=# select * from t1 where ctid = '(0, 1)';
+(0 rows)
+</programlisting>
+    </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>heap_force_freeze(regclass, tid[]) returns void</function>
+    </term>
+
+    <listitem>
+     <para>
+      <function>heap_force_freeze</function> marks a live tuple (whose xmin is
+      corrupted) as frozen by setting its xmin to FrozenTransactionId and xmax
+      to InvalidTransactionId. It clears all the visibility related bits in the
+      infomask of the tuple and sets the infomask to HEAP_XMIN_FROZEN and
+      HEAP_XMAX_INVALID. It is used to force-freeze a tuple in the table for
+      which VACUUM fails with "found xmin ... from before relfrozenxid ..."
+      error. Similar to <function>heap_force_kill</function> function, this also
+      takes the relation oid and an array of item pointers as an input arguments.
+      Here is an example demonstrating when and how this function is used:
+<programlisting>
+test=&gt; vacuum t1;
+ERROR:  found xmin 507 from before relfrozenxid 515
+CONTEXT:  while scanning block 0 of relation "public.t1"
+
+test=# select ctid from t1 where xmin = 507;
+ ctid  
+-------
+ (0,3)
+(1 row)
+
+test=# select heap_force_freeze('t1'::regclass, ARRAY['(0, 3)']::tid[]);
+-[ RECORD 1 ]-----+-
+heap_force_freeze | 
+
+test=# vacuum (disable_page_skipping) t1;
+VACUUM
+</programlisting>
+    </para>
+
+    <note>
+    <para>
+     This function should only be operated on a tuple for which VACUUM fails
+     with "found xmin ... from before relfrozenxid ..." error. The user must
+     verify that only the xmin and ctid fields of the tuple are corrupted, if
+     any other fields of a tuple are corrupted, they must avoid running this
+     function on such tuples.
+    </para>
+    </note>
+   </listitem>
+  </varlistentry>
+
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Authors</title>
+
+  <para>
+   Ashutosh Sharma <email>ashu.coe...@gmail.com</email>
+  </para>
+ </sect2>
+
+</sect1>
-- 
1.8.3.1

Reply via email to