[PATCH] Full support for index LP_DEAD hint bits on standby

Michail Nikolaev Mon, 18 Jan 2021 12:30:59 -0800

Hello, hackers.

[ABSTRACT]

Execution of queries to hot standby is one of the most popular ways to
scale application workload. Most of the modern Postgres installations
have two standby nodes for high-availability support. So, utilization
of replica's CPU seems to be a reasonable idea.
At the same time, some queries (index scans) could be much slower on
hot standby rather than on the primary one. It happens because the
LP_DEAD index hint bits mechanics is ignored in index scans during
recovery. It is done for reasons, of course [1]:

* We do this because the xmin on the primary node could easily be
* later than the xmin on the standby node, so that what the primary
* thinks is killed is supposed to be visible on standby. So for correct
* MVCC for queries during recovery we must ignore these hints and check
* all tuples.

Also, according to [2] and cases like [3], it seems to be a good idea
to support "ignore_killed_tuples" on standby.

The goal of this patch is to provide full support for index hint bits
on hot standby. The mechanism should be based on well-tested
functionality and not cause a lot of recovery conflicts.

This thread is the continuation (and party copy-paste) of the old
previous one [4].

[PROBLEM]

The standby itself can set and read hint bits during recovery. Such
bits are even correct according to standby visibility rules. But the
problem here - is full-page-write WAL records coming from the primary.
Such WAL records could bring invalid (according to standby xmin) hint
bits.

So, if we could be sure the scan doesn’t see any invalid hint bit from
primary - the problem is solved. And we will even be able to allow
standby to set its LP_DEAD bits itself.

The idea is simple: let WAL log hint bits before FPW somehow. It could
cause a lot of additional logs, however...

But there are ways to avoid it:
1) Send only one `latestRemovedXid` of all tuples marked as dead
during page scan.
2) Remember the latest sent `latestRemovedXid` in shared memory. And
optimistically skip WAL records with older xid values [5].

Such WAL records would cause a lot of recovery conflicts on standbys.
But we could be tricky here - let use hint bits only if
hot_standby_feedback is enabled and effective on standby. If HSF is
effective - then conflicts are not possible. If HSF is off - then
standby ignores both hint bits and additional conflict resolution. The
major thing here is that HSF is just optimization and has nothing with
MVCC correctness.

[DETAILS]

The patch introduces a new WAL record (named
XLOG_INDEX_HINT_BITS_HORIZON) to define a horizon of xmin required for
standbys snapshot to use LP_DEAD bits for an index scan.

`table_index_fetch_tuple` now returns `latest_removed_xid` value
additionally to `all_dead`. This value is used to advance
`killedLatestRemovedXid` at time of updating `killedItems` (see
`IndexHintBitAdvanceLatestRemovedXid`).

Primary sends the value of `killedLatestRemovedXid` in
XLOG_INDEX_HINT_BITS_HORIZON before it marks page dirty after setting
LP_DEAD bits on the index page (by calling
`MarkBufferDirtyIndexHint`).

New WAL is always sent before possible FPW. It is required to send
such a record only if its `latestRemovedXid` is newer than the one was
sent before for the current database (see
`LogIndexHintBitsHorizonIfNeeded`).

There is a new flag in the PGPROC structure -
`indexIgnoreKilledTuples`. If the flag is set to true – standby
queries are going to use LP_DEAD bits in index scans. In such a case
snapshot is required to satisfice the new horizon pushed by
XLOG_INDEX_HINT_BITS_HORIZON records.

It is safe to set `indexIgnoreKilledTuples` to any value from the
perspective of correctness. But `true` value could cause recovery
conflict. It is just some kind of compromise – use LP_DEAD bits but be
aware of XLOG_INDEX_HINT_BITS_HORIZON or vice versa.

What is the way to make the right decision about this compromise? It
is pretty simple – if `hot_standby_feedback` is on and primary
confirmed feedback is received – then set
`indexIgnoreKilledTuples`(see `GetSnapshotIndexIgnoreKilledTuples`).

While feedback is working as expected – the query will never be
canceled by XLOG_INDEX_HINT_BITS_HORIZON.

To support cascading standby setups (with a possible break of feedback
chain in the middle) – an additional byte was added to the keep-alive
message of the feedback protocol. This byte is used to make sure our
xmin is honored by primary (see
`sender_propagates_feedback_to_primary`). Also, the WAL sender now
always sends a keep-alive after receiving a feedback message.

So, this way, it is safe to use LP_DEAD bits received from the primary
when we want to.

And, as a result, it is safe to set LP_DEAD bits on standby.
Even if:
* the primary changes vacuum_defer_cleanup_age
* standby restarted
* standby promoted to the primary
* base backup taken from standby
* standby is serving queries during recovery
– nothing could go wrong here.

Because `HeapTupleIsSurelyDead` (and index LP_DEAD as result) needs
*heap* hint bits to be already set at standby. So, the same code
decides to set hint bits on the heap (it is done already on standby
for a long time) and in the index.

[EVALUATION]
It is not possible to find an ideal performance test for such kind of
optimization.

But there is a possible example in the attachment. It is a standard
pgbench schema with an additional index on balance and random balance
values.

On primary test do next:
1) transfer some money from one random of the top 100 rich accounts to
one random of the top 100 poor accounts.
2) calculate the amount of money in the top 10 rich and top 10 poor
accounts (and include an additional field to avoid index-only-scan).
In the case of standby only step 2 is used.

The patched version is about 9x faster for standby queries - like 455
TPS versus 4192 TPS on my system. There is no visible difference for
primary.

To estimate the additional amount of WAL logs, I have checked records
in WAL-segments during different conditions:
(pg_waldump pgdata/pg_wal/XXX | grep INDEX_HINT_BITS_HORIZON | wc -l)

- hot_standby_feedback=off - 5181 of 226274 records ~2%
- hot_standby_feedback=on (without load on standby) - 70 of 202594
records ~ 0.03%
- hot_standby_feedback=on (with load on standby) - 17 of 70504 records ~ 0.02%

So, with HSF=on (which is the default value) WAL increase is not
significant. Also, for HSF=off it should be possible to radically
reduce the number of additional WAL logs by using `latestRemovedXid`
from other records (like Heap2/CLEAN) in "send only newer xid"
optimization (I have skipped it for now for simplicity).

[CONCLUSION]

The only thing we pay – a few additional WAL records and some
additional moderate code complexity. But the support of hint-bits on
standby is a huge advantage for many workloads. I was able to get more
than a 900% performance boost (and it is not surprising – index hint
bits are just great optimization). And it works for almost all index
types out of the box.

Another major thing here – everything is based on old, well-tested
mechanics: query cancelation because of snapshot conflicts, setting
heap hint bits on standby, hot standby feedback.

[REFERENCES]

[1] -
https://www.postgresql.org/message-id/flat/7067.1529246768%40sss.pgh.pa.us#d9e2e570ba34fc96c4300a362cbe8c38
[2] -
https://www.postgresql.org/message-id/flat/12843.1529331619%40sss.pgh.pa.us#6df9694fdfd5d550fbb38e711d162be8
[3] -
https://www.postgresql.org/message-id/flat/20170428133818.24368.33533%40wrigleys.postgresql.org
[4] -
https://www.postgresql.org/message-id/flat/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ%40mail.gmail.com
[5] -
https://www.postgresql.org/message-id/flat/CANtu0oigC0%2BH0UkxktyovdLLU67ikM0%2BDw3J4EQqiDDeGhcwsQ%40mail.gmail.com

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index c8f7e781c6..625c8ddde3 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
+#include "access/heapam_xlog.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -87,7 +88,9 @@ gistkillitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		GistMarkPageHasGarbage(page);
-		MarkBufferDirtyHint(buffer, true);
+		MarkBufferDirtyIndexHint(buffer, true,
+								 scan->indexRelation,
+								 so->killedLatestRemovedXid);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -666,8 +669,12 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 						MemoryContextSwitchTo(oldCxt);
 					}
 					if (so->numKilled < MaxIndexTuplesPerPage)
+					{
 						so->killedItems[so->numKilled++] =
 							so->pageData[so->curPageData - 1].offnum;
+						IndexHintBitAdvanceLatestRemovedXid(scan->prior_tuple_removed_xid,
+															&so->killedLatestRemovedXid);
+					}
 				}
 				/* continuing to return tuples from a leaf page */
 				scan->xs_heaptid = so->pageData[so->curPageData].heapPtr;
@@ -703,8 +710,12 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 					MemoryContextSwitchTo(oldCxt);
 				}
 				if (so->numKilled < MaxIndexTuplesPerPage)
+				{
 					so->killedItems[so->numKilled++] =
 						so->pageData[so->curPageData - 1].offnum;
+					IndexHintBitAdvanceLatestRemovedXid(scan->prior_tuple_removed_xid,
+														&so->killedLatestRemovedXid);
+				}
 			}
 			/* find and process the next index page */
 			do
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 61e92cf0f5..b959ac5f17 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -107,6 +107,7 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
 	}
 
 	so->killedItems = NULL;		/* until needed */
+	so->killedLatestRemovedXid = InvalidTransactionId;
 	so->numKilled = 0;
 	so->curBlkno = InvalidBlockNumber;
 	so->curPageLSN = InvalidXLogRecPtr;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..c0e9fbac89 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -311,7 +311,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 					palloc(MaxIndexTuplesPerPage * sizeof(int));
 
 			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
 				so->killedItems[so->numKilled++] = so->currPos.itemIndex;
+				IndexHintBitAdvanceLatestRemovedXid(scan->prior_tuple_removed_xid,
+													&so->killedLatestRemovedXid);
+			}
 		}
 
 		/*
@@ -379,6 +383,7 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_split = false;
 
 	so->killedItems = NULL;
+	so->killedLatestRemovedXid = InvalidTransactionId;
 	so->numKilled = 0;
 
 	scan->opaque = so;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 519872850e..5617144221 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -611,7 +611,9 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(buf, true);
+		MarkBufferDirtyIndexHint(buf, true,
+								 scan->indexRelation,
+								 so->killedLatestRemovedXid);
 	}
 
 	if (so->hashso_bucket_buf == so->currPos.buf ||
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index faffbb1865..835ff8a2c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1517,7 +1517,8 @@ heap_fetch(Relation relation,
  *
  * If all_dead is not NULL, we check non-visible tuples to see if they are
  * globally dead; *all_dead is set true if all members of the HOT chain
- * are vacuumable, false if not.
+ * are vacuumable, false if not. Also, latest_removed_xid is set to the
+ * latest removed xid in a HOT chain.
  *
  * Unlike heap_fetch, the caller must already have pin and (at least) share
  * lock on the buffer; it is still pinned/locked at exit.  Also unlike
@@ -1526,7 +1527,7 @@ heap_fetch(Relation relation,
 bool
 heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 					   Snapshot snapshot, HeapTuple heapTuple,
-					   bool *all_dead, bool first_call)
+					   bool *all_dead, TransactionId *latest_removed_xid, bool first_call)
 {
 	Page		dp = (Page) BufferGetPage(buffer);
 	TransactionId prev_xmax = InvalidTransactionId;
@@ -1537,9 +1538,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 	bool		skip;
 	GlobalVisState *vistest = NULL;
 
+	Assert((!all_dead) || (all_dead && latest_removed_xid));
 	/* If this is not the first call, previous call returned a (live!) tuple */
 	if (all_dead)
 		*all_dead = first_call;
+	if (latest_removed_xid)
+		*latest_removed_xid = InvalidTransactionId;
 
 	blkno = ItemPointerGetBlockNumber(tid);
 	offnum = ItemPointerGetOffsetNumber(tid);
@@ -1622,7 +1626,10 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 				PredicateLockTID(relation, &heapTuple->t_self, snapshot,
 								 HeapTupleHeaderGetXmin(heapTuple->t_data));
 				if (all_dead)
+				{
 					*all_dead = false;
+					*latest_removed_xid = InvalidTransactionId;
+				}
 				return true;
 			}
 		}
@@ -1642,7 +1649,12 @@ heap_hot_search_buffer(ItemPointer tid, Relation relation, Buffer buffer,
 				vistest = GlobalVisTestFor(relation);
 
 			if (!HeapTupleIsSurelyDead(heapTuple, vistest))
+			{
 				*all_dead = false;
+				*latest_removed_xid = InvalidTransactionId;
+			}
+			else
+				HeapTupleHeaderAdvanceLatestRemovedXid(heapTuple->t_data, latest_removed_xid);
 		}
 
 		/*
@@ -7005,6 +7017,20 @@ HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 	/* *latestRemovedXid may still be invalid at end */
 }
 
+void
+IndexHintBitAdvanceLatestRemovedXid(TransactionId killedTupleRemovedXid,
+									TransactionId *latestRemovedXid)
+{
+	if (TransactionIdIsNormal(killedTupleRemovedXid))
+	{
+		if (!TransactionIdIsValid(*latestRemovedXid))
+			*latestRemovedXid = killedTupleRemovedXid;
+		else
+			*latestRemovedXid =
+			TransactionIdLatest(killedTupleRemovedXid, 1, latestRemovedXid);
+	}
+}
+
 #ifdef USE_PREFETCH
 /*
  * Helper function for heap_index_delete_tuples.  Issues prefetch requests for
@@ -7254,7 +7280,7 @@ heap_index_delete_tuples(Relation rel, TM_IndexDeleteOp *delstate)
 
 			/* Are any tuples from this HOT chain non-vacuumable? */
 			if (heap_hot_search_buffer(&tmp, rel, buf, &SnapshotNonVacuumable,
-									   &heapTuple, NULL, true))
+									   &heapTuple, NULL, NULL, true))
 				continue;		/* can't delete entry */
 
 			/* Caller will delete, since whole HOT chain is vacuumable */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..762e3cfd8c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -113,7 +113,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 						 ItemPointer tid,
 						 Snapshot snapshot,
 						 TupleTableSlot *slot,
-						 bool *call_again, bool *all_dead)
+						 bool *call_again, bool *all_dead,
+						 TransactionId *latest_removed_xid)
 {
 	IndexFetchHeapData *hscan = (IndexFetchHeapData *) scan;
 	BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
@@ -146,6 +147,7 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 											snapshot,
 											&bslot->base.tupdata,
 											all_dead,
+											latest_removed_xid,
 											!*call_again);
 	bslot->base.tupdata.t_self = *tid;
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_UNLOCK);
@@ -2173,7 +2175,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 
 			ItemPointerSet(&tid, page, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
-									   &heapTuple, NULL, true))
+									   &heapTuple, NULL, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
 		}
 	}
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c911c705ba..e4c1cd114c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -20,7 +20,6 @@
 #include "postgres.h"
 
 #include "access/genam.h"
-#include "access/heapam.h"
 #include "access/relscan.h"
 #include "access/tableam.h"
 #include "access/transam.h"
@@ -28,6 +27,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -106,18 +106,16 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_want_itup = false; /* may be set later */
 
 	/*
-	 * During recovery we ignore killed tuples and don't bother to kill them
-	 * either. We do this because the xmin on the primary node could easily be
-	 * later than the xmin on the standby node, so that what the primary
-	 * thinks is killed is supposed to be visible on standby. So for correct
-	 * MVCC for queries during recovery we must ignore these hints and check
-	 * all tuples. Do *not* set ignore_killed_tuples to true when running in a
-	 * transaction that was started during recovery. xactStartedInRecovery
-	 * should not be altered by index AMs.
-	 */
+	 * For correct MVCC for queries during recovery, we could use
+	 * index hint bits as on the primary. But to avoid frequent query
+	 * cancellation we do it only if hot_standby_feedback is active and
+	 * our xmin is honored on the primary.
+	 *
+	 * The decision is made in GetSnapshotIndexIgnoreKilledTuples.
+	*/
 	scan->kill_prior_tuple = false;
-	scan->xactStartedInRecovery = TransactionStartedDuringRecovery();
-	scan->ignore_killed_tuples = !scan->xactStartedInRecovery;
+	scan->prior_tuple_removed_xid = InvalidTransactionId;
+	scan->ignore_killed_tuples = MyProc->indexIgnoreKilledTuples;
 
 	scan->opaque = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..bc5c11def3 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -309,6 +309,7 @@ index_rescan(IndexScanDesc scan,
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
+	scan->prior_tuple_removed_xid = InvalidTransactionId;
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
@@ -386,6 +387,7 @@ index_restrpos(IndexScanDesc scan)
 		table_index_fetch_reset(scan->xs_heapfetch);
 
 	scan->kill_prior_tuple = false; /* for safety */
+	scan->prior_tuple_removed_xid = InvalidTransactionId;
 	scan->xs_heap_continue = false;
 
 	scan->indexRelation->rd_indam->amrestrpos(scan);
@@ -534,6 +536,7 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 
 	/* Reset kill flag immediately for safety */
 	scan->kill_prior_tuple = false;
+	scan->prior_tuple_removed_xid = InvalidTransactionId;
 	scan->xs_heap_continue = false;
 
 	/* If we're out of index entries, we're done */
@@ -574,12 +577,14 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 bool
 index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 {
-	bool		all_dead = false;
-	bool		found;
+	bool			all_dead = false;
+	TransactionId	latest_removed_xid = InvalidTransactionId;
+	bool			found;
 
 	found = table_index_fetch_tuple(scan->xs_heapfetch, &scan->xs_heaptid,
 									scan->xs_snapshot, slot,
-									&scan->xs_heap_continue, &all_dead);
+									&scan->xs_heap_continue,
+									&all_dead, &latest_removed_xid);
 
 	if (found)
 		pgstat_count_heap_fetch(scan->indexRelation);
@@ -587,13 +592,13 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
 	/*
 	 * If we scanned a whole HOT chain and found only dead tuples, tell index
 	 * AM to kill its entry for that TID (this will take effect in the next
-	 * amgettuple call, in index_getnext_tid).  We do not do this when in
-	 * recovery because it may violate MVCC to do so.  See comments in
-	 * RelationGetIndexScan().
+	 * amgettuple call, in index_getnext_tid).
 	 */
-	if (!scan->xactStartedInRecovery)
+	if (scan->ignore_killed_tuples)
+	{
 		scan->kill_prior_tuple = all_dead;
-
+		scan->prior_tuple_removed_xid = latest_removed_xid;
+	}
 	return found;
 }
 
@@ -667,6 +672,7 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 
 	/* just make sure this is false... */
 	scan->kill_prior_tuple = false;
+	scan->prior_tuple_removed_xid = InvalidTransactionId;
 
 	/*
 	 * have the am's getbitmap proc do all the work.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e333603912..5282831c40 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -502,7 +502,8 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 			if (inposting || !ItemIdIsDead(curitemid))
 			{
 				ItemPointerData htid;
-				bool		all_dead = false;
+				bool			all_dead = false;
+				TransactionId	latest_removed_xid = InvalidTransactionId;
 
 				if (!inposting)
 				{
@@ -556,7 +557,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				 */
 				else if (table_index_fetch_tuple_check(heapRel, &htid,
 													   &SnapshotDirty,
-													   &all_dead))
+													   &all_dead, &latest_removed_xid))
 				{
 					TransactionId xwait;
 
@@ -613,7 +614,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 */
 					htid = itup->t_tid;
 					if (table_index_fetch_tuple_check(heapRel, &htid,
-													  SnapshotSelf, NULL))
+													  SnapshotSelf, NULL, NULL))
 					{
 						/* Normal case --- it's still live */
 					}
@@ -687,9 +688,9 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 					 * crucial. Be sure to mark the proper buffer dirty.
 					 */
 					if (nbuf != InvalidBuffer)
-						MarkBufferDirtyHint(nbuf, true);
+						MarkBufferDirtyIndexHint(nbuf, true, rel, latest_removed_xid);
 					else
-						MarkBufferDirtyHint(insertstate->buf, true);
+						MarkBufferDirtyIndexHint(insertstate->buf, true, rel, latest_removed_xid);
 				}
 
 				/*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 289bd3c15d..c35a34003b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,7 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
-#include "access/xlog.h"
+#include "access/heapam_xlog.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -272,7 +272,11 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 					so->killedItems = (int *)
 						palloc(MaxTIDsPerBTreePage * sizeof(int));
 				if (so->numKilled < MaxTIDsPerBTreePage)
+				{
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
+					IndexHintBitAdvanceLatestRemovedXid(scan->prior_tuple_removed_xid,
+														&so->killedLatestRemovedXid);
+				}
 			}
 
 			/*
@@ -378,6 +382,7 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->arrayContext = NULL;
 
 	so->killedItems = NULL;		/* until needed */
+	so->killedLatestRemovedXid = InvalidTransactionId;
 	so->numKilled = 0;
 
 	/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d524310723..0ce93551ac 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1883,7 +1883,9 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		MarkBufferDirtyIndexHint(so->currPos.buf, true,
+								 scan->indexRelation,
+								 so->killedLatestRemovedXid);
 	}
 
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 01ee7ac6d2..32ebd730f1 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -36,6 +36,16 @@ standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec)
 		appendStringInfoString(buf, "; subxid ovf");
 }
 
+static void
+standby_desc_index_hint_bits_horizon(StringInfo buf,
+									 xl_index_hint_bits_horizon *xlrec)
+{
+	char		*path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+	appendStringInfo(buf, "latestRemovedXid %u in %s",
+					 xlrec->latestRemovedXid, path);
+}
+
 void
 standby_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -66,6 +76,12 @@ standby_desc(StringInfo buf, XLogReaderState *record)
 								   xlrec->dbId, xlrec->tsId,
 								   xlrec->relcacheInitFileInval);
 	}
+	else if (info == XLOG_INDEX_HINT_BITS_HORIZON)
+	{
+		xl_index_hint_bits_horizon *xlrec = (xl_index_hint_bits_horizon *) rec;
+
+		standby_desc_index_hint_bits_horizon(buf, xlrec);
+	}
 }
 
 const char *
@@ -84,6 +100,9 @@ standby_identify(uint8 info)
 		case XLOG_INVALIDATIONS:
 			id = "INVALIDATIONS";
 			break;
+		case XLOG_INDEX_HINT_BITS_HORIZON:
+			id = "INDEX_HINT_BITS_HORIZON";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5ea5bdd810..be02cad7e5 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -219,7 +219,8 @@ bool
 table_index_fetch_tuple_check(Relation rel,
 							  ItemPointer tid,
 							  Snapshot snapshot,
-							  bool *all_dead)
+							  bool *all_dead,
+							  TransactionId *latest_removed_xid)
 {
 	IndexFetchTableData *scan;
 	TupleTableSlot *slot;
@@ -229,7 +230,7 @@ table_index_fetch_tuple_check(Relation rel,
 	slot = table_slot_create(rel, NULL);
 	scan = table_index_fetch_begin(rel);
 	found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again,
-									all_dead);
+									all_dead, latest_removed_xid);
 	table_index_fetch_end(scan);
 	ExecDropSingleTupleTableSlot(slot);
 
diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c
index d0063164a7..9e72af54ba 100644
--- a/src/backend/commands/constraint.c
+++ b/src/backend/commands/constraint.c
@@ -112,7 +112,7 @@ unique_key_recheck(PG_FUNCTION_ARGS)
 		bool		call_again = false;
 
 		if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot,
-									 &call_again, NULL))
+									 &call_again, NULL, NULL))
 		{
 			/*
 			 * All rows referenced by the index entry are dead, so skip the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..f4c1f830d6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4104,6 +4104,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT:
 			event_name = "RecoveryConflictSnapshot";
 			break;
+		case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT_INDEX_HINT_BITS:
+			event_name = "RecoveryConflictSnapshotIndexHintBits";
+			break;
 		case WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE:
 			event_name = "RecoveryConflictTablespace";
 			break;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afa1df00d0..8289f055b0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -411,6 +411,8 @@ DecodeStandbyOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 * XLOG_XACT_INVALIDATIONS.  So we don't need to do anything here.
 			 */
 			break;
+		case XLOG_INDEX_HINT_BITS_HORIZON:
+			break;
 		default:
 			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
 	}
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 71d510e305..07681fdc6d 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -577,6 +577,8 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 #endif
 
 	MyProc->xmin = snap->xmin;
+	// to keep it simple use index hint bits on the primary only
+	MyProc->indexIgnoreKilledTuples = !RecoveryInProgress();
 
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..1199dffc45 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -225,6 +225,9 @@ WalReceiverMain(void)
 	/* Advertise our PID so that the startup process can kill us */
 	walrcv->pid = MyProcPid;
 	walrcv->walRcvState = WALRCV_STREAMING;
+	/* Initially true so we always send at least one feedback message */
+	walrcv->sender_has_standby_xmin = true;
+	walrcv->sender_propagates_feedback_to_primary = false;
 
 	/* Fetch information required to start streaming */
 	walrcv->ready_to_display = false;
@@ -806,6 +809,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 	XLogRecPtr	walEnd;
 	TimestampTz sendTime;
 	bool		replyRequested;
+	bool		senderPropagatesFeedbackToPrimary;
 
 	resetStringInfo(&incoming_message);
 
@@ -835,7 +839,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 		case 'k':				/* Keepalive */
 			{
 				/* copy message to StringInfo */
-				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char);
+				hdrlen = sizeof(int64) + sizeof(int64) + sizeof(char) + sizeof(char);
 				if (len != hdrlen)
 					ereport(ERROR,
 							(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -846,8 +850,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 				walEnd = pq_getmsgint64(&incoming_message);
 				sendTime = pq_getmsgint64(&incoming_message);
 				replyRequested = pq_getmsgbyte(&incoming_message);
+				senderPropagatesFeedbackToPrimary = pq_getmsgbyte(&incoming_message);
 
 				ProcessWalSndrMessage(walEnd, sendTime);
+				WalRcv->sender_propagates_feedback_to_primary = senderPropagatesFeedbackToPrimary;
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
@@ -1110,15 +1116,13 @@ XLogWalRcvSendHSFeedback(bool immed)
 				catalog_xmin;
 	static TimestampTz sendTime = 0;
 
-	/* initially true so we always send at least one feedback message */
-	static bool primary_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the primary, be sure
 	 * to exit before doing anything at all.
 	 */
 	if ((wal_receiver_status_interval <= 0 || !hot_standby_feedback) &&
-		!primary_has_standby_xmin)
+		!WalRcv->sender_has_standby_xmin)
 		return;
 
 	/* Get current timestamp. */
@@ -1188,9 +1192,9 @@ XLogWalRcvSendHSFeedback(bool immed)
 	pq_sendint32(&reply_message, catalog_xmin_epoch);
 	walrcv_send(wrconn, reply_message.data, reply_message.len);
 	if (TransactionIdIsValid(xmin) || TransactionIdIsValid(catalog_xmin))
-		primary_has_standby_xmin = true;
+		WalRcv->sender_has_standby_xmin = true;
 	else
-		primary_has_standby_xmin = false;
+		WalRcv->sender_has_standby_xmin = false;
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 8545c6c423..71fe08e1ab 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2169,6 +2169,12 @@ ProcessStandbyHSFeedbackMessage(void)
 		else
 			MyProc->xmin = feedbackXmin;
 	}
+
+	/*
+	 * Always send keep-alive after feedback to allow standby to maintain
+	 * WalRcv->sender_propagates_feedback_to_primary.
+	 */
+	WalSndKeepalive(false);
 }
 
 /*
@@ -3450,7 +3456,10 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 static void
 WalSndKeepalive(bool requestReply)
 {
+	bool am_propagating_feedback_to_primary;
 	elog(DEBUG2, "sending replication keepalive");
+	am_propagating_feedback_to_primary = !am_cascading_walsender
+		|| (WalRcv->sender_has_standby_xmin && WalRcv->sender_propagates_feedback_to_primary);
 
 	/* construct the message... */
 	resetStringInfo(&output_message);
@@ -3458,6 +3467,7 @@ WalSndKeepalive(bool requestReply)
 	pq_sendint64(&output_message, sentPtr);
 	pq_sendint64(&output_message, GetCurrentTimestamp());
 	pq_sendbyte(&output_message, requestReply ? 1 : 0);
+	pq_sendbyte(&output_message, am_propagating_feedback_to_primary ? 1 : 0);
 
 	/* ... and send it wrapped in CopyData */
 	pq_putmessage_noblock('d', output_message.data, output_message.len);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..1e7c1797c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3898,6 +3898,22 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	}
 }
 
+/*
+ * MarkBufferDirtyIndexHint
+ *
+ * This is essentially the same as MarkBufferDirtyHint, except it WAL log
+ * new value for index hint bits horizon if required.
+ *
+ * Should be used instead of MarkBufferDirtyHint for LP_DEAD hints in indexes.
+ */
+void
+MarkBufferDirtyIndexHint(Buffer buffer, bool buffer_std,
+						 Relation rel, TransactionId latestRemovedXid)
+{
+	LogIndexHintBitsHorizonIfNeeded(rel, latestRemovedXid);
+	MarkBufferDirtyHint(buffer, buffer_std);
+}
+
 /*
  * Release buffer content locks for shared buffers.
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..ce1b8f628a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -267,6 +267,7 @@ CreateSharedMemoryAndSemaphores(void)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	StandByShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..07863dbb2d 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -65,8 +65,10 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "replication/walreceiver.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
+#define BOOL_ACCESS_ONCE(var)		 ((bool)(*((volatile bool *)&(var))))
 
 /* Our shared memory area */
 typedef struct ProcArrayStruct
@@ -655,6 +657,7 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
 		proc->lxid = InvalidLocalTransactionId;
 		proc->xmin = InvalidTransactionId;
+		proc->indexIgnoreKilledTuples = false;
 		proc->delayChkpt = false;	/* be sure this is cleared in abort */
 		proc->recoveryConflictPending = false;
 
@@ -694,6 +697,7 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
 	proc->xid = InvalidTransactionId;
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
+	proc->indexIgnoreKilledTuples = false;
 	proc->delayChkpt = false;	/* be sure this is cleared in abort */
 	proc->recoveryConflictPending = false;
 
@@ -877,6 +881,7 @@ ProcArrayClearTransaction(PGPROC *proc)
 
 	proc->lxid = InvalidLocalTransactionId;
 	proc->xmin = InvalidTransactionId;
+	proc->indexIgnoreKilledTuples = false;
 	proc->recoveryConflictPending = false;
 
 	Assert(!(proc->statusFlags & PROC_VACUUM_STATE_MASK));
@@ -2013,6 +2018,23 @@ GetSnapshotDataInitOldSnapshot(Snapshot snapshot)
 	}
 }
 
+static bool
+GetSnapshotIndexIgnoreKilledTuples(Snapshot snapshot)
+{
+	/*
+	 * Always use and set LP_DEAD bits on primary. In case of standby
+	 * only if hot_standby_feedback enabled, walsender has our xmin
+	 * and walsender propagates feedback up to the primary (to avoid
+	 * unnecessary cancellations).
+	 *
+	 * It is always safe to set it to true but could cause high
+	 * rate of conflicts.
+	*/
+	Assert(!RecoveryInProgress() || WalRcv);
+	return !snapshot->takenDuringRecovery ||
+		(WalRcv->sender_propagates_feedback_to_primary && WalRcv->sender_has_standby_xmin);
+}
+
 /*
  * Helper function for GetSnapshotData() that checks if the bulk of the
  * visibility information in the snapshot is still valid. If so, it updates
@@ -2057,7 +2079,10 @@ GetSnapshotDataReuse(Snapshot snapshot)
 	 * xmin.
 	 */
 	if (!TransactionIdIsValid(MyProc->xmin))
+	{
 		MyProc->xmin = TransactionXmin = snapshot->xmin;
+		MyProc->indexIgnoreKilledTuples = GetSnapshotIndexIgnoreKilledTuples(snapshot);
+	}
 
 	RecentXmin = snapshot->xmin;
 	Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
@@ -2345,7 +2370,10 @@ GetSnapshotData(Snapshot snapshot)
 	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
 
 	if (!TransactionIdIsValid(MyProc->xmin))
+	{
 		MyProc->xmin = TransactionXmin = xmin;
+		MyProc->indexIgnoreKilledTuples = GetSnapshotIndexIgnoreKilledTuples(snapshot);
+	}
 
 	LWLockRelease(ProcArrayLock);
 
@@ -2524,6 +2552,7 @@ ProcArrayInstallImportedXmin(TransactionId xmin,
 		 * we don't check that.)
 		 */
 		MyProc->xmin = TransactionXmin = xmin;
+		// no need to change indexIgnoreKilledTuples because restriction is relaxed.
 
 		result = true;
 		break;
@@ -2567,6 +2596,8 @@ ProcArrayInstallRestoredXmin(TransactionId xmin, PGPROC *proc)
 		TransactionIdPrecedesOrEquals(xid, xmin))
 	{
 		MyProc->xmin = TransactionXmin = xmin;
+		// we could also copy indexIgnoreKilledTuples, could be useful for parallel scans
+		MyProc->indexIgnoreKilledTuples = proc->indexIgnoreKilledTuples;
 		result = true;
 	}
 
@@ -3245,11 +3276,15 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0,
  *
  * If dbOid is valid we skip backends attached to other databases.
  *
+ * If onlyIndexIgnoreKilledTuples is true we include only backends
+ * with indexIgnoreKilledTuples set.
+ *
  * Be careful to *not* pfree the result from this function. We reuse
  * this array sufficiently often that we use malloc for the result.
  */
 VirtualTransactionId *
-GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
+GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
+						  bool onlyIndexIgnoreKilledTuples)
 {
 	static VirtualTransactionId *vxids;
 	ProcArrayStruct *arrayP = procArray;
@@ -3287,6 +3322,8 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 		{
 			/* Fetch xmin just once - can't change on us, but good coding */
 			TransactionId pxmin = UINT32_ACCESS_ONCE(proc->xmin);
+			bool indexIgnoreKilledTuples =
+				BOOL_ACCESS_ONCE(proc->indexIgnoreKilledTuples);
 
 			/*
 			 * We ignore an invalid pxmin because this means that backend has
@@ -3297,7 +3334,8 @@ GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid)
 			 * test here.
 			 */
 			if (!TransactionIdIsValid(limitXmin) ||
-				(TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)))
+				(TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin) &&
+					(!onlyIndexIgnoreKilledTuples || indexIgnoreKilledTuples)))
 			{
 				VirtualTransactionId vxid;
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..3cffd64161 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -42,6 +42,7 @@ int			max_standby_streaming_delay = 30 * 1000;
 bool		log_recovery_conflict_waits = false;
 
 static HTAB *RecoveryLockLists;
+static HTAB *IndexHintBitsHorizons;
 
 /* Flags set by timeout handlers */
 static volatile sig_atomic_t got_standby_deadlock_timeout = false;
@@ -65,6 +66,12 @@ typedef struct RecoveryLockListsEntry
 	List	   *locks;
 } RecoveryLockListsEntry;
 
+typedef struct IndexHintBitsHorizonsEntry
+{
+	Oid				dbOid;
+	TransactionId	hintHorizonXid;
+} IndexHintBitsHorizonsEntry;
+
 /*
  * InitRecoveryTransactionEnvironment
  *		Initialize tracking of our primary's in-progress transactions.
@@ -425,7 +432,8 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 }
 
 void
-ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode node)
+ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+									RelFileNode node)
 {
 	VirtualTransactionId *backends;
 
@@ -444,7 +452,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 		return;
 
 	backends = GetConflictingVirtualXIDs(latestRemovedXid,
-										 node.dbNode);
+										 node.dbNode, false);
 
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
 										   PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
@@ -452,6 +460,22 @@ ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid, RelFileNode
 										   true);
 }
 
+void
+ResolveIndexHintBitsRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
+												 RelFileNode node)
+{
+	VirtualTransactionId *backends;
+
+	backends = GetConflictingVirtualXIDs(latestRemovedXid,
+										 node.dbNode, true);
+
+	ResolveRecoveryConflictWithVirtualXIDs(
+			backends,
+			PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
+			WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT_INDEX_HINT_BITS,
+			true);
+}
+
 void
 ResolveRecoveryConflictWithTablespace(Oid tsid)
 {
@@ -475,7 +499,7 @@ ResolveRecoveryConflictWithTablespace(Oid tsid)
 	 * We don't wait for commit because drop tablespace is non-transactional.
 	 */
 	temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId,
-												InvalidOid);
+												InvalidOid, false);
 	ResolveRecoveryConflictWithVirtualXIDs(temp_file_users,
 										   PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
 										   WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
@@ -1026,6 +1050,43 @@ StandbyReleaseOldLocks(TransactionId oldxid)
 	}
 }
 
+static bool
+IsNewerIndexHintBitsHorizonXid(Oid dbOid, TransactionId latestRemovedXid)
+{
+	bool found, result;
+	IndexHintBitsHorizonsEntry* entry;
+	Assert(TransactionIdIsNormal(latestRemovedXid));
+
+	LWLockAcquire(IndexHintBitsHorizonShmemLock, LW_SHARED);
+	entry = (IndexHintBitsHorizonsEntry *) hash_search(IndexHintBitsHorizons, &dbOid,
+													   HASH_FIND, &found);
+
+	result = !found || TransactionIdPrecedes(entry->hintHorizonXid, latestRemovedXid);
+	LWLockRelease(IndexHintBitsHorizonShmemLock);
+
+	return result;
+}
+
+static void
+UpsertLatestIndexHintBitsHorizonXid(Oid dbOid, TransactionId latestRemovedXid)
+{
+
+	bool found;
+	IndexHintBitsHorizonsEntry* entry;
+	Assert(TransactionIdIsNormal(latestRemovedXid));
+
+	LWLockAcquire(IndexHintBitsHorizonShmemLock, LW_EXCLUSIVE);
+
+	entry = (IndexHintBitsHorizonsEntry *) hash_search(IndexHintBitsHorizons, &dbOid,
+													   HASH_ENTER, &found);
+
+	if (!found || TransactionIdPrecedes(entry->hintHorizonXid, latestRemovedXid))
+		entry->hintHorizonXid = latestRemovedXid;
+
+	LWLockRelease(IndexHintBitsHorizonShmemLock);
+}
+
+
 /*
  * --------------------------------------------------------------------
  *		Recovery handling for Rmgr RM_STANDBY_ID
@@ -1081,6 +1142,16 @@ standby_redo(XLogReaderState *record)
 											 xlrec->dbId,
 											 xlrec->tsId);
 	}
+	else if (info == XLOG_INDEX_HINT_BITS_HORIZON) {
+		if (InHotStandby) {
+			xl_index_hint_bits_horizon *xlrec =
+					(xl_index_hint_bits_horizon *) XLogRecGetData(record);
+
+			ResolveIndexHintBitsRecoveryConflictWithSnapshot(
+												xlrec->latestRemovedXid,
+												xlrec->rnode);
+		}
+	}
 	else
 		elog(PANIC, "standby_redo: unknown op code %u", info);
 }
@@ -1381,3 +1452,49 @@ get_recovery_conflict_desc(ProcSignalReason reason)
 
 	return reasonDesc;
 }
+
+static void
+LogIndexHintBitsHorizon(RelFileNode rnode, TransactionId latestRemovedXid)
+{
+	xl_index_hint_bits_horizon xlrec;
+
+	xlrec.rnode = rnode;
+	xlrec.latestRemovedXid = latestRemovedXid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xl_index_hint_bits_horizon));
+
+	XLogInsert(RM_STANDBY_ID, XLOG_INDEX_HINT_BITS_HORIZON);
+}
+
+void
+LogIndexHintBitsHorizonIfNeeded(Relation rel, TransactionId latestRemovedXid)
+{
+	if (!RecoveryInProgress() && XLogStandbyInfoActive() &&
+			TransactionIdIsNormal(latestRemovedXid) && RelationNeedsWAL(rel)) {
+		if (IsNewerIndexHintBitsHorizonXid(rel->rd_node.dbNode, latestRemovedXid))
+		{
+			LogIndexHintBitsHorizon(rel->rd_node, latestRemovedXid);
+			UpsertLatestIndexHintBitsHorizonXid(rel->rd_node.dbNode,
+												latestRemovedXid);
+		}
+	}
+}
+
+void
+StandByShmemInit(void)
+{
+	HASHCTL		info;
+
+	MemSet(&info, 0, sizeof(info));
+	info.keysize = sizeof(Oid);
+	info.entrysize = sizeof(IndexHintBitsHorizonsEntry);
+
+	LWLockAcquire(IndexHintBitsHorizonShmemLock, LW_EXCLUSIVE);
+
+	IndexHintBitsHorizons = ShmemInitHash("IndexHintBitsHorizons",
+										  64, 64,
+										  &info, HASH_ELEM | HASH_BLOBS);
+
+	LWLockRelease(IndexHintBitsHorizonShmemLock);
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..e66f8fbb8b 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock					44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock				46
 NotifyQueueTailLock					47
+IndexHintBitsHorizonShmemLock		48
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c87ffc6549..2da7eb69da 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -386,6 +386,7 @@ InitProcess(void)
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
+	MyProc->indexIgnoreKilledTuples = false;
 	MyProc->pid = MyProcPid;
 	/* backendId, databaseId and roleId will be filled in later */
 	MyProc->backendId = InvalidBackendId;
@@ -569,6 +570,7 @@ InitAuxiliaryProcess(void)
 	MyProc->fpLocalTransactionId = InvalidLocalTransactionId;
 	MyProc->xid = InvalidTransactionId;
 	MyProc->xmin = InvalidTransactionId;
+	MyProc->indexIgnoreKilledTuples = false;
 	MyProc->backendId = InvalidBackendId;
 	MyProc->databaseId = InvalidOid;
 	MyProc->roleId = InvalidOid;
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ae16c3ed7d..bed98d6436 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -528,6 +528,10 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 	 * the state for GlobalVis*.
 	 */
 	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
+	/* To keep it simple, use index hint bits only on the primary for imported
+	 * snapshots.
+	 */
+	MyProc->indexIgnoreKilledTuples = !RecoveryInProgress();
 
 	/*
 	 * Now copy appropriate fields from the source snapshot.
@@ -932,6 +936,7 @@ SnapshotResetXmin(void)
 	if (pairingheap_is_empty(&RegisteredSnapshots))
 	{
 		MyProc->xmin = InvalidTransactionId;
+		MyProc->indexIgnoreKilledTuples = false;
 		return;
 	}
 
@@ -939,6 +944,7 @@ SnapshotResetXmin(void)
 										pairingheap_first(&RegisteredSnapshots));
 
 	if (TransactionIdPrecedes(MyProc->xmin, minSnapshot->xmin))
+		// no need to change indexIgnoreKilledTuples here because xmin restriction is relaxed
 		MyProc->xmin = minSnapshot->xmin;
 }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 553d364e2d..97c97c13c2 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -165,8 +165,9 @@ typedef struct GISTScanOpaqueData
 	IndexOrderByDistance *distances;	/* output area for gistindex_keytest */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
-	OffsetNumber *killedItems;	/* offset numbers of killed items */
-	int			numKilled;		/* number of currently stored items */
+	OffsetNumber *killedItems;			  /* offset numbers of killed items */
+	TransactionId killedLatestRemovedXid; /* latest removed xid of all killed items */
+	int			  numKilled;			  /* number of currently stored items */
 	BlockNumber curBlkno;		/* current number of block */
 	GistNSN		curPageLSN;		/* pos in the WAL stream when page was read */
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 1cce865be2..a3fc82192e 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -177,8 +177,9 @@ typedef struct HashScanOpaqueData
 	 */
 	bool		hashso_buc_split;
 	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
+	int			 *killedItems;			  /* currPos.items indexes of killed items */
+	TransactionId killedLatestRemovedXid; /* latest removed xid of all killed items */
+	int			  numKilled;			  /* number of currently stored items */
 
 	/*
 	 * Identify all the matching items on a page and save them in
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d96a47b1ce..a9ed0e0918 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,7 +126,8 @@ extern bool heap_fetch(Relation relation, Snapshot snapshot,
 					   HeapTuple tuple, Buffer *userbuf);
 extern bool heap_hot_search_buffer(ItemPointer tid, Relation relation,
 								   Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
-								   bool *all_dead, bool first_call);
+								   bool *all_dead, TransactionId *latest_removed_xid,
+								   bool first_call);
 
 extern void heap_get_latest_tid(TableScanDesc scan, ItemPointer tid);
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 178d49710a..b49c3b4dc7 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -387,6 +387,8 @@ typedef struct xl_heap_rewrite_mapping
 
 extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 												   TransactionId *latestRemovedXid);
+extern void IndexHintBitAdvanceLatestRemovedXid(TransactionId killedTupleRemovedXid,
+												TransactionId *latestRemovedXid);
 
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index cad4f2bdeb..10257821fa 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -920,8 +920,9 @@ typedef struct BTScanOpaqueData
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
 	/* info about killed items if any (killedItems is NULL if never used) */
-	int		   *killedItems;	/* currPos.items indexes of killed items */
-	int			numKilled;		/* number of currently stored items */
+	int				*killedItems;			/* currPos.items indexes of killed items */
+	TransactionId	 killedLatestRemovedXid;/* latest removed xid of all killed items */
+	int				 numKilled;				/* number of currently stored items */
 
 	/*
 	 * If we are doing an index-only scan, these are the tuple storage
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 005f3fdd2b..7038e7fdae 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -122,10 +122,9 @@ typedef struct IndexScanDescData
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
 	/* signaling to index AM about killing index tuples */
-	bool		kill_prior_tuple;	/* last-returned tuple is dead */
-	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool			kill_prior_tuple;		 /* last-returned tuple is dead */
+	TransactionId	prior_tuple_removed_xid; /* removed fix for dead tuple */
+	bool			ignore_killed_tuples;	 /* do not return killed entries */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 33bffb6815..2a90040985 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -398,12 +398,17 @@ typedef struct TableAmRoutine
 	 * index_fetch_tuple iff it is guaranteed that no backend needs to see
 	 * that tuple. Index AMs can use that to avoid returning that tid in
 	 * future searches.
+	 *
+	 * *latest_removed_xid, if all_dead is not NULL, will be set to
+	 * the latest removed xid of a HOT chain by table_index_fetch_tuple()
+	 * iff it is guaranteed that no backend needs to see that tuple.
 	 */
 	bool		(*index_fetch_tuple) (struct IndexFetchTableData *scan,
 									  ItemPointer tid,
 									  Snapshot snapshot,
 									  TupleTableSlot *slot,
-									  bool *call_again, bool *all_dead);
+									  bool *call_again, bool *all_dead,
+									  TransactionId *latest_removed_xid);
 
 
 	/* ------------------------------------------------------------------------
@@ -1112,6 +1117,10 @@ table_index_fetch_end(struct IndexFetchTableData *scan)
  * that tuple. Index AMs can use that to avoid returning that tid in future
  * searches.
  *
+ * *latest_removed_xid, if all_dead is not NULL, will be set to the latest removed
+ * xid of a HOT chain by table_index_fetch_tuple() iff it is guaranteed that no
+ * backend needs to see that tuple.
+ *
  * The difference between this function and table_tuple_fetch_row_version()
  * is that this function returns the currently visible version of a row if
  * the AM supports storing multiple row versions reachable via a single index
@@ -1124,7 +1133,8 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 						ItemPointer tid,
 						Snapshot snapshot,
 						TupleTableSlot *slot,
-						bool *call_again, bool *all_dead)
+						bool *call_again, bool *all_dead,
+						TransactionId *latest_removed_xid)
 {
 	/*
 	 * We don't expect direct calls to table_index_fetch_tuple with valid
@@ -1136,7 +1146,7 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 
 	return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
 													slot, call_again,
-													all_dead);
+													all_dead, latest_removed_xid);
 }
 
 /*
@@ -1148,7 +1158,8 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 extern bool table_index_fetch_tuple_check(Relation rel,
 										  ItemPointer tid,
 										  Snapshot snapshot,
-										  bool *all_dead);
+										  bool *all_dead,
+										  TransactionId *latest_removed_xid);
 
 
 /* ------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..ac649703cd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -992,6 +992,7 @@ typedef enum
 	WAIT_EVENT_PROC_SIGNAL_BARRIER,
 	WAIT_EVENT_PROMOTE,
 	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
+	WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT_INDEX_HINT_BITS,
 	WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE,
 	WAIT_EVENT_RECOVERY_PAUSE,
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 4313f516d3..0371223c1e 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,6 +156,12 @@ typedef struct
 	 * store semantics, so use sig_atomic_t.
 	 */
 	sig_atomic_t force_reply;	/* used as a bool */
+
+	/* If sender has received our xmin. */
+	sig_atomic_t sender_has_standby_xmin;
+
+	/* Is senders feedback propagated through cascading replication chain up to the primary. */
+	sig_atomic_t sender_propagates_feedback_to_primary;
 } WalRcvData;
 
 extern WalRcvData *WalRcv;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..cb4c9e9233 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -222,6 +222,8 @@ extern void BufferGetTag(Buffer buffer, RelFileNode *rnode,
 						 ForkNumber *forknum, BlockNumber *blknum);
 
 extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
+extern void MarkBufferDirtyIndexHint(Buffer buffer, bool buffer_std,
+									 Relation rel, TransactionId latestRemovedXid);
 
 extern void UnlockBuffers(void);
 extern void LockBuffer(Buffer buffer, int mode);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 683ab64f76..0a72160b61 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -165,6 +165,11 @@ struct PGPROC
 	 * though not required. Accessed without lock, if needed.
 	 */
 	bool		recoveryConflictPending;
+	/*
+	*  Flag allowing to read\set LP_DEAD bits in indexes.
+	*  Also used to raise recovery conflicts caused by index hint bits.
+	*/
+	bool		indexIgnoreKilledTuples;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
 	bool		lwWaiting;		/* true if waiting for an LW lock */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..3b922f3fcb 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -70,7 +70,8 @@ extern bool IsBackendPid(int pid);
 extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin,
 												   bool excludeXmin0, bool allDbs, int excludeVacuum,
 												   int *nvxids);
-extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid);
+extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid,
+													   bool onlyIndexIgnoreKilledTuples);
 extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode);
 extern pid_t SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
 									  bool conflictPending);
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 2b1f340b82..9758da768e 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -18,6 +18,7 @@
 #include "storage/procsignal.h"
 #include "storage/relfilenode.h"
 #include "storage/standbydefs.h"
+#include "utils/relcache.h"
 
 /* User-settable GUC parameters */
 extern int	vacuum_defer_cleanup_age;
@@ -30,6 +31,9 @@ extern void ShutdownRecoveryTransactionEnvironment(void);
 
 extern void ResolveRecoveryConflictWithSnapshot(TransactionId latestRemovedXid,
 												RelFileNode node);
+extern void ResolveIndexHintBitsRecoveryConflictWithSnapshot(
+												TransactionId latestRemovedXid,
+												RelFileNode node);
 extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
@@ -92,4 +96,8 @@ extern XLogRecPtr LogStandbySnapshot(void);
 extern void LogStandbyInvalidations(int nmsgs, SharedInvalidationMessage *msgs,
 									bool relcacheInitFileInval);
 
+extern void StandByShmemInit(void);
+extern void LogIndexHintBitsHorizonIfNeeded(Relation rel,
+											TransactionId latestRemovedXid);
+
 #endif							/* STANDBY_H */
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index d99e6f40c6..127de2e9eb 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -31,9 +31,10 @@ extern void standby_desc_invalidations(StringInfo buf,
 /*
  * XLOG message types
  */
-#define XLOG_STANDBY_LOCK			0x00
-#define XLOG_RUNNING_XACTS			0x10
-#define XLOG_INVALIDATIONS			0x20
+#define XLOG_STANDBY_LOCK				0x00
+#define XLOG_RUNNING_XACTS				0x10
+#define XLOG_INVALIDATIONS				0x20
+#define XLOG_INDEX_HINT_BITS_HORIZON	0x30
 
 typedef struct xl_standby_locks
 {
@@ -71,4 +72,10 @@ typedef struct xl_invalidations
 
 #define MinSizeOfInvalidations offsetof(xl_invalidations, msgs)
 
+typedef struct xl_index_hint_bits_horizon
+{
+	RelFileNode		rnode;
+	TransactionId	latestRemovedXid;
+} xl_index_hint_bits_horizon;
+
 #endif							/* STANDBYDEFS_H */

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..d1dfde9a01 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1733,6 +1733,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for recovery conflict resolution for dropping a
        tablespace.</entry>
      </row>
+      <row>
+       <entry><literal>RecoveryConflictSnapshotIndexHintBits</literal></entry>
+       <entry>Waiting for recovery conflict resolution for index hit bits.</entry>
+      </row>
      <row>
       <entry><literal>RecoveryPause</literal></entry>
       <entry>Waiting for recovery to be resumed.</entry>
@@ -1911,6 +1915,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to read or update dynamic shared memory allocation
        information.</entry>
      </row>
+     <row>
+      <entry><literal>IndexHintHorizonShmem</literal></entry>
+      <entry>Waiting to read or update information about the latest index hint
+      bits horizon.</entry>
+     </row>
      <row>
       <entry><literal>LockFastPath</literal></entry>
       <entry>Waiting to read or update a process' fast-path lock
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 92205325fb..6188ce2f36 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -473,6 +473,11 @@ LSN of the page, and only act to set LP_DEAD bits when the LSN has not
 changed at all. (Avoiding dropping the pin entirely also makes it safe, of
 course.)
 
+LP_DEAD are not WAL logged themself, but they could reach standby as part
+of the full-page write. To allow usage of LP_DEAD hints on hot standby
+without MVCC failures latestRemovedXid of entries that are marked dead is WAL
+logged (index hint bits horizon).
+
 Bottom-Up deletion
 ------------------
 
@@ -653,17 +658,24 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple LP_DEAD bits, but
+they can still appear on the standby because of the full-page writes. Such
+a cause could cause MVCC failures because the oldest xmin on the standby
+server can be older than the oldest xmin on the primary server, which means
+tuples can be marked LP_DEAD even when they are still visible on the standby.
+
+To prevent such failure, we WAL log latestRemovedXid of entries that are
+marked dead. This WAL record is used to raise conflicts on standby. But in
+some workloads, it could cause a very high rate of conflicts. To eliminate
+such conflicts each scan decides to use hint bits or not at the moment of
+getting the snapshot (ignore_killed_tuples flag). If no - bits are ignored,
+but conflict resolution is also ignored. The decision is based on effective
+hot_standby_feedback status. If hot_standby_feedback is enabled and backend
+xmin is effectively honored by the primary, then ignore_killed_tuples is set
+to true. BTW, hot_standby_feedback has nothing with MVCC correctness, it is
+just taken into account to reduce unnecessary backend cancelation.
 
 Note that we talk about scans that are started during recovery. We go to
 a little trouble to allow a scan to start during recovery and end during
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..401d87b2c0 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -633,13 +633,15 @@ In some cases, we write additional information to data blocks without
 writing a preceding WAL record. This should only happen iff the data can
 be reconstructed later following a crash and the action is simply a way
 of optimising for performance. When a hint is written we use
-MarkBufferDirtyHint() to mark the block dirty.
+MarkBufferDirtyHint() or MarkBufferDirtyindexHint() to mark the block dirty.
 
 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 that includes the hint. We do this to avoid a partial page write, when we
 write the dirtied page. WAL is not written during recovery, so we simply skip
-dirtying blocks because of hints when in recovery.
+dirtying blocks because of hints when in recovery. MarkBufferDirtyIndexHint()
+also, optionally WAL logs new index hint bits horizon value allowing
+LP_DEAD hints to be read\set on standby without MVCC failures.
 
 If you do decide to optimise away a WAL record, then any calls to
 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 96442ceb4e..6399184a8c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-------------------------------------------------------------------------
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/022_index_hint_bits.pl b/src/test/recovery/t/022_index_hint_bits.pl
new file mode 100644
index 0000000000..95205b4d98
--- /dev/null
+++ b/src/test/recovery/t/022_index_hint_bits.pl
@@ -0,0 +1,283 @@
+# Checks that snapshots on standbys behave in a minimally reasonable
+# way.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 15;
+use Config;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+    autovacuum = off
+    enable_seqscan = off
+    enable_indexonlyscan = off
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+    'postgres', 'CREATE TABLE test_index_hint (id int, value int)');
+$node_primary->safe_psql(
+    'postgres', 'CREATE INDEX test_index ON test_index_hint (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+    'INSERT INTO test_index_hint VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set
+$node_primary->safe_psql('postgres', 'VACUUM test_index_hint');
+# For fail-fast in case FPW from primary
+$node_primary->safe_psql('postgres', 'CHECKPOINT');
+
+# Take backup
+my $backup_name = 'my_backup';
+$node_primary->backup($backup_name);
+
+# Restore standby node from backup backup
+my $node_standby_1 = get_new_node('standby_1');
+$node_standby_1->init_from_backup($node_primary, $backup_name,
+    has_streaming => 1);
+
+my $standby_settings = qq{
+    max_standby_streaming_delay = 1
+    wal_receiver_status_interval = 1
+    hot_standby_feedback = on
+    enable_seqscan = off
+    enable_indexonlyscan = off
+};
+$node_standby_1->append_conf('postgresql.conf', $standby_settings);
+$node_standby_1->start;
+
+$node_standby_1->backup($backup_name);
+
+# Create second standby node linking to standby 1
+my $node_standby_2 = get_new_node('standby_2');
+$node_standby_2->init_from_backup($node_standby_1, $backup_name,
+    has_streaming => 1);
+$node_standby_2->append_conf('postgresql.conf', $standby_settings);
+$node_standby_2->start;
+
+# Make sure sender_propagates_feedback_to_primary is set on standbys
+wait_hfs($node_primary, 1);
+wait_hfs($node_standby_1, 1);
+
+# To avoid hanging while expecting some specific input from a psql
+# instance being driven by us, add a timeout high enough that it
+# should never trigger even on very slow machines, unless something
+# is really wrong.
+my $psql_timeout = IPC::Run::timer(30);
+
+# One psql to run command in repeatable read isolation level
+my %psql_standby_repeatable_read = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_repeatable_read{run} =
+    IPC::Run::start(
+        [ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+        '<', \$psql_standby_repeatable_read{stdin},
+        '>', \$psql_standby_repeatable_read{stdout},
+        '2>', \$psql_standby_repeatable_read{stderr},
+        $psql_timeout);
+
+# Another psql to run command in read committed isolation level
+my %psql_standby_read_committed = ('stdin' => '', 'stdout' => '', 'stderr' => '');
+$psql_standby_read_committed{run} =
+    IPC::Run::start(
+        [ 'psql', '-XAb', '-f', '-', '-d', $node_standby_1->connstr('postgres') ],
+        '<', \$psql_standby_read_committed{stdin},
+        '>', \$psql_standby_read_committed{stdout},
+        '2>', \$psql_standby_read_committed{stderr},
+        $psql_timeout);
+
+# Start RR transaction and read first row from index
+ok(send_query_and_wait(\%psql_standby_repeatable_read,
+    q[
+BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;
+],
+    qr/1\n\(1 row\)/m),
+    'row is visible in repeatable read');
+
+# Start RC transaction and read first row from index
+ok(send_query_and_wait(\%psql_standby_read_committed,
+    q[
+BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
+SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;
+],
+    qr/1\n\(1 row\)/m),
+    'row is visible in read committed');
+
+# Now delete first 10 rows in index
+$node_primary->safe_psql('postgres',
+    'UPDATE test_index_hint SET value = 1 WHERE id <= 10');
+
+# Make sure hint bits are not set on primary
+is(hints_num($node_primary), qq(0), 'no index hint bits are set on primary yet');
+
+# Make sure page is not processed by heap_page_prune_opt
+is(non_normal_num($node_primary), qq(0), 'all items are normal in heap');
+
+# Wait for standbys to catch up transaction
+wait_for_catchup_all();
+
+# Disable hot_standby_feedback to trigger conflicts later
+$node_standby_1->safe_psql('postgres',
+    'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_1->reload;
+
+# Make sure sender_propagates_feedback_to_primary is not set on standby
+wait_hfs($node_primary, 0);
+wait_hfs($node_standby_1, 1);
+
+# Try to set hint bits in index on standby
+try_to_set_hint_bits();
+
+# Make sure read committed transaction is able to see correct data
+ok(send_query_and_wait(\%psql_standby_read_committed,
+    q/SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;/,
+    qr/11\n\(1 row\)/m),
+    'session is not canceled for read committed');
+
+# Make sure previous queries not set the hints on standby because
+# of parallel transaction running
+ok(send_query_and_wait(\%psql_standby_repeatable_read,
+    q/SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;/,
+    qr/1\n\(1 row\)/m),
+    'hints on standby are not set');
+
+is(hints_num($node_standby_1), qq(0), 'no index hint bits are set on standby yet');
+
+
+# Set index hint bits and replicate to standby
+$node_primary->safe_psql('postgres',
+    'SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;');
+
+# Make sure page is not processed by heap_page_prune_opt
+is(non_normal_num($node_primary), qq(0), 'all items are normal in heap');
+# Make sure hint bits are set
+is(hints_num($node_primary), qq(10), 'hint bits are set on primary already');
+
+## Wait for standbys to catch up hint bits
+wait_for_catchup_all();
+
+# Make sure read committed transaction is able to see correct data
+ok(send_query_and_wait(\%psql_standby_read_committed,
+    q/SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;/,
+    qr/11\n\(1 row\)/m),
+    'session is not canceled for read committed');
+
+# Make sure repeatable read transaction is canceled because of XLOG_INDEX_HINT_BITS_HORIZON from primary
+ok((send_query_and_wait(\%psql_standby_repeatable_read,
+    q/SELECT id FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;/,
+    qr/.*terminating connection due to conflict with recovery.*/m)),
+    'session is canceled for repeatable read');
+
+# Try to set hint bits in index on standby
+try_to_set_hint_bits();
+
+is(hints_num($node_standby_1), qq(0),
+    'hints are not set on standby1 because hs feedback is off');
+is(hints_num($node_standby_2), qq(0),
+    'hint bits are not set on standby2 because hs feedback chain is broker');
+
+# Enable hot_standby_feedback to allow hint bits to be set
+$node_standby_1->safe_psql('postgres',
+    'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_1->reload;
+
+# Make sure sender_propagates_feedback_to_primary is now set on standbys
+wait_hfs($node_primary, 1);
+wait_hfs($node_standby_1, 1);
+
+# Try to set hint bits in index on standby
+try_to_set_hint_bits();
+
+is(hints_num($node_standby_1), qq(10),
+    'hint bits are set on standby 1 yet because feedback is on');
+is(hints_num($node_standby_2), qq(10),
+    'hint bits are set on standby 2 yet because feedback chain is uninterrupted');
+
+$node_primary->stop();
+$node_standby_1->stop();
+$node_standby_2->stop();
+
+# Send query, wait until string matches
+sub send_query_and_wait {
+    my ($psql, $query, $untl) = @_;
+
+    # send query
+    $$psql{stdin} .= $query;
+    $$psql{stdin} .= "\n";
+
+    # wait for query results
+    $$psql{run}->pump_nb();
+    while (1) {
+        # See PostgresNode.pm's psql()
+        $$psql{stdout} =~ s/\r\n/\n/g if $Config{osname} eq 'msys';
+
+        #diag("\n" . $$psql{stdout}); # for debugging
+        #diag("\n" . $$psql{stderr}); # for debugging
+
+        last if $$psql{stdout} =~ /$untl/;
+        last if $$psql{stderr} =~ /$untl/;
+
+        if ($psql_timeout->is_expired) {
+            BAIL_OUT("aborting wait: program timed out \n" .
+                "stream contents: >>$$psql{stdout}<< \n" .
+                "pattern searched for: $untl");
+            return 0;
+        }
+        if (not $$psql{run}->pumpable()) {
+            # This is fine for some tests, keep running
+            return 0;
+        }
+        $$psql{run}->pump();
+        select(undef, undef, undef, 0.01); # sleep a little
+
+    }
+
+    $$psql{stdout} = '';
+
+    return 1;
+}
+
+sub try_to_set_hint_bits {
+    # Try to set hint bits in index on standby
+    foreach (0 .. 3) {
+        $node_standby_1->safe_psql('postgres',
+            'SELECT * FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;');
+        $node_standby_2->safe_psql('postgres',
+            'SELECT * FROM test_index_hint WHERE value = 0 ORDER BY id LIMIT 1;');
+    }
+}
+
+sub wait_for_catchup_all {
+    $node_primary->wait_for_catchup($node_standby_1, 'replay',
+        $node_primary->lsn('insert'));
+    $node_standby_1->wait_for_catchup($node_standby_2, 'replay',
+        $node_standby_1->lsn('replay'));
+}
+
+sub hints_num {
+    my ($node) = @_;
+    return $node->safe_psql('postgres',
+        "SELECT count(*) FROM bt_page_items('test_index', 1) WHERE dead = true");
+}
+
+sub non_normal_num {
+    my ($node) = @_;
+    return $node->safe_psql('postgres',
+        "SELECT COUNT(*) FROM heap_page_items(get_raw_page('test_index_hint', 0)) WHERE lp_flags != 1");
+}
+
+sub wait_hfs {
+    my ($node, $n) = @_;
+    $node->poll_query_until('postgres',
+        "SELECT (SELECT COUNT(*) FROM (SELECT * FROM pg_stat_replication WHERE backend_xmin IS NOT NULL) AS X) = $n")
+            or die 'backend_xmin is invalid';
+    # Make sure we have received reply to feedback message
+    sleep(2);
+}
\ No newline at end of file

pefttest.tar.gz
Description: application/gzip

[PATCH] Full support for index LP_DEAD hint bits on standby

Reply via email to