Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan Mon, 19 Oct 2020 19:38:13 -0700

On Fri, Oct 16, 2020 at 1:58 PM Peter Geoghegan <p...@bowt.ie> wrote:
> Attached is v3, which is rebased against the master branch as of
> today. No real changes, though.


And now here's v4.

This version adds handling of posting list tuples, which I was
skipping over before. Highly contended leaf pages with posting list
tuples could still sometimes get "logically unnecessary" page splits
in v3, which this seems to fix (barring the most extreme cases of
version churn, where the patch cannot and should not prevent page
splits). Actually, posting list tuple handling was something that we
had in v1 but lost in v2, because v2 changed our general strategy to
focus on what is convenient to the heap/tableam, which is the most
important thing by far (the cost of failing must be a low, fixed, well
understood and well targeted cost). The fix was to include TIDs from
posting lists, while marking them "non-promising". Only plain tuples
that are duplicates are considered promising. Only promising tuples
influence where we look for tuples to kill most of the time. The
exception is when there is an even number of promising tuples on two
heap pages, where we tiebreak on the total number of TIDs that point
to the heap page from the leaf page.

I seem to be able to cap the costs paid when the optimization doesn't
work out to the extent that we can get away with visiting only *one*
heap page before giving up. And, we now never visit more than 3 total
(2 is the common case when the optimization is really effective). This
may sound conservative -- because it is -- but it seems to work in
practice. I may change my mind about that and decide to be less
conservative, but so far all of the evidence I've seen suggests that
it doesn't matter -- the heuristics seem to really work. Might as well
be completely conservative. I'm *sure* that we can afford one heap
access here -- we currently *always* visit at least one heap tuple in
roughly the same way during each and every unique index tuple insert
(not just when the page fills).

Posting list TIDs are not the only kind of TIDs that are marked
non-promising. We now also include TIDs from non-duplicate tuples. So
we're prepared to kill any of the TIDs on the page, though we only
really expect it to happen with the "promising" tuples (duplicates).
But why not be open to the possibility of killing some extra TIDs in
passing? We don't have to visit any extra heap pages to do so, so it's
practically free. Empirical evidence shows that this happens quite
often.

Here's why this posting list tuple strategy is a good one: we consider
posting list tuple TIDs non-promising to represent that we think that
there are probably actually multiple logical rows involved, or at
least to represent that they didn't work -- simple trial and error
suggests that they aren't very "promising", whatever the true reason
happens to be. But why not "keep an open mind" about the TIDs not each
being for distinct logical rows? If it just so happens that the
posting list TIDs really were multiple versions of the same logical
row all along, then there is a reasonable chance that there'll be even
more versions on that same heap page later on. When this happens and
when we end up back on the same B-Tree leaf page to think about dedup
deletion once again, it's pretty likely that we'll also naturally end
up looking into the later additional versions on this same heap page
from earlier. At which point we won't miss the opportunity to check
the posting lists TIDs in passing. So we get to delete the posting
list after all!

(If you're thinking "but we probably won't get lucky like that", then
consider that it doesn't have to happen on the next occasion when
delete deduplication happens on the same leaf page. Just at some point
in the future. This is due to the way we visit the heap pages that
look most promising first. It might take a couple of rounds of dedup
deletion to get back to the same heap page, but that's okay. The
simple heuristics proposed in the patch seem to have some really
interesting and useful consequences in practice. It's hard to quantify
how important these kinds of accidents of locality will be. I haven't
targeted this or that effect -- my heuristics are dead simple, and
based almost entirely on keeping costs down. You can think of it as
"increase the number of TIDs to increase our chances of success" if
you prefer.)

The life cycle of logical rows/heap tuples seems to naturally lead to
these kinds of opportunities. Obviously heapam is naturally very keen
on storing related versions on the same heap block already, without
any input from this patch. The patch is still evolving, and the
overall structure and interface certainly still needs work -- I've
focussed on the algorithm so far. I could really use some feedback on
high level direction, though. It's a relatively short patch, even with
all of my README commentary. But...it's not an easy concept to digest.

Note: I'm not really sure if it's necessary to provide specialized
qsort routines for the sorting we do to lists of TIDs, etc. I did do
some experimentation with that, and it's an open question. So for now
I rely on the patch that Thomas Munro posted to do that a little while
back, which is why that's included here. The question of whether this
is truly needed is unsettled.

--
Peter Geoghegan

From dbd44a9982b772e751f5da9630a4151794131380 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 17 Aug 2020 21:31:56 +1200
Subject: [PATCH v4 1/2] Add sort_template.h for making fast sort functions.

Move our qsort implementation into a header that can be used to
define specialized functions for better performance.

Discussion: https://postgr.es/m/CA%2BhUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw%40mail.gmail.com
---
 src/include/lib/sort_template.h | 428 ++++++++++++++++++++++++++++++++
 1 file changed, 428 insertions(+)
 create mode 100644 src/include/lib/sort_template.h

diff --git a/src/include/lib/sort_template.h b/src/include/lib/sort_template.h
new file mode 100644
index 0000000000..a279bcf959
--- /dev/null
+++ b/src/include/lib/sort_template.h
@@ -0,0 +1,428 @@
+/*-------------------------------------------------------------------------
+ *
+ * sort_template.h
+ *
+ *	  A template for a sort algorithm that supports varying degrees of
+ *	  specialization.
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - ST_SORT - the name of a sort function to be generated
+ *	  - ST_ELEMENT_TYPE - type of the referenced elements
+ *	  - ST_DECLARE - if defined the functions and types are declared
+ *	  - ST_DEFINE - if defined the functions and types are defined
+ *	  - ST_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  Instead of ST_ELEMENT_TYPE, ST_ELEMENT_TYPE_VOID can be defined.  Then
+ *	  the generated functions will automatically gain an "element_size"
+ *	  parameter.  This allows us to generate a traditional qsort function.
+ *
+ *	  One of the following macros must be defined, to show how to compare
+ *	  elements.  The first two options are arbitrary expressions depending
+ *	  on whether an extra pass-through argument is desired, and the third
+ *	  option should be defined if the sort function should receive a
+ *	  function pointer at runtime.
+ *
+ * 	  - ST_COMPARE(a, b) - a simple comparison expression
+ *	  - ST_COMPARE(a, b, arg) - variant that takes an extra argument
+ *	  - ST_COMPARE_RUNTIME_POINTER - sort function takes a function pointer
+ *
+ *	  To say that the comparator and therefore also sort function should
+ *	  receive an extra pass-through argument, specify the type of the
+ *	  argument.
+ *
+ *	  - ST_COMPARE_ARG_TYPE - type of extra argument
+ *
+ *	  The prototype of the generated sort function is:
+ *
+ *	  void ST_SORT(ST_ELEMENT_TYPE *data, size_t n,
+ *				   [size_t element_size,]
+ *				   [ST_SORT_compare_function compare,]
+ *				   [ST_COMPARE_ARG_TYPE *arg]);
+ *
+ *	  ST_SORT_compare_function is a function pointer of the following type:
+ *
+ *	  int (*)(const ST_ELEMENT_TYPE *a, const ST_ELEMENT_TYPE *b,
+ *			  [ST_COMPARE_ARG_TYPE *arg])
+ *
+ * HISTORY
+ *
+ *	  Modifications from vanilla NetBSD source:
+ *	  - Add do ... while() macro fix
+ *	  - Remove __inline, _DIAGASSERTs, __P
+ *	  - Remove ill-considered "swap_cnt" switch to insertion sort, in favor
+ *		of a simple check for presorted input.
+ *	  - Take care to recurse on the smaller partition, to bound stack usage
+ *	  - Convert into a header that can generate specialized functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_template.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*	  $NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $   */
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	  The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+ *	  without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Qsort routine based on J. L. Bentley and M. D. McIlroy,
+ * "Engineering a sort function",
+ * Software--Practice and Experience 23 (1993) 1249-1265.
+ *
+ * We have modified their original by adding a check for already-sorted
+ * input, which seems to be a win per discussions on pgsql-hackers around
+ * 2006-03-21.
+ *
+ * Also, we recurse on the smaller partition and iterate on the larger one,
+ * which ensures we cannot recurse more than log(N) levels (since the
+ * partition recursed to is surely no more than half of the input).  Bentley
+ * and McIlroy explicitly rejected doing this on the grounds that it's "not
+ * worth the effort", but we have seen crashes in the field due to stack
+ * overrun, so that judgment seems wrong.
+ */
+
+#define ST_MAKE_PREFIX(a) CppConcat(a,_)
+#define ST_MAKE_NAME(a,b) ST_MAKE_NAME_(ST_MAKE_PREFIX(a),b)
+#define ST_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/*
+ * If the element type is void, we'll also need an element_size argument
+ * because we don't know the size.
+ */
+#ifdef ST_ELEMENT_TYPE_VOID
+#define ST_ELEMENT_TYPE void
+#define ST_SORT_PROTO_SIZE , size_t element_size
+#define ST_SORT_INVOKE_SIZE , element_size
+#else
+#define ST_SORT_PROTO_SIZE
+#define ST_SORT_INVOKE_SIZE
+#endif
+
+/*
+ * If the user wants to be able to pass in compare functions at runtime,
+ * we'll need to make that an argument of the sort and med3 functions.
+ */
+#ifdef ST_COMPARE_RUNTIME_POINTER
+/*
+ * The type of the comparator function pointer that ST_SORT will take, unless
+ * you've already declared a type name manually and want to use that instead of
+ * having a new one defined.
+ */
+#ifndef ST_COMPARATOR_TYPE_NAME
+#define ST_COMPARATOR_TYPE_NAME ST_MAKE_NAME(ST_SORT, compare_function)
+#endif
+#define ST_COMPARE compare
+#ifndef ST_COMPARE_ARG_TYPE
+#define ST_SORT_PROTO_COMPARE , ST_COMPARATOR_TYPE_NAME compare
+#define ST_SORT_INVOKE_COMPARE , compare
+#else
+#define ST_SORT_PROTO_COMPARE , ST_COMPARATOR_TYPE_NAME compare
+#define ST_SORT_INVOKE_COMPARE , compare
+#endif
+#else
+#define ST_SORT_PROTO_COMPARE
+#define ST_SORT_INVOKE_COMPARE
+#endif
+
+/*
+ * If the user wants to use a compare function or expression that takes an
+ * extra argument, we'll need to make that an argument of the sort, compare and
+ * med3 functions.
+ */
+#ifdef ST_COMPARE_ARG_TYPE
+#define ST_SORT_PROTO_ARG , ST_COMPARE_ARG_TYPE *arg
+#define ST_SORT_INVOKE_ARG , arg
+#else
+#define ST_SORT_PROTO_ARG
+#define ST_SORT_INVOKE_ARG
+#endif
+
+#ifdef ST_DECLARE
+
+#ifdef ST_COMPARE_RUNTIME_POINTER
+typedef int (*ST_COMPARATOR_TYPE_NAME)(const ST_ELEMENT_TYPE *,
+									   const ST_ELEMENT_TYPE *
+									   ST_SORT_PROTO_ARG);
+#endif
+
+/* Declare the sort function.  Note optional arguments at end. */
+ST_SCOPE void ST_SORT(ST_ELEMENT_TYPE *first, size_t n
+					  ST_SORT_PROTO_SIZE
+					  ST_SORT_PROTO_COMPARE
+					  ST_SORT_PROTO_ARG);
+
+#endif
+
+#ifdef ST_DEFINE
+
+/* sort private helper functions */
+#define ST_MED3 ST_MAKE_NAME(ST_SORT, med3)
+#define ST_SWAP ST_MAKE_NAME(ST_SORT, swap)
+#define ST_SWAPN ST_MAKE_NAME(ST_SORT, swapn)
+
+/* Users expecting to run very large sorts may need them to be interruptible. */
+#ifdef ST_CHECK_FOR_INTERRUPTS
+#define DO_CHECK_FOR_INTERRUPTS() CHECK_FOR_INTERRUPTS()
+#else
+#define DO_CHECK_FOR_INTERRUPTS()
+#endif
+
+/*
+ * Create wrapper macros that know how to invoke compare, med3 and sort with
+ * the right arguments.
+ */
+#ifdef ST_COMPARE_RUNTIME_POINTER
+#define DO_COMPARE(a_, b_) ST_COMPARE((a_), (b_) ST_SORT_INVOKE_ARG)
+#elif defined(ST_COMPARE_ARG_TYPE)
+#define DO_COMPARE(a_, b_) ST_COMPARE((a_), (b_), arg)
+#else
+#define DO_COMPARE(a_, b_) ST_COMPARE((a_), (b_))
+#endif
+#define DO_MED3(a_, b_, c_)												\
+	ST_MED3((a_), (b_), (c_)											\
+			ST_SORT_INVOKE_COMPARE										\
+			ST_SORT_INVOKE_ARG)
+#define DO_SORT(a_, n_)													\
+	ST_SORT((a_), (n_)													\
+			ST_SORT_INVOKE_SIZE											\
+			ST_SORT_INVOKE_COMPARE										\
+			ST_SORT_INVOKE_ARG)
+
+/*
+ * If we're working with void pointers, we'll use pointer arithmetic based on
+ * uint8, and use the runtime element_size to step through the array and swap
+ * elements.  Otherwise we'll work with ST_ELEMENT_TYPE.
+ */
+#ifndef ST_ELEMENT_TYPE_VOID
+#define ST_POINTER_TYPE ST_ELEMENT_TYPE
+#define ST_POINTER_STEP 1
+#define DO_SWAPN(a_, b_, n_) ST_SWAPN((a_), (b_), (n_))
+#define DO_SWAP(a_, b_) ST_SWAP((a_), (b_))
+#else
+#define ST_POINTER_TYPE uint8
+#define ST_POINTER_STEP element_size
+#define DO_SWAPN(a_, b_, n_) ST_SWAPN((a_), (b_), (n_))
+#define DO_SWAP(a_, b_) DO_SWAPN((a_), (b_), element_size)
+#endif
+
+/*
+ * Find the median of three values.  Currently, performance seems to be best
+ * if the the comparator is inlined here, but the med3 function is not inlined
+ * in the qsort function.
+ */
+static pg_noinline ST_ELEMENT_TYPE *
+ST_MED3(ST_ELEMENT_TYPE *a,
+		ST_ELEMENT_TYPE *b,
+		ST_ELEMENT_TYPE *c
+		ST_SORT_PROTO_COMPARE
+		ST_SORT_PROTO_ARG)
+{
+		return DO_COMPARE(a, b) < 0 ?
+		(DO_COMPARE(b, c) < 0 ? b : (DO_COMPARE(a, c) < 0 ? c : a))
+		: (DO_COMPARE(b, c) > 0 ? b : (DO_COMPARE(a, c) < 0 ? a : c));
+}
+
+static inline void
+ST_SWAP(ST_POINTER_TYPE *a, ST_POINTER_TYPE *b)
+{
+	ST_POINTER_TYPE tmp = *a;
+
+	*a = *b;
+	*b = tmp;
+}
+
+static inline void
+ST_SWAPN(ST_POINTER_TYPE *a, ST_POINTER_TYPE *b, size_t n)
+{
+	for (size_t i = 0; i < n; ++i)
+		ST_SWAP(&a[i], &b[i]);
+}
+
+/*
+ * Sort an array.
+ */
+ST_SCOPE void
+ST_SORT(ST_ELEMENT_TYPE *data, size_t n
+		ST_SORT_PROTO_SIZE
+		ST_SORT_PROTO_COMPARE
+		ST_SORT_PROTO_ARG)
+{
+	ST_POINTER_TYPE *a = (ST_POINTER_TYPE *) data,
+			   *pa,
+			   *pb,
+			   *pc,
+			   *pd,
+			   *pl,
+			   *pm,
+			   *pn;
+	size_t		d1,
+				d2;
+	int			r,
+				presorted;
+
+loop:
+	DO_CHECK_FOR_INTERRUPTS();
+	if (n < 7)
+	{
+		for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
+			 pm += ST_POINTER_STEP)
+			for (pl = pm; pl > a && DO_COMPARE(pl - ST_POINTER_STEP, pl) > 0;
+				 pl -= ST_POINTER_STEP)
+				DO_SWAP(pl, pl - ST_POINTER_STEP);
+		return;
+	}
+	presorted = 1;
+	for (pm = a + ST_POINTER_STEP; pm < a + n * ST_POINTER_STEP;
+		 pm += ST_POINTER_STEP)
+	{
+		DO_CHECK_FOR_INTERRUPTS();
+		if (DO_COMPARE(pm - ST_POINTER_STEP, pm) > 0)
+		{
+			presorted = 0;
+			break;
+		}
+	}
+	if (presorted)
+		return;
+	pm = a + (n / 2) * ST_POINTER_STEP;
+	if (n > 7)
+	{
+		pl = a;
+		pn = a + (n - 1) * ST_POINTER_STEP;
+		if (n > 40)
+		{
+			size_t		d = (n / 8) * ST_POINTER_STEP;
+
+			pl = DO_MED3(pl, pl + d, pl + 2 * d);
+			pm = DO_MED3(pm - d, pm, pm + d);
+			pn = DO_MED3(pn - 2 * d, pn - d, pn);
+		}
+		pm = DO_MED3(pl, pm, pn);
+	}
+	DO_SWAP(a, pm);
+	pa = pb = a + ST_POINTER_STEP;
+	pc = pd = a + (n - 1) * ST_POINTER_STEP;
+	for (;;)
+	{
+		while (pb <= pc && (r = DO_COMPARE(pb, a)) <= 0)
+		{
+			if (r == 0)
+			{
+				DO_SWAP(pa, pb);
+				pa += ST_POINTER_STEP;
+			}
+			pb += ST_POINTER_STEP;
+			DO_CHECK_FOR_INTERRUPTS();
+		}
+		while (pb <= pc && (r = DO_COMPARE(pc, a)) >= 0)
+		{
+			if (r == 0)
+			{
+				DO_SWAP(pc, pd);
+				pd -= ST_POINTER_STEP;
+			}
+			pc -= ST_POINTER_STEP;
+			DO_CHECK_FOR_INTERRUPTS();
+		}
+		if (pb > pc)
+			break;
+		DO_SWAP(pb, pc);
+		pb += ST_POINTER_STEP;
+		pc -= ST_POINTER_STEP;
+	}
+	pn = a + n * ST_POINTER_STEP;
+	d1 = Min(pa - a, pb - pa);
+	DO_SWAPN(a, pb - d1, d1);
+	d1 = Min(pd - pc, pn - pd - ST_POINTER_STEP);
+	DO_SWAPN(pb, pn - d1, d1);
+	d1 = pb - pa;
+	d2 = pd - pc;
+	if (d1 <= d2)
+	{
+		/* Recurse on left partition, then iterate on right partition */
+		if (d1 > ST_POINTER_STEP)
+			DO_SORT(a, d1 / ST_POINTER_STEP);
+		if (d2 > ST_POINTER_STEP)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* DO_SORT(pn - d2, d2 / ST_POINTER_STEP) */
+			a = pn - d2;
+			n = d2 / ST_POINTER_STEP;
+			goto loop;
+		}
+	}
+	else
+	{
+		/* Recurse on right partition, then iterate on left partition */
+		if (d2 > ST_POINTER_STEP)
+			DO_SORT(pn - d2, d2 / ST_POINTER_STEP);
+		if (d1 > ST_POINTER_STEP)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* DO_SORT(a, d1 / ST_POINTER_STEP) */
+			n = d1 / ST_POINTER_STEP;
+			goto loop;
+		}
+	}
+}
+#endif
+
+#undef DO_COMPARE
+#undef DO_MED3
+#undef DO_SORT
+#undef DO_SWAP
+#undef DO_SWAPN
+#undef ST_COMPARATOR_TYPE_NAME
+#undef ST_COMPARE
+#undef ST_COMPARE_ARG_TYPE
+#undef ST_COMPARE_RUNTIME_POINTER
+#undef ST_ELEMENT_TYPE
+#undef ST_ELEMENT_TYPE_VOID
+#undef ST_MAKE_NAME
+#undef ST_MAKE_NAME_
+#undef ST_MAKE_PREFIX
+#undef ST_MED3
+#undef ST_POINTER_STEP
+#undef ST_POINTER_TYPE
+#undef ST_SORT
+#undef ST_SORT_INVOKE_ARG
+#undef ST_SORT_INVOKE_COMPARE
+#undef ST_SORT_INVOKE_SIZE
+#undef ST_SORT_PROTO_ARG
+#undef ST_SORT_PROTO_COMPARE
+#undef ST_SORT_PROTO_SIZE
+#undef ST_SWAP
+#undef ST_SWAPN
-- 
2.25.1

From 96e78aa55eaae66e8efec4fb528f851977e71c1d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 30 Jun 2020 16:29:27 -0700
Subject: [PATCH v4 2/2] Add delete deduplication to nbtree.

Repurpose deduplication infrastructure to delete items in indexes at the
point where we'd usually have to split the page, even when they don't
have their LP_DEAD bits set.  Testing has shown that this is almost
completely effective at preventing "version index bloat" from non-HOT
updates, provided there are no long running transactions.

This is primarily valuable with leaf pages that contain mostly-distinct
index tuples, particularly with unique indexes.  It is intended to
complement deduplication.  Heuristics are used to guess which index
tuples are likely to point to no longer needed old table row versions.

Note that INCLUDE indexes support the optimization.
---
 src/include/access/genam.h               |  15 +
 src/include/access/heapam.h              |   3 +-
 src/include/access/nbtree.h              |   5 +-
 src/include/access/tableam.h             |  44 +-
 src/include/executor/executor.h          |   3 +-
 src/backend/access/heap/heapam.c         |  12 +-
 src/backend/access/heap/heapam_handler.c |   5 +-
 src/backend/access/nbtree/README         |  70 ++-
 src/backend/access/nbtree/nbtdedup.c     | 566 +++++++++++++++++++++--
 src/backend/access/nbtree/nbtinsert.c    |  41 +-
 src/backend/access/nbtree/nbtsort.c      |  12 +-
 src/backend/access/nbtree/nbtxlog.c      |   4 +-
 src/backend/access/table/tableam.c       | 299 +++++++++++-
 src/backend/commands/copy.c              |   5 +-
 src/backend/executor/execIndexing.c      |  41 +-
 src/backend/executor/execReplication.c   |   4 +-
 src/backend/executor/nodeModifyTable.c   |  14 +-
 17 files changed, 1063 insertions(+), 80 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 68d90f5141..7002da0716 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -108,10 +108,25 @@ typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
  * call is made with UNIQUE_CHECK_EXISTING.  The tuple is already in the
  * index in this case, so it should not be inserted again.  Rather, just
  * check for conflicting live tuples (possibly blocking).
+ *
+ * UNIQUE_CHECK_NO indicates the absence of any unique checking.
+ * UNIQUE_CHECK_NO_WITH_UNCHANGED is a variant of UNIQUE_CHECK_NO that
+ * indicates that the index tuple comes from an UPDATE that did not modify
+ * the row in respect of any columns that are indexed.  The implementation
+ * requires a successor version, but there is no logical change.  Some
+ * index access AMs can use this as hint that can trigger optimizations.
+ *
+ * XXX: Adding UNIQUE_CHECK_NO_WITH_UNCHANGED like this kind of makes
+ * sense, since it's pretty natural to leave it up to index AMs to figure
+ * it out with unique indexes.  But what about when we insert NULLs into a
+ * unique index?  Isn't that case UNIQUE_CHECK_YES, and yet also a thing
+ * that nbtree pretty much treats as UNIQUE_CHECK_NO once it sees that the
+ * index tuple has NULLs?
  */
 typedef enum IndexUniqueCheck
 {
 	UNIQUE_CHECK_NO,			/* Don't do any uniqueness checking */
+	UNIQUE_CHECK_NO_WITH_UNCHANGED, /* "No logical change" duplicate */
 	UNIQUE_CHECK_YES,			/* Enforce uniqueness at insertion time */
 	UNIQUE_CHECK_PARTIAL,		/* Test uniqueness, but no error */
 	UNIQUE_CHECK_EXISTING		/* Check if existing tuple is unique */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 92b19dba32..d950083a7d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -148,7 +148,8 @@ extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
 							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, LockTupleMode *lockmode);
+							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
+							 Bitmapset **modified_attrs_hint);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_update,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 65d9698b89..05956c8868 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1029,11 +1029,12 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
  */
 extern void _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 							   IndexTuple newitem, Size newitemsz,
-							   bool checkingunique);
+							   bool checkingunique, bool dedupdelete,
+							   bool allequalimage);
 extern void _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
 									OffsetNumber baseoff);
 extern bool _bt_dedup_save_htid(BTDedupState state, IndexTuple itup);
-extern Size _bt_dedup_finish_pending(Page newpage, BTDedupState state);
+extern Size _bt_dedup_merge_finish_pending(Page newpage, BTDedupState state);
 extern IndexTuple _bt_form_posting(IndexTuple base, ItemPointer htids,
 								   int nhtids);
 extern void _bt_update_posting(BTVacuumPosting vacposting);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..8fc41d950f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -128,6 +128,18 @@ typedef struct TM_FailureData
 	bool		traversed;
 } TM_FailureData;
 
+/*
+ * State used by table_index_batch_check() to perform "bottom up" deletion of
+ * duplicate index tuples
+ */
+typedef struct TM_IndexDelete
+{
+	ItemPointerData tid;		/* table TID from index tuple */
+	OffsetNumber ioffnum;		/* Index am identifies entries with this */
+	bool		ispromising;	/* Contribute to order we visit table blocks? */
+	bool		isdead;			/* Was tuple found dead? */
+} TM_IndexDelete;
+
 /* "options" flag bits for table_tuple_insert */
 /* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */
 #define TABLE_INSERT_SKIP_FSM		0x0002
@@ -396,7 +408,8 @@ typedef struct TableAmRoutine
 								 bool wait,
 								 TM_FailureData *tmfd,
 								 LockTupleMode *lockmode,
-								 bool *update_indexes);
+								 bool *update_indexes,
+								 Bitmapset **modified_attrs_hint);
 
 	/* see table_tuple_lock() for reference about parameters */
 	TM_Result	(*tuple_lock) (Relation rel,
@@ -1041,16 +1054,32 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
 }
 
 /*
- * This is a convenience wrapper around table_index_fetch_tuple() which
- * returns whether there are table tuple items corresponding to an index
- * entry.  This likely is only useful to verify if there's a conflict in a
- * unique index.
+ * These are convenience wrappers around table_index_fetch_tuple() which
+ * indicate whether there are table tuple items corresponding to an index
+ * entry.  Can be used to verify if there's a conflict in a unique index.
+ *
+ * table_index_batch_check() is a variant that is specialized to garbage
+ * collection of dead tuples in index access methods.  Duplicates are
+ * commonly caused by MVCC version churn when an optimization like
+ * heapam's HOT cannot be applied.  It can make sense to opportunistically
+ * guess that many index tuples are dead versions, particularly in unique
+ * indexes.
+ *
+ * Note that table_index_batch_check() sorts the deltids array so that the
+ * order of access is optimized.  Callers need to be able to deal with
+ * that.
  */
 extern bool table_index_fetch_tuple_check(Relation rel,
 										  ItemPointer tid,
 										  Snapshot snapshot,
 										  bool *all_dead);
 
+extern int	table_index_batch_check(Relation rel,
+									TM_IndexDelete *deltids,
+									int ndeltids,
+									Snapshot snapshot,
+									int npromisingkillsneeded,
+									int *nblocksaccessed);
 
 /* ------------------------------------------------------------------------
  * Functions for non-modifying operations on individual tuples
@@ -1311,12 +1340,13 @@ static inline TM_Result
 table_tuple_update(Relation rel, ItemPointer otid, TupleTableSlot *slot,
 				   CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 				   bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
-				   bool *update_indexes)
+				   bool *update_indexes, Bitmapset **modified_attrs_hint)
 {
 	return rel->rd_tableam->tuple_update(rel, otid, slot,
 										 cid, snapshot, crosscheck,
 										 wait, tmfd,
-										 lockmode, update_indexes);
+										 lockmode, update_indexes,
+										 modified_attrs_hint);
 }
 
 /*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b7978cd22e..f056a7b124 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -579,7 +579,8 @@ extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
 extern List *ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 								   TupleTableSlot *slot, EState *estate,
 								   bool noDupErr,
-								   bool *specConflict, List *arbiterIndexes);
+								   bool *specConflict, List *arbiterIndexes,
+								   Bitmapset *modified_attrs_hint);
 extern bool ExecCheckIndexConstraints(ResultRelInfo *resultRelInfo,
 									  TupleTableSlot *slot,
 									  EState *estate, ItemPointer conflictTid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1585861a02..fa7ca33289 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2892,7 +2892,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, LockTupleMode *lockmode)
+			TM_FailureData *tmfd, LockTupleMode *lockmode,
+			Bitmapset **modified_attrs_hint)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3758,10 +3759,15 @@ l2:
 	if (old_key_tuple != NULL && old_key_copied)
 		heap_freetuple(old_key_tuple);
 
+	/* Save for no logical changes hint when non-HOT update performed */
+	if (!use_hot_update && modified_attrs_hint)
+		*modified_attrs_hint = modified_attrs;
+	else
+		bms_free(modified_attrs);
+
 	bms_free(hot_attrs);
 	bms_free(key_attrs);
 	bms_free(id_attrs);
-	bms_free(modified_attrs);
 	bms_free(interesting_attrs);
 
 	return TM_Ok;
@@ -3891,7 +3897,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, &lockmode);
+						 &tmfd, &lockmode, NULL);
 	switch (result)
 	{
 		case TM_SelfModified:
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..f32ed0a5f2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -314,7 +314,8 @@ static TM_Result
 heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 					CommandId cid, Snapshot snapshot, Snapshot crosscheck,
 					bool wait, TM_FailureData *tmfd,
-					LockTupleMode *lockmode, bool *update_indexes)
+					LockTupleMode *lockmode, bool *update_indexes,
+					Bitmapset **modified_attrs_hint)
 {
 	bool		shouldFree = true;
 	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, &shouldFree);
@@ -325,7 +326,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode);
+						 tmfd, lockmode, modified_attrs_hint);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 9692e4cdf6..8560c5f6c3 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -807,7 +807,75 @@ Deduplication in unique indexes helps to prevent these pathological page
 splits.  Storing duplicates in a space efficient manner is not the goal,
 since in the long run there won't be any duplicates anyway.  Rather, we're
 buying time for standard garbage collection mechanisms to run before a
-page split is needed.
+page split is needed.  Also, the deduplication pass performs a targeted
+form of opportunistic deletion for unique indexes with version churn
+duplicates, as well as in cases where an UPDATE statement did not
+logically modify any indexed column, but nevertheless requires a successor
+index tuple.  The latter case happens when tableam optimizations such as
+heapam's HOT cannot be applied.  (We don't want to distinguish between
+version churn from UPDATEs and version churn from related INSERTs and
+DELETEs within a unique index.)
+
+The deduplication module usually opportunistically deletes whatever
+duplicates happen to be present on the page before moving on to
+deduplication proper, since in general some duplicates are likely to
+already be dead to everybody.  This mechanism is quite similar to
+on-the-fly deletion of index tuples that will already have failed to
+prevent a page split by the time deduplication is considered.  The main
+difference is that the tuples that get deleted are not opportunistically
+marked LP_DEAD by transactions that had to read the tuples in any case.
+
+The implementation must weigh the need to avoid a page split against the
+extra work performed with an exclusive buffer lock held.  It's possible to
+make this trade-off sensibly despite the uncertainty about versioning and
+update chains within nbtree.  In a unique index it's clear that there can
+only be one most recent committed version for any given value, which makes
+it certain that we'll delete some of the old versions --- at least in the
+absence of either a long running transaction that holds back the xmin
+horizon, and barring extreme churn concentrated in one part of the key
+space.
+
+Deduplication-deletion in non-unique indexes is trickier (the
+implementation is almost the same, but the justification is more
+complicated).  In general there is nothing that assures us that there
+cannot be many logical rows that all have the same value in respect of an
+indexed column, which will cause us to waste time trying to find "old dead
+versions" among the duplicates that are actually distinct logical rows.
+We assume that all indexes work more or less like a unique index.  This
+works better than you'd think.  We at least know that there is some chance
+of UPDATE version churn in affected pages; the tuple we're trying to
+insert on the page at the point that this happens certainly originated
+that way, so there is a good chance that the same is true of existing,
+committed tuples.  We're only willing to access a small number of
+heap/table pages to determine if our guess is correct, so if we're totally
+wrong then we'll have accessed no more than 2 or so heap/table pages.
+Finally, and perhaps most importantly, we'll learn from our mistake.  The
+natural consequence of failing to deduplicate-delete is to do a
+deduplicate-merge pass.  That will merge together the duplicate index
+tuples -- we surmise that these correspond to multiple extant logical
+rows.  If and when there is another deduplication-delete pass on the same
+page, we'll skip over the posting list tuple.
+
+A posting list tuple may not actually point to one distinct logical row
+per TID, of course.  Even when our inference is totally wrong it still
+seems like a good idea to skip posting lists like this.  In general, the
+deduplication-deletion algorithm aims to maximize the chances of deleting
+_some_ tuples, while paying only a low fixed cost to access visibility
+information from the table.  In general it's possible that deleting just
+one or two index tuples will buy us many hours or days before the question
+of splitting the same leaf page comes up again -- and VACUUM may well
+visit the page in that time anyway.  If there really is intense pressure
+against the page, with many deduplication-delete passes occurring only
+milliseconds apart, then a version-driven page split is practically
+guaranteed to occur before long.  This resolves the situation.
+
+Negative feedback (such as failing to dedup-delete any tuples) is not
+really undesirable.  At worst it is an unavoidable part of how the
+algorithm works.  We require that our various approaches to handling an
+overflowing page (due partially or entirely to version churn) compete to
+determine how best to handle the problem in a localized fashion.  We
+expect to converge on a stable and roughly optimal behavior at each part
+of the key space in each index affected by version churn.
 
 Unique index leaf pages only get a deduplication pass when an insertion
 (that might have to split the page) observed an existing duplicate on the
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index f6be865b17..2c6e5740bc 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -16,13 +16,25 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
+static bool _bt_dedup_delete_pass(Relation rel, Buffer buf,
+								  Relation heapRel, Size newitemsz,
+								  bool *merge);
+static void _bt_dedup_merge_pass(Relation rel, Buffer buf,
+								 Relation heapRel, IndexTuple newitem,
+								 Size newitemsz, bool checkingunique);
+static void _bt_dedup_delete_finish_pending(BTDedupState state,
+											TM_IndexDelete *deltids,
+											int *ndeltids);
 static bool _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 							 OffsetNumber minoff, IndexTuple newitem);
 static void _bt_singleval_fillfactor(Page page, BTDedupState state,
 									 Size newitemsz);
+static inline int _bt_indexdelete_cmp(TM_IndexDelete *indexdelete1,
+									  TM_IndexDelete *indexdelete2);
 #ifdef USE_ASSERT_CHECKING
 static bool _bt_posting_valid(IndexTuple posting);
 #endif
@@ -32,16 +44,12 @@ static bool _bt_posting_valid(IndexTuple posting);
  * if we cannot successfully free at least newitemsz (we also need space for
  * newitem's line pointer, which isn't included in caller's newitemsz).
  *
- * The general approach taken here is to perform as much deduplication as
- * possible to free as much space as possible.  Note, however, that "single
- * value" strategy is sometimes used for !checkingunique callers, in which
- * case deduplication will leave a few tuples untouched at the end of the
- * page.  The general idea is to prepare the page for an anticipated page
- * split that uses nbtsplitloc.c's "single value" strategy to determine a
- * split point.  (There is no reason to deduplicate items that will end up on
- * the right half of the page after the anticipated page split; better to
- * handle those if and when the anticipated right half page gets its own
- * deduplication pass, following further inserts of duplicates.)
+ * There are two types of deduplication pass: The merge deduplication pass,
+ * where we merge together duplicate index tuples into a new posting list, and
+ * the delete deduplication pass, where old garbage version index tuples are
+ * deleted based on visibility information that we fetch from the table.  We
+ * generally expect to perform only one type of deduplication pass per call
+ * here, but it's possible that we'll end up doing both.
  *
  * This function should be called during insertion, when the page doesn't have
  * enough space to fit an incoming newitem.  If the BTP_HAS_GARBAGE page flag
@@ -54,27 +62,23 @@ static bool _bt_posting_valid(IndexTuple posting);
  */
 void
 _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
-				   IndexTuple newitem, Size newitemsz, bool checkingunique)
+				   IndexTuple newitem, Size newitemsz, bool checkingunique,
+				   bool dedupdelete, bool allequalimage)
 {
 	OffsetNumber offnum,
 				minoff,
 				maxoff;
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
-	Page		newpage;
 	OffsetNumber deletable[MaxIndexTuplesPerPage];
-	BTDedupState state;
 	int			ndeletable = 0;
-	Size		pagesaving = 0;
-	bool		singlevalstrat = false;
-	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
 	 * We can't assume that there are no LP_DEAD items.  For one thing, VACUUM
 	 * will clear the BTP_HAS_GARBAGE hint without reliably removing items
 	 * that are marked LP_DEAD.  We don't want to unnecessarily unset LP_DEAD
-	 * bits when deduplicating items.  Allowing it would be correct, though
-	 * wasteful.
+	 * bits when deduplicating items by merging.  Allowing it would be
+	 * correct, though wasteful.
 	 */
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	minoff = P_FIRSTDATAKEY(opaque);
@@ -99,18 +103,417 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 		 */
 		if (PageGetFreeSpace(page) >= newitemsz)
 			return;
-
-		/*
-		 * Reconsider number of items on page, in case _bt_delitems_delete()
-		 * managed to delete an item or two
-		 */
-		minoff = P_FIRSTDATAKEY(opaque);
-		maxoff = PageGetMaxOffsetNumber(page);
 	}
 
+	minoff = maxoff = InvalidOffsetNumber;	/* Invalidate */
+
+	/*
+	 * We're willing to do dedup deletion with a unique index that is not
+	 * generally safe for deduplication (though only when deduplicate_items
+	 * storage param is not explicitly set to 'off', which our caller checks
+	 * for us).
+	 *
+	 * The logic used by the !checkingunique _bt_dedup_delete_pass() case
+	 * relies on regular deduplication passes occurring, and merging together
+	 * index entries that point to distinct logical table rows that happen to
+	 * have the same key value (this might not happen immediately, but it
+	 * should happen before too long).  We're not willing to deduplicate when
+	 * the index isn't a unique index and isn't an index that is generally
+	 * safe for deduplication.  Exit early if we see that.
+	 */
+	if (!allequalimage && !checkingunique)
+		return;
+
 	/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
 	newitemsz += sizeof(ItemIdData);
 
+	if (checkingunique || dedupdelete)
+	{
+		bool		merge = true;
+
+		/* Perform delete deduplication pass */
+		if (_bt_dedup_delete_pass(rel, buf, heapRel, newitemsz, &merge))
+			return;
+
+		/*
+		 * _bt_dedup_delete_pass() may occasionally indicate no duplicates, in
+		 * which case we should give up now
+		 */
+		if (!merge)
+			return;
+
+		/* Fall back on merge deduplication.  This happens infrequently. */
+	}
+
+	/*
+	 * Perform merge deduplication pass, though only when index is
+	 * allequalimage -- otherwise it's not safe
+	 */
+	if (allequalimage)
+		_bt_dedup_merge_pass(rel, buf, heapRel, newitem, newitemsz,
+							 checkingunique);
+}
+
+#define ST_SORT qsort_deltids
+#define ST_ELEMENT_TYPE TM_IndexDelete
+#define ST_COMPARE(a, b) (_bt_indexdelete_cmp(a, b))
+#define ST_SCOPE static
+#define ST_DEFINE
+#include "lib/sort_template.h"
+
+/*
+ * Perform a delete deduplication pass.
+ *
+ * See if duplicate index tuples are eligible to be deleted, even though they
+ * don't have their LP_DEAD bit set already.  Give up if we have to access
+ * more than a few heap pages before we can free enough space to fit newitem.
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+static bool
+_bt_dedup_delete_pass(Relation rel, Buffer buf, Relation heapRel,
+					  Size newitemsz, bool *merge)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	BTDedupState state;
+	TM_IndexDelete *deltids;
+	int			ndeltids,
+				npromisingdeltids,
+				ntableamkills,
+				ndeletable;
+	SnapshotData SnapshotNonVacuumable;
+	OffsetNumber deletable[MaxIndexTuplesPerPage];
+	int			nblocksaccessed;
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
+	state = (BTDedupState) palloc(sizeof(BTDedupStateData));
+	state->deduplicate = true;
+	state->nmaxitems = 0;
+	/* Final "posting list" size should not restrict anything */
+	state->maxpostingsize = BLCKSZ;
+	state->base = NULL;
+	state->baseoff = InvalidOffsetNumber;
+	state->basetupsize = 0;
+	state->htids = palloc(state->maxpostingsize);
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+	state->nintervals = 0;
+
+	ndeltids = 0;
+	npromisingdeltids = 0;
+	deltids = palloc(MaxTIDsPerBTreePage * sizeof(TM_IndexDelete));
+
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (offnum == minoff)
+		{
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_dedup_save_htid(state, itup))
+		{
+
+		}
+		else
+		{
+			/* Handle interval, saving TIDs when "non-promising" */
+			_bt_dedup_delete_finish_pending(state, deltids, &ndeltids);
+
+			/* itup starts new pending interval */
+			_bt_dedup_start_pending(state, itup, offnum);
+		}
+	}
+	/* Handle the final interval, saving TIDs when "non-promising" */
+	_bt_dedup_delete_finish_pending(state, deltids, &ndeltids);
+
+	if (state->nintervals == 0)
+	{
+		/* No duplicates/promising tuples, don't bother trying */
+		pfree(state->htids);
+		pfree(state);
+		pfree(deltids);
+		/* Caller should avoid deduplication-by-merging pass */
+		*merge = false;
+		return false;
+	}
+
+	/*
+	 * deltids array now contains non-duplicate tuples, all of which are
+	 * marked non-promising.
+	 *
+	 * Add known duplicates to array now by extracting them from the dedup
+	 * intervals we just formed.  Most of the tuples are marked promising so
+	 * that the tableam infrastructure can focus its efforts there.  See
+	 * comment block below for a full explanation of promising tuples.
+	 */
+	for (int i = 0; i < state->nintervals; i++)
+	{
+		BTDedupInterval interval = state->intervals[i];
+
+		Assert(interval.nitems > 0);
+
+		for (int j = 0; j < interval.nitems; j++)
+		{
+			OffsetNumber dupoffnum = interval.baseoff + j;
+			ItemId		itemid = PageGetItemId(page, dupoffnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+			if (!BTreeTupleIsPosting(itup))
+			{
+				/*
+				 * Plain non-pivot tuple duplicate -- TID is promising
+				 */
+				deltids[ndeltids].tid = itup->t_tid;
+				deltids[ndeltids].ioffnum = dupoffnum;
+				deltids[ndeltids].ispromising = true;
+				deltids[ndeltids].isdead = false;	/* for now */
+				ndeltids++;
+				npromisingdeltids++;
+			}
+			else
+			{
+				/*
+				 * Posting list tuple duplicate -- TIDs are not promising, but
+				 * tableam might manage to delete them in passing
+				 */
+				Assert(_bt_posting_valid(itup));
+
+				for (int p = 0; p < BTreeTupleGetNPosting(itup); p++)
+				{
+					ItemPointer htid = BTreeTupleGetPostingN(itup, p);
+
+					deltids[ndeltids].tid = *htid;
+					deltids[ndeltids].ioffnum = dupoffnum;
+					deltids[ndeltids].ispromising = false;
+					deltids[ndeltids].isdead = false;	/* for now */
+					ndeltids++;
+				}
+			}
+		}
+	}
+
+	/* Done with dedup state */
+	pfree(state->htids);
+	pfree(state);
+
+	/*
+	 * Determine which TIDs are dead in deltids array (if we have any
+	 * duplicates at all, since we only really expect to find dead tuples
+	 * among duplicates).
+	 *
+	 * We aim to delete 10 "promising" tuples in total (i.e. to delete 10
+	 * non-pivot tuple duplicates that we consider promising).  This is
+	 * helpful when the same page receives very frequent delete deduplication
+	 * passes, where it makes sense to have fewer slightly larger WAL records.
+	 *
+	 * You might wonder why we don't tell tableam more about our preferences
+	 * (for example, we don't register a callback that figures out when
+	 * tableam has found enough dead TIDs to allow us to free just enough leaf
+	 * page space to avoid a page split -- even though that interface is quite
+	 * feasible).  Our precise preferences are unimportant because the high
+	 * level design of delete deduplication assumes asymmetry: the cost of
+	 * failing to delete even one tuple once per page is drastically lower
+	 * than the potential upside of never having to split a page affected by
+	 * version churn.
+	 *
+	 * Each failure to delete even one tuple here is, in effect, a learning
+	 * experience.  It results in caller falling back on splitting the page
+	 * (or on a merge deduplication pass), discouraging future calls back here
+	 * for the same page.  We converge on the most effective strategy for each
+	 * page in the index over time, through trial and error.  We accept well
+	 * understood small negative outcomes as the price of a potentially huge
+	 * (though uncertain) upside.  We are not interested in barely managing to
+	 * avoid a page split once -- we're playing a long game.
+	 *
+	 * We don't mark posting list tuples as promising specifically so that
+	 * cases where we have a failed delete deduplication pass followed by a
+	 * successfully merge deduplication pass do not go on to waste time on
+	 * posting list tuples during any future delete deduplication passes.  We
+	 * effectively learned our lesson about the TIDs in the posting list
+	 * tuples -- all of the TIDs probably point to distinct logical rows that
+	 * we have no chance of deleting.  (Though even then we have some chance
+	 * of getting lucky if they really were version churn tuples, since there
+	 * might have been even more version churn affecting the same tableam
+	 * block, which we'll try to target directly -- the non-promising posting
+	 * list TIDs might well get picked up.)
+	 */
+	Assert(ndeltids > 0);
+	ntableamkills = 0;
+	if (npromisingdeltids > 0)
+	{
+		InitNonVacuumableSnapshot(SnapshotNonVacuumable,
+								  GlobalVisTestFor(heapRel));
+		ntableamkills = table_index_batch_check(heapRel, deltids, ndeltids,
+												&SnapshotNonVacuumable,
+												Min(10, npromisingdeltids),
+												&nblocksaccessed);
+	}
+
+	if (ntableamkills == 0)
+	{
+		pfree(deltids);
+		return false;
+	}
+
+	/*
+	 * We have at least one TID to kill (though maybe not if it comes from a
+	 * posting list due to the restriction on those below).  Deal with that.
+	 *
+	 * Must first sort deltids in order expected by loop.  See
+	 * _bt_indexdelete_cmp().
+	 */
+	qsort_deltids(deltids, ndeltids);
+	ndeletable = 0;
+	for (int i = 0; i < ndeltids; i++)
+	{
+		OffsetNumber deadoffnum = deltids[i].ioffnum;
+		ItemId		itemid = PageGetItemId(page, deadoffnum);
+		IndexTuple	itup;
+		bool		killtuple = false;
+
+		/* Skip if not dead to tableam, or when item already marked */
+		if (!deltids[i].isdead || ItemIdIsDead(itemid))
+			continue;
+
+		/*
+		 * See if we can actually delete the item.  For a plain non-pivot this
+		 * is trivial (we can).  But we'll only delete a posting list tuple
+		 * when all of its TIDs are dead.
+		 *
+		 * XXX: It actually wouldn't be that hard to do VACUUM-style granular
+		 * posting list TID deletes here (using _bt_update_posting()).  For
+		 * now we don't bother with that because it's not clear if it's
+		 * actually worth anything.  (We really want to kill duplicate tuple
+		 * groups before they can become posting list tuples in the first
+		 * place.)
+		 */
+		itup = (IndexTuple) PageGetItem(page, itemid);
+		if (!BTreeTupleIsPosting(itup))
+			killtuple = true;
+		else
+		{
+			ItemPointer dtid = &deltids[i].tid;
+			int			pi = i + 1;
+			int			nposting = BTreeTupleGetNPosting(itup);
+			int			j;
+
+			Assert(_bt_posting_valid(itup));
+
+			for (j = 0; j < nposting; j++)
+			{
+				ItemPointer item = BTreeTupleGetPostingN(itup, j);
+
+				if (!ItemPointerEquals(item, dtid))
+					break;		/* out of posting list loop */
+
+				/*
+				 * Read-ahead to next duplicate TID here.
+				 *
+				 * We rely on the assumption that not advancing dtid here will
+				 * prevent us from considering the posting list tuple fully
+				 * dead by not matching its next table TID in next outer loop
+				 * iteration.
+				 *
+				 * If, on the other hand, this is the final table TID in the
+				 * posting list tuple, then tuple gets killed regardless (i.e.
+				 * we handle the case where the last item is also the last
+				 * table TID in the last index tuple correctly -- posting
+				 * tuple still gets killed).
+				 */
+				if (pi < ndeltids && deltids[pi].isdead)
+					dtid = &deltids[pi++].tid;
+			}
+
+			if (j == nposting)
+				killtuple = true;
+		}
+
+		if (killtuple)
+		{
+			/*
+			 * Remember item as one we'll delete.
+			 *
+			 * No MarkBufferDirtyHint() call needed -- we'll physically delete
+			 * item in a moment anyway.  N.B.: We must not return before
+			 * reaching _bt_delitems_delete() when we reach here even once.
+			 *
+			 * (Actually, we don't really need to mark the ItemId dead either,
+			 * but we do so anyway because it's expected in opportunistic
+			 * deletion code called below.  XXX: We also rely on it being set
+			 * to avoid processing a posting list tuple twice, but that should
+			 * probably be fixed.)
+			 */
+			ItemIdMarkDead(itemid);
+			deletable[ndeletable++] = deadoffnum;
+		}
+
+		/* Try to break out early */
+		if (ntableamkills == ndeletable)
+			break;
+	}
+
+	Assert(ntableamkills >= ndeletable);
+
+	/* Done with deltids state */
+	pfree(deltids);
+
+	if (unlikely(ndeletable == 0))
+		return false;
+
+	/* Go through with deleting items we found */
+	_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
+
+	/* Return success when page split (or merge deduplication pass) avoided */
+	return PageGetExactFreeSpace(page) >= newitemsz;
+}
+
+/*
+ * Perform a merge deduplication pass.
+ *
+ * The general approach taken here is to perform as much deduplication as
+ * possible to free as much space as possible.  Note, however, that "single
+ * value" strategy is sometimes used for !checkingunique callers, in which
+ * case deduplication will leave a few tuples untouched at the end of the
+ * page.  The general idea is to prepare the page for an anticipated page
+ * split that uses nbtsplitloc.c's "single value" strategy to determine a
+ * split point.  (There is no reason to deduplicate items that will end up on
+ * the right half of the page after the anticipated page split; better to
+ * handle those if and when the anticipated right half page gets its own
+ * deduplication pass, following further inserts of duplicates.)
+ *
+ * Note: Caller should have already deleted all existing items with their
+ * LP_DEAD bits set.
+ */
+static void
+_bt_dedup_merge_pass(Relation rel, Buffer buf, Relation heapRel,
+					 IndexTuple newitem, Size newitemsz, bool checkingunique)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	Page		newpage;
+	BTDedupState state;
+	Size		pagesaving = 0;
+	bool		singlevalstrat = false;
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+
 	/*
 	 * By here, it's clear that deduplication will definitely be attempted.
 	 * Initialize deduplication state.
@@ -138,6 +541,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 	/* nintervals should be initialized to zero */
 	state->nintervals = 0;
 
+	minoff = P_FIRSTDATAKEY(opaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
 	/* Determine if "single value" strategy should be used */
 	if (!checkingunique)
 		singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
@@ -203,7 +609,7 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 			 * form new posting tuple, and actually update the page.  Else
 			 * reset the state and move on without modifying the page.
 			 */
-			pagesaving += _bt_dedup_finish_pending(newpage, state);
+			pagesaving += _bt_dedup_merge_finish_pending(newpage, state);
 
 			if (singlevalstrat)
 			{
@@ -235,7 +641,7 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
 	}
 
 	/* Handle the last item */
-	pagesaving += _bt_dedup_finish_pending(newpage, state);
+	pagesaving += _bt_dedup_merge_finish_pending(newpage, state);
 
 	/*
 	 * If no items suitable for deduplication were found, newpage must be
@@ -317,8 +723,8 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
  * Every tuple processed by deduplication either becomes the base tuple for a
  * posting list, or gets its heap TID(s) accepted into a pending posting list.
  * A tuple that starts out as the base tuple for a posting list will only
- * actually be rewritten within _bt_dedup_finish_pending() when it turns out
- * that there are duplicates that can be merged into the base tuple.
+ * actually be rewritten within _bt_dedup_merge_finish_pending() when it turns
+ * out that there are duplicates that can be merged into the base tuple.
  */
 void
 _bt_dedup_start_pending(BTDedupState state, IndexTuple base,
@@ -443,7 +849,7 @@ _bt_dedup_save_htid(BTDedupState state, IndexTuple itup)
  * where no deduplication was possible.
  */
 Size
-_bt_dedup_finish_pending(Page newpage, BTDedupState state)
+_bt_dedup_merge_finish_pending(Page newpage, BTDedupState state)
 {
 	OffsetNumber tupoff;
 	Size		tuplesz;
@@ -496,6 +902,68 @@ _bt_dedup_finish_pending(Page newpage, BTDedupState state)
 	return spacesaving;
 }
 
+/*
+ * Stripped down version of _bt_dedup_merge_finish_pending() used by
+ * _bt_dedup_delete_pass().
+ *
+ * Finalize interval of duplicates (duplicate group) without materializing the
+ * would-be posting list tuple.  We store all TIDs on the leaf page in the
+ * array, but only TIDs that we determine are duplicates are marked as
+ * promising.  (Non-promising TIDs only get considered in passing, when they
+ * happen to be on the same table am page as promising TIDs.)
+ */
+static void
+_bt_dedup_delete_finish_pending(BTDedupState state, TM_IndexDelete *deltids,
+								int *ndeltids)
+{
+	Assert(state->nitems > 0);
+	Assert(state->nitems <= state->nhtids);
+	Assert(state->intervals[state->nintervals].baseoff == state->baseoff);
+
+	if (state->nitems == 1)
+	{
+		/* Remember non-duplicate's TID, but mark it not promising */
+		OffsetNumber offnum = state->baseoff;
+		IndexTuple	itup = state->base;
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			deltids[*ndeltids].tid = itup->t_tid;
+			deltids[*ndeltids].ioffnum = offnum;
+			deltids[*ndeltids].ispromising = false;
+			deltids[*ndeltids].isdead = false;	/* for now */
+			(*ndeltids)++;
+		}
+		else
+		{
+			Assert(_bt_posting_valid(itup));
+
+			for (int p = 0; p < BTreeTupleGetNPosting(itup); p++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, p);
+
+				deltids[*ndeltids].tid = *htid;
+				deltids[*ndeltids].ioffnum = offnum;
+				deltids[*ndeltids].ispromising = false;
+				deltids[*ndeltids].isdead = false;	/* for now */
+				(*ndeltids)++;
+			}
+		}
+	}
+	else
+	{
+		/* Dups in interval -- store in deltids later */
+		state->intervals[state->nintervals].nitems = state->nitems;
+		/* Increment nintervals, since we wrote a new posting list tuple */
+		state->nintervals++;
+	}
+
+	/* Reset state for next pending posting list */
+	state->nhtids = 0;
+	state->nitems = 0;
+	state->phystupsize = 0;
+}
+
 /*
  * Determine if page non-pivot tuples (data items) are all duplicates of the
  * same value -- if they are, deduplication's "single value" strategy should
@@ -809,6 +1277,44 @@ _bt_swap_posting(IndexTuple newitem, IndexTuple oposting, int postingoff)
 	return nposting;
 }
 
+/*
+ * qsort-style comparator used by _bt_dedup_delete_pass()
+ */
+static inline int
+_bt_indexdelete_cmp(TM_IndexDelete *indexdelete1, TM_IndexDelete *indexdelete2)
+{
+	OffsetNumber offset1 = indexdelete1->ioffnum;
+	OffsetNumber offset2 = indexdelete2->ioffnum;
+	ItemPointer tid1 = &indexdelete1->tid;
+	ItemPointer tid2 = &indexdelete2->tid;
+
+	if (offset1 > offset2)
+		return 1;
+	if (offset1 < offset2)
+		return -1;
+
+	/* Must be posting list tuple -- restore TID order */
+	{
+		BlockNumber blk1 = ItemPointerGetBlockNumber(tid1);
+		BlockNumber blk2 = ItemPointerGetBlockNumber(tid2);
+
+		if (blk1 != blk2)
+			return (blk1 < blk2) ? -1 : 1;
+	}
+	{
+		OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1);
+		OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2);
+
+		if (pos1 != pos2)
+			return (pos1 < pos2) ? -1 : 1;
+	}
+
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
+	return 0;
+}
+
 /*
  * Verify posting list invariants for "posting", which must be a posting list
  * tuple.  Used within assertions.
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index d36f7557c8..0dda82e823 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -37,6 +37,7 @@ static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
 static OffsetNumber _bt_findinsertloc(Relation rel,
 									  BTInsertState insertstate,
 									  bool checkingunique,
+									  bool logicallymodified,
 									  BTStack stack,
 									  Relation heapRel);
 static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
@@ -86,7 +87,9 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTInsertStateData insertstate;
 	BTScanInsert itup_key;
 	BTStack		stack;
-	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
+	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO &&
+								  checkUnique != UNIQUE_CHECK_NO_WITH_UNCHANGED);
+	bool		logicallymodified = (checkUnique != UNIQUE_CHECK_NO_WITH_UNCHANGED);
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_key = _bt_mkscankey(rel, itup);
@@ -235,7 +238,7 @@ search:
 		 * checkingunique.
 		 */
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
-									   stack, heapRel);
+									   logicallymodified, stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
 					   itup, insertstate.itemsz, newitemoff,
 					   insertstate.postingoff, false);
@@ -767,6 +770,11 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
  *		the right, rather than the first page.  In that case, this function
  *		moves right to the correct target page.
  *
+ *		If 'logicallymodified' is false, this is for an UPDATE that didn't
+ *		logically change the indexed value, but must nevertheless have a new
+ *		entry to point to a successor version.  This hint from the executor
+ *		influences the behavior of deduplication.
+ *
  *		(In a !heapkeyspace index, there can be multiple pages with the same
  *		high key, where the new tuple could legitimately be placed on.  In
  *		that case, the caller passes the first page containing duplicates,
@@ -790,6 +798,7 @@ static OffsetNumber
 _bt_findinsertloc(Relation rel,
 				  BTInsertState insertstate,
 				  bool checkingunique,
+				  bool logicallymodified,
 				  BTStack stack,
 				  Relation heapRel)
 {
@@ -873,14 +882,20 @@ _bt_findinsertloc(Relation rel,
 		/*
 		 * If the target page is full, see if we can obtain enough space by
 		 * erasing LP_DEAD items.  If that fails to free enough space, see if
-		 * we can avoid a page split by performing a deduplication pass over
-		 * the page.
+		 * we can avoid a page split by performing deduplication.  Usually
+		 * this means a deduplication merge pass, though a deduplication
+		 * delete pass is preferred when it looks like version churn is the
+		 * source of most of the duplicates.
 		 *
-		 * We only perform a deduplication pass for a checkingunique caller
-		 * when the incoming item is a duplicate of an existing item on the
-		 * leaf page.  This heuristic avoids wasting cycles -- we only expect
-		 * to benefit from deduplicating a unique index page when most or all
-		 * recently added items are duplicates.  See nbtree/README.
+		 * We only consider deduplication for a checkingunique caller when the
+		 * incoming item is a known duplicate of an existing item on the leaf
+		 * page.  This heuristic avoids wasting cycles.  The overarching goal
+		 * within a unique index is to prevent an unnecessary page split
+		 * altogether by delaying splits again and again (the goal is not to
+		 * save space).  If even one incoming tuple that gets added to this
+		 * page originates with an INSERT statement then a page split is all
+		 * but inevitable anyway --- that's why it's okay that our heuristic
+		 * only considers the current incoming newitem.  See nbtree/README.
 		 */
 		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
@@ -893,13 +908,15 @@ _bt_findinsertloc(Relation rel,
 				uniquedup = true;
 			}
 
-			if (itup_key->allequalimage && BTGetDeduplicateItems(rel) &&
-				(!checkingunique || uniquedup) &&
+			if (BTGetDeduplicateItems(rel) && (!checkingunique || uniquedup) &&
 				PageGetFreeSpace(page) < insertstate->itemsz)
 			{
+				bool	dedupdelete = !logicallymodified;
+
 				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
 								   insertstate->itup, insertstate->itemsz,
-								   checkingunique);
+								   checkingunique, dedupdelete,
+								   itup_key->allequalimage);
 				insertstate->bounds_valid = false;
 			}
 		}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index efee86784b..ecfe79badb 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -273,7 +273,7 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 						   bool newfirstdataitem);
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup, Size truncextra);
-static void _bt_sort_dedup_finish_pending(BTWriteState *wstate,
+static void _bt_dedup_sort_finish_pending(BTWriteState *wstate,
 										  BTPageState *state,
 										  BTDedupState dstate);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
@@ -1068,11 +1068,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup,
  * Finalize pending posting list tuple, and add it to the index.  Final tuple
  * is based on saved base tuple, and saved list of heap TIDs.
  *
- * This is almost like _bt_dedup_finish_pending(), but it adds a new tuple
- * using _bt_buildadd().
+ * This is almost like _bt_dedup_merge_finish_pending(), but it adds a new
+ * tuple using _bt_buildadd().
  */
 static void
-_bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state,
+_bt_dedup_sort_finish_pending(BTWriteState *wstate, BTPageState *state,
 							  BTDedupState dstate)
 {
 	Assert(dstate->nitems > 0);
@@ -1371,7 +1371,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 				 * _bt_dedup_save_htid() opted to not merge current item into
 				 * pending posting list.
 				 */
-				_bt_sort_dedup_finish_pending(wstate, state, dstate);
+				_bt_dedup_sort_finish_pending(wstate, state, dstate);
 				pfree(dstate->base);
 
 				/* start new pending posting list with itup copy */
@@ -1390,7 +1390,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			 * Handle the last item (there must be a last item when the
 			 * tuplesort returned one or more tuples)
 			 */
-			_bt_sort_dedup_finish_pending(wstate, state, dstate);
+			_bt_dedup_sort_finish_pending(wstate, state, dstate);
 			pfree(dstate->base);
 			pfree(dstate->htids);
 		}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index bda9be2348..9186bdeea5 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -530,12 +530,12 @@ btree_xlog_dedup(XLogReaderState *record)
 			}
 			else
 			{
-				_bt_dedup_finish_pending(newpage, state);
+				_bt_dedup_merge_finish_pending(newpage, state);
 				_bt_dedup_start_pending(state, itup, offnum);
 			}
 		}
 
-		_bt_dedup_finish_pending(newpage, state);
+		_bt_dedup_merge_finish_pending(newpage, state);
 		Assert(state->nintervals == xlrec->nintervals);
 		Assert(memcmp(state->intervals, intervals,
 					  state->nintervals * sizeof(BTDedupInterval)) == 0);
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6438c45716..5ef3530df4 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -30,6 +30,20 @@
 #include "storage/shmem.h"
 #include "storage/smgr.h"
 
+typedef struct TM_IndexDeleteCounts
+{
+	BlockNumber table_block;
+	int16		npromisingtids_in_block;
+	int16		ntids_in_block;
+} TM_IndexDeleteCounts;
+
+static int	table_index_batch_check_block_sort(TM_IndexDelete *deltids,
+											   int ndeltids);
+static inline int indexdelete_tids_cmp(TM_IndexDelete *indexdelete1,
+									   TM_IndexDelete *indexdelete2);
+static inline int indexdeletecount_ntids_cmp(TM_IndexDeleteCounts *count1,
+											 TM_IndexDeleteCounts *count2);
+
 /*
  * Constants to control the behavior of block allocation to parallel workers
  * during a parallel seqscan.  Technically these values do not need to be
@@ -207,9 +221,9 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc parallel_scan)
 /*
  * To perform that check simply start an index scan, create the necessary
  * slot, do the heap lookup, and shut everything down again. This could be
- * optimized, but is unlikely to matter from a performance POV. If there
- * frequently are live index pointers also matching a unique index key, the
- * CPU overhead of this routine is unlikely to matter.
+ * optimized, but is unlikely to matter from a performance POV. Note that
+ * table_index_batch_check() is optimized in this way, since it is designed
+ * as a batch operation.
  *
  * Note that *tid may be modified when we return true if the AM supports
  * storing multiple row versions reachable via a single index entry (like
@@ -236,6 +250,116 @@ table_index_fetch_tuple_check(Relation rel,
 	return found;
 }
 
+/*
+ * Specialized variant of table_index_fetch_tuple_check() that can be used
+ * by index AMs to perform "bottom up" deletion of duplicate index tuples.
+ * This is particularly likely to work well with unique indexes.
+ *
+ * Note: This routine sorts the deltids array, but does not modify any
+ * individual entry accept to mark it as dead for caller.
+ *
+ * Returns total number of deltids that can be killed in index by caller.
+ *
+ * TODO: This should be combined with the equivalent of a call to
+ * table_compute_xid_horizon_for_tuples().
+ */
+int
+table_index_batch_check(Relation rel, TM_IndexDelete *deltids, int ndeltids,
+						Snapshot snapshot, int npromisingkillsneeded,
+						int *nblocksaccessed)
+{
+	IndexFetchTableData *scan;
+	TupleTableSlot *slot;
+	int			nkills = 0;
+	int			npromkills = 0;
+	BlockNumber last = InvalidBlockNumber;
+	bool		final_block = false;
+	int			nblocks_promising;
+
+	slot = table_slot_create(rel, NULL);
+	scan = table_index_fetch_begin(rel);
+
+	*nblocksaccessed = 0;
+	nblocks_promising = table_index_batch_check_block_sort(deltids, ndeltids);
+	for (int i = 0; i < ndeltids; i++)
+	{
+		ItemPointer tid = &deltids[i].tid;
+		bool		ispromising = deltids[i].ispromising;
+		bool		new_block = last != ItemPointerGetBlockNumber(tid);
+		ItemPointerData tmp;
+		bool		call_again = false;
+		bool		all_dead = false;
+		bool		found;
+
+		Assert(!deltids[i].isdead);
+
+		/*
+		 * When we were just finishing off a table block we'd already started
+		 * with at the point we reached caller's requirements, quit.  (We quit
+		 * here when we already decided not to access the new block.  We're
+		 * carrying out that earlier decision now.)
+		 */
+		if (final_block && new_block)
+			break;
+
+		if (new_block)
+		{
+			/*
+			 * Each time we're about to access a new block we end up here to
+			 * consider if it's really worth accessing.  Apply the following
+			 * tests, and quit without accessing block if any test fails:
+			 *
+			 * 1. Give up when we'd otherwise access the second table block in
+			 * line and have no kills (promising or otherwise) to show for
+			 * accessing the first/"most promising" block.
+			 *
+			 * 2. Give up once we just finished final block that has at least
+			 * one promising tuple (or when there were no promising tuples in
+			 * the first place).  We never access a new block just to get to
+			 * non-promising tuples because the chances of success are
+			 * precisely zero with many workloads.
+			 *
+			 * 3. Give up before accessing a fourth block, no matter what.
+			 */
+			if (*nblocksaccessed == 1 && nkills == 0)
+				break;
+			if (nblocks_promising-- == 0)
+				break;
+			if (*nblocksaccessed == 3)
+				break;
+		}
+
+		tmp = *tid;
+		found = table_index_fetch_tuple(scan, &tmp, snapshot, slot,
+										&call_again, &all_dead);
+
+		if (new_block)
+			(*nblocksaccessed)++;
+		last = ItemPointerGetBlockNumber(tid);
+		if (!found && all_dead)
+		{
+			deltids[i].isdead = true;
+			nkills++;
+			if (ispromising)
+				npromkills++;
+		}
+
+		if (npromkills >= npromisingkillsneeded)
+		{
+			/*
+			 * Caller is satisfied, so we can quit now.  But before we do,
+			 * might as well finish off remaining TIDs on same table block (if
+			 * any).
+			 */
+			final_block = true;
+		}
+	}
+
+	table_index_fetch_end(scan);
+	ExecDropSingleTupleTableSlot(slot);
+
+	return nkills;
+}
 
 /* ------------------------------------------------------------------------
  * Functions for non-modifying operations on individual tuples
@@ -356,7 +480,7 @@ simple_table_tuple_update(Relation rel, ItemPointer otid,
 								GetCurrentCommandId(true),
 								snapshot, InvalidSnapshot,
 								true /* wait for commit */ ,
-								&tmfd, &lockmode, update_indexes);
+								&tmfd, &lockmode, update_indexes, NULL);
 
 	switch (result)
 	{
@@ -763,3 +887,170 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 	else
 		*allvisfrac = (double) relallvisible / curpages;
 }
+
+#define ST_SORT qsort_deltids_promising
+#define ST_ELEMENT_TYPE TM_IndexDelete
+#define ST_COMPARE(a, b) (indexdelete_tids_cmp(a, b))
+#define ST_SCOPE static
+#define ST_DEFINE
+#include "lib/sort_template.h"
+
+#define ST_SORT qsort_blockgroup
+#define ST_ELEMENT_TYPE TM_IndexDeleteCounts
+#define ST_COMPARE(a, b) (indexdeletecount_ntids_cmp(a, b))
+#define ST_SCOPE static
+#define ST_DEFINE
+#include "lib/sort_template.h"
+
+/*
+ * table_index_batch_check() requires that the deltids array be in a certain
+ * order before it gets started.  This helper routine handles that.
+ *
+ * TIDs are grouped together by block number, with ascending TID order within
+ * each group (i.e. in ascending TID offset number order).  The block number
+ * groups are ordered according to the total number of promising candidate
+ * TIDs.  This order maximizes the final number of TIDs that caller can kill
+ * in index relative to the number of tableam blocks accessed.
+ *
+ * The goal of the sort order is to process as many dup table TIDs as
+ * possible with as few table buffer accesses as possible.  In practice it's
+ * frequently possible to kill relatively many TIDs with only one or two
+ * table page accesses due to the effect of locality.
+ */
+static int
+table_index_batch_check_block_sort(TM_IndexDelete *deltids, int ndeltids)
+{
+	TM_IndexDeleteCounts *blockcounts;
+	TM_IndexDelete *reordereddeltids;
+	BlockNumber curblock = InvalidBlockNumber;
+	int			nblock_groups = 0;
+	int			nblocks_promising = 0;
+	int			ncopied = 0;
+
+	Assert(ndeltids > 0);
+
+	/* First sort caller's array by TID */
+	qsort_deltids_promising(deltids, ndeltids);
+
+	/* Calculate per-table-block count of TIDs */
+	blockcounts = palloc(sizeof(TM_IndexDeleteCounts) * ndeltids);
+	for (int i = 0; i < ndeltids; i++)
+	{
+		ItemPointer deltid = &deltids[i].tid;
+		bool		ispromising = deltids[i].ispromising;
+
+		if (curblock != ItemPointerGetBlockNumber(deltid))
+		{
+			/* New block group */
+			nblock_groups++;
+
+			curblock = ItemPointerGetBlockNumber(deltid);
+			blockcounts[nblock_groups - 1].table_block = curblock;
+			blockcounts[nblock_groups - 1].ntids_in_block = 1;
+			blockcounts[nblock_groups - 1].npromisingtids_in_block = 0;
+		}
+		else
+		{
+			blockcounts[nblock_groups - 1].ntids_in_block++;
+		}
+
+		if (ispromising)
+			blockcounts[nblock_groups - 1].npromisingtids_in_block++;
+	}
+
+	/*
+	 * Sort blockcounts by "promising" TID count in desc order, then tiebreak
+	 * on block number
+	 */
+	qsort_blockgroup(blockcounts, nblock_groups);
+	reordereddeltids = palloc(ndeltids * sizeof(TM_IndexDelete));
+	for (int i = 0; i < nblock_groups; i++)
+	{
+		TM_IndexDeleteCounts *blockgroup = blockcounts + i;
+
+		for (int j = 0; j < ndeltids; j++)
+		{
+			ItemPointer tid = &deltids[j].tid;
+
+			if (blockgroup->table_block == ItemPointerGetBlockNumber(tid))
+			{
+				memcpy(reordereddeltids + ncopied, deltids + j,
+					   sizeof(TM_IndexDelete) * blockgroup->ntids_in_block);
+				ncopied += blockgroup->ntids_in_block;
+				if (blockgroup->npromisingtids_in_block > 0)
+					nblocks_promising++;
+				break;			/* Move on to next table block group */
+			}
+		}
+	}
+
+	/* Copy back final sorted array into caller's array */
+	memcpy(deltids, reordereddeltids, sizeof(TM_IndexDelete) * ndeltids);
+
+	/* be tidy */
+	pfree(reordereddeltids);
+	pfree(blockcounts);
+
+	return nblocks_promising;
+}
+
+/*
+ * qsort-style comparator used in table_index_batch_check_block_sort()
+ */
+static inline int
+indexdelete_tids_cmp(TM_IndexDelete *indexdelete1,
+					 TM_IndexDelete *indexdelete2)
+{
+	ItemPointer tid1 = &indexdelete1->tid;
+	ItemPointer tid2 = &indexdelete2->tid;
+
+	{
+		BlockNumber blk1 = ItemPointerGetBlockNumber(tid1);
+		BlockNumber blk2 = ItemPointerGetBlockNumber(tid2);
+
+		if (blk1 != blk2)
+			return (blk1 < blk2) ? -1 : 1;
+	}
+	{
+		OffsetNumber pos1 = ItemPointerGetOffsetNumber(tid1);
+		OffsetNumber pos2 = ItemPointerGetOffsetNumber(tid2);
+
+		if (pos1 != pos2)
+			return (pos1 < pos2) ? -1 : 1;
+	}
+
+	/* ItemPointer values should never be equal */
+	Assert(false);
+
+	return 0;
+}
+
+/*
+ * qsort-style comparator used in table_index_batch_check_block_sort()
+ */
+static inline int
+indexdeletecount_ntids_cmp(TM_IndexDeleteCounts *count1,
+						   TM_IndexDeleteCounts *count2)
+{
+	/* Invert usual order here (desc npromisingtids_in_block sort order) */
+	if (count1->npromisingtids_in_block > count2->npromisingtids_in_block)
+		return -1;
+	if (count1->npromisingtids_in_block < count2->npromisingtids_in_block)
+		return 1;
+
+	/* Tiebreak: desc ntids_in_block sort order */
+	if (count1->ntids_in_block > count2->ntids_in_block)
+		return -1;
+	if (count1->ntids_in_block < count2->ntids_in_block)
+		return 1;
+
+	/* Tiebreak: block number (asc order) */
+	if (count1->table_block > count2->table_block)
+		return 1;
+	if (count1->table_block < count2->table_block)
+		return -1;
+
+	Assert(false);
+
+	return 0;
+}
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 531bd7c73a..b9b625b883 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2523,7 +2523,7 @@ CopyMultiInsertBufferFlush(CopyMultiInsertInfo *miinfo,
 			recheckIndexes =
 				ExecInsertIndexTuples(resultRelInfo,
 									  buffer->slots[i], estate, false, NULL,
-									  NIL);
+									  NIL, NULL);
 			ExecARInsertTriggers(estate, resultRelInfo,
 								 slots[i], recheckIndexes,
 								 cstate->transition_capture);
@@ -3285,7 +3285,8 @@ CopyFrom(CopyState cstate)
 																   estate,
 																   false,
 																   NULL,
-																   NIL);
+																   NIL,
+																   NULL);
 					}
 
 					/* AFTER ROW INSERT Triggers */
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index c6b5bcba7b..d171d26b69 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -275,7 +275,8 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 					  EState *estate,
 					  bool noDupErr,
 					  bool *specConflict,
-					  List *arbiterIndexes)
+					  List *arbiterIndexes,
+					  Bitmapset *modified_attrs_hint)
 {
 	ItemPointer tupleid = &slot->tts_tid;
 	List	   *result = NIL;
@@ -389,6 +390,44 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		else
 			checkUnique = UNIQUE_CHECK_PARTIAL;
 
+		/*
+		 * We may have to hint to index am that this is a logically unchanged
+		 * index tuple.  This happens when we're inserting a duplicate tuple
+		 * just to represent the successor version.
+		 */
+		if (checkUnique == UNIQUE_CHECK_NO && modified_attrs_hint)
+		{
+			bool		logicallyModified = false;
+
+			for (int attr = 0; attr < indexInfo->ii_NumIndexAttrs; attr++)
+			{
+				int			keycol = indexInfo->ii_IndexAttrNumbers[attr];
+
+				if (keycol > 0)
+				{
+					logicallyModified =
+						bms_is_member(keycol - FirstLowInvalidHeapAttributeNumber,
+									  modified_attrs_hint);
+					if (logicallyModified)
+						break;
+				}
+				else
+				{
+					/*
+					 * XXX: For now we always assume that expression indexes
+					 * and indexes with whole-row vars were not modified by an
+					 * UPDATE (i.e. they just use the dedup delete
+					 * optimization regardless of the details of the UPDATE).
+					 * Review this decision when the high level design is a
+					 * bit better worked out.
+					 */
+				}
+			}
+
+			if (!logicallyModified)
+				checkUnique = UNIQUE_CHECK_NO_WITH_UNCHANGED;
+		}
+
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
 						 values,	/* array of index Datums */
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 01d26881e7..e97d05b448 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -445,7 +445,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 		if (resultRelInfo->ri_NumIndices > 0)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, NULL,
-												   NIL);
+												   NIL, NULL);
 
 		/* AFTER ROW INSERT Triggers */
 		ExecARInsertTriggers(estate, resultRelInfo, slot,
@@ -513,7 +513,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (resultRelInfo->ri_NumIndices > 0 && update_indexes)
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false, NULL,
-												   NIL);
+												   NIL, NULL);
 
 		/* AFTER ROW UPDATE Triggers */
 		ExecARUpdateTriggers(estate, resultRelInfo,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0c055ed408..718539d7b8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -605,7 +605,7 @@ ExecInsert(ModifyTableState *mtstate,
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, true,
 												   &specConflict,
-												   arbiterIndexes);
+												   arbiterIndexes, NULL);
 
 			/* adjust the tuple's state accordingly */
 			table_tuple_complete_speculative(resultRelationDesc, slot,
@@ -644,7 +644,7 @@ ExecInsert(ModifyTableState *mtstate,
 			if (resultRelInfo->ri_NumIndices > 0)
 				recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 													   slot, estate, false,
-													   NULL, NIL);
+													   NULL, NIL, NULL);
 		}
 	}
 
@@ -1238,6 +1238,7 @@ ExecUpdate(ModifyTableState *mtstate,
 	TM_Result	result;
 	TM_FailureData tmfd;
 	List	   *recheckIndexes = NIL;
+	Bitmapset  *modified_attrs_hint = NULL;
 
 	/*
 	 * abort the operation if not running transactions
@@ -1401,7 +1402,8 @@ lreplace:;
 									estate->es_snapshot,
 									estate->es_crosscheck_snapshot,
 									true /* wait for commit */ ,
-									&tmfd, &lockmode, &update_indexes);
+									&tmfd, &lockmode, &update_indexes,
+									&modified_attrs_hint);
 
 		switch (result)
 		{
@@ -1532,9 +1534,13 @@ lreplace:;
 
 		/* insert index entries for tuple if necessary */
 		if (resultRelInfo->ri_NumIndices > 0 && update_indexes)
+		{
 			recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
 												   slot, estate, false,
-												   NULL, NIL);
+												   NULL, NIL,
+												   modified_attrs_hint);
+			bms_free(modified_attrs_hint);
+		}
 	}
 
 	if (canSetTag)
-- 
2.25.1

Re: Deleting older versions in unique indexes to avoid page splits

Reply via email to