06.08.2019 4:28, Peter Geoghegan wrote:
Attached is v5, which is based on your v4. The three main differences
between this and v4 are:
* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.
Fair enough.
I think we can measure performance and make a decision, when patch will
stabilize.
* We now have experimental support for unique indexes. This is broken
out into its own patch.
* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().
As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.
If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
also safe.
I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.
Could you explain more about these cases?
Now I don't understand the problem.
The main reason why I decided to avoid applying compression to unique
indexes
is the performance of microvacuum. It is not applied to items inside a
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.
I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.
BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that
it doesn't
distract us from the original feature.
if (ItemIdIsDead(itemId))
continue;
In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.
I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove
them if any.
Another possible solution is to copy dead items as is from old page to
the new one,
but I think it's good to remove dead tuples as fast as possible.
v5 makes _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken).
Is it possible that DEAD tuple to delete was smaller than itup?
I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.
Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.
How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
I agree.
Should I rename all related names of functions and variables in the patch?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
commit 9ac37503c71f7623413a2e406d81f5c9a4b02742
Author: Anastasia <a.lubennik...@postgrespro.ru>
Date: Tue Aug 13 17:00:41 2019 +0300
v6-0001-Compression-deduplication-in-nbtree.patch
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..504bca2 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
size_t tupsize;
BTScanInsert skey;
bool lowersizelimit;
+ ItemPointer scantid;
CHECK_FOR_INTERRUPTS();
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
/*
* Readonly callers may optionally verify that non-pivot tuples can
- * each be found by an independent search that starts from the root
+ * each be found by an independent search that starts from the root.
+ * Note that we deliberately don't do individual searches for each
+ * "logical" posting list tuple, since the posting list itself is
+ * validated by other checks.
*/
if (state->rootdescend && P_ISLEAF(topaque) &&
!bt_rootdescend(state, itup))
{
char *itid,
*htid;
+ ItemPointer tid = BTreeTupleGetHeapTID(itup);
itid = psprintf("(%u,%u)", state->targetblock, offset);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumber(&(itup->t_tid)),
- ItemPointerGetOffsetNumber(&(itup->t_tid)));
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("could not find tuple using search from root page in index \"%s\"",
RelationGetRelationName(state->rel)),
- errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
itid, htid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ /*
+ * If tuple is actually a posting list, make sure posting list TIDs
+ * are in order.
+ */
+ if (BTreeTupleIsPosting(itup))
+ {
+ ItemPointerData last;
+ ItemPointer current;
+
+ ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+ for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+ {
+
+ current = BTreeTupleGetPostingN(itup, i);
+
+ if (ItemPointerCompare(current, &last) <= 0)
+ {
+ char *itid,
+ *htid;
+
+ itid = psprintf("(%u,%u)", state->targetblock, offset);
+ htid = psprintf("(%u,%u)",
+ ItemPointerGetBlockNumberNoCheck(current),
+ ItemPointerGetOffsetNumberNoCheck(current));
+
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("posting list heap TIDs out of order in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+ itid, htid,
+ (uint32) (state->targetlsn >> 32),
+ (uint32) state->targetlsn)));
+ }
+
+ ItemPointerCopy(current, &last);
+ }
+ }
+
/* Build insertion scankey for current page offset */
skey = bt_mkscankey_pivotsearch(state->rel, itup);
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
{
IndexTuple norm;
- norm = bt_normalize_tuple(state, itup);
- bloom_add_element(state->filter, (unsigned char *) norm,
- IndexTupleSize(norm));
- /* Be tidy */
- if (norm != itup)
- pfree(norm);
+ if (BTreeTupleIsPosting(itup))
+ {
+ IndexTuple onetup;
+
+ /* Fingerprint all elements of posting tuple one by one */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+ norm = bt_normalize_tuple(state, onetup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != onetup)
+ pfree(norm);
+ pfree(onetup);
+ }
+ }
+ else
+ {
+ norm = bt_normalize_tuple(state, itup);
+ bloom_add_element(state->filter, (unsigned char *) norm,
+ IndexTupleSize(norm));
+ /* Be tidy */
+ if (norm != itup)
+ pfree(norm);
+ }
}
/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
*
* If there is a high key (if this is not the rightmost page on its
* entire level), check that high key actually is upper bound on all
- * page items.
+ * page items. If this is a posting list tuple, we'll need to set
+ * scantid to be highest TID in posting list.
*
* We prefer to check all items against high key rather than checking
* just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
* tuple. (See also: "Notes About Data Representation" in the nbtree
* README.)
*/
+ scantid = skey->scantid;
+ if (!BTreeTupleIsPivot(itup))
+ skey->scantid = BTreeTupleGetMaxTID(itup);
if (!P_RIGHTMOST(topaque) &&
!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
}
+ skey->scantid = scantid;
/*
* * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
*htid,
*nitid,
*nhtid;
+ ItemPointer tid;
itid = psprintf("(%u,%u)", state->targetblock, offset);
+ tid = BTreeTupleGetHeapTID(itup);
htid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
nitid = psprintf("(%u,%u)", state->targetblock,
OffsetNumberNext(offset));
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
state->target,
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+ tid = BTreeTupleGetHeapTID(itup);
nhtid = psprintf("(%u,%u)",
- ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
- ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+ ItemPointerGetBlockNumberNoCheck(tid),
+ ItemPointerGetOffsetNumberNoCheck(tid));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
"higher index tid=%s (points to %s tid=%s) "
"page lsn=%X/%X.",
itid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
htid,
nitid,
- P_ISLEAF(topaque) ? "heap" : "index",
+ P_ISLEAF(topaque) ? "min heap" : "index",
nhtid,
(uint32) (state->targetlsn >> 32),
(uint32) state->targetlsn)));
@@ -1953,10 +2028,11 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
* verification. In particular, it won't try to normalize opclass-equal
* datums with potentially distinct representations (e.g., btree/numeric_ops
* index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though. For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have their own posting
+ * list, since dummy CREATE INDEX callback code generates new tuples with the
+ * same normalized representation. Compression is performed
+ * opportunistically, and in general there is no guarantee about how or when
+ * compression will be applied.
*/
static IndexTuple
bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2560,14 +2636,16 @@ static inline ItemPointer
BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
bool nonpivot)
{
- ItemPointer result = BTreeTupleGetHeapTID(itup);
+ ItemPointer result;
BlockNumber targetblock = state->targetblock;
- if (result == NULL && nonpivot)
+ if (BTreeTupleIsPivot(itup) == nonpivot)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
targetblock, RelationGetRelationName(state->rel))));
+ result = BTreeTupleGetHeapTID(itup);
+
return result;
}
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 5890f39..e96f5ec 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -41,6 +41,17 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
BTStack stack,
Relation heapRel);
static void _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack);
+static void _bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff);
+static void _bt_insertonpg_in_posting(Relation rel, BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page, int in_posting_offset);
static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -56,6 +67,8 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off);
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void insert_itupprev_to_page(Page page, BTCompressState *compressState);
+static void _bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel);
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -297,10 +310,17 @@ top:
* search bounds established within _bt_check_unique when insertion is
* checkingunique.
*/
+ insertstate.in_posting_offset = 0;
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
stack, heapRel);
- _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
- itup, newitemoff, false);
+
+ if (insertstate.in_posting_offset)
+ _bt_insertonpg_in_posting(rel, itup_key, insertstate.buf,
+ InvalidBuffer, stack, itup, newitemoff,
+ false, insertstate.in_posting_offset);
+ else
+ _bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer,
+ stack, itup, newitemoff, false);
}
else
{
@@ -435,6 +455,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* okay, we gotta fetch the heap tuple ... */
curitup = (IndexTuple) PageGetItem(page, curitemid);
+ Assert(!BTreeTupleIsPosting(curitup));
htid = curitup->t_tid;
/*
@@ -759,6 +780,26 @@ _bt_findinsertloc(Relation rel,
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
insertstate->bounds_valid = false;
}
+
+ /*
+ * If the target page is full, try to compress the page
+ */
+ if (PageGetFreeSpace(page) < insertstate->itemsz && !checkingunique)
+ {
+ _bt_compress_one_page(rel, insertstate->buf, heapRel);
+ insertstate->bounds_valid = false; /* paranoia */
+
+ /*
+ * FIXME: _bt_vacuum_one_page() won't have cleared the
+ * BTP_HAS_GARBAGE flag when it didn't kill items. Maybe we
+ * should clear the BTP_HAS_GARBAGE flag bit from the page when
+ * compression avoids a page split -- _bt_vacuum_one_page() is
+ * expecting a page split that takes care of it.
+ *
+ * (On the other hand, maybe it doesn't matter very much. A
+ * comment update seems like the bare minimum we should do.)
+ */
+ }
}
else
{
@@ -900,6 +941,208 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
insertstate->bounds_valid = false;
}
+/*
+ * Delete tuple on newitemoff offset and insert newitup at the same offset.
+ * All checks of free space must have been done before calling this function.
+ *
+ * For use in posting tuple's update.
+ */
+static void
+_bt_delete_and_insert(Relation rel,
+ Buffer buf,
+ IndexTuple newitup,
+ OffsetNumber newitemoff)
+{
+ Page page = BufferGetPage(buf);
+ Size newitupsz = IndexTupleSize(newitup);
+
+ newitupsz = MAXALIGN(newitupsz);
+
+ START_CRIT_SECTION();
+
+ PageIndexTupleDelete(page, newitemoff);
+
+ if (!_bt_pgaddtup(page, newitupsz, newitup, newitemoff))
+ elog(ERROR, "failed to insert compressed item in index \"%s\"",
+ RelationGetRelationName(rel));
+
+ MarkBufferDirty(buf);
+
+ /* Xlog stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_btree_insert xlrec;
+ XLogRecPtr recptr;
+
+ xlrec.offnum = newitemoff;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
+
+ Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+
+ /*
+ * Force full page write to keep code simple
+ *
+ * TODO: think of using XLOG_BTREE_INSERT_LEAF with a new tuple's data
+ */
+ XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_FORCE_IMAGE);
+ recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_INSERT_LEAF);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
+/*
+ * _bt_insertonpg_in_posting() --
+ * Insert a tuple on a particular page in the index
+ * (compression aware version).
+ *
+ * If new tuple's key is equal to the key of a posting tuple that already
+ * exists on the page and it's TID falls inside the min/max range of
+ * existing posting list, update the posting tuple.
+ *
+ * It only can happen on leaf page.
+ *
+ * newitemoff - offset of the posting tuple we must update
+ * in_posting_offset - position of the new tuple's TID in posting list
+ *
+ * If necessary, split the page.
+ */
+static void
+_bt_insertonpg_in_posting(Relation rel,
+ BTScanInsert itup_key,
+ Buffer buf,
+ Buffer cbuf,
+ BTStack stack,
+ IndexTuple itup,
+ OffsetNumber newitemoff,
+ bool split_only_page,
+ int in_posting_offset)
+{
+ IndexTuple origtup;
+ IndexTuple lefttup;
+ IndexTuple righttup;
+ ItemPointerData *ipd;
+ IndexTuple newitup;
+ ItemId itemid;
+ Page page;
+ int nipd,
+ nipd_right;
+
+ page = BufferGetPage(buf);
+ /* get old posting tuple */
+ itemid = PageGetItemId(page, newitemoff);
+ origtup = (IndexTuple) PageGetItem(page, itemid);
+ Assert(BTreeTupleIsPosting(origtup));
+ nipd = BTreeTupleGetNPosting(origtup);
+ Assert(in_posting_offset < nipd);
+ Assert(itup_key->scantid != NULL);
+ Assert(itup_key->heapkeyspace);
+
+ elog(DEBUG4, "(%u,%u) is min, (%u,%u) is max, (%u,%u) is new",
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(origtup)),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetMaxTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetMaxTID(itup)));
+
+ /*
+ * Fist check if existing item is dead.
+ *
+ * Then check if the new itempointer fits into the tuple's posting list.
+ *
+ * Also check if new itempointer fits into the page.
+ *
+ * If not, posting tuple's split is required in both cases.
+ *
+ * XXX: Think some more about alignment - pg
+ */
+ if (ItemIdIsDead(itemid))
+ {
+ /* FIXME: We need to call index_compute_xid_horizon_for_tuples() */
+ elog(DEBUG4, "replacing LP_DEAD posting list item, new off %d",
+ newitemoff);
+ _bt_delete_and_insert(rel, buf, itup, newitemoff);
+ _bt_relbuf(rel, buf);
+ }
+ else if (BTMaxItemSize(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)) ||
+ PageGetFreeSpace(page) < MAXALIGN(IndexTupleSize(origtup)) + MAXALIGN(sizeof(ItemPointerData)))
+ {
+ /*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */
+ lefttup = BTreeFormPostingTuple(origtup, BTreeTupleGetPosting(origtup),
+ in_posting_offset);
+
+ nipd_right = nipd - in_posting_offset + 1;
+ ipd = palloc0(sizeof(ItemPointerData) * nipd_right);
+ /* insert new item pointer */
+ memcpy(ipd, itup, sizeof(ItemPointerData));
+ /* copy item pointers from original tuple that belong on right */
+ memcpy(ipd + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ righttup = BTreeFormPostingTuple(origtup, ipd, nipd_right);
+ elog(DEBUG4, "inserting inside posting list with split due to no space orig elements %d new off %d",
+ nipd, in_posting_offset);
+
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
+ BTreeTupleGetHeapTID(righttup)) < 0);
+
+ /*
+ * Replace old tuple with a left tuple on a page.
+ *
+ * And insert righttuple using ordinary _bt_insertonpg() function If
+ * split is required, _bt_insertonpg will handle it.
+ *
+ * FIXME: This doesn't seem very crash safe -- what if we fail after
+ * _bt_delete_and_insert() but before _bt_insertonpg()? We could
+ * crash and then lose some of the logical tuples that used to be
+ * contained within original posting list, but will now go into new
+ * righttup posting list.
+ */
+ _bt_delete_and_insert(rel, buf, lefttup, newitemoff);
+ _bt_insertonpg(rel, itup_key, buf, InvalidBuffer,
+ stack, righttup, newitemoff + 1, false);
+
+ pfree(ipd);
+ pfree(lefttup);
+ pfree(righttup);
+ }
+ else
+ {
+ ipd = palloc0(sizeof(ItemPointerData) * (nipd + 1));
+ elog(DEBUG4, "inserting inside posting list due to apparent overlap");
+
+ /* copy item pointers from original tuple into ipd */
+ memcpy(ipd, BTreeTupleGetPosting(origtup),
+ sizeof(ItemPointerData) * in_posting_offset);
+ /* add item pointer of the new tuple into ipd */
+ memcpy(ipd + in_posting_offset, itup, sizeof(ItemPointerData));
+ /* copy item pointers from old tuple into ipd */
+ memcpy(ipd + in_posting_offset + 1,
+ BTreeTupleGetPostingN(origtup, in_posting_offset),
+ sizeof(ItemPointerData) * (nipd - in_posting_offset));
+
+ newitup = BTreeFormPostingTuple(itup, ipd, nipd + 1);
+
+ _bt_delete_and_insert(rel, buf, newitup, newitemoff);
+
+ pfree(ipd);
+ pfree(newitup);
+ _bt_relbuf(rel, buf);
+ }
+}
+
/*----------
* _bt_insertonpg() -- Insert a tuple on a particular page in the index.
*
@@ -2290,3 +2533,206 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
* the page.
*/
}
+
+/*
+ * Add new item (compressed or not) to the page, while compressing it.
+ * If insertion failed, return false.
+ * Caller should consider this as compression failure and
+ * leave page uncompressed.
+ */
+static void
+insert_itupprev_to_page(Page page, BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+ OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
+ if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+ offnum, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add tuple to page while compresing it");
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Before splitting the page, try to compress items to free some space.
+ * If compression didn't succeed, buffer will contain old state of the page.
+ * This function should be called after lp_dead items
+ * were removed by _bt_vacuum_one_page().
+ */
+static void
+_bt_compress_one_page(Relation rel, Buffer buffer, Relation heapRel)
+{
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+ Page page = BufferGetPage(buffer);
+ Page newpage;
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+ int natts = IndexRelationGetNumberOfAttributes(rel);
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+ IndexRelationGetNumberOfAttributes(rel) &&
+ !rel->rd_index->indisunique);
+ if (!use_compression)
+ return;
+
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = BTMaxItemSize(page);
+ compressState->maxpostingsize = 0;
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+
+ /*
+ * Delete dead tuples if any.
+ * We cannot simply skip them in the cycle below, because it's neccessary
+ * to generate special Xlog record containing such tuples to compute
+ * latestRemovedXid on a standby server later.
+ *
+ * This should not affect performance, since it only can happen in a rare
+ * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+ * was not called, or _bt_vacuum_one_page didn't remove all dead items.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+
+ if (ItemIdIsDead(itemid))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ _bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+
+ /*
+ * Scan over all items to see which ones can be compressed
+ */
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+ newpage = PageGetTempPageCopySpecial(page);
+ elog(DEBUG4, "_bt_compress_one_page rel: %s,blkno: %u",
+ RelationGetRelationName(rel), BufferGetBlockNumber(buffer));
+
+ /* Copy High Key if any */
+ if (!P_RIGHTMOST(opaque))
+ {
+ ItemId itemid = PageGetItemId(page, P_HIKEY);
+ Size itemsz = ItemIdGetLength(itemid);
+ IndexTuple item = (IndexTuple) PageGetItem(page, itemid);
+
+ if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add highkey during compression");
+ }
+
+ /*
+ * Iterate over tuples on the page, try to compress them into posting
+ * lists and insert into new page.
+ */
+ for (offnum = minoff;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, itemId);
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts =
+ _bt_keep_natts_fast(rel, compressState->itupprev, itup);
+ int itup_ntuples = BTreeTupleIsPosting(itup) ?
+ BTreeTupleGetNPosting(itup) : 1;
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * When tuples are equal, create or update posting.
+ *
+ * If posting is too big, insert it on page and continue.
+ */
+ if (compressState->maxitemsize >
+ MAXALIGN(((IndexTupleSize(compressState->itupprev)
+ + (compressState->ntuples + itup_ntuples + 1) * sizeof(ItemPointerData)))))
+ {
+ _bt_add_posting_item(compressState, itup);
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+ else
+ {
+ insert_itupprev_to_page(newpage, compressState);
+ }
+ }
+
+ /*
+ * Copy the tuple into temp variable itupprev to compare it with the
+ * following tuple and maybe unite them into a posting tuple
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ Assert(IndexTupleSize(compressState->itupprev) <= compressState->maxitemsize);
+ }
+
+ /* Handle the last item. */
+ insert_itupprev_to_page(newpage, compressState);
+
+ START_CRIT_SECTION();
+
+ PageRestoreTempPage(newpage, page);
+ MarkBufferDirty(buffer);
+
+ /* Log full page write */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = log_newpage_buffer(buffer, true);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ elog(DEBUG4, "_bt_compress_one_page. success");
+}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 9c1f7de..86c662d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -983,14 +983,52 @@ _bt_page_recyclable(Page page)
void
_bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed)
{
Page page = BufferGetPage(buf);
BTPageOpaque opaque;
+ Size itemsz;
+ Size remaining_sz = 0;
+ char *remaining_buf = NULL;
+
+ /* XLOG stuff, buffer for remainings */
+ if (nremaining && RelationNeedsWAL(rel))
+ {
+ Size offset = 0;
+
+ for (int i = 0; i < nremaining; i++)
+ remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+ remaining_buf = palloc0(remaining_sz);
+ for (int i = 0; i < nremaining; i++)
+ {
+ itemsz = IndexTupleSize(remaining[i]);
+ memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+ offset += MAXALIGN(itemsz);
+ }
+ Assert(offset == remaining_sz);
+ }
/* No ereport(ERROR) until changes are logged */
START_CRIT_SECTION();
+ /* Handle posting tuples here */
+ for (int i = 0; i < nremaining; i++)
+ {
+ /* At first, delete the old tuple. */
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = IndexTupleSize(remaining[i]);
+ itemsz = MAXALIGN(itemsz);
+
+ /* Add tuple with remaining ItemPointers to the page. */
+ if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to rewrite compressed item in index while doing vacuum");
+ }
+
/* Fix the page */
if (nitems > 0)
PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1058,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
xl_btree_vacuum xlrec_vacuum;
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ xlrec_vacuum.nremaining = nremaining;
+ xlrec_vacuum.ndeleted = nitems;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1073,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
if (nitems > 0)
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+ /*
+ * Here we should save offnums and remaining tuples themselves. It's
+ * important to restore them in correct order. At first, we must
+ * handle remaining tuples and only after that other deleted items.
+ */
+ if (nremaining > 0)
+ {
+ Assert(remaining_buf != NULL);
+ XLogRegisterBufData(0, (char *) remainingoffset,
+ nremaining * sizeof(OffsetNumber));
+ XLogRegisterBufData(0, remaining_buf, remaining_sz);
+ }
+
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
PageSetLSN(page, recptr);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..22fb228 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BTCycleId cycleid, TransactionId *oldestBtpoXact);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+ int *nremaining);
/*
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+ vstate.lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
@@ -1193,6 +1196,9 @@ restart:
OffsetNumber offnum,
minoff,
maxoff;
+ IndexTuple remaining[MaxOffsetNumber];
+ OffsetNumber remainingoffset[MaxOffsetNumber];
+ int nremaining;
/*
* Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
* callback function.
*/
ndeletable = 0;
+ nremaining = 0;
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
if (callback)
@@ -1242,31 +1249,78 @@ restart:
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offnum));
- htup = &(itup->t_tid);
- /*
- * During Hot Standby we currently assume that
- * XLOG_BTREE_VACUUM records do not produce conflicts. That is
- * only true as long as the callback function depends only
- * upon whether the index tuple refers to heap tuples removed
- * in the initial heap scan. When vacuum starts it derives a
- * value of OldestXmin. Backends taking later snapshots could
- * have a RecentGlobalXmin with a later xid than the vacuum's
- * OldestXmin, so it is possible that row versions deleted
- * after OldestXmin could be marked as killed by other
- * backends. The callback function *could* look at the index
- * tuple state in isolation and decide to delete the index
- * tuple, though currently it does not. If it ever did, we
- * would need to reconsider whether XLOG_BTREE_VACUUM records
- * should cause conflicts. If they did cause conflicts they
- * would be fairly harsh conflicts, since we haven't yet
- * worked out a way to pass a useful value for
- * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
- * applies to *any* type of index that marks index tuples as
- * killed.
- */
- if (callback(htup, callback_state))
- deletable[ndeletable++] = offnum;
+ if (BTreeTupleIsPosting(itup))
+ {
+ int nnewipd = 0;
+ ItemPointer newipd = NULL;
+
+ newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+ if (nnewipd == 0)
+ {
+ /*
+ * All TIDs from posting list must be deleted, we can
+ * delete whole tuple in a regular way.
+ */
+ deletable[ndeletable++] = offnum;
+ }
+ else if (nnewipd == BTreeTupleGetNPosting(itup))
+ {
+ /*
+ * All TIDs from posting tuple must remain. Do
+ * nothing, just cleanup.
+ */
+ pfree(newipd);
+ }
+ else if (nnewipd < BTreeTupleGetNPosting(itup))
+ {
+ /* Some TIDs from posting tuple must remain. */
+ Assert(nnewipd > 0);
+ Assert(newipd != NULL);
+
+ /*
+ * Form new tuple that contains only remaining TIDs.
+ * Remember this tuple and the offset of the old tuple
+ * to update it in place.
+ */
+ remainingoffset[nremaining] = offnum;
+ remaining[nremaining] = BTreeFormPostingTuple(itup, newipd, nnewipd);
+ nremaining++;
+ pfree(newipd);
+
+ Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+ }
+ }
+ else
+ {
+ htup = &(itup->t_tid);
+
+ /*
+ * During Hot Standby we currently assume that
+ * XLOG_BTREE_VACUUM records do not produce conflicts.
+ * That is only true as long as the callback function
+ * depends only upon whether the index tuple refers to
+ * heap tuples removed in the initial heap scan. When
+ * vacuum starts it derives a value of OldestXmin.
+ * Backends taking later snapshots could have a
+ * RecentGlobalXmin with a later xid than the vacuum's
+ * OldestXmin, so it is possible that row versions deleted
+ * after OldestXmin could be marked as killed by other
+ * backends. The callback function *could* look at the
+ * index tuple state in isolation and decide to delete the
+ * index tuple, though currently it does not. If it ever
+ * did, we would need to reconsider whether
+ * XLOG_BTREE_VACUUM records should cause conflicts. If
+ * they did cause conflicts they would be fairly harsh
+ * conflicts, since we haven't yet worked out a way to
+ * pass a useful value for latestRemovedXid on the
+ * XLOG_BTREE_VACUUM records. This applies to *any* type
+ * of index that marks index tuples as killed.
+ */
+ if (callback(htup, callback_state))
+ deletable[ndeletable++] = offnum;
+ }
}
}
@@ -1274,7 +1328,7 @@ restart:
* Apply any needed deletes. We issue just one _bt_delitems_vacuum()
* call per page, so as to minimize WAL traffic.
*/
- if (ndeletable > 0)
+ if (ndeletable > 0 || nremaining > 0)
{
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1345,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+ remainingoffset, remaining, nremaining,
vstate->lastBlockVacuumed);
/*
@@ -1376,6 +1431,41 @@ restart:
}
/*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+ int remaining = 0;
+ int nitem = BTreeTupleGetNPosting(itup);
+ ItemPointer tmpitems = NULL,
+ items = BTreeTupleGetPosting(itup);
+
+ /*
+ * Check each tuple in the posting list, save alive tuples into tmpitems
+ */
+ for (int i = 0; i < nitem; i++)
+ {
+ if (vstate->callback(items + i, vstate->callback_state))
+ continue;
+
+ if (tmpitems == NULL)
+ tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+ tmpitems[remaining++] = items[i];
+ }
+
+ *nremaining = remaining;
+ return tmpitems;
+}
+
+/*
* btcanreturn() -- Check whether btree indexes support index-only scans.
*
* btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 19735bf..2097597 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,9 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
OffsetNumber offnum);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
+static void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+ OffsetNumber offnum, ItemPointer iptr,
+ IndexTuple itup, int i);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -504,7 +507,8 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
/* We have low <= mid < high, so mid points at a real slot */
- result = _bt_compare(rel, key, page, mid);
+ result = _bt_compare_posting(rel, key, page, mid,
+ &(insertstate->in_posting_offset));
if (result >= cmpval)
low = mid + 1;
@@ -533,6 +537,55 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
return low;
}
+/*
+ * Compare insertion-type scankey to tuple on a page,
+ * taking into account posting tuples.
+ * If the key of the posting tuple is equal to scankey,
+ * find exact position inside the posting list,
+ * using TID as extra attribute.
+ */
+int32
+_bt_compare_posting(Relation rel,
+ BTScanInsert key,
+ Page page,
+ OffsetNumber offnum,
+ int *in_posting_offset)
+{
+ IndexTuple itup;
+ int result;
+
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+ result = _bt_compare(rel, key, page, offnum);
+
+ if (BTreeTupleIsPosting(itup) && result == 0)
+ {
+ int low,
+ high,
+ mid,
+ res;
+
+ low = 0;
+ /* "high" is past end of posting list for loop invariant */
+ high = BTreeTupleGetNPosting(itup);
+
+ while (high > low)
+ {
+ mid = low + ((high - low) / 2);
+ res = ItemPointerCompare(key->scantid,
+ BTreeTupleGetPostingN(itup, mid));
+
+ if (res >= 1)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *in_posting_offset = high;
+ }
+
+ return result;
+}
+
/*----------
* _bt_compare() -- Compare insertion-type scankey to tuple on a page.
*
@@ -665,61 +718,120 @@ _bt_compare(Relation rel,
* Use the heap TID attribute and scantid to try to break the tie. The
* rules are the same as any other key attribute -- only the
* representation differs.
+ *
+ * When itup is a posting tuple, the check becomes more complex. It is
+ * possible that the scankey belongs to the tuple's posting list TID
+ * range.
+ *
+ * _bt_compare() is multipurpose, so it just returns 0 for a fact that key
+ * matches tuple at this offset.
+ *
+ * Use special _bt_compare_posting() wrapper function to handle this case
+ * and perform recheck for posting tuple, finding exact position of the
+ * scankey.
*/
- heapTid = BTreeTupleGetHeapTID(itup);
- if (key->scantid == NULL)
+ if (!BTreeTupleIsPosting(itup))
{
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid == NULL)
+ {
+ /*
+ * Most searches have a scankey that is considered greater than a
+ * truncated pivot tuple if and when the scankey has equal values
+ * for attributes up to and including the least significant
+ * untruncated attribute in tuple.
+ *
+ * For example, if an index has the minimum two attributes (single
+ * user key attribute, plus heap TID attribute), and a page's high
+ * key is ('foo', -inf), and scankey is ('foo', <omitted>), the
+ * search will not descend to the page to the left. The search
+ * will descend right instead. The truncated attribute in pivot
+ * tuple means that all non-pivot tuples on the page to the left
+ * are strictly < 'foo', so it isn't necessary to descend left. In
+ * other words, search doesn't have to descend left because it
+ * isn't interested in a match that has a heap TID value of -inf.
+ *
+ * However, some searches (pivotsearch searches) actually require
+ * that we descend left when this happens. -inf is treated as a
+ * possible match for omitted scankey attribute(s). This is
+ * needed by page deletion, which must re-find leaf pages that are
+ * targets for deletion using their high keys.
+ *
+ * Note: the heap TID part of the test ensures that scankey is
+ * being compared to a pivot tuple with one or more truncated key
+ * attributes.
+ *
+ * Note: pg_upgrade'd !heapkeyspace indexes must always descend to
+ * the left here, since they have no heap TID attribute (and
+ * cannot have any -inf key values in any case, since truncation
+ * can only remove non-key attributes). !heapkeyspace searches
+ * must always be prepared to deal with matches on both sides of
+ * the pivot once the leaf level is reached.
+ */
+ if (key->heapkeyspace && !key->pivotsearch &&
+ key->keysz == ntupatts && heapTid == NULL)
+ return 1;
+
+ /* All provided scankey arguments found to be equal */
+ return 0;
+ }
+
/*
- * Most searches have a scankey that is considered greater than a
- * truncated pivot tuple if and when the scankey has equal values for
- * attributes up to and including the least significant untruncated
- * attribute in tuple.
- *
- * For example, if an index has the minimum two attributes (single
- * user key attribute, plus heap TID attribute), and a page's high key
- * is ('foo', -inf), and scankey is ('foo', <omitted>), the search
- * will not descend to the page to the left. The search will descend
- * right instead. The truncated attribute in pivot tuple means that
- * all non-pivot tuples on the page to the left are strictly < 'foo',
- * so it isn't necessary to descend left. In other words, search
- * doesn't have to descend left because it isn't interested in a match
- * that has a heap TID value of -inf.
- *
- * However, some searches (pivotsearch searches) actually require that
- * we descend left when this happens. -inf is treated as a possible
- * match for omitted scankey attribute(s). This is needed by page
- * deletion, which must re-find leaf pages that are targets for
- * deletion using their high keys.
- *
- * Note: the heap TID part of the test ensures that scankey is being
- * compared to a pivot tuple with one or more truncated key
- * attributes.
- *
- * Note: pg_upgrade'd !heapkeyspace indexes must always descend to the
- * left here, since they have no heap TID attribute (and cannot have
- * any -inf key values in any case, since truncation can only remove
- * non-key attributes). !heapkeyspace searches must always be
- * prepared to deal with matches on both sides of the pivot once the
- * leaf level is reached.
+ * Treat truncated heap TID as minus infinity, since scankey has a key
+ * attribute value (scantid) that would otherwise be compared directly
*/
- if (key->heapkeyspace && !key->pivotsearch &&
- key->keysz == ntupatts && heapTid == NULL)
+ Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
+ if (heapTid == NULL)
return 1;
- /* All provided scankey arguments found to be equal */
- return 0;
+ Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
+ return ItemPointerCompare(key->scantid, heapTid);
}
+ else
+ {
+ heapTid = BTreeTupleGetHeapTID(itup);
+ if (key->scantid != NULL && heapTid != NULL)
+ {
+ int cmp = ItemPointerCompare(key->scantid, heapTid);
- /*
- * Treat truncated heap TID as minus infinity, since scankey has a key
- * attribute value (scantid) that would otherwise be compared directly
- */
- Assert(key->keysz == IndexRelationGetNumberOfKeyAttributes(rel));
- if (heapTid == NULL)
- return 1;
+ if (cmp == -1 || cmp == 0)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is less than or equal to posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
- Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
- return ItemPointerCompare(key->scantid, heapTid);
+ heapTid = BTreeTupleGetMaxTID(itup);
+ cmp = ItemPointerCompare(key->scantid, heapTid);
+ if (cmp == 1)
+ {
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is greater than posting tuple (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return cmp;
+ }
+
+ /*
+ * if we got here, scantid is inbetween of posting items of the
+ * tuple
+ */
+ elog(DEBUG4, "offnum %d Scankey (%u,%u) is between posting items (%u,%u) and (%u,%u)",
+ offnum, ItemPointerGetBlockNumberNoCheck(key->scantid),
+ ItemPointerGetOffsetNumberNoCheck(key->scantid),
+ ItemPointerGetBlockNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetOffsetNumberNoCheck(BTreeTupleGetHeapTID(itup)),
+ ItemPointerGetBlockNumberNoCheck(heapTid),
+ ItemPointerGetOffsetNumberNoCheck(heapTid));
+ return 0;
+ }
+ }
+
+ return 0;
}
/*
@@ -1456,6 +1568,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* initialize tuple workspace to empty */
so->currPos.nextTupleOffset = 0;
+ so->currPos.prevTupleOffset = 0;
/*
* Now that the current page has been made consistent, the macro should be
@@ -1490,8 +1603,22 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
{
/* tuple passes all scan key conditions, so remember it */
- _bt_saveitem(so, itemIndex, offnum, itup);
- itemIndex++;
+ if (!BTreeTupleIsPosting(itup))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ itemIndex++;
+ }
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1524,7 +1651,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (!continuescan)
so->currPos.moreRight = false;
- Assert(itemIndex <= MaxIndexTuplesPerPage);
+ Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
so->currPos.itemIndex = 0;
@@ -1532,7 +1659,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
else
{
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ itemIndex = MaxPostingIndexTuplesPerPage;
offnum = Min(offnum, maxoff);
@@ -1574,8 +1701,23 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (passes_quals && tuple_alive)
{
/* tuple passes all scan key conditions, so remember it */
- itemIndex--;
- _bt_saveitem(so, itemIndex, offnum, itup);
+ if (!BTreeTupleIsPosting(itup))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ else
+ {
+ /* Return posting list "logical" tuples */
+ /* XXX: Maybe this loop should be backwards? */
+ for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+ {
+ itemIndex--;
+ _bt_savepostingitem(so, itemIndex, offnum,
+ BTreeTupleGetPostingN(itup, i),
+ itup, i);
+ }
+ }
}
if (!continuescan)
{
@@ -1589,8 +1731,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
- so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+ so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+ so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
}
return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1603,6 +1745,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
{
BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+ Assert(!BTreeTupleIsPosting(itup));
+
currItem->heapTid = itup->t_tid;
currItem->indexOffset = offnum;
if (so->currTuples)
@@ -1615,6 +1759,33 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
}
}
+/* Save an index item into so->currPos.items[itemIndex] for posting tuples. */
+static void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+ ItemPointer iptr, IndexTuple itup, int i)
+{
+ BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+ currItem->heapTid = *iptr;
+ currItem->indexOffset = offnum;
+
+ if (so->currTuples)
+ {
+ if (i == 0)
+ {
+ /* save key. the same for all tuples in the posting */
+ Size itupsz = BTreeTupleGetPostingOffset(itup);
+
+ currItem->tupleOffset = so->currPos.nextTupleOffset;
+ memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+ so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+ so->currPos.prevTupleOffset = currItem->tupleOffset;
+ }
+ else
+ currItem->tupleOffset = so->currPos.prevTupleOffset;
+ }
+}
+
/*
* _bt_steppage() -- Step to next page containing valid data for scan
*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b30cf9e..b058599 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
IndexTuple itup);
static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState);
static void _bt_load(BTWriteState *wstate,
BTSpool *btspool, BTSpool *btspool2);
static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -972,6 +974,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* only shift the line pointer array back and forth, and overwrite
* the tuple space previously occupied by oitup. This is fairly
* cheap.
+ *
+ * If lastleft tuple was a posting tuple, we'll truncate its
+ * posting list in _bt_truncate as well. Note that it is also
+ * applicable only to leaf pages, since internal pages never
+ * contain posting tuples.
*/
ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1011,6 +1018,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
* the minimum key for the new page.
*/
state->btps_minkey = CopyIndexTuple(oitup);
+ Assert(BTreeTupleIsPivot(state->btps_minkey));
/*
* Set the sibling links for both pages.
@@ -1052,6 +1060,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
Assert(state->btps_minkey == NULL);
state->btps_minkey = CopyIndexTuple(itup);
/* _bt_sortaddtup() will perform full truncation later */
+ BTreeTupleClearBtIsPosting(state->btps_minkey);
BTreeTupleSetNAtts(state->btps_minkey, 0);
}
@@ -1137,6 +1146,91 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
}
/*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+ BTCompressState *compressState)
+{
+ IndexTuple to_insert;
+
+ /* Return, if there is no tuple to insert */
+ if (state == NULL)
+ return;
+
+ if (compressState->ntuples == 0)
+ to_insert = compressState->itupprev;
+ else
+ {
+ IndexTuple postingtuple;
+
+ /* form a tuple with a posting list */
+ postingtuple = BTreeFormPostingTuple(compressState->itupprev,
+ compressState->ipd,
+ compressState->ntuples);
+ to_insert = postingtuple;
+ pfree(compressState->ipd);
+ }
+
+ _bt_buildadd(wstate, state, to_insert);
+
+ if (compressState->ntuples > 0)
+ pfree(to_insert);
+ compressState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in compressState.
+ *
+ * Helper function for _bt_load() and _bt_compress_one_page().
+ *
+ * Note: caller is responsible for size check to ensure that resulting tuple
+ * won't exceed BTMaxItemSize.
+ */
+void
+_bt_add_posting_item(BTCompressState *compressState, IndexTuple itup)
+{
+ int nposting = 0;
+
+ if (compressState->ntuples == 0)
+ {
+ compressState->ipd = palloc0(compressState->maxitemsize);
+
+ if (BTreeTupleIsPosting(compressState->itupprev))
+ {
+ /* if itupprev is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(compressState->itupprev);
+ memcpy(compressState->ipd,
+ BTreeTupleGetPosting(compressState->itupprev),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd, compressState->itupprev,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+ }
+
+ if (BTreeTupleIsPosting(itup))
+ {
+ /* if tuple is posting, add all its TIDs to the posting list */
+ nposting = BTreeTupleGetNPosting(itup);
+ memcpy(compressState->ipd + compressState->ntuples,
+ BTreeTupleGetPosting(itup),
+ sizeof(ItemPointerData) * nposting);
+ compressState->ntuples += nposting;
+ }
+ else
+ {
+ memcpy(compressState->ipd + compressState->ntuples, itup,
+ sizeof(ItemPointerData));
+ compressState->ntuples++;
+ }
+}
+
+/*
* Read tuples in correct sort order from tuplesort, and load them into
* btree leaves.
*/
@@ -1150,9 +1244,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
bool load1;
TupleDesc tupdes = RelationGetDescr(wstate->index);
int i,
- keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+ keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+ natts = IndexRelationGetNumberOfAttributes(wstate->index);
SortSupport sortKeys;
int64 tuples_done = 0;
+ bool use_compression = false;
+ BTCompressState *compressState = NULL;
+
+ /*
+ * Don't use compression for indexes with INCLUDEd columns and unique
+ * indexes.
+ */
+ use_compression = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+ IndexRelationGetNumberOfAttributes(wstate->index) &&
+ !wstate->index->rd_index->indisunique);
if (merge)
{
@@ -1266,19 +1371,89 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
}
else
{
- /* merge is unnecessary */
- while ((itup = tuplesort_getindextuple(btspool->sortstate,
- true)) != NULL)
+ if (!use_compression)
{
- /* When we see first tuple, create first index page */
- if (state == NULL)
- state = _bt_pagestate(wstate, 0);
+ /* merge is unnecessary */
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ state = _bt_pagestate(wstate, 0);
- _bt_buildadd(wstate, state, itup);
+ _bt_buildadd(wstate, state, itup);
- /* Report progress */
- pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
- ++tuples_done);
+ /* Report progress */
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+ ++tuples_done);
+ }
+ }
+ else
+ {
+ /* init compress state needed to build posting tuples */
+ compressState = (BTCompressState *) palloc0(sizeof(BTCompressState));
+ compressState->ipd = NULL;
+ compressState->ntuples = 0;
+ compressState->itupprev = NULL;
+ compressState->maxitemsize = 0;
+ compressState->maxpostingsize = 0;
+
+ while ((itup = tuplesort_getindextuple(btspool->sortstate,
+ true)) != NULL)
+ {
+ /* When we see first tuple, create first index page */
+ if (state == NULL)
+ {
+ state = _bt_pagestate(wstate, 0);
+ compressState->maxitemsize = BTMaxItemSize(state->btps_page);
+ }
+
+ if (compressState->itupprev != NULL)
+ {
+ int n_equal_atts = _bt_keep_natts_fast(wstate->index,
+ compressState->itupprev, itup);
+
+ if (n_equal_atts > natts)
+ {
+ /*
+ * Tuples are equal. Create or update posting.
+ *
+ * Else If posting is too big, insert it on page and
+ * continue.
+ */
+ if ((compressState->ntuples + 1) * sizeof(ItemPointerData) <
+ compressState->maxpostingsize)
+ _bt_add_posting_item(compressState, itup);
+ else
+ _bt_buildadd_posting(wstate, state,
+ compressState);
+ }
+ else
+ {
+ /*
+ * Tuples are not equal. Insert itupprev into index.
+ * Save current tuple for the next iteration.
+ */
+ _bt_buildadd_posting(wstate, state, compressState);
+ }
+ }
+
+ /*
+ * Save the tuple to compare it with the next one and maybe
+ * unite them into a posting tuple.
+ */
+ if (compressState->itupprev)
+ pfree(compressState->itupprev);
+ compressState->itupprev = CopyIndexTuple(itup);
+
+ /* compute max size of posting list */
+ compressState->maxpostingsize = compressState->maxitemsize -
+ IndexInfoFindDataOffset(compressState->itupprev->t_info) -
+ MAXALIGN(IndexTupleSize(compressState->itupprev));
+ }
+
+ /* Handle the last item */
+ _bt_buildadd_posting(wstate, state, compressState);
}
}
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index a7882fd..77e1d46 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -459,6 +459,7 @@ _bt_recsplitloc(FindSplitData *state,
int16 leftfree,
rightfree;
Size firstrightitemsz;
+ Size postingsubhikey = 0;
bool newitemisfirstonright;
/* Is the new item going to be the first item on the right page? */
@@ -466,10 +467,33 @@ _bt_recsplitloc(FindSplitData *state,
&& !newitemonleft);
if (newitemisfirstonright)
+ {
firstrightitemsz = state->newitemsz;
+
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+ postingsubhikey = IndexTupleSize(state->newitem) -
+ BTreeTupleGetPostingOffset(state->newitem);
+ }
else
+ {
firstrightitemsz = firstoldonrightsz;
+ /* Calculate posting list overhead, if any */
+ if (state->is_leaf)
+ {
+ ItemId itemid;
+ IndexTuple newhighkey;
+
+ itemid = PageGetItemId(state->page, firstoldonright);
+ newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+ if (BTreeTupleIsPosting(newhighkey))
+ postingsubhikey = IndexTupleSize(newhighkey) -
+ BTreeTupleGetPostingOffset(newhighkey);
+ }
+ }
+
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
@@ -492,9 +516,13 @@ _bt_recsplitloc(FindSplitData *state,
* adding a heap TID to the left half's new high key when splitting at the
* leaf level. In practice the new high key will often be smaller and
* will rarely be larger, but conservatively assume the worst case.
+ * Truncation always truncates away any posting list that appears in the
+ * first right tuple, though, so it's safe to subtract that overhead
+ * (while still conservatively assuming that truncation might have to add
+ * back a single heap TID using the pivot tuple heap TID representation).
*/
if (state->is_leaf)
- leftfree -= (int16) (firstrightitemsz +
+ leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
MAXALIGN(sizeof(ItemPointerData)));
else
leftfree -= (int16) firstrightitemsz;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..9552acb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -111,8 +111,12 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
key->nextkey = false;
key->pivotsearch = false;
key->keysz = Min(indnkeyatts, tupnatts);
- key->scantid = key->heapkeyspace && itup ?
- BTreeTupleGetHeapTID(itup) : NULL;
+
+ if (itup && key->heapkeyspace)
+ key->scantid = BTreeTupleGetHeapTID(itup);
+ else
+ key->scantid = NULL;
+
skey = key->scankeys;
for (i = 0; i < indnkeyatts; i++)
{
@@ -1787,7 +1791,9 @@ _bt_killitems(IndexScanDesc scan)
ItemId iid = PageGetItemId(page, offnum);
IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
- if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+ /* No microvacuum for posting tuples */
+ if (!BTreeTupleIsPosting(ituple) &&
+ (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
{
/* found the item */
ItemIdMarkDead(iid);
@@ -2145,6 +2151,16 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
+ if (BTreeTupleIsPosting(firstright))
+ {
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetNAtts(pivot, keepnatts);
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+ }
+
+ Assert(!BTreeTupleIsPosting(pivot));
+
/*
* If there is a distinguishing key attribute within new pivot tuple,
* there is no need to add an explicit heap TID attribute
@@ -2161,6 +2177,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute to the new pivot tuple.
*/
Assert(natts != nkeyatts);
+ Assert(!BTreeTupleIsPosting(lastleft));
+ Assert(!BTreeTupleIsPosting(firstright));
newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
tidpivot = palloc0(newsize);
memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2186,27 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
pfree(pivot);
pivot = tidpivot;
}
+ else if (BTreeTupleIsPosting(firstright))
+ {
+ /*
+ * No truncation was possible, since key attributes are all equal. But
+ * the tuple is a compressed tuple with a posting list, so we still
+ * must truncate it.
+ *
+ * It's necessary to add a heap TID attribute to the new pivot tuple.
+ */
+ newsize = BTreeTupleGetPostingOffset(firstright) +
+ MAXALIGN(sizeof(ItemPointerData));
+ pivot = palloc0(newsize);
+ memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+ pivot->t_info &= ~INDEX_SIZE_MASK;
+ pivot->t_info |= newsize;
+ BTreeTupleClearBtIsPosting(pivot);
+ BTreeTupleSetAltHeapTID(pivot);
+
+ Assert(!BTreeTupleIsPosting(pivot));
+ }
else
{
/*
@@ -2205,7 +2244,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
sizeof(ItemPointerData));
- ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
/*
* Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2255,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
- Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
- Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+ BTreeTupleGetHeapTID(firstright)) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(lastleft)) >= 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#else
/*
@@ -2231,7 +2273,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* attribute values along with lastleft's heap TID value when lastleft's
* TID happens to be greater than firstright's TID.
*/
- ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+ ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
/*
* Pivot heap TID should never be fully equal to firstright. Note that
@@ -2240,7 +2282,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
*/
ItemPointerSetOffsetNumber(pivotheaptid,
OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
- Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+ Assert(ItemPointerCompare(pivotheaptid,
+ BTreeTupleGetHeapTID(firstright)) < 0);
#endif
BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2330,6 +2373,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
* leaving excessive amounts of free space on either side of page split.
* Callers can rely on the fact that attributes considered equal here are
* definitely also equal according to _bt_keep_natts.
+ *
+ * To build a posting tuple we need to ensure that all attributes
+ * of both tuples are equal. Use this function to compare them.
+ * TODO: maybe it's worth to rename the function.
+ *
+ * XXX: Obviously we need infrastructure for making sure it is okay to use
+ * this for posting list stuff. For example, non-deterministic collations
+ * cannot use compression, and will not work with what we have now.
+ *
+ * XXX: Even then, we probably also need to worry about TOAST as a special
+ * case. Don't repeat bugs like the amcheck bug that was fixed in commit
+ * eba775345d23d2c999bbb412ae658b6dab36e3e8. As the test case added in that
+ * commit shows, we need to worry about pg_attribute.attstorage changing in
+ * the underlying table due to an ALTER TABLE (and maybe a few other things
+ * like that). In general, the "TOAST input state" of a TOASTable datum isn't
+ * something that we make many guarantees about today, so even with C
+ * collation text we could in theory get different answers from
+ * _bt_keep_natts_fast() and _bt_keep_natts(). This needs to be nailed down
+ * in some way.
*/
int
_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2415,7 +2477,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* Non-pivot tuples currently never use alternative heap TID
* representation -- even those within heapkeyspace indexes
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+ if (BTreeTupleIsPivot(itup))
return false;
/*
@@ -2470,7 +2532,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* that to decide if the tuple is a pre-v11 tuple.
*/
return tupnatts == 0 ||
- ((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
+ (!BTreeTupleIsPivot(itup) &&
ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
else
@@ -2497,7 +2559,7 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
* heapkeyspace index pivot tuples, regardless of whether or not there are
* non-key attributes.
*/
- if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+ if (!BTreeTupleIsPivot(itup))
return false;
/*
@@ -2549,6 +2611,8 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
if (!needheaptidspace && itemsz <= BTMaxItemSizeNoHeapTid(page))
return;
+ /* TODO correct error messages for posting tuples */
+
/*
* Internal page insertions cannot fail here, because that would mean that
* an earlier leaf level insertion that should have failed didn't
@@ -2575,3 +2639,79 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
"or use full text indexing."),
errtableconstraint(heap, RelationGetRelationName(rel))));
}
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+ uint32 keysize,
+ newsize = 0;
+ IndexTuple itup;
+
+ /* We only need key part of the tuple */
+ if (BTreeTupleIsPosting(tuple))
+ keysize = BTreeTupleGetPostingOffset(tuple);
+ else
+ keysize = IndexTupleSize(tuple);
+
+ Assert(nipd > 0);
+
+ /* Add space needed for posting list */
+ if (nipd > 1)
+ newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+ else
+ newsize = keysize;
+
+ newsize = MAXALIGN(newsize);
+ itup = palloc0(newsize);
+ memcpy(itup, tuple, keysize);
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= newsize;
+
+ if (nipd > 1)
+ {
+ /* Form posting tuple, fill posting fields */
+
+ /* Set meta info about the posting list */
+ itup->t_info |= INDEX_ALT_TID_MASK;
+ BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+ /* sort the list to preserve TID order invariant */
+ qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+ (int (*) (const void *, const void *)) ItemPointerCompare);
+
+ /* Copy posting list into the posting tuple */
+ memcpy(BTreeTupleGetPosting(itup), ipd,
+ sizeof(ItemPointerData) * nipd);
+ }
+ else
+ {
+ /* To finish building of a non-posting tuple, copy TID from ipd */
+ itup->t_info &= ~INDEX_ALT_TID_MASK;
+ ItemPointerCopy(ipd, &itup->t_tid);
+ }
+
+ return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+ Assert(BTreeTupleIsPosting(tuple));
+ return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..538a6bc 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -386,8 +386,8 @@ btree_xlog_vacuum(XLogReaderState *record)
Buffer buffer;
Page page;
BTPageOpaque opaque;
-#ifdef UNUSED
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
/*
* This section of code is thought to be no longer needed, after analysis
@@ -478,14 +478,34 @@ btree_xlog_vacuum(XLogReaderState *record)
if (len > 0)
{
- OffsetNumber *unused;
- OffsetNumber *unend;
+ if (xlrec->nremaining)
+ {
+ OffsetNumber *remainingoffset;
+ IndexTuple remaining;
+ Size itemsz;
+
+ remainingoffset = (OffsetNumber *)
+ (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+ remaining = (IndexTuple) ((char *) remainingoffset +
+ xlrec->nremaining * sizeof(OffsetNumber));
- unused = (OffsetNumber *) ptr;
- unend = (OffsetNumber *) ((char *) ptr + len);
+ /* Handle posting tuples */
+ for (int i = 0; i < xlrec->nremaining; i++)
+ {
+ PageIndexTupleDelete(page, remainingoffset[i]);
+
+ itemsz = MAXALIGN(IndexTupleSize(remaining));
+
+ if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+ false, false) == InvalidOffsetNumber)
+ elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+ remaining = (IndexTuple) ((char *) remaining + itemsz);
+ }
+ }
- if ((unend - unused) > 0)
- PageIndexMultiDelete(page, unused, unend - unused);
+ if (xlrec->ndeleted)
+ PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
}
/*
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..e4fa99a 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -46,8 +46,10 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
- appendStringInfo(buf, "lastBlockVacuumed %u",
- xlrec->lastBlockVacuumed);
+ appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+ xlrec->lastBlockVacuumed,
+ xlrec->nremaining,
+ xlrec->ndeleted);
break;
}
case XLOG_BTREE_DELETE:
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 744ffb6..b10c0d5 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -141,6 +141,10 @@ typedef IndexAttributeBitMapData * IndexAttributeBitMap;
* On such a page, N tuples could take one MAXALIGN quantum less space than
* estimated here, seemingly allowing one more tuple than estimated here.
* But such a page always has at least MAXALIGN special space, so we're safe.
+ *
+ * Note: btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so they may contain more tuples.
+ * Use MaxPostingIndexTuplesPerPage instead.
*/
#define MaxIndexTuplesPerPage \
((int) ((BLCKSZ - SizeOfPageHeaderData) / \
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..bacc77b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
* t_tid | t_info | key values | INCLUDE columns, if any
*
* t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4. Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
*
* All other types of index tuples ("pivot" tuples) only have key columns,
* since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,39 @@ typedef struct BTMetaPageData
* omitted rather than truncated, since its representation is different to
* the non-pivot representation.)
*
+ * Non-pivot posting tuple format:
+ * t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively,
+ * we use special format of tuples - posting tuples.
+ * posting_list is an array of ItemPointerData.
+ *
+ * This type of compression never applies to system indexes, unique indexes
+ * or indexes with INCLUDEd columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
* Pivot tuple format:
*
* t_tid | t_info | key values | [heap TID]
@@ -281,23 +313,144 @@ typedef struct BTMetaPageData
* bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
* future use. BT_N_KEYS_OFFSET_MASK should be large enough to store any
* number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
*
* Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes). They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
*/
#define INDEX_ALT_TID_MASK INDEX_AM_RESERVED_BIT
/* Item pointer offset bits */
#define BT_RESERVED_OFFSET_MASK 0xF000
#define BT_N_KEYS_OFFSET_MASK 0x0FFF
+#define BT_N_POSTING_OFFSET_MASK 0x0FFF
#define BT_HEAP_TID_ATTR 0x1000
+#define BT_IS_POSTING 0x2000
+
+#define BTreeTupleIsPosting(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+ )
+
+#define BTreeTupleIsPivot(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+ )
-/* Get/set downlink block number */
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+ ((int) ((BLCKSZ - SizeOfPageHeaderData - \
+ 3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+ (sizeof(ItemPointerData)))
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying compression to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it. If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTCompressState
+{
+ Size maxitemsize;
+ Size maxpostingsize;
+ IndexTuple itupprev;
+ int ntuples;
+ ItemPointerData *ipd;
+} BTCompressState;
+
+/* macros to work with posting tuples *BEGIN* */
+#define BTreeTupleSetBtIsPosting(itup) \
+ do { \
+ Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+ } while(0)
+
+#define BTreeTupleGetNPosting(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+ )
+
+#define BTreeTupleSetNPosting(itup, n) \
+ do { \
+ ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+ BTreeTupleSetBtIsPosting(itup); \
+ } while(0)
+
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that it
+ * will get what is expected.
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+ )
+#define BTreeTupleSetPostingOffset(itup, offset) \
+ ( \
+ AssertMacro(BTreeTupleIsPosting(itup)), \
+ ItemPointerSetBlockNumber(&((itup)->t_tid), (offset)) \
+ )
+#define BTreeSetPostingMeta(itup, nposting, off) \
+ do { \
+ BTreeTupleSetNPosting(itup, nposting); \
+ BTreeTupleSetPostingOffset(itup, off); \
+ } while(0)
+
+#define BTreeTupleGetPosting(itup) \
+ (ItemPointerData*) ((char*)(itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+ (ItemPointerData*) (BTreeTupleGetPosting(itup) + (n))
+
+/*
+ * Posting tuples always contain more than one TID. The minimum TID can be
+ * accessed using BTreeTupleGetHeapTID(). The maximum is accessed using
+ * BTreeTupleGetMaxTID().
+ */
+#define BTreeTupleGetMaxTID(itup) \
+ ( \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING))) ? \
+ ( \
+ (ItemPointer) (BTreeTupleGetPosting(itup) + (BTreeTupleGetNPosting(itup)-1)) \
+ ) \
+ : \
+ (ItemPointer) &((itup)->t_tid) \
+ )
+/* macros to work with posting tuples *END* */
+
+/* Get/set downlink block number */
#define BTreeInnerTupleGetDownLink(itup) \
ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
#define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,7 +479,8 @@ typedef struct BTMetaPageData
*/
#define BTreeTupleGetNAtts(itup, rel) \
( \
- (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ ((itup)->t_info & INDEX_ALT_TID_MASK && \
+ ((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
( \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
) \
@@ -335,6 +489,7 @@ typedef struct BTMetaPageData
)
#define BTreeTupleSetNAtts(itup, n) \
do { \
+ Assert(!BTreeTupleIsPosting(itup)); \
(itup)->t_info |= INDEX_ALT_TID_MASK; \
ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
} while(0)
@@ -342,6 +497,8 @@ typedef struct BTMetaPageData
/*
* Get tiebreaker heap TID attribute, if any. Macro works with both pivot
* and non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * For non-pivot posting tuples this returns the first tid from posting list.
*/
#define BTreeTupleGetHeapTID(itup) \
( \
@@ -351,7 +508,10 @@ typedef struct BTMetaPageData
(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
sizeof(ItemPointerData)) \
) \
- : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
+ : (itup)->t_info & INDEX_ALT_TID_MASK ? \
+ (((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0) ? \
+ (ItemPointer) BTreeTupleGetPosting(itup) : NULL) \
+ : (ItemPointer) &((itup)->t_tid) \
)
/*
* Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
@@ -360,6 +520,7 @@ typedef struct BTMetaPageData
#define BTreeTupleSetAltHeapTID(itup) \
do { \
Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+ Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
ItemPointerSetOffsetNumber(&(itup)->t_tid, \
ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
} while(0)
@@ -501,6 +662,12 @@ typedef struct BTInsertStateData
Buffer buf;
/*
+ * if _bt_binsrch_insert() found the location inside existing posting
+ * list, save the position inside the list.
+ */
+ int in_posting_offset;
+
+ /*
* Cache of bounds within the current buffer. Only used for insertions
* where _bt_check_unique is called. See _bt_binsrch_insert and
* _bt_findinsertloc for details.
@@ -567,6 +734,8 @@ typedef struct BTScanPosData
* location in the associated tuple storage workspace.
*/
int nextTupleOffset;
+ /* prevTupleOffset is for posting list handling */
+ int prevTupleOffset;
/*
* The items array is always ordered in index order (ie, increasing
@@ -579,7 +748,7 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
- BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+ BTScanPosItem items[MaxPostingIndexTuplesPerPage]; /* MUST BE LAST */
} BTScanPosData;
typedef BTScanPosData *BTScanPos;
@@ -763,6 +932,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems, Relation heapRel);
extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
OffsetNumber *itemnos, int nitems,
+ OffsetNumber *remainingoffset,
+ IndexTuple *remaining, int nremaining,
BlockNumber lastBlockVacuumed);
extern int _bt_pagedel(Relation rel, Buffer buf);
@@ -775,6 +946,8 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
+extern int32 _bt_compare_posting(Relation rel, BTScanInsert key, Page page,
+ OffsetNumber offnum, int *in_posting_offset);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -813,6 +986,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
OffsetNumber offnum);
extern void _bt_check_third_page(Relation rel, Relation heap,
bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+ int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
/*
* prototypes for functions in nbtvalidate.c
@@ -825,5 +1001,7 @@ extern bool btvalidate(Oid opclassoid);
extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_add_posting_item(BTCompressState *compressState,
+ IndexTuple itup);
#endif /* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..4b615e0 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -173,10 +173,19 @@ typedef struct xl_btree_vacuum
{
BlockNumber lastBlockVacuumed;
- /* TARGET OFFSET NUMBERS FOLLOW */
+ /*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
+ uint32 nremaining;
+ uint32 ndeleted;
+
+ /* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+ /* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+ /* TARGET OFFSET NUMBERS FOLLOW (if any) */
} xl_btree_vacuum;
-#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
/*
* This is what we need to know about marking an empty branch for deletion.