Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova Thu, 18 Feb 2016 09:30:14 -0800

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru>  wrote:
I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring itlooks much better.I described all details of the compression in this documenthttps://goo.gl/50O8Q0 (the same text without pictures is attached inbtc_readme_1.0.txt).Consider it as a rough copy of readme. It contains some notes abouttricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Compression. To be correct, itâs not actually compression, but just effective layout of ItemPointers on an index page.
compressed tuple = IndexTuple (with metadata in TID field+ key) + PostingList

1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. Thatâs why we canât just copy the gin implementation of item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if thereâs more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I donât sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that itâs worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isnât a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I donât see good reasons to do that.

2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I donât sure whether itâs worth to handle this case separately because itâs really rare and I consider that itâs the DBAâs job to disable compression on such indexes. But if you see any clear way to do this, it would be great.

3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that:
Indexed keys are never deleted. It makes all tree algorithms much easier.
There are always many duplicates. Otherwise, gin becomes really inefficient.
Thereâs no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion.

In btree we canât afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below).

4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: âWe actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.â That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ â pagesize for regular items, and ~ Â½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple.

5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large enough to fit into an index tuple (maxposting size id computed as BTMaxItemSize - size of regular index tuple) or if the following tuple is not equal to the previous, we should create packed tuple using BtreeFormPackedTuple on posting list (if any) and insert it into a page. The same we do if there are no more elements in the spool.

6. High key is not a real data, but just an upper bound of the keys that allowed on the page. So thereâs no need to compress it. While copying a posting tuple into a high key, we should to get rid of posting list. A posting tuple should be truncated to length of a regular tuple, and the metadata in its TID field should be set with appropriate values. Itâs worth to mention here a very specific point in _bt_buildadd(). If current page is full (there is no room for a new tuple), we copy the last item on the page into the new page, and then rearrange the old page so that the 'last item' becomes its high key rather than a true data item. If the last tuple was compressed, we can truncate it before setting as a high key. But, if it had a big posting list, there will be plenty of free space on the original page. So we must split Posting tuple into 2 pieces. see the picture below and comments in the code. Iâm not sure about correctness of locking here, but I assume that there are no possible concurrent operations while building index. Is it right?

7. Another difference between gin and btree is that item pointers in gin posting list/tree are always ordered while btree doesnât require this strictly. If there are many duplicates in btree, we donât bother to find the ideal place to keep TIDs ordered. The insertion has a choice whether or not to move right. Currently, we just try to find a page where there is room for the new key. The next TODO item is to keep item pointers in posting list ordered. The advantage here is that the best compression of posting list could be reached on sorted TIDs. What do you think about it?

8. Insertion. After we found the sutable place for insertion, check, whether the previous item has the same key. If so, and if there is enough room to add a pointer into the page, we can add it into item. There are two possible cases. If old item is a regular tuple, we should form new compressed tuple. Note, that this case requires to have enough space for two TIDs (metadata and new TID). Otherwise, we just add the pointer into existing posting list. Then delete old tuple and insert the new one.

9. Search. Fortunately, itâs quite easy to change search algorithm. If compressed tuple is found, just go over all TIDs and return them. If an index-only scan is processed, just return the same tuple N times in a row. To avoid storing duplicates in currTuples array, save the key once and then connect it with posting TIDs using tupleOffset. Itâs clear that if compression is applied, the page could contain more tuples than if it has only uncompressed tuples. That is why MaxPackedIndexTuplesPerPage appears. Array items (which actually has currTuples and tupleOffset) in BTScanPos is preallocated with length = MaxPackedIndexTuplesPerPage, because we must be sure that all items would fit into the array.

10. Split. The only change in this section is posting list truncation before insert the tuple as a high key.

11. Vacuum. Check all TIDs in a posting list. If there are no live items in the compressed tuple, delete the tuple. Otherwise do the following: form new posting tuple, that contains remaining item pointers; delete "old" posting; insert new posting back to the page. Microvacuum of compressed tuples is not implemented yet. Itâs possible to use high bit of offset field of item pointer to flag killed items. But it requires additional performance testing.

12. Locking. Compressed index tuples use the same functions of insertion and deletion as regular index tuples. Most of the operations are performed inside standart functions and donât need any specific locks. Although this issue defenitely requires more properly testing and review. All the operations where posting tuple is updated in place (deleted and then inserted again with new set of item pointers in posting list) are performed with special function _bt_pgupdtup(). As well as operation, where we want to replace one tuple with another one e.g. in btvacuumpage() and _bt_buildadd (see issue related to high key).

13. Xlog. TODO.

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3c55eb..d6922d5 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -24,6 +24,8 @@
 #include "storage/predicate.h"
 #include "utils/tqual.h"
 
+#include "catalog/catalog.h"
+#include "utils/datum.h"
 
 typedef struct
 {
@@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 			 OffsetNumber itup_off);
 static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 			int keysz, ScanKey scankey);
+
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
 
 
@@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	BTStack		stack;
 	Buffer		buf;
 	OffsetNumber offset;
+	Page 		page;
+	TupleDesc	itupdesc;
+	int			nipd;
+	IndexTuple 	olditup;
+	Size 		sizetoadd;
 
 	/* we need an insertion scan key to do our search, so build one */
 	itup_scankey = _bt_mkscankey(rel, itup);
@@ -190,6 +198,7 @@ top:
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
 	{
+		bool updposting = false;
 		/*
 		 * The only conflict predicate locking cares about for indexes is when
 		 * an index tuple insert conflicts with an existing lock.  Since the
@@ -201,7 +210,45 @@ top:
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
 						  stack, heapRel);
-		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
+
+		/*
+		 * Decide, whether we can apply compression
+		 */
+		page = BufferGetPage(buf);
+
+		if(!IsSystemRelation(rel)
+			&& !rel->rd_index->indisunique
+			&& offset != InvalidOffsetNumber
+			&& offset <= PageGetMaxOffsetNumber(page))
+		{
+			itupdesc = RelationGetDescr(rel);
+			sizetoadd = sizeof(ItemPointerData);
+			olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset));
+
+			if(_bt_isbinaryequal(itupdesc, olditup,
+									rel->rd_index->indnatts, itup))
+			{
+				if (!BtreeTupleIsPosting(olditup))
+				{
+					nipd = 1;
+					sizetoadd = sizetoadd*2;
+				}
+				else
+					nipd = BtreeGetNPosting(olditup);
+
+				if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page)
+					&& PageGetFreeSpace(page) > sizetoadd)
+					updposting = true;
+			}
+		}
+
+		if (updposting)
+		{
+			_bt_pgupdtup(rel, page, offset, itup, true, olditup, nipd);
+			_bt_relbuf(rel, buf);
+		}
+		else
+			_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
 	}
 	else
 	{
@@ -1042,6 +1089,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1072,13 +1120,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 	}
-	if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+
+	if (BtreeTupleIsPosting(item))
+	{
+		Size hikeysize =  BtreeGetPostingOffset(item);
+		IndexTuple hikey = palloc0(hikeysize);
+
+		/* Truncate posting before insert it as a hikey. */
+		memcpy (hikey, item, hikeysize);
+		hikey->t_info &= ~INDEX_SIZE_MASK;
+		hikey->t_info |= hikeysize;
+		ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY);
+
+		if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff,
 					false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
+
+		pfree(hikey);
+	}
+	else
 	{
-		memset(rightpage, 0, BufferGetPageSize(rbuf));
-		elog(ERROR, "failed to add hikey to the left sibling"
-			 " while splitting block %u of index \"%s\"",
-			 origpagenumber, RelationGetRelationName(rel));
+		if (PageAddItem(leftpage, (Item) item, itemsz, leftoff,
+						false, false) == InvalidOffsetNumber)
+		{
+			memset(rightpage, 0, BufferGetPageSize(rbuf));
+			elog(ERROR, "failed to add hikey to the left sibling"
+				" while splitting block %u of index \"%s\"",
+				origpagenumber, RelationGetRelationName(rel));
+		}
 	}
 	leftoff = OffsetNumberNext(leftoff);
 
@@ -2103,6 +2177,76 @@ _bt_pgaddtup(Page page,
 }
 
 /*
+ * _bt_pgupdtup() -- update a tuple in place.
+ * This function is used for purposes of deduplication of item pointers.
+ * If new tuple to insert is equal to the tuple that already exists on the page,
+ * we can avoid key insertion and just add new item pointer.
+ *
+ * offset is the position of olditup on the page.
+ * itup is the new tuple to insert
+ * concat - this flag shows, whether we should add new item to existing one
+ * or just replace old tuple with the new value. If concat is false, the
+ * following fields are senseless.
+ * nipd is the number of item pointers in old tuple.
+ * The caller is responsible for checking of free space on the page.
+ */
+void
+_bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup,
+			 bool concat, IndexTuple olditup, int nipd)
+{
+	ItemPointerData *ipd;
+	IndexTuple 		newitup;
+	Size 			newitupsz;
+
+	if (concat)
+	{
+		ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1));
+
+		/* copy item pointers from old tuple into ipd */
+		if (BtreeTupleIsPosting(olditup))
+			memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd);
+		else
+			memcpy(ipd, olditup, sizeof(ItemPointerData));
+
+		/* add item pointer of the new tuple into ipd */
+		memcpy(ipd+nipd, itup, sizeof(ItemPointerData));
+
+		newitup = BtreeReformPackedTuple(itup, ipd, nipd+1);
+
+		/*
+		* Update the tuple in place. We have already checked that the
+		* new tuple would fit into this page, so it's safe to delete
+		* old tuple and insert the new one without any side effects.
+		*/
+		newitupsz = IndexTupleDSize(*newitup);
+		newitupsz = MAXALIGN(newitupsz);
+	}
+	else
+	{
+		newitup = itup;
+		newitupsz = IndexTupleSize(itup);
+	}
+
+	START_CRIT_SECTION();
+
+	PageIndexTupleDelete(page, offset);
+
+	if (!_bt_pgaddtup(page, newitupsz, newitup, offset))
+		elog(ERROR, "failed to insert compressed item in index \"%s\"",
+			 RelationGetRelationName(rel));
+
+	//TODO add Xlog stuff
+
+	END_CRIT_SECTION();
+
+	if (concat)
+	{
+		pfree(ipd);
+		pfree(newitup);
+	}
+}
+
+/*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
  * This is very similar to _bt_compare, except for NULL handling.
@@ -2151,6 +2295,63 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 }
 
 /*
+ * _bt_isbinaryequal -  used in _bt_doinsert and _bt_load
+ * in check for duplicates. This is very similar to heap_tuple_attr_equals
+ * subroutine. And this function differs from _bt_isequal
+ * because here we require strict binary equality of tuples.
+ */
+bool
+_bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup,
+			int nindatts, IndexTuple ituptoinsert)
+{
+	AttrNumber	attno;
+
+	for (attno = 1; attno <= nindatts; attno++)
+	{
+		Datum		datum1,
+					datum2;
+		bool		isnull1,
+					isnull2;
+		Form_pg_attribute att;
+
+		datum1 = index_getattr(itup, attno, itupdesc, &isnull1);
+		datum2 = index_getattr(ituptoinsert, attno, itupdesc, &isnull2);
+
+		/*
+		 * If one value is NULL and other is not, then they are certainly not
+		 * equal
+		 */
+		if (isnull1 != isnull2)
+			return false;
+		/*
+		 * We do simple binary comparison of the two datums.  This may be overly
+		 * strict because there can be multiple binary representations for the
+		 * same logical value.  But we should be OK as long as there are no false
+		 * positives.  Using a type-specific equality operator is messy because
+		 * there could be multiple notions of equality in different operator
+		 * classes; furthermore, we cannot safely invoke user-defined functions
+		 * while holding exclusive buffer lock.
+		 */
+		if (attno <= 0)
+		{
+			/* The only allowed system columns are OIDs, so do this */
+			if (DatumGetObjectId(datum1) != DatumGetObjectId(datum2))
+				return false;
+		}
+		else
+		{
+			Assert(attno <= itupdesc->natts);
+			att = itupdesc->attrs[attno - 1];
+			if(!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+				return false;
+		}
+	}
+
+	/* if we get here, the keys are equal */
+	return true;
+}
+
+/*
  * _bt_vacuum_one_page - vacuum just one index page.
  *
  * Try to remove LP_DEAD items from the given page.  The passed buffer
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f2905cb..a08c500 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -74,7 +74,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			 BTCycleId cycleid);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 			 BlockNumber orig_blkno);
-
+static ItemPointer btreevacuumPosting(BTVacState *vstate,
+						ItemPointerData *items,int nitem, int *nremaining);
 
 /*
  * Btree handler function: return IndexAmRoutine with access method parameters
@@ -962,6 +963,7 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1011,31 +1013,58 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
-
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if(BtreeTupleIsPosting(itup))
+				{
+					ItemPointer newipd;
+					int 		nipd,
+								nnewipd;
+
+					nipd = BtreeGetNPosting(itup);
+					newipd = btreevacuumPosting(vstate, BtreeGetPosting(itup), nipd, &nnewipd);
+
+					if (newipd != NULL)
+					{
+						if (nnewipd > 0)
+						{
+							/* There are still some live tuples in the posting.
+							 * 1) form new posting tuple, that contains remaining ipds
+							 * 2) delete "old" posting and insert new posting back to the page
+							 */
+							remaining = BtreeReformPackedTuple(itup, newipd, nnewipd);
+							_bt_pgupdtup(info->index, page, offnum, remaining, false, NULL, 0);
+						}
+						else
+							deletable[ndeletable++] = offnum;
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					* During Hot Standby we currently assume that
+					* XLOG_BTREE_VACUUM records do not produce conflicts. That is
+					* only true as long as the callback function depends only
+					* upon whether the index tuple refers to heap tuples removed
+					* in the initial heap scan. When vacuum starts it derives a
+					* value of OldestXmin. Backends taking later snapshots could
+					* have a RecentGlobalXmin with a later xid than the vacuum's
+					* OldestXmin, so it is possible that row versions deleted
+					* after OldestXmin could be marked as killed by other
+					* backends. The callback function *could* look at the index
+					* tuple state in isolation and decide to delete the index
+					* tuple, though currently it does not. If it ever did, we
+					* would need to reconsider whether XLOG_BTREE_VACUUM records
+					* should cause conflicts. If they did cause conflicts they
+					* would be fairly harsh conflicts, since we haven't yet
+					* worked out a way to pass a useful value for
+					* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
+					* applies to *any* type of index that marks index tuples as
+					* killed.
+					*/
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1160,3 +1189,50 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+/*
+ * btreevacuumPosting() -- vacuums a posting list.
+ * The size of the list must be specified via number of items (nitems).
+ *
+ * If none of the items need to be removed, returns NULL. Otherwise returns
+ * a new palloc'd array with the remaining items. The number of remaining
+ * items is returned via nremaining.
+ */
+ItemPointer
+btreevacuumPosting(BTVacState *vstate, ItemPointerData *items,
+				   int nitem, int *nremaining)
+{
+	int			i,
+				remaining = 0;
+	ItemPointer tmpitems = NULL;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+
+	/*
+	 * Iterate over TIDs array
+	 */
+	for (i = 0; i < nitem; i++)
+	{
+		if (callback(items + i, callback_state))
+		{
+			if (!tmpitems)
+			{
+				/*
+				 * First TID to be deleted: allocate memory to hold the
+				 * remaining items.
+				 */
+				tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+				memcpy(tmpitems, items, sizeof(ItemPointerData) * i);
+			}
+		}
+		else
+		{
+			if (tmpitems)
+				tmpitems[remaining] = items[i];
+			remaining++;
+		}
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 3db32e8..301c019 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,6 +29,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
+static void _bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -1161,6 +1163,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	IndexTuple	itup;
 	bool		continuescan;
+	int 		i;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1195,6 +1198,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.prevTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1215,8 +1219,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+						itemIndex++;
+					}
+				}
+				else
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
 			}
 			if (!continuescan)
 			{
@@ -1228,7 +1243,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			offnum = OffsetNumberNext(offnum);
 		}
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPackedIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1236,7 +1251,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPackedIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1246,8 +1261,20 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (itup != NULL)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (BtreeTupleIsPosting(itup))
+				{
+					for (i = 0; i < BtreeGetNPosting(itup); i++)
+					{
+						itemIndex--;
+						_bt_savePostingitem(so, itemIndex, offnum, BtreeGetPostingN(itup, i), itup, i);
+					}
+				}
+				else
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+
 			}
 			if (!continuescan)
 			{
@@ -1261,8 +1288,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPackedIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPackedIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1288,6 +1315,37 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Save an index item into so->currPos.items[itemIndex]
+ * Performing index-only scan, handle the first elem separately.
+ * Save the key once, and connect it with posting tids using tupleOffset.
+ */
+static void
+_bt_savePostingitem(BTScanOpaque so, int itemIndex,
+			 OffsetNumber offnum, ItemPointer iptr, IndexTuple itup, int i)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		if (i == 0)
+		{
+			/* save key. the same for all tuples in the posting */
+			Size itupsz = BtreeGetPostingOffset(itup);
+			currItem->tupleOffset = so->currPos.nextTupleOffset;
+			memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+			so->currPos.nextTupleOffset += MAXALIGN(itupsz);
+			so->currPos.prevTupleOffset = currItem->tupleOffset;
+		}
+		else
+			currItem->tupleOffset = so->currPos.prevTupleOffset;
+	}
+}
+
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 99a014e..e46930b 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -75,7 +75,7 @@
 #include "utils/rel.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
-
+#include "catalog/catalog.h"
 
 /*
  * Status record for spooling/sorting phase.  (Note we may have two of
@@ -136,6 +136,9 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 			 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static SortSupport _bt_prepare_SortSupport(BTWriteState *wstate, int keysz);
+static int	_bt_call_comparator(SortSupport sortKeys, int i,
+				IndexTuple itup, IndexTuple itup2, TupleDesc tupdes);
 static void _bt_load(BTWriteState *wstate,
 		 BTSpool *btspool, BTSpool *btspool2);
 
@@ -527,15 +530,120 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(last_off > P_FIRSTKEY);
 		ii = PageGetItemId(opage, last_off);
 		oitup = (IndexTuple) PageGetItem(opage, ii);
-		_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
 
 		/*
-		 * Move 'last' into the high key position on opage
+		 * If the item is PostingTuple, we can cut it, because HIKEY
+		 * is not considered as real data, and it need not to keep any
+		 * ItemPointerData at all. And of course it need not to keep
+		 * a list of ipd.
+		 * But, if it had a big posting list, there will be plenty of
+		 * free space on the opage. In that case we must split posting
+		 * tuple into 2 pieces.
 		 */
-		hii = PageGetItemId(opage, P_HIKEY);
-		*hii = *ii;
-		ItemIdSetUnused(ii);	/* redundant */
-		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		 if (BtreeTupleIsPosting(oitup))
+		 {
+			IndexTuple  keytup;
+			Size 		keytupsz;
+			int 		nipd,
+						ntocut,
+						ntoleave;
+
+			nipd = BtreeGetNPosting(oitup);
+			ntocut = (sizeof(ItemIdData) + BtreeGetPostingOffset(oitup))/sizeof(ItemPointerData);
+			ntocut++; /* round up to be sure that we cut enough */
+			ntoleave = nipd - ntocut;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(oitup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, oitup, keytupsz);
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), oblkno, P_HIKEY);
+
+			if (ntocut < nipd)
+			{
+				ItemPointerData *newipd;
+				IndexTuple		newitup,
+								newlasttup;
+				/*
+				 * 1) Cut part of old tuple to shift to npage.
+				 * And insert it as P_FIRSTKEY.
+				 * This tuple is based on keytup.
+				 * Blkno & offnum are reset in BtreeFormPackedTuple.
+				 */
+				newipd = palloc0(sizeof(ItemPointerData)*ntocut);
+				/* Note, that we cut last 'ntocut' items */
+				memcpy(newipd, BtreeGetPosting(oitup)+ntoleave, sizeof(ItemPointerData)*ntocut);
+				newitup = BtreeFormPackedTuple(keytup, newipd, ntocut);
+
+				_bt_sortaddtup(npage, IndexTupleSize(newitup), newitup, P_FIRSTKEY);
+				pfree(newipd);
+				pfree(newitup);
+
+				/*
+				 * 2) set last item to the P_HIKEY linp
+				 * Move 'last' into the high key position on opage
+				 * NOTE: Do this because of indextuple deletion algorithm, which
+				 * doesn't allow to delete an item while we have unused one before it.
+				 */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+				/* 4) form the part of old tuple with ntoleave ipds. And insert it as last tuple. */
+				newlasttup = BtreeFormPackedTuple(keytup, BtreeGetPosting(oitup), ntoleave);
+
+				_bt_sortaddtup(opage, IndexTupleSize(newlasttup), newlasttup, PageGetMaxOffsetNumber(opage)+1);
+
+				pfree(newlasttup);
+			}
+			else
+			{
+				/* The tuple isn't big enough to split it. Handle it as a regular tuple. */
+
+				/*
+				 * 1) Shift the last tuple to npage.
+				 * Insert it as P_FIRSTKEY.
+				 */
+				_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+				/* 2) set last item to the P_HIKEY linp */
+				/* Move 'last' into the high key position on opage */
+				hii = PageGetItemId(opage, P_HIKEY);
+				*hii = *ii;
+				ItemIdSetUnused(ii);	/* redundant */
+				((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+
+				/* 3) delete "wrong" high key, insert keytup as P_HIKEY. */
+				_bt_pgupdtup(wstate->index, opage, P_HIKEY, keytup, false, NULL, 0);
+
+			}
+			pfree(keytup);
+		 }
+		 else
+		 {
+			/*
+			 * 1) Shift the last tuple to npage.
+			 * Insert it as P_FIRSTKEY.
+			 */
+			_bt_sortaddtup(npage, ItemIdGetLength(ii), oitup, P_FIRSTKEY);
+
+			/* 2) set last item to the P_HIKEY linp */
+			/* Move 'last' into the high key position on opage */
+			hii = PageGetItemId(opage, P_HIKEY);
+			*hii = *ii;
+			ItemIdSetUnused(ii);	/* redundant */
+			((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
+		}
 
 		/*
 		 * Link the old page into its parent, using its minimum key. If we
@@ -547,6 +655,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 
 		Assert(state->btps_minkey != NULL);
 		ItemPointerSet(&(state->btps_minkey->t_tid), oblkno, P_HIKEY);
+
 		_bt_buildadd(wstate, state->btps_next, state->btps_minkey);
 		pfree(state->btps_minkey);
 
@@ -554,8 +663,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * Save a copy of the minimum key for the new page.  We have to copy
 		 * it off the old page, not the new one, in case we are not at leaf
 		 * level.
+		 * We can not just copy oitup, because it could be posting tuple
+		 * and it's more safe just to get new inserted hikey.
 		 */
-		state->btps_minkey = CopyIndexTuple(oitup);
+		ItemId iihk = PageGetItemId(opage, P_HIKEY);
+		IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+		state->btps_minkey = CopyIndexTuple(hikey);
 
 		/*
 		 * Set the sibling links for both pages.
@@ -590,7 +703,29 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	if (last_off == P_HIKEY)
 	{
 		Assert(state->btps_minkey == NULL);
-		state->btps_minkey = CopyIndexTuple(itup);
+
+		if (BtreeTupleIsPosting(itup))
+		{
+			Size		keytupsz;
+			IndexTuple  keytup;
+
+			/*
+			 * 0) Form key tuple, that doesn't contain any ipd.
+			 * NOTE: key tuple will have blkno & offset suitable for P_HIKEY.
+			 * any function that uses keytup should handle them itself.
+			 */
+			keytupsz =  BtreeGetPostingOffset(itup);
+			keytup = palloc0(keytupsz);
+			memcpy (keytup, itup, keytupsz);
+
+			keytup->t_info &= ~INDEX_SIZE_MASK;
+			keytup->t_info |= keytupsz;
+			ItemPointerSet(&(keytup->t_tid), nblkno, P_HIKEY);
+
+			state->btps_minkey = CopyIndexTuple(keytup);
+		}
+		else
+			state->btps_minkey = CopyIndexTuple(itup);
 	}
 
 	/*
@@ -670,6 +805,71 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Prepare SortSupport structure for indextuples comparison
+ */
+static SortSupport
+_bt_prepare_SortSupport(BTWriteState *wstate, int keysz)
+{
+	ScanKey		indexScanKey;
+	SortSupport sortKeys;
+	int 		i;
+
+	/* Prepare SortSupport data for each column */
+	indexScanKey = _bt_mkscankey_nodata(wstate->index);
+	sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
+
+	for (i = 0; i < keysz; i++)
+	{
+		SortSupport sortKey = sortKeys + i;
+		ScanKey		scanKey = indexScanKey + i;
+		int16		strategy;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = scanKey->sk_collation;
+		sortKey->ssup_nulls_first =
+			(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
+		sortKey->ssup_attno = scanKey->sk_attno;
+		/* Abbreviation is not supported here */
+		sortKey->abbreviate = false;
+
+		AssertState(sortKey->ssup_attno != 0);
+
+		strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
+			BTGreaterStrategyNumber : BTLessStrategyNumber;
+
+		PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
+	}
+
+	_bt_freeskey(indexScanKey);
+	return sortKeys;
+}
+
+/*
+ * Compare two tuples using sortKey on attribute i
+ */
+static int
+_bt_call_comparator(SortSupport sortKeys, int i,
+						 IndexTuple itup, IndexTuple itup2, TupleDesc tupdes)
+{
+		SortSupport entry;
+		Datum		attrDatum1,
+					attrDatum2;
+		bool		isNull1,
+					isNull2;
+		int32		compare;
+
+		entry = sortKeys + i - 1;
+		attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
+		attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
+
+		compare = ApplySortComparator(attrDatum1, isNull1,
+										attrDatum2, isNull2,
+										entry);
+
+		return compare;
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -679,16 +879,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	BTPageState *state = NULL;
 	bool		merge = (btspool2 != NULL);
 	IndexTuple	itup,
-				itup2 = NULL;
+				itup2 = NULL,
+				itupprev = NULL;
 	bool		should_free,
 				should_free2,
 				load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
 				keysz = RelationGetNumberOfAttributes(wstate->index);
-	ScanKey		indexScanKey = NULL;
+	int			ntuples = 0;
 	SortSupport sortKeys;
 
+	/* Prepare SortSupport structure for indextuples comparison */
+	sortKeys = (SortSupport)_bt_prepare_SortSupport(wstate, keysz);
+
 	if (merge)
 	{
 		/*
@@ -701,34 +905,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 									   true, &should_free);
 		itup2 = tuplesort_getindextuple(btspool2->sortstate,
 										true, &should_free2);
-		indexScanKey = _bt_mkscankey_nodata(wstate->index);
-
-		/* Prepare SortSupport data for each column */
-		sortKeys = (SortSupport) palloc0(keysz * sizeof(SortSupportData));
-
-		for (i = 0; i < keysz; i++)
-		{
-			SortSupport sortKey = sortKeys + i;
-			ScanKey		scanKey = indexScanKey + i;
-			int16		strategy;
-
-			sortKey->ssup_cxt = CurrentMemoryContext;
-			sortKey->ssup_collation = scanKey->sk_collation;
-			sortKey->ssup_nulls_first =
-				(scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
-			sortKey->ssup_attno = scanKey->sk_attno;
-			/* Abbreviation is not supported here */
-			sortKey->abbreviate = false;
-
-			AssertState(sortKey->ssup_attno != 0);
-
-			strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
-				BTGreaterStrategyNumber : BTLessStrategyNumber;
-
-			PrepareSortSupportFromIndexRel(wstate->index, strategy, sortKey);
-		}
-
-		_bt_freeskey(indexScanKey);
 
 		for (;;)
 		{
@@ -742,20 +918,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			{
 				for (i = 1; i <= keysz; i++)
 				{
-					SortSupport entry;
-					Datum		attrDatum1,
-								attrDatum2;
-					bool		isNull1,
-								isNull2;
-					int32		compare;
-
-					entry = sortKeys + i - 1;
-					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
-					attrDatum2 = index_getattr(itup2, i, tupdes, &isNull2);
-
-					compare = ApplySortComparator(attrDatum1, isNull1,
-												  attrDatum2, isNull2,
-												  entry);
+					int32 compare = _bt_call_comparator(sortKeys, i, itup, itup2, tupdes);
+
 					if (compare > 0)
 					{
 						load1 = false;
@@ -794,16 +958,123 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	else
 	{
 		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
+		Relation indexRelation = wstate->index;
+		Form_pg_index index = indexRelation->rd_index;
+
+		if (IsSystemRelation(indexRelation) || index->indisunique)
+		{
+			/* Do not use compression. */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
 											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
+
+				_bt_buildadd(wstate, state, itup);
+				if (should_free)
+					pfree(itup);
+			}
+		}
+		else
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			ItemPointerData *ipd = NULL;
+			IndexTuple 		postingtuple;
+			Size			maxitemsize = 0,
+							maxpostingsize = 0;
 
-			_bt_buildadd(wstate, state, itup);
-			if (should_free)
-				pfree(itup);
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+											   true, &should_free)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				/*
+				 * Compare current tuple with previous one.
+				 * If tuples are equal, we can unite them into a posting list.
+				 */
+				if (itupprev != NULL)
+				{
+					if (_bt_isbinaryequal(tupdes, itupprev, index->indnatts, itup))
+					{
+						/* Tuples are equal. Create or update posting */
+						if (ntuples == 0)
+						{
+							/*
+							 * We haven't suitable posting list yet, so allocate
+							 * it and save both itupprev and current tuple.
+							 */
+							ipd = palloc0(maxitemsize);
+
+							memcpy(ipd, itupprev, sizeof(ItemPointerData));
+							ntuples++;
+							memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+							ntuples++;
+						}
+						else
+						{
+							if ((ntuples+1)*sizeof(ItemPointerData) < maxpostingsize)
+							{
+								memcpy(ipd + ntuples, itup, sizeof(ItemPointerData));
+								ntuples++;
+							}
+							else
+							{
+								postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+								_bt_buildadd(wstate, state, postingtuple);
+								ntuples = 0;
+								pfree(ipd);
+							}
+						}
+
+					}
+					else
+					{
+						/* Tuples are not equal. Insert itupprev into index. */
+						if (ntuples == 0)
+							_bt_buildadd(wstate, state, itupprev);
+						else
+						{
+							postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+							_bt_buildadd(wstate, state, postingtuple);
+							ntuples = 0;
+							pfree(ipd);
+						}
+					}
+				}
+
+				/*
+				 * Copy the tuple into temp variable itupprev
+				 * to compare it with the following tuple
+				 * and maybe unite them into a posting tuple
+				 */
+				itupprev = CopyIndexTuple(itup);
+				if (should_free)
+					pfree(itup);
+
+				/* compute max size of ipd list */
+				maxpostingsize = maxitemsize - IndexInfoFindDataOffset(itupprev->t_info) - MAXALIGN(IndexTupleSize(itupprev));
+			}
+
+			/* Handle the last item.*/
+			if (ntuples == 0)
+			{
+				if (itupprev != NULL)
+					_bt_buildadd(wstate, state, itupprev);
+			}
+			else
+			{
+				Assert(ipd!=NULL);
+				Assert(itupprev != NULL);
+				postingtuple = BtreeFormPackedTuple(itupprev, ipd, ntuples);
+				_bt_buildadd(wstate, state, postingtuple);
+				ntuples = 0;
+				pfree(ipd);
+			}
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c850b48..8c9dda1 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1821,7 +1821,9 @@ _bt_killitems(IndexScanDesc scan)
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			/* No microvacuum for posting tuples */
+			if (!BtreeTupleIsPosting(ituple)
+				&& (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)))
 			{
 				/* found the item */
 				ItemIdMarkDead(iid);
@@ -2063,3 +2065,69 @@ btoptions(Datum reloptions, bool validate)
 {
 	return default_reloptions(reloptions, validate, RELOPT_KIND_BTREE);
 }
+
+/*
+ * Already have basic index tuple that contains key datum
+ */
+IndexTuple
+BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	uint32	   newsize;
+	IndexTuple itup = CopyIndexTuple(tuple);
+
+	/*
+	 * Determine and store offset to the posting list.
+	 */
+	newsize = IndexTupleSize(itup);
+	newsize = SHORTALIGN(newsize);
+
+	/*
+	 * Set meta info about the posting list.
+	 */
+	BtreeSetPostingOffset(itup, newsize);
+	BtreeSetNPosting(itup, nipd);
+	/*
+	 * Add space needed for posting list, if any.  Then check that the tuple
+	 * won't be too big to store.
+	 */
+	newsize += sizeof(ItemPointerData)*nipd;
+	newsize = MAXALIGN(newsize);
+
+	/*
+	 * Resize tuple if needed
+	 */
+	if (newsize != IndexTupleSize(itup))
+	{
+		itup = repalloc(itup, newsize);
+
+		/*
+		 * PostgreSQL 9.3 and earlier did not clear this new space, so we
+		 * might find uninitialized padding when reading tuples from disk.
+		 */
+		memset((char *) itup + IndexTupleSize(itup),
+			   0, newsize - IndexTupleSize(itup));
+		/* set new size in tuple header */
+		itup->t_info &= ~INDEX_SIZE_MASK;
+		itup->t_info |= newsize;
+	}
+
+	/*
+	 * Copy data into the posting tuple
+	 */
+	memcpy(BtreeGetPosting(itup), data, sizeof(ItemPointerData)*nipd);
+	return itup;
+}
+
+IndexTuple
+BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd)
+{
+	int size;
+	if (BtreeTupleIsPosting(tuple))
+	{
+		size = BtreeGetPostingOffset(tuple);
+		tuple->t_info &= ~INDEX_SIZE_MASK;
+		tuple->t_info |= size;
+	}
+
+	return BtreeFormPackedTuple(tuple, data, nipd);
+}
diff --git a/src/include/access/itup.h b/src/include/access/itup.h
index 8350fa0..3dd19c0 100644
--- a/src/include/access/itup.h
+++ b/src/include/access/itup.h
@@ -138,7 +138,6 @@ typedef IndexAttributeBitMapData *IndexAttributeBitMap;
 	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
 			(MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))))
 
-
 /* routines in indextuple.c */
 extern IndexTuple index_form_tuple(TupleDesc tupleDescriptor,
 				 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 06822fa..dc82ce7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,6 +122,15 @@ typedef struct BTMetaPageData
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
+ * If compression is applied, the page could contain more tuples
+ * than if it has only uncompressed tuples, so we need new max value.
+ * Note that it is a rough upper estimate.
+ */
+#define MaxPackedIndexTuplesPerPage	\
+	((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+			(sizeof(ItemPointerData))))
+
+/*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
  * For pages above the leaf level, we use a fixed 70% fillfactor.
  * The fillfactor is applied during index build and when splitting
@@ -538,6 +547,8 @@ typedef struct BTScanPosData
 	 * location in the associated tuple storage workspace.
 	 */
 	int			nextTupleOffset;
+	/* prevTupleOffset is for Posting list handling*/
+	int			prevTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -550,7 +561,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPackedIndexTuplesPerPage]; /* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -651,6 +662,27 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
+
+/*
+ * We use our own ItemPointerGet(BlockNumber|OffsetNumber)
+ * to avoid Asserts, since sometimes the ip_posid isn't "valid"
+ */
+#define BtreeItemPointerGetBlockNumber(pointer) \
+	BlockIdGetBlockNumber(&(pointer)->ip_blkid)
+
+#define BtreeItemPointerGetOffsetNumber(pointer) \
+	((pointer)->ip_posid)
+
+#define BT_POSTING (1<<31)
+#define BtreeGetNPosting(itup)			BtreeItemPointerGetOffsetNumber(&(itup)->t_tid)
+#define BtreeSetNPosting(itup,n)		ItemPointerSetOffsetNumber(&(itup)->t_tid,n)
+
+#define BtreeGetPostingOffset(itup)		(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & (~BT_POSTING))
+#define BtreeSetPostingOffset(itup,n)	ItemPointerSetBlockNumber(&(itup)->t_tid,(n)|BT_POSTING)
+#define BtreeTupleIsPosting(itup)    	(BtreeItemPointerGetBlockNumber(&(itup)->t_tid) & BT_POSTING)
+#define BtreeGetPosting(itup)			(ItemPointerData*) ((char*)(itup) + BtreeGetPostingOffset(itup))
+#define BtreeGetPostingN(itup,n)		(ItemPointerData*) (BtreeGetPosting(itup) + n)
+
 /*
  * prototypes for functions in nbtree.c (external entry points for btree)
  */
@@ -684,6 +716,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 			 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern void _bt_pgupdtup(Relation rel, Page page, OffsetNumber offset, IndexTuple itup, 
+						 bool concat, IndexTuple olditup, int nipd);
+extern bool _bt_isbinaryequal(TupleDesc itupdesc, IndexTuple itup, int nindatts, IndexTuple ituptoinsert);
 
 /*
  * prototypes for functions in nbtpage.c
@@ -715,8 +750,8 @@ extern BTStack _bt_search(Relation rel,
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
 			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
 			  int access);
-extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+extern OffsetNumber _bt_binsrch( Relation rel, Buffer buf, int keysz,
+								ScanKey scankey, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
 			Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
@@ -747,6 +782,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern IndexTuple BtreeFormPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
+extern IndexTuple BtreeReformPackedTuple(IndexTuple tuple, ItemPointerData *data, int nipd);
 
 /*
  * prototypes for functions in nbtvalidate.c

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to