Hi hackers!

Here is a patch with more concurrency-friendly vacuum of GIN.

===Short problem statement===
Currently GIN vacuum during cleaning posting tree can lock this tree
for a long time without real need.

===Problem statement===
Vacuum of posting tree (function ginVacuumPostingTree() in ginvacuum.c
) is doing two passes through posting tree:

1. ginVacuumPostingTreeLeaves() takes LockBufferForCleanup,
effectively excluding all inserts. Then it traverses down trough tree
taking exclusive lock, effectively excluding all reads. On leaf level
it calls ginVacuumPostingTreeLeaf(), which deletes all dead tuples. If
there are any empty page, root lock is not relaesed, it passes to
stage two.

Interim: vacuum_delay_point(), which can hand for a while, holding
CleanupLock on root.

2. If there are any empty pages, ginScanToDelete() scans through tree,
deleting empty lead pages, then deleting empty inner pages, if they
emerged after leaf page deletion.


===Proposed algorithm===
ginVacuumPostingTreeLeaves() takes SHARED locks on inner pages and
EXCLUSIVE locks on leaf pages. On leaf pages it calls
ginVacuumPostingTreeLeaf().

If ginVacuumPostingTreeLeaves() encounters subtree with all leafs
empty, then it takes LockBufferForCleanup() on page P pointing to
empty subtree. After taking lock on P, ginScanToDelete() is called for
P. For every parent of P we can guarantee that this procedure will not
be necessary: if P was empty subtree itself, we would pass this
procedure to parent (unless P is root, than behavior is effectively
equal to before-patched).


===Testing===
This patch solved a problem encountered by Evgeniy Efimkin and
Vladimir Borodin from Yandex.Mail.
They implemented testbed with GIN index

CREATE TABLE BOX (uid bigint,
                  lids integer[] NOT NULL
                  CHECK (array_ndims(lids) = 1));

CREATE OR REPLACE FUNCTION ulids(
    i_uid bigint,
    i_lids integer[]
) RETURNS bigint[] AS $$
    SELECT array_agg((i_uid << 32) | lid)
      FROM unnest(i_lids) lid;
$$ LANGUAGE SQL IMMUTABLE STRICT;

CREATE INDEX i_box_uid_lids
    ON box USING gin (ulids(uid, lids)) WITH (FASTUPDATE=OFF);

Executing by a pgbench following query

\setrandom uid 1 1500000
\setrandom lid 1 1000
\setrandom lid2 1 1000
\setrandom lid3 1 1000
BEGIN;
insert into box values(:uid,'{:lid,:lid2,:lid3}');
insert into box values(:uid,'{}');
END;

and eventually deleting some of data. This testbed showed VACUUM
holding inserts for up to tenths of seconds. They claim that proposed
patch made vacuum invisible in this test.

Evgeniy, Vladimir, if I missed something or you have something to add,
please join discussion.

===Known issues===
1. Probably, there exists better algorithms. I could not adapt B-tree
vacuum wizardy to GIN, but that does not mean it is impossible.

2. There may be more CleanUp locks taken. Under write-dense scenarios,
this can lead to longer vacuum, but it should not consume more
resources, just wait.

3. I have changed locks sequence during page deleteion. I think that
it is safe, but comments there were in fear of inserts (excluded by
CleanupLock). More details in a code of ginDeletePage().

4. ginVacuumPostingTreeLeaves() traverses children pages after release
of parent lock (event SHARED). This is safe if there is only one
vacuum at a time. Is there a reason to be afraid of concurrent
vacuums?


I will be happy if someone points me were is a best place to read
about B-tree vacuum process. Or if someone will explain me how it
works in a few words.
Dev process is here
https://github.com/x4m/postgres_g/pull/2

Thank you for reading, I'm looking forward to hear your thought on the matter.


Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/ginbtree.c 
b/src/backend/access/gin/ginbtree.c
index a0afec4..dc28c81 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack 
*stack,
 /*
  * Lock buffer by needed method for search.
  */
-static int
+int
 ginTraverseLock(Buffer buffer, bool searchMode)
 {
        Page            page;
diff --git a/src/backend/access/gin/ginvacuum.c 
b/src/backend/access/gin/ginvacuum.c
index 2685a1c..bc54284 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer)
        PageSetLSN(page, recptr);
 }
 
-static bool
-ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool 
isRoot, Buffer *rootBuffer)
-{
-       Buffer          buffer;
-       Page            page;
-       bool            hasVoidPage = FALSE;
-       MemoryContext oldCxt;
-
-       buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-                                                               RBM_NORMAL, 
gvs->strategy);
-       page = BufferGetPage(buffer);
-
-       /*
-        * We should be sure that we don't concurrent with inserts, insert 
process
-        * never release root page until end (but it can unlock it and lock
-        * again). New scan can't start but previously started ones work
-        * concurrently.
-        */
-       if (isRoot)
-               LockBufferForCleanup(buffer);
-       else
-               LockBuffer(buffer, GIN_EXCLUSIVE);
-
-       Assert(GinPageIsData(page));
-
-       if (GinPageIsLeaf(page))
-       {
-               oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
-               ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
-               MemoryContextSwitchTo(oldCxt);
-               MemoryContextReset(gvs->tmpCxt);
-
-               /* if root is a leaf page, we don't desire further processing */
-               if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page))
-                       hasVoidPage = TRUE;
-       }
-       else
-       {
-               OffsetNumber i;
-               bool            isChildHasVoid = FALSE;
-
-               for (i = FirstOffsetNumber; i <= 
GinPageGetOpaque(page)->maxoff; i++)
-               {
-                       PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-                       if (ginVacuumPostingTreeLeaves(gvs, 
PostingItemGetBlockNumber(pitem), FALSE, NULL))
-                               isChildHasVoid = TRUE;
-               }
-
-               if (isChildHasVoid)
-                       hasVoidPage = TRUE;
-       }
+typedef struct DataPageDeleteStack
+{
+       struct DataPageDeleteStack *child;
+       struct DataPageDeleteStack *parent;
 
-       /*
-        * if we have root and there are empty pages in tree, then we don't
-        * release lock to go further processing and guarantee that tree is 
unused
-        */
-       if (!(isRoot && hasVoidPage))
-       {
-               UnlockReleaseBuffer(buffer);
-       }
-       else
-       {
-               Assert(rootBuffer);
-               *rootBuffer = buffer;
-       }
+       BlockNumber blkno;                      /* current block number */
+       BlockNumber leftBlkno;          /* rightest non-deleted page on left */
+       bool            isRoot;
+} DataPageDeleteStack;
 
-       return hasVoidPage;
-}
 
 /*
  * Delete a posting tree page.
@@ -193,8 +135,16 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber 
deleteBlkno, BlockNumber leftBlkn
        BlockNumber rightlink;
 
        /*
-        * Lock the pages in the same order as an insertion would, to avoid
+        * OBSOLETE COMMENT: Lock the pages in the same order as an insertion 
would, to avoid
         * deadlocks: left, then right, then parent.
+        *
+        * AB: I will delete this comment and all following single line 
comments.
+        * they are here to highlight changes in locking
+        *
+        * This function MUST be called only if someone of parent pages hold
+        * exclusive cleanup lock. This guarantees that no insertions currently
+        * happen in this subtree. We still acquire Exclusive lock to exclude
+        * reads. Parent and this page is already locked.
         */
        lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
                                                                 RBM_NORMAL, 
gvs->strategy);
@@ -204,10 +154,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber 
deleteBlkno, BlockNumber leftBlkn
                                                                 RBM_NORMAL, 
gvs->strategy);
 
        LockBuffer(lBuffer, GIN_EXCLUSIVE);
-       LockBuffer(dBuffer, GIN_EXCLUSIVE);
-       if (!isParentRoot)                      /* parent is already locked by
-                                                                * 
LockBufferForCleanup() */
-               LockBuffer(pBuffer, GIN_EXCLUSIVE);
+       //LockBuffer(dBuffer, GIN_EXCLUSIVE);
+       //if (!isParentRoot)                    /* parent is already locked by
+       //                                                       * 
LockBufferForCleanup() */
+       //      LockBuffer(pBuffer, GIN_EXCLUSIVE);
 
        START_CRIT_SECTION();
 
@@ -271,26 +221,18 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber 
deleteBlkno, BlockNumber leftBlkn
                PageSetLSN(BufferGetPage(lBuffer), recptr);
        }
 
-       if (!isParentRoot)
-               LockBuffer(pBuffer, GIN_UNLOCK);
+       //if (!isParentRoot)
+       //      LockBuffer(pBuffer, GIN_UNLOCK);
+       // These comments will be deleted, explanation is upper
        ReleaseBuffer(pBuffer);
        UnlockReleaseBuffer(lBuffer);
-       UnlockReleaseBuffer(dBuffer);
+       ReleaseBuffer(dBuffer);//UnlockReleaseBuffer(dBuffer);
 
        END_CRIT_SECTION();
 
        gvs->result->pages_deleted++;
 }
 
-typedef struct DataPageDeleteStack
-{
-       struct DataPageDeleteStack *child;
-       struct DataPageDeleteStack *parent;
-
-       BlockNumber blkno;                      /* current block number */
-       BlockNumber leftBlkno;          /* rightest non-deleted page on left */
-       bool            isRoot;
-} DataPageDeleteStack;
 
 /*
  * scans posting tree and deletes empty pages
@@ -324,6 +266,10 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, 
bool isRoot,
 
        buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
                                                                RBM_NORMAL, 
gvs->strategy);
+
+       if(!isRoot)
+               LockBuffer(buffer, GIN_EXCLUSIVE);
+
        page = BufferGetPage(buffer);
 
        Assert(GinPageIsData(page));
@@ -358,6 +304,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, 
bool isRoot,
                }
        }
 
+       if(!isRoot)
+                       LockBuffer(buffer, GIN_UNLOCK);
+
        ReleaseBuffer(buffer);
 
        if (!meDelete)
@@ -366,37 +315,124 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, 
bool isRoot,
        return meDelete;
 }
 
-static void
-ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+
+/*
+ * Scan through posting tree, delete empty tuples from leaf pages.
+ * Also, this function collects empty subtrees (with all empty leafs).
+ * For parents of these subtrees CleanUp lock is taken, then we call
+ * ScanToDelete. This is done for every inner page, which points to
+ * empty subtree.
+ */
+static bool
+ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot)
 {
-       Buffer          rootBuffer = InvalidBuffer;
-       DataPageDeleteStack root,
-                          *ptr,
-                          *tmp;
+       Buffer          buffer;
+       Page            page;
+       bool            hasVoidPage = FALSE;
+       MemoryContext oldCxt;
 
-       if (ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE, &rootBuffer) == 
FALSE)
+       buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
+                                                               RBM_NORMAL, 
gvs->strategy);
+       page = BufferGetPage(buffer);
+
+       ginTraverseLock(buffer,false);
+
+       Assert(GinPageIsData(page));
+
+       if (GinPageIsLeaf(page))
        {
-               Assert(rootBuffer == InvalidBuffer);
-               return;
+               oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
+               ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
+               MemoryContextSwitchTo(oldCxt);
+               MemoryContextReset(gvs->tmpCxt);
+
+               /* if root is a leaf page, we don't desire further processing */
+               if (GinDataLeafPageIsEmpty(page))
+                       hasVoidPage = TRUE;
+
+               UnlockReleaseBuffer(buffer);
+
+               return hasVoidPage;
        }
+       else
+       {
+               OffsetNumber    i;
+               bool                    isChildHasVoid = FALSE;
+               bool                    isAnyNonempy = FALSE;
+               OffsetNumber    maxoff = GinPageGetOpaque(page)->maxoff;
+               BlockNumber*    children = palloc(sizeof(BlockNumber) * (maxoff 
+ 1));
 
-       memset(&root, 0, sizeof(DataPageDeleteStack));
-       root.leftBlkno = InvalidBlockNumber;
-       root.isRoot = TRUE;
+               /*
+                * Read all children BlockNumbers.
+                * Not sure it is safe if there are many concurrent vacuums.
+                */
 
-       vacuum_delay_point();
+               for (i = FirstOffsetNumber; i <= maxoff; i++)
+               {
+                       PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-       ginScanToDelete(gvs, rootBlkno, TRUE, &root, InvalidOffsetNumber);
+                       children[i] = PostingItemGetBlockNumber(pitem);
+               }
 
-       ptr = root.child;
-       while (ptr)
-       {
-               tmp = ptr->child;
-               pfree(ptr);
-               ptr = tmp;
+               UnlockReleaseBuffer(buffer);
+
+               for (i = FirstOffsetNumber; i <= maxoff; i++)
+               {
+                       if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE))
+                               isChildHasVoid = TRUE;
+                       else
+                               isAnyNonempy = TRUE;
+               }
+
+               pfree(children);
+
+               vacuum_delay_point();
+
+               /*
+                * All subtree is empty - just return TRUE to indicate that 
parent must
+                * do a cleanup. Unless we are ROOT an there is way to go upper.
+                */
+
+               if(isChildHasVoid && !isAnyNonempy && !isRoot)
+                       return TRUE;
+
+               if(isChildHasVoid)
+               {
+                       DataPageDeleteStack root,
+                                          *ptr,
+                                          *tmp;
+
+                       buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, 
blkno,
+                                                                               
        RBM_NORMAL, gvs->strategy);
+                       LockBufferForCleanup(buffer);
+
+                       memset(&root, 0, sizeof(DataPageDeleteStack));
+                               root.leftBlkno = InvalidBlockNumber;
+                               root.isRoot = TRUE;
+
+                       ginScanToDelete(gvs, blkno, TRUE, &root, 
InvalidOffsetNumber);
+
+                       ptr = root.child;
+
+                       while (ptr)
+                       {
+                               tmp = ptr->child;
+                               pfree(ptr);
+                               ptr = tmp;
+                       }
+
+                       UnlockReleaseBuffer(buffer);
+               }
+
+               /* Here we have deleted all empty subtrees */
+               return FALSE;
        }
+}
 
-       UnlockReleaseBuffer(rootBuffer);
+static void
+ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+{
+       ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE);
 }
 
 /*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index bf589ab..b738f47 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -981,4 +981,6 @@ ginCompareItemPointers(ItemPointer a, ItemPointer b)
                return -1;
 }
 
+extern int ginTraverseLock(Buffer buffer, bool searchMode);
+
 #endif   /* GIN_PRIVATE_H */
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to