Here is v1 of the patch. Now it has changes for README and contains more comments clarifying changes of locking model.
Also I will elaborate some more on what is patched. Main portion of changes is made to function ginVacuumPostingTreeLeaves(). Before the patch it was traversing tree in depth-first fashion, acquiring exclusive lock on each page and removing dead tuples from leafs. Also this function used to hold cleanup lock. Now this function is doing same DFS, but without cleanup lock, acquiring only read locks on inner pages and exclusive lock on leafs before eliminating dead tuples. If this function finds empty leafs, it computes minimal subtree, containing only empty pages and start scan for empty pages from parent page pointing to found subtree. This scan acquires cleanup lock on root of scan (not necessarily root of posting tree). Cleanup lock ensures no inserts are inside subtree. Scan traverses subtree DF taking exclusive locks from left to right. For any page being deleted all leftmost pages were locked and unlocked before. New reads cannot enter subtree, all old readscans were excluded by lock\unlock. Thus there shall not be deadlocks with ginStepRight(). We get lock on page being deleted, then on a left page. ginStepRight() takes lock on left page, than on right page. But it’s presence is excluded by cleanup lock and DFS scan with locks of upper and left parts of tree. Thank you for reading this. Concurrency bothers me a lot. If you see that anything is wrong or suspicious, please do not hesitate to post your thoughts. Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README index fade0cb..29dafce 100644 --- a/src/backend/access/gin/README +++ b/src/backend/access/gin/README @@ -314,10 +314,16 @@ deleted. The previous paragraph's reasoning only applies to searches, and only to posting trees. To protect from inserters following a downlink to a deleted page, vacuum simply locks out all concurrent insertions to the posting tree, -by holding a super-exclusive lock on the posting tree root. Inserters hold a -pin on the root page, but searches do not, so while new searches cannot begin -while root page is locked, any already-in-progress scans can continue -concurrently with vacuum. In the entry tree, we never delete pages. +by holding a super-exclusive lock on the parent page of subtree with deletable +pages. Inserters hold a pin on the root page, but searches do not, so while +new searches cannot begin while root page is locked, any already-in-progress +scans can continue concurrently with vacuum in corresponding subtree of +posting tree. To exclude interference with readers vacuum takes exclusive +locks in a depth-first scan in leaf-to-right order of page tuples. Leftmost +page is never deleted. Thus before deleting any page we obtain exclusive +lock on any left page, effectively excluding deadlock with any reader, despite +takinf parent lock first and not holding left lock at all. +In the entry tree, we never delete pages. (This is quite different from the mechanism the btree indexam uses to make page-deletions safe; it stamps the deleted pages with an XID and keeps the diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index a0afec4..dc28c81 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack *stack, /* * Lock buffer by needed method for search. */ -static int +int ginTraverseLock(Buffer buffer, bool searchMode) { Page page; diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c index 2685a1c..9bd2f50 100644 --- a/src/backend/access/gin/ginvacuum.c +++ b/src/backend/access/gin/ginvacuum.c @@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer) PageSetLSN(page, recptr); } -static bool -ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, Buffer *rootBuffer) -{ - Buffer buffer; - Page page; - bool hasVoidPage = FALSE; - MemoryContext oldCxt; - - buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, - RBM_NORMAL, gvs->strategy); - page = BufferGetPage(buffer); - - /* - * We should be sure that we don't concurrent with inserts, insert process - * never release root page until end (but it can unlock it and lock - * again). New scan can't start but previously started ones work - * concurrently. - */ - if (isRoot) - LockBufferForCleanup(buffer); - else - LockBuffer(buffer, GIN_EXCLUSIVE); - - Assert(GinPageIsData(page)); - - if (GinPageIsLeaf(page)) - { - oldCxt = MemoryContextSwitchTo(gvs->tmpCxt); - ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs); - MemoryContextSwitchTo(oldCxt); - MemoryContextReset(gvs->tmpCxt); - - /* if root is a leaf page, we don't desire further processing */ - if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page)) - hasVoidPage = TRUE; - } - else - { - OffsetNumber i; - bool isChildHasVoid = FALSE; - - for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++) - { - PostingItem *pitem = GinDataPageGetPostingItem(page, i); - if (ginVacuumPostingTreeLeaves(gvs, PostingItemGetBlockNumber(pitem), FALSE, NULL)) - isChildHasVoid = TRUE; - } - - if (isChildHasVoid) - hasVoidPage = TRUE; - } +typedef struct DataPageDeleteStack +{ + struct DataPageDeleteStack *child; + struct DataPageDeleteStack *parent; - /* - * if we have root and there are empty pages in tree, then we don't - * release lock to go further processing and guarantee that tree is unused - */ - if (!(isRoot && hasVoidPage)) - { - UnlockReleaseBuffer(buffer); - } - else - { - Assert(rootBuffer); - *rootBuffer = buffer; - } + BlockNumber blkno; /* current block number */ + BlockNumber leftBlkno; /* rightest non-deleted page on left */ + bool isRoot; +} DataPageDeleteStack; - return hasVoidPage; -} /* * Delete a posting tree page. @@ -193,8 +135,17 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn BlockNumber rightlink; /* - * Lock the pages in the same order as an insertion would, to avoid + * OBSOLETE COMMENT: Lock the pages in the same order as an insertion would, to avoid * deadlocks: left, then right, then parent. + * + * AB: I will delete this comment and all following single line comments. + * they are here to highlight changes in locking + * + * This function MUST be called only if someone of parent pages hold + * exclusive cleanup lock. This guarantees that no insertions currently + * happen in this subtree. Caller also acquire Exclusive lock on deletable + * page and is acquiring and releasing exclusive lock on left page before. + * Left page was locked and released. Then parent and this page are locked. */ lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno, RBM_NORMAL, gvs->strategy); @@ -203,11 +154,11 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno, RBM_NORMAL, gvs->strategy); - LockBuffer(lBuffer, GIN_EXCLUSIVE); - LockBuffer(dBuffer, GIN_EXCLUSIVE); - if (!isParentRoot) /* parent is already locked by - * LockBufferForCleanup() */ - LockBuffer(pBuffer, GIN_EXCLUSIVE); + //LockBuffer(lBuffer, GIN_EXCLUSIVE); + //LockBuffer(dBuffer, GIN_EXCLUSIVE); + //if (!isParentRoot) /* parent is already locked by + // * LockBufferForCleanup() */ + // LockBuffer(pBuffer, GIN_EXCLUSIVE); START_CRIT_SECTION(); @@ -271,26 +222,18 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn PageSetLSN(BufferGetPage(lBuffer), recptr); } - if (!isParentRoot) - LockBuffer(pBuffer, GIN_UNLOCK); + //if (!isParentRoot) + // LockBuffer(pBuffer, GIN_UNLOCK); + // These comments will be deleted, explanation is upper ReleaseBuffer(pBuffer); - UnlockReleaseBuffer(lBuffer); - UnlockReleaseBuffer(dBuffer); + ReleaseBuffer(lBuffer);//UnlockReleaseBuffer(lBuffer); + ReleaseBuffer(dBuffer);//UnlockReleaseBuffer(dBuffer); END_CRIT_SECTION(); gvs->result->pages_deleted++; } -typedef struct DataPageDeleteStack -{ - struct DataPageDeleteStack *child; - struct DataPageDeleteStack *parent; - - BlockNumber blkno; /* current block number */ - BlockNumber leftBlkno; /* rightest non-deleted page on left */ - bool isRoot; -} DataPageDeleteStack; /* * scans posting tree and deletes empty pages @@ -324,6 +267,10 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, RBM_NORMAL, gvs->strategy); + + if(!isRoot) + LockBuffer(buffer, GIN_EXCLUSIVE); + page = BufferGetPage(buffer); Assert(GinPageIsData(page)); @@ -358,6 +305,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, } } + if(!isRoot) + LockBuffer(buffer, GIN_UNLOCK); + ReleaseBuffer(buffer); if (!meDelete) @@ -366,37 +316,124 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, return meDelete; } -static void -ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno) + +/* + * Scan through posting tree, delete empty tuples from leaf pages. + * Also, this function collects empty subtrees (with all empty leafs). + * For parents of these subtrees CleanUp lock is taken, then we call + * ScanToDelete. This is done for every inner page, which points to + * empty subtree. + */ +static bool +ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot) { - Buffer rootBuffer = InvalidBuffer; - DataPageDeleteStack root, - *ptr, - *tmp; + Buffer buffer; + Page page; + bool hasVoidPage = FALSE; + MemoryContext oldCxt; - if (ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE, &rootBuffer) == FALSE) + buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, + RBM_NORMAL, gvs->strategy); + page = BufferGetPage(buffer); + + ginTraverseLock(buffer,false); + + Assert(GinPageIsData(page)); + + if (GinPageIsLeaf(page)) { - Assert(rootBuffer == InvalidBuffer); - return; + oldCxt = MemoryContextSwitchTo(gvs->tmpCxt); + ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs); + MemoryContextSwitchTo(oldCxt); + MemoryContextReset(gvs->tmpCxt); + + /* if root is a leaf page, we don't desire further processing */ + if (GinDataLeafPageIsEmpty(page)) + hasVoidPage = TRUE; + + UnlockReleaseBuffer(buffer); + + return hasVoidPage; } + else + { + OffsetNumber i; + bool isChildHasVoid = FALSE; + bool isAnyNonempy = FALSE; + OffsetNumber maxoff = GinPageGetOpaque(page)->maxoff; + BlockNumber* children = palloc(sizeof(BlockNumber) * (maxoff + 1)); - memset(&root, 0, sizeof(DataPageDeleteStack)); - root.leftBlkno = InvalidBlockNumber; - root.isRoot = TRUE; + /* + * Read all children BlockNumbers. + * Not sure it is safe if there are many concurrent vacuums. + */ - vacuum_delay_point(); + for (i = FirstOffsetNumber; i <= maxoff; i++) + { + PostingItem *pitem = GinDataPageGetPostingItem(page, i); - ginScanToDelete(gvs, rootBlkno, TRUE, &root, InvalidOffsetNumber); + children[i] = PostingItemGetBlockNumber(pitem); + } - ptr = root.child; - while (ptr) - { - tmp = ptr->child; - pfree(ptr); - ptr = tmp; + UnlockReleaseBuffer(buffer); + + for (i = FirstOffsetNumber; i <= maxoff; i++) + { + if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE)) + isChildHasVoid = TRUE; + else + isAnyNonempy = TRUE; + } + + pfree(children); + + vacuum_delay_point(); + + /* + * All subtree is empty - just return TRUE to indicate that parent must + * do a cleanup. Unless we are ROOT an there is way to go upper. + */ + + if(isChildHasVoid && !isAnyNonempy && !isRoot) + return TRUE; + + if(isChildHasVoid) + { + DataPageDeleteStack root, + *ptr, + *tmp; + + buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, + RBM_NORMAL, gvs->strategy); + LockBufferForCleanup(buffer); + + memset(&root, 0, sizeof(DataPageDeleteStack)); + root.leftBlkno = InvalidBlockNumber; + root.isRoot = TRUE; + + ginScanToDelete(gvs, blkno, TRUE, &root, InvalidOffsetNumber); + + ptr = root.child; + + while (ptr) + { + tmp = ptr->child; + pfree(ptr); + ptr = tmp; + } + + UnlockReleaseBuffer(buffer); + } + + /* Here we have deleted all empty subtrees */ + return FALSE; } +} - UnlockReleaseBuffer(rootBuffer); +static void +ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno) +{ + ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE); } /* diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h index bf589ab..b738f47 100644 --- a/src/include/access/gin_private.h +++ b/src/include/access/gin_private.h @@ -981,4 +981,6 @@ ginCompareItemPointers(ItemPointer a, ItemPointer b) return -1; } +extern int ginTraverseLock(Buffer buffer, bool searchMode); + #endif /* GIN_PRIVATE_H */
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers