Hi hackers! Here is a patch with more concurrency-friendly vacuum of GIN.
===Short problem statement=== Currently GIN vacuum during cleaning posting tree can lock this tree for a long time without real need. ===Problem statement=== Vacuum of posting tree (function ginVacuumPostingTree() in ginvacuum.c ) is doing two passes through posting tree: 1. ginVacuumPostingTreeLeaves() takes LockBufferForCleanup, effectively excluding all inserts. Then it traverses down trough tree taking exclusive lock, effectively excluding all reads. On leaf level it calls ginVacuumPostingTreeLeaf(), which deletes all dead tuples. If there are any empty page, root lock is not relaesed, it passes to stage two. Interim: vacuum_delay_point(), which can hand for a while, holding CleanupLock on root. 2. If there are any empty pages, ginScanToDelete() scans through tree, deleting empty lead pages, then deleting empty inner pages, if they emerged after leaf page deletion. ===Proposed algorithm=== ginVacuumPostingTreeLeaves() takes SHARED locks on inner pages and EXCLUSIVE locks on leaf pages. On leaf pages it calls ginVacuumPostingTreeLeaf(). If ginVacuumPostingTreeLeaves() encounters subtree with all leafs empty, then it takes LockBufferForCleanup() on page P pointing to empty subtree. After taking lock on P, ginScanToDelete() is called for P. For every parent of P we can guarantee that this procedure will not be necessary: if P was empty subtree itself, we would pass this procedure to parent (unless P is root, than behavior is effectively equal to before-patched). ===Testing=== This patch solved a problem encountered by Evgeniy Efimkin and Vladimir Borodin from Yandex.Mail. They implemented testbed with GIN index CREATE TABLE BOX (uid bigint, lids integer[] NOT NULL CHECK (array_ndims(lids) = 1)); CREATE OR REPLACE FUNCTION ulids( i_uid bigint, i_lids integer[] ) RETURNS bigint[] AS $$ SELECT array_agg((i_uid << 32) | lid) FROM unnest(i_lids) lid; $$ LANGUAGE SQL IMMUTABLE STRICT; CREATE INDEX i_box_uid_lids ON box USING gin (ulids(uid, lids)) WITH (FASTUPDATE=OFF); Executing by a pgbench following query \setrandom uid 1 1500000 \setrandom lid 1 1000 \setrandom lid2 1 1000 \setrandom lid3 1 1000 BEGIN; insert into box values(:uid,'{:lid,:lid2,:lid3}'); insert into box values(:uid,'{}'); END; and eventually deleting some of data. This testbed showed VACUUM holding inserts for up to tenths of seconds. They claim that proposed patch made vacuum invisible in this test. Evgeniy, Vladimir, if I missed something or you have something to add, please join discussion. ===Known issues=== 1. Probably, there exists better algorithms. I could not adapt B-tree vacuum wizardy to GIN, but that does not mean it is impossible. 2. There may be more CleanUp locks taken. Under write-dense scenarios, this can lead to longer vacuum, but it should not consume more resources, just wait. 3. I have changed locks sequence during page deleteion. I think that it is safe, but comments there were in fear of inserts (excluded by CleanupLock). More details in a code of ginDeletePage(). 4. ginVacuumPostingTreeLeaves() traverses children pages after release of parent lock (event SHARED). This is safe if there is only one vacuum at a time. Is there a reason to be afraid of concurrent vacuums? I will be happy if someone points me were is a best place to read about B-tree vacuum process. Or if someone will explain me how it works in a few words. Dev process is here https://github.com/x4m/postgres_g/pull/2 Thank you for reading, I'm looking forward to hear your thought on the matter. Best regards, Andrey Borodin.
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index a0afec4..dc28c81 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack *stack, /* * Lock buffer by needed method for search. */ -static int +int ginTraverseLock(Buffer buffer, bool searchMode) { Page page; diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c index 2685a1c..bc54284 100644 --- a/src/backend/access/gin/ginvacuum.c +++ b/src/backend/access/gin/ginvacuum.c @@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer) PageSetLSN(page, recptr); } -static bool -ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, Buffer *rootBuffer) -{ - Buffer buffer; - Page page; - bool hasVoidPage = FALSE; - MemoryContext oldCxt; - - buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, - RBM_NORMAL, gvs->strategy); - page = BufferGetPage(buffer); - - /* - * We should be sure that we don't concurrent with inserts, insert process - * never release root page until end (but it can unlock it and lock - * again). New scan can't start but previously started ones work - * concurrently. - */ - if (isRoot) - LockBufferForCleanup(buffer); - else - LockBuffer(buffer, GIN_EXCLUSIVE); - - Assert(GinPageIsData(page)); - - if (GinPageIsLeaf(page)) - { - oldCxt = MemoryContextSwitchTo(gvs->tmpCxt); - ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs); - MemoryContextSwitchTo(oldCxt); - MemoryContextReset(gvs->tmpCxt); - - /* if root is a leaf page, we don't desire further processing */ - if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page)) - hasVoidPage = TRUE; - } - else - { - OffsetNumber i; - bool isChildHasVoid = FALSE; - - for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++) - { - PostingItem *pitem = GinDataPageGetPostingItem(page, i); - if (ginVacuumPostingTreeLeaves(gvs, PostingItemGetBlockNumber(pitem), FALSE, NULL)) - isChildHasVoid = TRUE; - } - - if (isChildHasVoid) - hasVoidPage = TRUE; - } +typedef struct DataPageDeleteStack +{ + struct DataPageDeleteStack *child; + struct DataPageDeleteStack *parent; - /* - * if we have root and there are empty pages in tree, then we don't - * release lock to go further processing and guarantee that tree is unused - */ - if (!(isRoot && hasVoidPage)) - { - UnlockReleaseBuffer(buffer); - } - else - { - Assert(rootBuffer); - *rootBuffer = buffer; - } + BlockNumber blkno; /* current block number */ + BlockNumber leftBlkno; /* rightest non-deleted page on left */ + bool isRoot; +} DataPageDeleteStack; - return hasVoidPage; -} /* * Delete a posting tree page. @@ -193,8 +135,16 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn BlockNumber rightlink; /* - * Lock the pages in the same order as an insertion would, to avoid + * OBSOLETE COMMENT: Lock the pages in the same order as an insertion would, to avoid * deadlocks: left, then right, then parent. + * + * AB: I will delete this comment and all following single line comments. + * they are here to highlight changes in locking + * + * This function MUST be called only if someone of parent pages hold + * exclusive cleanup lock. This guarantees that no insertions currently + * happen in this subtree. We still acquire Exclusive lock to exclude + * reads. Parent and this page is already locked. */ lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno, RBM_NORMAL, gvs->strategy); @@ -204,10 +154,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn RBM_NORMAL, gvs->strategy); LockBuffer(lBuffer, GIN_EXCLUSIVE); - LockBuffer(dBuffer, GIN_EXCLUSIVE); - if (!isParentRoot) /* parent is already locked by - * LockBufferForCleanup() */ - LockBuffer(pBuffer, GIN_EXCLUSIVE); + //LockBuffer(dBuffer, GIN_EXCLUSIVE); + //if (!isParentRoot) /* parent is already locked by + // * LockBufferForCleanup() */ + // LockBuffer(pBuffer, GIN_EXCLUSIVE); START_CRIT_SECTION(); @@ -271,26 +221,18 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn PageSetLSN(BufferGetPage(lBuffer), recptr); } - if (!isParentRoot) - LockBuffer(pBuffer, GIN_UNLOCK); + //if (!isParentRoot) + // LockBuffer(pBuffer, GIN_UNLOCK); + // These comments will be deleted, explanation is upper ReleaseBuffer(pBuffer); UnlockReleaseBuffer(lBuffer); - UnlockReleaseBuffer(dBuffer); + ReleaseBuffer(dBuffer);//UnlockReleaseBuffer(dBuffer); END_CRIT_SECTION(); gvs->result->pages_deleted++; } -typedef struct DataPageDeleteStack -{ - struct DataPageDeleteStack *child; - struct DataPageDeleteStack *parent; - - BlockNumber blkno; /* current block number */ - BlockNumber leftBlkno; /* rightest non-deleted page on left */ - bool isRoot; -} DataPageDeleteStack; /* * scans posting tree and deletes empty pages @@ -324,6 +266,10 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, RBM_NORMAL, gvs->strategy); + + if(!isRoot) + LockBuffer(buffer, GIN_EXCLUSIVE); + page = BufferGetPage(buffer); Assert(GinPageIsData(page)); @@ -358,6 +304,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, } } + if(!isRoot) + LockBuffer(buffer, GIN_UNLOCK); + ReleaseBuffer(buffer); if (!meDelete) @@ -366,37 +315,124 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, return meDelete; } -static void -ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno) + +/* + * Scan through posting tree, delete empty tuples from leaf pages. + * Also, this function collects empty subtrees (with all empty leafs). + * For parents of these subtrees CleanUp lock is taken, then we call + * ScanToDelete. This is done for every inner page, which points to + * empty subtree. + */ +static bool +ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot) { - Buffer rootBuffer = InvalidBuffer; - DataPageDeleteStack root, - *ptr, - *tmp; + Buffer buffer; + Page page; + bool hasVoidPage = FALSE; + MemoryContext oldCxt; - if (ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE, &rootBuffer) == FALSE) + buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, + RBM_NORMAL, gvs->strategy); + page = BufferGetPage(buffer); + + ginTraverseLock(buffer,false); + + Assert(GinPageIsData(page)); + + if (GinPageIsLeaf(page)) { - Assert(rootBuffer == InvalidBuffer); - return; + oldCxt = MemoryContextSwitchTo(gvs->tmpCxt); + ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs); + MemoryContextSwitchTo(oldCxt); + MemoryContextReset(gvs->tmpCxt); + + /* if root is a leaf page, we don't desire further processing */ + if (GinDataLeafPageIsEmpty(page)) + hasVoidPage = TRUE; + + UnlockReleaseBuffer(buffer); + + return hasVoidPage; } + else + { + OffsetNumber i; + bool isChildHasVoid = FALSE; + bool isAnyNonempy = FALSE; + OffsetNumber maxoff = GinPageGetOpaque(page)->maxoff; + BlockNumber* children = palloc(sizeof(BlockNumber) * (maxoff + 1)); - memset(&root, 0, sizeof(DataPageDeleteStack)); - root.leftBlkno = InvalidBlockNumber; - root.isRoot = TRUE; + /* + * Read all children BlockNumbers. + * Not sure it is safe if there are many concurrent vacuums. + */ - vacuum_delay_point(); + for (i = FirstOffsetNumber; i <= maxoff; i++) + { + PostingItem *pitem = GinDataPageGetPostingItem(page, i); - ginScanToDelete(gvs, rootBlkno, TRUE, &root, InvalidOffsetNumber); + children[i] = PostingItemGetBlockNumber(pitem); + } - ptr = root.child; - while (ptr) - { - tmp = ptr->child; - pfree(ptr); - ptr = tmp; + UnlockReleaseBuffer(buffer); + + for (i = FirstOffsetNumber; i <= maxoff; i++) + { + if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE)) + isChildHasVoid = TRUE; + else + isAnyNonempy = TRUE; + } + + pfree(children); + + vacuum_delay_point(); + + /* + * All subtree is empty - just return TRUE to indicate that parent must + * do a cleanup. Unless we are ROOT an there is way to go upper. + */ + + if(isChildHasVoid && !isAnyNonempy && !isRoot) + return TRUE; + + if(isChildHasVoid) + { + DataPageDeleteStack root, + *ptr, + *tmp; + + buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno, + RBM_NORMAL, gvs->strategy); + LockBufferForCleanup(buffer); + + memset(&root, 0, sizeof(DataPageDeleteStack)); + root.leftBlkno = InvalidBlockNumber; + root.isRoot = TRUE; + + ginScanToDelete(gvs, blkno, TRUE, &root, InvalidOffsetNumber); + + ptr = root.child; + + while (ptr) + { + tmp = ptr->child; + pfree(ptr); + ptr = tmp; + } + + UnlockReleaseBuffer(buffer); + } + + /* Here we have deleted all empty subtrees */ + return FALSE; } +} - UnlockReleaseBuffer(rootBuffer); +static void +ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno) +{ + ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE); } /* diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h index bf589ab..b738f47 100644 --- a/src/include/access/gin_private.h +++ b/src/include/access/gin_private.h @@ -981,4 +981,6 @@ ginCompareItemPointers(ItemPointer a, ItemPointer b) return -1; } +extern int ginTraverseLock(Buffer buffer, bool searchMode); + #endif /* GIN_PRIVATE_H */
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers