On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire <klaussfre...@gmail.com> > wrote: >> On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> >> wrote: >>> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfre...@gmail.com> >>> wrote: >>>> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> >>>> wrote: >>>>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfre...@gmail.com> >>>>> wrote: >>>>>> I can look into doing 3, that *might* get rid of the need to do that >>>>>> initial FSM vacuum, but all other intermediate vacuums are still >>>>>> needed. >>>>> >>>>> Understood. So how about that this patch focuses only make FSM vacuum >>>>> more frequently and leaves the initial FSM vacuum and the handling >>>>> cancellation cases? The rest can be a separate patch. >>>> >>>> Ok. >>>> >>>> Attached split patches. All but the initial FSM pass is in 1, the >>>> initial FSM pass as in the original patch is in 2. >>>> >>>> I will post an alternative patch for 2 when/if I have one. In the >>>> meanwhile, we can talk about 1. >>> >>> Thank you for updating the patch! >>> >>> + /* Tree pruning for partial vacuums */ >>> + if (threshold) >>> + { >>> + child_avail = fsm_get_avail(page, slot); >>> + if (child_avail >= threshold) >>> + continue; >>> + } >>> >>> I think slots in a non-leaf page might not have a correct value >>> because fsm_set_avail doesn't propagate up beyond pages. So do we need >>> to check the root of child page rather than a slot in the non-lead >>> page? I might be missing something though. >> >> I'm not sure I understand what you're asking. >> >> Yes, a partial FSM vacuum won't correct those inconsistencies. It >> doesn't matter though, because as long as the root node (or whichever >> inner node we're looking at the time) reports free space, the lookup >> for free space will recurse and find the lower nodes, even if they >> report more space than the inner node reported. The goal of making >> free space discoverable will have been achieved nonetheless. > > For example, if a slot of internal node of fsm tree has a value > greater than the threshold while the root of leaf node of fsm tree has > a value less than threshold, we want to update the internal node of > fsm tree but we will skip it with your patch. I wonder if this can > happen.
If I understand your point correctly, that would mean that space was used since the last vacuum. Partial FSM vacuums don't concern themselves with that case, the normal free space search algorithm will handle that, retrying with the next slot until it succeeds (or until it gives up). IIUC, each time a failure of such kind is found while searching, the FSM gets updated to avoid following that link a second time. See, in fsm_search: /* * At lower level, failure can happen if the value in the upper- * level node didn't reflect the value on the lower page. Update * the upper node, to avoid falling into the same trap again, and * start over. > >> The final FSM vacuum pass isn't partial, to finally correct all those >> small inconsistencies. > > Yes, but the purpose of this patch is to prevent table bloating from > concurrent modifications? Yes, by *marking* unmarked space. If the slot is above the threshold, it means there's already marked free space on that branch, and search will go into it already, so no pressing need to refine the mark. The space already marked can serve while vacuum makes further progress. Ie: it can wait. > Here is other review comments. > > + /* Tree pruning for partial vacuums */ > + if (threshold) > + { > + child_avail = fsm_get_avail(page, slot); > + if (child_avail >= threshold) > + continue; > + } > > 'child_avail' is a category value while 'threshold' is an actual size > of freespace. Good catch. It was meant to be a category, but I see the public interface of the FSM doesn't usually expose categories, so fixing that. > The logic of finding the max_freespace seems not work fine if table > with indices because max_freespace is not updated based on > post-compaction freespace. I think we need to get the max freespace > size during vacuuming heap (i.g. at lazy_vacuum_heap) and use it as > the threshold. Good point. > > + vacuum_fsm_every_pages = nblocks / 8; > + vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, 1048576); > > I think a comment for 1048576 is needed. And perhaps we can define it > as a macro. 1M pages = 8GB with an 8kb page size. I can clarify. Attached are new versions of the patches with these corrections.
From 0587d8fb9855314bbde73297856fe2116d3310a5 Mon Sep 17 00:00:00 2001 From: Claudio Freire <klaussfre...@gmail.com> Date: Thu, 8 Feb 2018 12:39:15 -0300 Subject: [PATCH 2/2] Vacuum: do a partial FSM vacuum at the beginning A partial FSM vacuum right at the start of the vacuum process helps ameliorate issues with cancelled autovacuums never updating the FSM. --- src/backend/commands/vacuumlazy.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index b3d9552..68f9a6c 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -117,6 +117,18 @@ */ #define PREFETCH_SIZE ((BlockNumber) 32) +/* + * If autovacuums get regularly cancelled, the FSM may never be + * vacuumed. To work around that, we perform an initial partial + * FSM vacuum at the beginning of the vacuum process, to quickly + * make existing unmarked free space visible. To avoid useless + * redundant work, however, we avoid recursing into branches + * that already have a set amount of free space, we only try + * to discover unmarked free space. This value controls how + * much free space is enough free space in that context. + */ +#define INITIAL_PARTIAL_FSM_VACUUM_THRESHOLD ((Size) BLCKSZ/4) + typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ @@ -264,6 +276,17 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, vacrelstats->pages_removed = 0; vacrelstats->lock_waiter_detected = false; + /* + * Vacuum the Free Space Map partially before we start. + * If an earlier vacuum was canceled, and that's likely in + * highly contended tables, we may need to finish up. If we do + * it now, we make the space visible to other backends regardless + * of whether we succeed in finishing this time around. + * Don't bother checking branches that already have usable space, + * though. + */ + FreeSpaceMapVacuum(onerel, INITIAL_PARTIAL_FSM_VACUUM_THRESHOLD); + /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); -- 1.8.4.5
From 00d98ac77e5db2ce9992a8c18fae815fa8012dce Mon Sep 17 00:00:00 2001 From: Claudio Freire <klaussfre...@gmail.com> Date: Fri, 28 Jul 2017 21:31:59 -0300 Subject: [PATCH 1/2] Vacuum: Update FSM more frequently Make Vacuum update the FSM more frequently, to avoid the case where autovacuum fails to reach the point where it updates the FSM in highly contended tables. Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids recursing into branches that already contain enough free space, to avoid having to traverse the whole FSM and thus induce quadratic costs. Intermediate FSM vacuums are only supposed to make enough free space visible to avoid extension until the final (non-partial) FSM vacuum. Run partial FSM vacuums after each index pass, which is a point at which whole ranges of the heap have been thorougly cleaned, and we can expect no further updates to those ranges of the FSM save for concurrent activity. When there are no indexes, and thus no index passes, run partial FSM vacuums every 8GB of dirtied pages or 1/8th of the relation, whichever is highest. This allows some partial work to be made visible without incurring quadratic cost. In any case, FSM are small in relation to the table, so even when quadratic cost is involved, it should not be problematic. Index passes already incur quadratic cost, and the addition of the FSM is unlikely to be measurable. Run a partial FSM vacuum with a low pruning threshold right at the beginning of the VACUUM to finish up any work left over from prior canceled vacuum runs, something that is common in highly contended tables when running only autovacuum. --- src/backend/access/brin/brin.c | 2 +- src/backend/access/brin/brin_pageops.c | 10 ++-- src/backend/commands/vacuumlazy.c | 80 ++++++++++++++++++++++++++++--- src/backend/storage/freespace/freespace.c | 29 ++++++++--- src/backend/storage/freespace/indexfsm.c | 2 +- src/include/storage/freespace.h | 2 +- 6 files changed, 105 insertions(+), 20 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 68b3371..24d6df7 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -1461,5 +1461,5 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy) * the way to the top. */ if (vacuum_fsm) - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); } diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c index 60a7025..d4c7a87 100644 --- a/src/backend/access/brin/brin_pageops.c +++ b/src/backend/access/brin/brin_pageops.c @@ -136,7 +136,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, brin_initialize_empty_new_buffer(idxrel, newbuf); UnlockReleaseBuffer(newbuf); if (extended) - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); } return false; } @@ -156,7 +156,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, brin_initialize_empty_new_buffer(idxrel, newbuf); UnlockReleaseBuffer(newbuf); if (extended) - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); } return false; } @@ -211,7 +211,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK); if (extended) - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); return true; } @@ -313,7 +313,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange, { Assert(BlockNumberIsValid(newblk)); RecordPageWithFreeSpace(idxrel, newblk, freespace); - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); } return true; @@ -457,7 +457,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange, LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK); if (extended) - FreeSpaceMapVacuum(idxrel); + FreeSpaceMapVacuum(idxrel, 0); return off; } diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index cf7f5e1..b3d9552 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -85,6 +85,20 @@ #define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ /* + * When a table has no indexes, vacuum the FSM at most every 1/Nth + * of the relation has been vacuumed to prevent bloat during long-running + * vacuums. This specifies the N. + */ +#define VACUUM_FSM_EVERY_FRACTION 8 + +/* + * When a table has no indexes, vacuum the FSM at most every this + * many dirty pages. With a default page size of 8kb, this value + * basically means 8GB of dirtied pages. + */ +#define VACUUM_FSM_EVERY_PAGES 1048576 + +/* * Guesstimation of number of dead tuples per page. This is used to * provide an upper limit to memory allocated when vacuuming small * tables. @@ -146,7 +160,7 @@ static BufferAccessStrategy vac_strategy; static void lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, Relation *Irel, int nindexes, bool aggressive); -static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); +static Size lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static bool lazy_check_needs_freeze(Buffer buf, bool *hastup); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, @@ -287,7 +301,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, PROGRESS_VACUUM_PHASE_FINAL_CLEANUP); /* Vacuum the Free Space Map */ - FreeSpaceMapVacuum(onerel); + FreeSpaceMapVacuum(onerel, 0); /* * Update statistics in pg_class. @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; TransactionId relminmxid = onerel->rd_rel->relminmxid; BlockNumber empty_pages, - vacuumed_pages; + vacuumed_pages, + vacuumed_pages_at_fsm_vac, + vacuum_fsm_every_pages; double num_tuples, tups_vacuumed, nkeep, @@ -480,6 +496,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, PGRUsage ru0; Buffer vmbuffer = InvalidBuffer; BlockNumber next_unskippable_block; + Size max_freespace = 0; bool skipping_blocks; xl_heap_freeze_tuple *frozen; StringInfoData buf; @@ -504,7 +521,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, get_namespace_name(RelationGetNamespace(onerel)), relname))); - empty_pages = vacuumed_pages = 0; + empty_pages = vacuumed_pages = vacuumed_pages_at_fsm_vac = 0; num_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) @@ -517,6 +534,16 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, vacrelstats->nonempty_pages = 0; vacrelstats->latestRemovedXid = InvalidTransactionId; + /* + * Vacuum the FSM a few times in the middle if the relation is big + * and has no indexes. Once every some amount of dirtied pages, or + * fraction of the relation, whatever is bigger, to avoid quadratic cost. + * If it has indexes, this is ignored, and the FSM is vacuumed after + * each index pass. + */ + vacuum_fsm_every_pages = nblocks / VACUUM_FSM_EVERY_FRACTION; + vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, VACUUM_FSM_EVERY_PAGES); + lazy_space_alloc(vacrelstats, nblocks); frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage); @@ -746,7 +773,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, pgstat_progress_update_multi_param(2, hvp_index, hvp_val); /* Remove tuples from heap */ - lazy_vacuum_heap(onerel, vacrelstats); + freespace = lazy_vacuum_heap(onerel, vacrelstats); + if (freespace > max_freespace) + max_freespace = freespace; /* * Forget the now-vacuumed tuples, and press on, but be careful @@ -756,6 +785,14 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, vacrelstats->num_dead_tuples = 0; vacrelstats->num_index_scans++; + /* + * Vacuum the Free Space Map to make the changes we made visible. + * Don't recurse into branches with more than max_freespace, + * as we can't set it higher already. + */ + FreeSpaceMapVacuum(onerel, max_freespace); + max_freespace = 0; + /* Report that we are once again scanning the heap */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_SCAN_HEAP); @@ -878,6 +915,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); + if (freespace > max_freespace) + max_freespace = freespace; continue; } @@ -917,6 +956,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); + if (freespace > max_freespace) + max_freespace = freespace; continue; } @@ -1272,7 +1313,24 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, * taken if there are no indexes.) */ if (vacrelstats->num_dead_tuples == prev_dead_count) + { RecordPageWithFreeSpace(onerel, blkno, freespace); + if (freespace > max_freespace) + max_freespace = freespace; + } + + /* + * If there are no indexes then we should periodically vacuum the FSM + * on huge relations to make free space visible early. + */ + if (nindexes == 0 && + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, max_freespace); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + max_freespace = 0; + } } /* report that everything is scanned and vacuumed */ @@ -1392,13 +1450,16 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, * space on their pages. Pages not having dead tuples recorded from * lazy_scan_heap are not visited at all. * + * Returns the maximum amount of free space on vacuumed pages. + * * Note: the reason for doing this as a second pass is we cannot remove * the tuples until we've removed their index entries, and we want to * process index entry removal in batches as large as possible. */ -static void +static Size lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) { + Size max_freespace = 0; int tupindex; int npages; PGRUsage ru0; @@ -1433,6 +1494,9 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) page = BufferGetPage(buf); freespace = PageGetHeapFreeSpace(page); + if (freespace > max_freespace) + max_freespace = freespace; + UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, tblk, freespace); npages++; @@ -1449,6 +1513,8 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) RelationGetRelationName(onerel), tupindex, npages), errdetail_internal("%s", pg_rusage_show(&ru0)))); + + return max_freespace; } /* @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) * We don't insert a vacuum delay point here, because we have an * exclusive lock on the table which we want to hold for as short a * time as possible. We still need to check for interrupts however. + * We might have to acquire the autovacuum lock, however, but that + * shouldn't pose a deadlock risk. */ CHECK_FOR_INTERRUPTS(); diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c index dd8ae52..455e994 100644 --- a/src/backend/storage/freespace/freespace.c +++ b/src/backend/storage/freespace/freespace.c @@ -108,7 +108,7 @@ static Size fsm_space_cat_to_avail(uint8 cat); static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, uint8 newValue, uint8 minValue); static BlockNumber fsm_search(Relation rel, uint8 min_cat); -static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof); +static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof); static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr); static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat); @@ -376,7 +376,7 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks) * FreeSpaceMapVacuum - scan and fix any inconsistencies in the FSM */ void -FreeSpaceMapVacuum(Relation rel) +FreeSpaceMapVacuum(Relation rel, Size threshold) { bool dummy; @@ -384,7 +384,7 @@ FreeSpaceMapVacuum(Relation rel) * Traverse the tree in depth-first order. The tree is stored physically * in depth-first order, so this should be pretty I/O efficient. */ - fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, &dummy); + fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, fsm_space_avail_to_cat(threshold), &dummy); } /******** Internal routines ********/ @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) * If minValue > 0, the updated page is also searched for a page with at * least minValue of free space. If one is found, its slot number is * returned, -1 otherwise. + * + * If minValue == 0, the value at the root node is returned. */ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, addr.level == FSM_BOTTOM_LEVEL, true); } + else + newslot = fsm_get_avail(page, 0); UnlockReleaseBuffer(buf); @@ -785,7 +789,7 @@ fsm_search(Relation rel, uint8 min_cat) * Recursive guts of FreeSpaceMapVacuum */ static uint8 -fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) +fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof_p) { Buffer buf; Page page; @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) { int child_avail; + /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + } + CHECK_FOR_INTERRUPTS(); /* After we hit end-of-file, just clear the rest of the slots */ if (!eof) - child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), &eof); + child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), threshold, &eof); else child_avail = 0; @@ -884,6 +896,11 @@ fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat) * information in that. */ parent = fsm_get_parent(addr, &parentslot); - fsm_set_and_search(rel, parent, parentslot, new_cat, 0); + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); + + /* + * Bubble up, not the value we just set, but the one now in the root + * node of the just-updated page, which is the page's highest value. + */ fsm_update_recursive(rel, parent, new_cat); } diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c index e21047b..dd77a16 100644 --- a/src/backend/storage/freespace/indexfsm.c +++ b/src/backend/storage/freespace/indexfsm.c @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) void IndexFreeSpaceMapVacuum(Relation rel) { - FreeSpaceMapVacuum(rel); + FreeSpaceMapVacuum(rel, 0); } diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h index a517d7e..e8bdaa2 100644 --- a/src/include/storage/freespace.h +++ b/src/include/storage/freespace.h @@ -31,7 +31,7 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk, Size spaceAvail); extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks); -extern void FreeSpaceMapVacuum(Relation rel); +extern void FreeSpaceMapVacuum(Relation rel, Size threshold); extern void UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum, BlockNumber endBlkNum, -- 1.8.4.5