We recently had a customer report a very strange problem, involving a very large insert-only table: without explanation, insertions would stall for several seconds, causing application timeout and process accumulation and other nastiness.
After some investigation, we narrowed this down to happening immediately after the first VACUUM on the table right after a standby got promoted. It wasn't at first obvious what the connection between these factors was, but eventually we realized that VACUUM must have been skipping a bunch of pages because they had been marked all-frozen previously, so the FSM was not updated with the correct freespace figures for those pages. The FSM pages had been transmitted as full-page images on WAL before the promotion (because wal_log_hints), so they contained optimistic numbers on amount of free space coming from the previous master. (Because this only happens on the first change to that FSM page after a checkpoint, it's quite likely that one page every few thousand or so contains optimistic figures while the others remain all zeroes, or something like that.) Before VACUUM, nothing too bad would happen, because the upper layers of the FSM would not know about those optimistic numbers. But when VACUUM does FreeSpaceMapVacuum, it propagates those numbers upwards; as soon as that happens, inserters looking for pages would be told about those pages (wrongly catalogued to contain sufficient free space), go to insert there, and fail because there isn't actually any freespace; ask FSM for another page, lather, rinse, repeat until all those pages are all catalogued correctly by FSM, at which point things continue normally. (There are many processes doing this chase-up concurrently and it seems a pretty contentious process, about which see last paragraph; it can be seen in pg_xlogdump that it takes several seconds for things to settle). After considering several possible solutions, I propose to have heap_xlog_visible compute free space for any page being marked frozen; Pavan adds to that to have heap_xlog_clean compute free space for all pages also. This means that if we later promote this standby and VACUUM skips all-frozen pages, their FSM numbers are going to be up-to-date anyway. Patch attached. Now, it's possible that the problem occurs for all-visible pages not just all-frozen. I haven't seen that one, maybe there's some reason why it cannot. But fixing both things together is an easy change in the proposed patch: just do it on xlrec->flags != 0 rather than checking for the specific all-frozen flag. (This problem seems to be made worse by the fact that RecordAndGetPageWithFreeSpace (or rather fsm_set_and_search) holds exclusive lock on the FSM page for the whole duration of update plus search. So when there are many inserters, they all race to the update process. Maybe it'd be less terrible if we would release exclusive after the update and grab shared lock for the search in fsm_set_and_search, but we still have to have the exclusive for the update, so the contention point remains. Maybe there's not sufficient improvement to make a practical difference, so I'm not proposing changing this.) -- Álvaro Herrera
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 5016181fd7..d024b4fa59 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -8056,6 +8056,7 @@ heap_xlog_clean(XLogReaderState *record) xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record); Buffer buffer; Size freespace = 0; + bool know_freespace = false; RelFileNode rnode; BlockNumber blkno; XLogRedoAction action; @@ -8107,8 +8108,6 @@ heap_xlog_clean(XLogReaderState *record) nowdead, ndead, nowunused, nunused); - freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ - /* * Note: we don't worry about updating the page's prunability hints. * At worst this will cause an extra prune cycle to occur soon. @@ -8118,16 +8117,16 @@ heap_xlog_clean(XLogReaderState *record) MarkBufferDirty(buffer); } if (BufferIsValid(buffer)) + { + freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ + know_freespace = true; UnlockReleaseBuffer(buffer); + } /* - * Update the FSM as well. - * - * XXX: Don't do this if the page was restored from full page image. We - * don't bother to update the FSM in that case, it doesn't need to be - * totally accurate anyway. + * Update the FSM as well, if we can. */ - if (action == BLK_NEEDS_REDO) + if (know_freespace) XLogRecordPageWithFreeSpace(rnode, blkno, freespace); } @@ -8149,6 +8148,8 @@ heap_xlog_visible(XLogReaderState *record) Page page; RelFileNode rnode; BlockNumber blkno; + Size space; + bool know_freespace = false; XLogRedoAction action; XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno); @@ -8201,8 +8202,31 @@ heap_xlog_visible(XLogReaderState *record) * wal_log_hints enabled.) */ } + if (BufferIsValid(buffer)) + { + space = PageGetFreeSpace(BufferGetPage(buffer)); /* for later */ + know_freespace = true; UnlockReleaseBuffer(buffer); + } + + /* + * Since FSM is not WAL-logged and only updated heuristicaly, it easily + * becomes stale in standbys. If the standby is later promoted and runs + * VACUUM, it will skip updating individual free space figures for pages + * that became frozen, which is troublesome when FreeSpaceMapVacuum + * propagates too optimistic free space values to upper FSM layers; later + * inserters try to use such pages only to find out that they are + * unusable. This can cause long stalls when there are many such pages. + * + * Forestall those problems by updating FSM's idea about a page that is + * becoming frozen. + * + * Do this regardless of full-page image being applied, since the FSM data + * is not in the page anyway. + */ + if ((xlrec->flags & VISIBILITYMAP_ALL_FROZEN) && know_freespace) + XLogRecordPageWithFreeSpace(rnode, blkno, space); /* * Even if we skipped the heap page update due to the LSN interlock, it's