On Fri, 8 Nov 2024 at 17:11, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > > We recently had a customer report a very strange problem, involving a > very large insert-only table: without explanation, insertions would > stall for several seconds, causing application timeout and process > accumulation and other nastiness. > > After some investigation, we narrowed this down to happening immediately > after the first VACUUM on the table right after a standby got promoted. > It wasn't at first obvious what the connection between these factors > was, but eventually we realized that VACUUM must have been skipping a > bunch of pages because they had been marked all-frozen previously, so > the FSM was not updated with the correct freespace figures for those > pages. The FSM pages had been transmitted as full-page images on WAL > before the promotion (because wal_log_hints), so they contained > optimistic numbers on amount of free space coming from the previous > master. (Because this only happens on the first change to that FSM page > after a checkpoint, it's quite likely that one page every few thousand > or so contains optimistic figures while the others remain all zeroes, or > something like that.) > > Before VACUUM, nothing too bad would happen, because the upper layers of > the FSM would not know about those optimistic numbers. But when VACUUM > does FreeSpaceMapVacuum, it propagates those numbers upwards; as soon as > that happens, inserters looking for pages would be told about those > pages (wrongly catalogued to contain sufficient free space), go to > insert there, and fail because there isn't actually any freespace; ask > FSM for another page, lather, rinse, repeat until all those pages are > all catalogued correctly by FSM, at which point things continue > normally. (There are many processes doing this chase-up concurrently > and it seems a pretty contentious process, about which see last > paragraph; it can be seen in pg_xlogdump that it takes several seconds > for things to settle). > > After considering several possible solutions, I propose to have > heap_xlog_visible compute free space for any page being marked frozen; > Pavan adds to that to have heap_xlog_clean compute free space for all > pages also. This means that if we later promote this standby and VACUUM > skips all-frozen pages, their FSM numbers are going to be up-to-date > anyway. Patch attached. > > > Now, it's possible that the problem occurs for all-visible pages not > just all-frozen. I haven't seen that one, maybe there's some reason why > it cannot. But fixing both things together is an easy change in the > proposed patch: just do it on xlrec->flags != 0 rather than checking for > the specific all-frozen flag. > > (This problem seems to be made worse by the fact that > RecordAndGetPageWithFreeSpace (or rather fsm_set_and_search) holds > exclusive lock on the FSM page for the whole duration of update plus > search. So when there are many inserters, they all race to the update > process. Maybe it'd be less terrible if we would release exclusive > after the update and grab shared lock for the search in > fsm_set_and_search, but we still have to have the exclusive for the > update, so the contention point remains. Maybe there's not sufficient > improvement to make a practical difference, so I'm not proposing > changing this.) > > -- > Álvaro Herrera
Hi! Sorry for disturbing you after so much time. Today, while I was doing my stuff and researching several FSM-related questions, I noticed that the comment in the `heap_xlog_visible` function used improper punctuation. After some investigation, I conclude that this is an oversight of ab7dbd6, which was proposed in this thread. I'd like to propose a fix for that. Sorry for making so much noise for this minor matter. -- Best regards, Kirill Reshke
v1-0001-Fixup-FSM-comment-inside-heap_xlog_visible.patch
Description: Binary data