On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <and...@anarazel.de> wrote: > Afaict we'll need to backpatch this all the way?
I thought that we probably wouldn't need to, at first. But I now think that we really have to. I didn't realize that affected visibilitymap_set() calls could generate useless set-VM WAL records until you pointed it out. That's far more likely to happen than the race condition that I described -- it has nothing at all to do with concurrency. That's what clinches it for me. > > One of the calls to visibilitymap_set() during VACUUM's initial heap > > pass could unset a page's all-visible bit during the process of setting > > the same page's all-frozen bit. > > As just mentioned upthread, this just seems wrong. I don't know why this sentence ever made sense to me. Anyway, it's not important now. > Do you have a reproducer for this? No, but I'm quite certain that the race can happen. If it's important to have a reproducer then I can probably come up with one. I could likely figure out a way to write an isolation test that reliably triggers the issue. It would have to work by playing games with cleanup lock/buffer pin waits, since that's the only thing that the test can hook into to make things happen in just the right/wrong order. > > elog(WARNING, "page is not marked all-visible but > > visibility map bit is set in relation \"%s\" page %u", > > vacrel->relname, blkno); > > Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it > might be useful to know what bit was wrong when debugging problems. Theoretically it might change again, if we call visibilitymap_get_status() again. Maybe I should just broaden the error message a bit instead? > I still think such changes are inappropriate for a bugfix, particularly one > that needs to be backpatched. I'll remove the changes that are inessential in the next revision. I wouldn't have done it if I'd fully understood the seriousness of the issue from the start. If you're really concerned about diff size then I should point out that the changes to lazy_vacuum_heap_rel() aren't strictly necessary, and probably shouldn't be backpatched. I deemed that in scope because it's part of the same overall problem of updating the visibility map based on potentially stale information. It makes zero sense to check with the visibility map before updating it when we already know that the page is all-visible. I mean, are we trying to avoid the work of needlessly updating the visibility map in cases where its state was corrupt, but then became uncorrupt (relative to the heap page) by mistake? -- Peter Geoghegan