On Wed, Apr 29, 2020 at 2:56 PM Andres Freund <and...@anarazel.de> wrote: > I'm not sure I see the advantage. Only doing so in the freezing case > seems unlikely to cause additional conflicts, but I'm less sure about > doing it always. btpo.xact will often be quite recent, right? So it'd > likely cause more conflicts.
btpo.xact values come from a call to ReadNewTransactionId(). There pretty much has to be one call to ReadNewTransactionId() per page deletion (see comments about it within _bt_unlink_halfdead_page()). So yes, I'd say that it could be very recent. I don't mind continuing to do the conflicts in _bt_getbuf(), which would delay it until the point where we really need the page -- especially if we could do that in a way that captures temporal locality (e.g. your recyclable page chaining idea). > I don't really see the problem with the check in _bt_getbuf()? I'd > actually like to be *more* aggressive about putting pages in the FSM (or > whatever), and that'd probably require checks like this. E.g. whenever > we unlink a page, we should put it into the FSM (or something different, > see below). And then do all the necessary checks in _bt_getbuf(). Basically, I would like to have a page state that represents "it should be impossible for any scan to land on this page, except for btvacuumscan()". Without relying on 32-bit XIDs, and ideally without relying on any other state to interpret what it really means. In principle we can set a deleted page to that state at the earliest possible point when that becomes true, without it meaning anything else, or requiring that we do anything else at the same time (e.g. actually using it for the right half of a page in a page split, generating recovery conflicts). > It's pretty sad that one right now basically needs to vacuum twice to > reuse pages in nbtree (once to delete the page, once to record it in the > fsm). Usually the xmin horizon should advance much more quickly than > that, allowing reuse earlier. Yes, that's definitely bad. I like the general idea of making us more aggressive with recycling. Marking pages as "recyclable" (not "recycled") not too long after they first get deleted in VACUUM, and then separately recycling them in _bt_getbuf() is a cleaner design. Separation of concerns. That would build confidence in a more aggressive approach -- we could add lots of cross-checks against landing on a recyclable page. Note that we have had a bug in this exact mechanism in the past -- see commit d3abbbebe52. If there was a bug then we might still land on the page after it gets fully recycled, in which case the cross-checks won't detect the bug. But ISTM that we always have a good chance of landing on the page before that happens, in which case the cross-check complains and we get a log message, and possibly even a bug report. We don't have to be truly lucky to see when our approach is buggy when we go on to make page deletion more aggressive (in whatever way). And we get the same cross-checks on standbys. > > It makes sense that the FSM is not crash safe, I suppose, but we're > > talking about relatively large amounts of free space here. Can't we > > just do it properly/reliably? > > What do you mean by that? To have the FSM be crash-safe? That, or the equivalent (pretty much your chaining idea) may well be a good idea. But what I really meant was an explicit "recyclable" page state. That's all. We may or may not also continue to rely on the FSM in the same way. I suppose that we should try to get rid of the FSM in nbtree. I see the advantages. It's not essential to my proposal, though. > It could make sense to just not have the FSM, and have a linked-list > style stack of pages reachable from the meta page. That'd be especially > advantageous if we kept xids associated with the "about to be > recyclable" pages in the metapage, so it'd be cheap to evaluate. I like that idea. But doesn't that also argue for delaying the conflicts until we actually recycle a "recyclable" page? > There's possibly some not-entirely-trivial locking concerns around such > a list. Adding entries seems easy enough, because we currently only > delete pages from within vacuum. But popping entries could be more > complicated, I'm not exactly sure if there are potential lock nesting > issues (nor am I actually sure there aren't existing ones). A "recyclable" page state might help with this, too. _bt_getbuf() is a bag of tricks, even leaving aside generating recovery conflicts. If we are going to be more eager, then the cost of dirtying the page a second time to mark it "recyclable" might mostly not matter. Especially if we chain the pages. That is, maybe VACUUM recomputes RecentGlobalXmin at the end of its first btvacuumscan() scan (or at the end of the whole VACUUM operation), when it notices that it is already possible to mark many pages as "recyclable". Perhaps we won't write out the page twice much of the time, because it won't have been that long since VACUUM dirtied the page in order to delete it. Yeah, we could be a lot more aggressive here, in a bunch of ways. As I've said quite a few times, it seems like our implementation of "the drain technique" is way more conservative than it needs to be (i.e. we use ReadNewTransactionId() without considering any of the specifics of the index). But if we mess up, we can't expect amcheck to detect the problems, which would be totally transient. We're talking about incredibly subtle concurrency bugs. So IMV it's just not going to happen until the whole thing becomes way less scary. -- Peter Geoghegan