On Thu, Jan 18, 2024 at 11:46 AM Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <p...@bowt.ie> wrote: > > True. But the way that PageGetHeapFreeSpace() returns 0 for a page > > with 291 LP_DEAD stubs is a much older behavior. When that happens it > > is literally true that the page has lots of free space. And yet it's > > not free space we can actually use. Not until those LP_DEAD items are > > marked LP_UNUSED. > > To me, this is just accurate reporting. What we care about in this > context is the amount of free space on the page that can be used to > store a new tuple. When there are no line pointers available to be > allocated, that amount is 0.
I agree. All I'm saying is this (can't imagine you'll disagree): It's not okay if you fail to update the FSM a second time in the second heap pass -- at least in some cases. It's reasonably frequent for a page that has 0 usable free space when lazy_scan_prune returns to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page() is done with it. While I am sympathetic to the argument that LP_DEAD item space just isn't that important in general, that doesn't apply with this one special case. This is a "step function" behavior, and is seen whenever VACUUM runs following bulk deletes of tuples -- a rather common case. Clearly the FSM shouldn't show that pages that are actually completely empty at the end of VACUUM as having no available free space after a VACUUM finishes (on account of how they looked immediately after lazy_scan_prune ran). That'd just be wrong. > > Another big source of inaccuracies here is that we don't credit > > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a > > huge problem, but it makes it even harder to believe that precision in > > FSM accounting is an intrinsic good. > > The difficulty here is that we don't know how long it will be before > that space can be reused. Those recently dead tuples could become dead > within a few milliseconds or stick around for hours. I've wondered > about the merits of some FSM that had built-in visibility awareness, > i.e. the capability to record something like "page X currently has Y > space free and after XID Z is all-visible it will have Y' space free". > That seems complex, but without it, we either have to bet that the > space will actually become free before anyone tries to use it, or that > it won't. If whatever guess we make is wrong, bad things happen. All true -- it is rather complex. Other systems with a heap table access method based on a foundation of 2PL (Oracle, DB2) literally need a transactionally consistent FSM structure. In fact I believe that Oracle literally requires the equivalent of an MVCC snapshot read (a "consistent get") to be able to access what seems like it ought to be strictly a physical data structure correctly. Everything needs to work in the rollback path, independent of whatever else may happen to the page before an xact rolls back (i.e. independently of what other xacts might end up doing with the page). This requires very tight coordination to avoid bugs where a transaction cannot roll back due to not having enough free space to restore the original tuple during UNDO. I don't think it's desirable to have anything as delicate as that here. But some rudimentary understanding of free space being allocated/leased to certain transactions and/or backends does seem like a good idea. There is some intrinsic value to these sorts of behaviors, even in a system without any UNDO segments, where it is never strictly necessary. > I think that the completely deterministic nature of the computation is > a mistake regardless of anything else. That serves to focus contention > rather than spreading it out, which is dumb, and would still be dumb > with any other number of FSM_CATEGORIES. That's a part of the problem too, I guess. The actual available free space on each page is literally changing all the time, when measured at FSM_CATEGORIES-wise granularity -- which leads to a mad dash among backends that all need the same amount of free space for their new tuple. One reason why other systems pretty much require coarse-grained increments of free space is the need to manage the WAL overhead for a crash-safe FSM/free list structure. > Yeah. I'm not sure we're actually going to change that right now, but > I agree with the high-level point regardless, which I would summarize > like this: The current system provides more precision about available > free space than we actually need, while failing to provide some other > things that we really do need. We need not agree today on exactly what > those other things are or how best to get them in order to agree that > the current system has significant flaws, and we do agree that it > does. I agree with this. -- Peter Geoghegan