Re: Emit fewer vacuum records by reaping removable tuples during pruning

Peter Geoghegan Thu, 18 Jan 2024 09:15:37 -0800

On Thu, Jan 18, 2024 at 11:46 AM Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan <p...@bowt.ie> wrote:
> > True. But the way that PageGetHeapFreeSpace() returns 0 for a page
> > with 291 LP_DEAD stubs is a much older behavior. When that happens it
> > is literally true that the page has lots of free space. And yet it's
> > not free space we can actually use. Not until those LP_DEAD items are
> > marked LP_UNUSED.
>
> To me, this is just accurate reporting. What we care about in this
> context is the amount of free space on the page that can be used to
> store a new tuple. When there are no line pointers available to be
> allocated, that amount is 0.


I agree. All I'm saying is this (can't imagine you'll disagree):

It's not okay if you fail to update the FSM a second time in the
second heap pass -- at least in some cases. It's reasonably frequent
for a page that has 0 usable free space when lazy_scan_prune returns
to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page()
is done with it.

While I am sympathetic to the argument that LP_DEAD item space just
isn't that important in general, that doesn't apply with this one
special case. This is a "step function" behavior, and is seen whenever
VACUUM runs following bulk deletes of tuples -- a rather common case.
Clearly the FSM shouldn't show that pages that are actually completely
empty at the end of VACUUM as having no available free space after a
VACUUM finishes (on account of how they looked immediately after
lazy_scan_prune ran). That'd just be wrong.

> > Another big source of inaccuracies here is that we don't credit
> > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
> > huge problem, but it makes it even harder to believe that precision in
> > FSM accounting is an intrinsic good.
>
> The difficulty here is that we don't know how long it will be before
> that space can be reused. Those recently dead tuples could become dead
> within a few milliseconds or stick around for hours. I've wondered
> about the merits of some FSM that had built-in visibility awareness,
> i.e. the capability to record something like "page X currently has Y
> space free and after XID Z is all-visible it will have Y' space free".
> That seems complex, but without it, we either have to bet that the
> space will actually become free before anyone tries to use it, or that
> it won't. If whatever guess we make is wrong, bad things happen.

All true -- it is rather complex.

Other systems with a heap table access method based on a foundation of
2PL (Oracle, DB2) literally need a transactionally consistent FSM
structure. In fact I believe that Oracle literally requires the
equivalent of an MVCC snapshot read (a "consistent get") to be able to
access what seems like it ought to be strictly a physical data
structure correctly. Everything needs to work in the rollback path,
independent of whatever else may happen to the page before an xact
rolls back (i.e. independently of what other xacts might end up doing
with the page). This requires very tight coordination to avoid bugs
where a transaction cannot roll back due to not having enough free
space to restore the original tuple during UNDO.

I don't think it's desirable to have anything as delicate as that
here. But some rudimentary understanding of free space being
allocated/leased to certain transactions and/or backends does seem
like a good idea. There is some intrinsic value to these sorts of
behaviors, even in a system without any UNDO segments, where it is
never strictly necessary.

> I think that the completely deterministic nature of the computation is
> a mistake regardless of anything else. That serves to focus contention
> rather than spreading it out, which is dumb, and would still be dumb
> with any other number of FSM_CATEGORIES.

That's a part of the problem too, I guess.

The actual available free space on each page is literally changing all
the time, when measured at FSM_CATEGORIES-wise granularity -- which
leads to a mad dash among backends that all need the same amount of
free space for their new tuple. One reason why other systems pretty
much require coarse-grained increments of free space is the need to
manage the WAL overhead for a crash-safe FSM/free list structure.

> Yeah. I'm not sure we're actually going to change that right now, but
> I agree with the high-level point regardless, which I would summarize
> like this: The current system provides more precision about available
> free space than we actually need, while failing to provide some other
> things that we really do need. We need not agree today on exactly what
> those other things are or how best to get them in order to agree that
> the current system has significant flaws, and we do agree that it
> does.

I agree with this.

-- 
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

Reply via email to