On 10/02/2021 00:14, Peter Geoghegan wrote:
There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].
Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.
Thanks for picking this up!
The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.
Is it really worth the trouble to maintain 'level' on deleted pages? All
you currently do with it is check that the BTP_LEAF flag is set iff
"level == 0", which seems pointless. I guess there could be some
forensic value in keeping 'level', but meh.
The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.
Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:
"38 index pages have been deleted, 38 are currently reusable."
With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:
"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."
This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.
In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).
Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.
The full message on master is:
INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.
How about:
INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.
The idea is that the first DETAIL line now says what the VACUUM did this
round, and the last line says what the state of the index is now. One
concern with that phrasing is that it might not be clear what "686 pages
were removed" means. We don't actually shrink the file. Then again, I'm
not sure if the "have been deleted" was any better in that regard.
It's still a bit weird that the "what VACUUM did this round" information
is sandwiched between the two other lines that talk about the state of
the index after the operation. But I think the language now makes it
more clear which is which. Or perhaps flip the INFO and first DETAIL
lines around like this:
INFO: 250000 index row versions and 686 pages were removed from index
"foo_pkey"
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.
For context, the more full message you get on master is:
postgres=# vacuum verbose foo;
INFO: vacuuming "public.foo"
INFO: scanned index "foo_pkey" to remove 250000 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
INFO: "foo": removed 250000 row versions in 1107 pages
DETAIL: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
INFO: index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL: 250000 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "foo": found 250000 removable, 271 nonremovable row versions in
1108 out of 4425 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 1164
There were 87 unused item identifiers.
Skipped 0 pages due to buffer pins, 2212 frozen pages.
0 pages are entirely empty.
CPU: user: 0.27 s, system: 0.00 s, elapsed: 0.28 s.
VACUUM
That's pretty confusing, it's a mix of basically progress indicators
(vacuuming "public.foo"), CPU measurements, information about what was
removed, and what the state is afterwards. Would be nice to make that
more clear overall. But for now, for this particular INFO message,
perhaps make it more consistent with the lines printed by heapam, like this:
INFO: "foo_pkey": removed 250000 index row versions and 686 pages
DETAIL: index now contains 250001 row versions in 2745 pages.
2056 index pages are now unused, of which 1370 are currently reusable.
- Heikki