Re: partial heap only tuples

Peter Geoghegan Sun, 18 Apr 2021 16:28:20 -0700

On Tue, Feb 9, 2021 at 10:48 AM Bossart, Nathan <bossa...@amazon.com> wrote:
> I'm hoping to gather some early feedback on a heap optimization I've
> been working on.  In short, I'm hoping to add "partial heap only
> tuple" (PHOT) support, which would allow you to skip updating indexes
> for unchanged columns even when other indexes require updates.  Today,
> HOT works wonders when no indexed columns are updated.  However, as
> soon as you touch one indexed column, you lose that optimization
> entirely, as you must update every index on the table.  The resulting
> performance impact is a pain point for many of our (AWS's) enterprise
> customers, so we'd like to lend a hand for some improvements in this
> area.  For workloads involving a lot of columns and a lot of indexes,
> an optimization like PHOT can make a huge difference.  I'm aware that
> there was a previous attempt a few years ago to add a similar
> optimization called WARM [0] [1].  However, I only noticed this
> previous effort after coming up with the design for PHOT, so I ended
> up taking a slightly different approach.  I am also aware of a couple
> of recent nbtree improvements that may mitigate some of the impact of
> non-HOT updates [2] [3], but I am hoping that PHOT serves as a nice
> complement to those.  I've attached a very early proof-of-concept
> patch with the design described below.


I would like to share some thoughts that I have about how I think
about optimizations like PHOT, and how they might fit together with my
own work -- particularly the nbtree bottom-up index deletion feature
you referenced. My remarks could equally well apply to WARM.
Ordinarily this is the kind of thing that would be too hand-wavey for
the mailing list, but we don't have the luxury of in-person
communication right now.

Everybody tends to talk about HOT as if it works perfectly once you
make some modest assumptions, such as "there are no long-running
transactions", and "no UPDATEs will logically modify indexed columns".
But I tend to doubt that that's truly the case -- I think that there
are still pathological cases where HOT cannot keep the total table
size stable in the long run due to subtle effects that eventually
aggregate into significant issues, like heap fragmentation. Ask Jan
Wieck about the stability of some of the TPC-C/BenchmarkSQL tables to
get one example of this. There is no reason to believe that PHOT will
help with that. Maybe that's okay, but I would think carefully about
what that means if I were undertaking this work. Ensuring stability in
the on-disk size of tables in cases where the size of the logical
database is stable should be an important goal of a project like PHOT
or HOT.

If you want to get a better sense of how these inefficiencies might
happen, I suggest looking into using recently added autovacuum logging
to analyze how well HOT works today, using the technique that I go
into here:

https://postgr.es/m/cah2-wzkju+nibskzunbdpz6trse+aqvupae+xgm8zvob4wq...@mail.gmail.com

Small inefficiencies in the on-disk structure have a tendency to
aggregate over time, at least when there is no possible way to reverse
them. The bottom-up index deletion stuff is very effective as a
backstop against index bloat, because things are generally very
non-linear. The cost of an unnecessary page split is very high, and
permanent. But we can make it cheap to *try* to avoid that using
fairly simple heuristics. We can be reasonably confident that we're
about to split the page unnecessarily, and use cues that ramp up the
number of heap page accesses as needed. We ramp up during a bottom-up
index deletion, as we manage to free some index tuples as a result of
previous heap page accesses.

This works very well because we can intervene very selectively. We
aren't interested in deleting index tuples unless and until we really
have to, and in general there tends to be quite a bit of free space to
temporarily store extra version duplicates -- that's what most index
pages look like, even on the busiest of databases. It's possible for
the bottom-up index deletion mechanism to be invoked very
infrequently, and yet make a huge difference. And when it fails to
free anything, it fails permanently for that particular leaf page
(because it splits) -- so now we have lots of space for future index
tuple insertions that cover the original page's key space. We won't
thrash.

My intuition is that similar principles can be applied inside heapam.
Failing to fit related versions on a heap page (having managed to do
so for hours or days before that point) is more or less the heap page
equivalent of a leaf page split from version churn (this is the
pathology that bottom-up index deletion targets). For example, we
could have a fall back mode that compresses old versions that is used
if and only if heap pruning was attempted but then failed. We should
always try to avoid migrating to a new heap page, because that amounts
to a permanent solution to a temporary problem. We should perhaps make
the updater work to prove that that's truly necessary, rather than
giving up immediately (i.e. assuming that it must be necessary at the
first sign of trouble).

We might have successfully fit the successor heap tuple version a
million times before just by HOT pruning, and yet currently we give up
just because it didn't work on the one millionth and first occasion --
don't you think that's kind of silly? We may be able to afford having
a fallback strategy that is relatively expensive, provided it is
rarely used. And it might be very effective in the aggregate, despite
being rarely used -- it might provide us just what we were missing
before. Just try harder when you run into a problem every once in a
blue moon!

A diversity of strategies with fallback behavior is sometimes the best
strategy. Don't underestimate the contribution of rare and seemingly
insignificant adverse events. Consider the lifecycle of the data over
time. If we quit trying to fit new versions on the same heap page at
the first sign of real trouble, then it's only a matter of time until
widespread heap fragmentation results -- each heap page only has to be
unlucky once, and in the long run it's inevitable that they all will.
We could probably do better at nipping it in the bud at the level of
individual heap pages and opportunistic prune operations.

I'm sure that it would be useful to not have to rely on bottom-up
index deletion in more cases -- I think that the idea of "a better
HOT" might still be very helpful. Bottom-up index deletion is only
supposed to be a backstop against pathological behavior (version churn
page splits), which is probably always going to be possible with a
sufficiently extreme workload. I don't believe that the current levels
of version churn/write amplification that we still see with Postgres
must be addressed through totally eliminating multiple versions of the
same logical row that live together in the same heap page. This idea
is a false dichotomy. And it fails to acknowledge how the current
design often works very well. When and how it fails to work well with
a real workload and real tuning (especially heap fill factor tuning)
is probably not well understood. Why not start with that?

Our default heap fill factor is 100. Maybe that's the right decision,
but it significantly impedes the ability of HOT to keep the size of
tables stable over time. Just because heap fill factor 90 also has
issues today doesn't mean that each pathological behavior cannot be
fixed through targeted intervention. Maybe the myth that HOT works
perfectly once you make some modest assumptions could come true.

-- 
Peter Geoghegan

Re: partial heap only tuples

Reply via email to