On Mon, Aug 16, 2021 at 5:15 PM Peter Geoghegan <p...@bowt.ie> wrote: > This is actually quite simple -- the chaos that follows is where it > gets truly complicated (and not necessarily worth getting into just > yet). The update of a given order and its order lines entries takes > place hours later, within a delivery transaction.
Update on this: aborted transactions play an important role in the chaos that we see with BenchmarkSQL/TPC-C. This is surprising, in one way. Only 1% of all transactions are aborted (per the TPC-C spec). But in another way it's not surprising. It's totally consistent with what I've been saying about open and closed pages. When a page that's initially allocated and opened by a backend becomes closed following some inserts, the page should be left in a more or less pristine state -- owning backends that open and close pages need to be "good citizens. The definition of "pristine" must also include "no already-aborted tuples" -- aborted xact tuples are a problem that we can and should nip in the bud. This is really an extension of what I've been saying about not disrupting naturally occuring temporal and spatial locality. We're far better off if the space reused by freeing aborted heap tuples is reused by the very next transaction running in the very same backend. Plus it's not that hard; we can know for sure that this space is immediately available, which is not generally true of pruning. And so this free space from aborted xact tuples is quite special. Even with the master branch the problems with the FSM are much less noticeable once you configure BenchmarkSQL to not abort any transactions [1]. You need to consider the whole picture to understand how the issue of aborted transactions has such an outsized negative impact on performance and space efficiency. It's a whole *chain* of events: * If just 1% of all transactions abort, that's enough to leave an average of about 1 aborted tuple on every "order" table page (which naturally has small heap tuples [2]), and about 10 on a minority of the really big "new order" table's pages. Maybe 1 in 10 or 12 "order line" table heap pages are affected in this way. * The overall impact is that small (but not tiny) pockets of free space are left randomly strewn all over the place. Not just LP_DEAD line pointer stubs -- whole entire aborted heap tuples. * In theory we could address this problem using opportunistic pruning. But in practice that simply cannot happen today, because we don't touch the inserted rows until literally hours later. The same rows will eventually have updates and selects, but that won't help -- too little, too late. That's just how opportunistic pruning works: currently you need some kind of scan node for opportunistic pruning to kick in, so continually inserting transactions (from new orders) are currently fundamentally unable to do any opportunistic pruning. Much less opportunistic pruning that kicks in at exactly the right time. * When an autovacuum worker runs against these tables, it will of course notice this diffuse free space from aborted tuples, and put it in the FSM. That free space will be reused, but by totally unrelated logical rows. What we really seem to need here is some rudimentary form of "eager physical rollback"; the really important thing seems to be to not allow the problem to happen in the first place. I have a basic prototype implementation, built on top of the larger prototype patch. It opportunistically prunes-away aborted heap tuples on target/open pages, bypassing the usual pd_prune_xid check we have in heap_page_prune_opt() -- that doesn't make sense with the new form of targeted, abort-orientated pruning. The new pruning mechanism makes a really big difference on its own, without even involving the FSM. The best cure for all kinds of bloat is prevention, which this comes pretty close to. I think that this general idea can be pushed further. I've talked about aborts, but what about the commit path? I don't see why we can't make inserting backends responsible for setting hint bits on recently inserted rows -- the cost of putting off that work can only go up, no matter what (unless maybe the user drops the table). Again, the missing piece seems to be a general sense of ownership of heap pages by transactions and/or backends. In order for a backend to be able to clean up its own mess after itself while it's still cheap, it has to understand what that actually means. Don't forget that the TPC-C workload doesn't have that many updates (at least not for the tables I'm focussed on here). Nothing that we tend to think of as an adversarial case for Postgres, except perhaps the updates against the "order" table, which are always non-HOT updates due to an issue with the indexes. That's the extent of it, though -- every other table easily has 99%+ of all updates use HOT (albeit with some heap fill factor tuning). We should *expect* long term stability here, based on the particulars of the workload; we only need to update every "order" row and every "order lines" row exactly once. After that they can just age out of shared_buffers forever (actually they might be read again, but never modified again, but reads were never the problem). The big "order line" table is by far the biggest problem, even though it manages to use HOT for practically all updates. [1] https://github.com/wieck/benchmarksql/commit/66b8db073545026dc76ef513d2b0e318d2f3d5a2 [2] https://apps.dtic.mil/dtic/tr/fulltext/u2/a264793.pdf -- "Table 1: Summary of Logical Database" -- Peter Geoghegan