On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian <br...@momjian.us> wrote: > > Attached is v26, which adds this new criteria/heuristic for unique > > indexes. We now seem to consistently get good results with unique > > indexes. > > In the past we tried to increase the number of cases where HOT updates > can happen but were unable to.
Right -- the WARM project. The Z-heap project won't change the fundamentals here. It isn't going to solve the fundamental problem of requiring that the index AM create a new set of physical index tuples in at least *some* cases. A heap tuple cannot be updated in-place when even one indexed column changes -- you're not much better off than you were with the classic heapam, because indexes get bloated in a way that wouldn't happen with Oracle. (Even still, Z-heap isn't sensitive to when and how opportunistic heap pruning takes place, and doesn't have the same issue with having to fit the heap tuple on the same page or create a new HOT chain. This will make things much better with some workloads.) > Would this help with non-HOT updates? Definitely, yes. The strategy used with unique indexes is specifically designed to avoid "unnecessary" page splits altogether -- it only makes sense because of the possibility of non-HOT UPDATEs with mostly-unchanged index tuples. Thinking about what's going on here from first principles is what drove the unique index deduplication design: With many real world unique indexes, the true reason behind most or all B-Tree page splits is "version churn". I view these page splits as a permanent solution to a temporary problem -- we *permanently* degrade the index structure in order to deal with a *temporary* burst in versions that need to be stored. That's really bad. Consider a classic pgbench workload, for example. The smaller indexes on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey) have leaf pages that will almost certainly be split a few minutes in, even though the UPDATEs on the underlying tables never modify indexed columns (i.e. even though HOT is as effective as it possibly could be with this unthrottled workload). Actually, even the resulting split pages will themselves usually be split again, and maybe even once more after that. We started out with leaf pages that stored just under 370 items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up with leaf pages that often have less than 50 items (sometimes as few as 10). Even though the "logical contents" of the index are *totally* unchanged. This could almost be considered pathological by users. Of course, it's easy to imagine a case where it matters a lot more than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey are always small, so it's easy to see the effect, which is why I went with that example). For example, you could have a long running transaction, which would probably have the effect of significantly bloating even the large pgbench index (pgbench_accounts_pkey) -- typically you won't see that with classic pgbench until you do something to frustrate VACUUM (and opportunistic cleanup). (I have mostly been using non-HOT UPDATEs to test the patch, though.) In theory we could go even further than this by having some kind of version store for indexes, and using this to stash old versions rather than performing a page split. Then you wouldn't have any page splits in the pgbench indexes; VACUUM would eventually be able to return the index to its "pristine" state. The trade-off with that design would be that index scans would have to access two index pages for a while (a leaf page, plus its subsidiary old version page). Maybe we can actually go that far in the future -- there are various database research papers that describe designs like this (the designs described within these papers do things like determine whether a "version split" or a "value split" should be performed). What we have now is an incremental improvement, that doesn't have any apparent downside with unique indexes -- the way that deduplication is triggered for unique indexes is almost certain to be a win. When deduplication isn't triggered, everything works in the same way as before -- it's "zero overhead" for unique indexes that don't benefit. The design augments existing garbage collection mechanisms, particularly the way in which we set LP_DEAD bits within _bt_check_unique(). > Do we have any benchmarks where non-HOT updates cause slowdowns that we > can test on this? AFAICT, any workload that has lots of non-HOT updates will benefit at least a little bit -- indexes will finish up smaller, there will be higher throughput, and there will be a reduction in latency for queries. With the right distribution of values, it's not that hard to mostly control bloat in an index that doubles in size without the optimization, which is much more significant. I have already reported on this [1]. I've also been able to observe increases of 15%-20% in TPS with similar workloads (with commensurate reductions in query latency) more recently. This was with a simple gaussian distribution for pgbench_accounts.aid, and a non-unique index with deduplication enabled on pgbench_accounts.abalance. (The patch helps control the size of both indexes, especially the extra non-unique one.) [1] https://postgr.es/m/CAH2-WzkXHhjhmUYfVvu6afbojU97MST8RUT1U=hld2w-gc5...@mail.gmail.com -- Peter Geoghegan