On Fri, Mar 22, 2019 at 2:15 PM Peter Geoghegan <p...@bowt.ie> wrote: > On Thu, Mar 21, 2019 at 10:28 AM Peter Geoghegan <p...@bowt.ie> wrote: > > I'll likely push the remaining two patches on Sunday or Monday. > > I noticed that if I initidb and run "make installcheck" with and > without the "split after new tuple" optimization patch, the largest > system catalog indexes shrink quite noticeably:
I pushed this final patch a week ago, as commit f21668f3, concluding work on integrating the patch series. I have some closing thoughts that I would like to close out the project on. I was casually discussing this project over IM with Robert the other day. I was asked a question I'd often asked myself about the "split after new item" heuristics: What if you're wrong? What if some "black swan" type workload fools your heuristics into bloating an index uncontrollably? I gave an answer to his question that may have seemed kind of inscrutable. My intuition about the worst case for the heuristics is based on its similarity to the worst case for quicksort. Any real-world instance of quicksort going quadratic is essentially a case where we *consistently* do the wrong thing when selecting a pivot. A random pivot selection will still perform reasonably well, because we'll still choose the median pivot on average. A malicious actor will always be able to fool any quicksort implementation into going quadratic [1] in certain circumstances. We're defending against Murphy, not Machiavelli, though, so that's okay. I think that I can produce a more tangible argument than this, though. Attached patch removes every heuristic that limits the application of the "split after new item" optimization (it doesn't force the optimization in the case of rightmost splits, or in the case where the new item happens to be first on the page, since caller isn't prepared for that). This is an attempt to come up with a wildly exaggerated worst case. Nevertheless, the consequences are not actually all that bad. Summary: * The "UK land registry" test case that I leaned on a lot for the patch has a final index that's about 1% larger. However, it was about 16% smaller compared to Postgres without the patch, so this is not a problem. * Most of the TPC-C indexes are actually slightly smaller, because we didn't quite go as far as we could have (TPC-C strongly rewards this optimization). 8 out of the 10 indexes are either smaller or unchanged. The customer name index is about 28% larger, though. The oorder table index is also about 28% larger. * TPC-E never benefits from the "split after new item" optimization, and yet the picture isn't so bad here either. The holding history PK is about 40% bigger, which is quite bad, and the biggest regression overall. However, in other affected cases indexes are about 15% larger, which is not that bad. Also attached are the regressions from my test suite in the form of diff files -- these are the full details of the regression, just in case that's interesting to somebody. This isn't the final word. I'm not asking anybody to accept with total certainty that there can never be a "black swan" workload that the heuristics consistently mishandle, leading to pathological performance. However, I think it's fair to say that the risk of that happening has been managed well. The attached test patch literally removes any restraint on applying the optimization, and yet we arguably do no worse than Postgres 11 would overall. Once again, I would like to thank my collaborators for all their help, especially Heikki. [1] https://www.cs.dartmouth.edu/~doug/mdmspe.pdf -- Peter Geoghegan
always-split-after-new-item.patch
Description: Binary data
land_balanced.diff
Description: Binary data
tpcc_balanced.diff
Description: Binary data
tpce_balanced.diff
Description: Binary data