On Wed, Jan 29, 2020 at 6:56 AM Robert Haas <robertmh...@gmail.com> wrote: > This (and the rest of the explanation) don't really address my > concern. I understand that deduplicating in lieu of splitting a page > in a unique index is highly likely to be a win. What I don't > understand is why it shouldn't just be a win, period. Not splitting a > page seems like it has a big upside regardless of whether the index is > unique -- and in fact, the upside could be a lot bigger for a > non-unique index. If the coarse-grained LP_DEAD thing is the problem, > then I can grasp that issue, but you don't seem very worried about > that.
You're right that I'm not worried about the coarse-grained LP_DEAD thing here. What I'm concerned about is cases where we attempt deduplication, but it doesn't work out because there are no duplicates -- that means we waste some cycles. Or cases where we manage to delay a split, but only for a very short period of time -- in theory it would be preferable to just accept the page split up front. However, in practice we can't make these distinctions, since it would hinge upon predicting the future, and we don't have a good heuristic. The fact that a deduplication pass barely manages to prevent an immediate page split isn't a useful proxy for how likely it is that the page will split in any timeframe. We might have prevented it from happening for another 2 milliseconds, or we might have prevented it forever. It's totally workload dependent. The good news is that these extra cycles aren't very noticeable even with a workload where deduplication doesn't help at all (e.g. with several indexes an append-only table, and few or no duplicates). The cycles are generally a fixed cost. Furthermore, it seems to be possible to virtually avoid the problem in the case of unique indexes by applying the incoming-item-is-duplicate heuristic. Maybe I am worrying over nothing. > Generally, I think it's a bad idea to give the user an "emergency off > switch" and then sometimes ignore it. If the feature seems to be > generally beneficial, but you're worried that there might be > regressions in obscure cases, then turn it on by default, and give the > user the ability to forcibly turn it off. But don't give the the > opportunity to forcibly turn it off sometimes. Nobody's going to run > around setting a reloption just for fun -- they're going to do it > because they hit a problem. Actually, we do. There is both a reloption and a GUC. The GUC only works with non-unique indexes, where the extra cost I describe might be an issue (it can at least be demonstrated in a benchmark).The reloption works with both unique and non-unique indexes. It will be useful for turning off deduplication selectively in non-unique indexes. In the case of unique indexes, it can be thought of as a debugging thing (though we really don't want users to think about deduplication in unique indexes at all). I'm really having a hard time imagining or demonstrating any downside with unique indexes, given the heuristic, so ISTM that turning off deduplication really is just a debugging thing there. My general assumption is that 99%+ of users will want to use deduplication everywhere. I am concerned about the remaining ~1% of users who might have a workload that is slightly regressed by deduplication. Even this small minority of users will still want to use deduplication in unique indexes. Plus we don't really want to talk about deduplication in unique indexes to users, since it'll probably confuse them. That's another reason to treat each case differently. Again, maybe I'm making an excessively thin distinction. I really want to be able to enable the feature everywhere, while also not getting even one complaint about it. Perhaps that's just not a realistic or useful goal. -- Peter Geoghegan