On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire <klaussfre...@gmail.com> wrote: > I've been changing the on-disk format considerably, to one that makes > more sense.
I can see how that would be necessary. I'm going to suggest some more things that need a new btree version number now, too. I realized today, quite suddenly, why I disliked your original patch: it didn't go far enough with embracing the idea of heap-item-pointer-as-key. In other words, I didn't like that you didn't change anything in the internal pages. Maybe you should put heap TIDs someplace in new internal pages, so that even there we treat them as part of the key. This may significantly benefit UPDATE-heavy workloads where HOT doesn't work out so well. Particularly for non-unique indexes, we currently have to wade through a lot of bloat from the leaf level, rather than jumping straight to the correct leaf page when updating or inserting. You might not want to do the same with unique indexes, where we must initially buffer lock "the first leaf page that a duplicate could be on" while we potentially look in one or more additional sibling pages (but bloat won't be so bad there, perhaps). Or, you might want to make sure that B-Tree tuple insertions still find that "first page" to check, despite still generally treating heap item pointer as part of the key proper (in unique indexes, it can still behave like NULL, and not participate in uniqueness checking, while still essentially being part of the key/sort order). Of course, it isn't great to add stuff to the internal pages, but if you're also implementing suffix truncation, there is probably no harm at all. You're only going to affect cases that will also greatly benefit from having that extra information, to guide insertions of new values to the right place more efficiently (by inserters, I mostly mean insertions from UPDATEs). It also occurs to me that our performance for updates in the event of many NULL values in indexes is probably really bad, and could be helped by this. You'll want to test all of this with a workload that isn't very sympathetic to HOT, of course -- most of these benefits are seen when HOT doesn't help. > However, I haven't tested the current state of the patch thoroughly. > > If a WIP is fine, I can post the what I have, rebased. I'm mostly curious about the effects on bloat. I now feel like you haven't sufficiently examined the potential benefits there, since you never made heap item pointer a first class part of the key space. Maybe you'd like to look into that yourself first? >> I also think that this might help with bitmap index scans. > > How so? I was thinking about teaching nbtree to start the scan at the level above the leaf level. If the heap TID was part of the key proper, one can imagine it being much cheaper to build a bitmap for a moderate cardinality value (e.g., NULL values needed for "WHERE foo IS NULL" index scans). The exact details would need to be worked out, but one can imagine building a bitmap that is much larger than necessary one level up from the leaf, then lazily unsetting bits as we descend the B-Tree to the leaf level and find pages whose bitmap bits can be unset. We might balance the avoidance of I/O during the scan against how much we might expect to save in a subsequent bitmap heap scan, and so on. This might be based on a selectivity estimate. That's all fairly hand-wavy, certainly, but I see significant potential along these lines. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers