On Fri, Sep 28, 2018 at 8:00 AM Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > On 21/09/2018 01:18, Peter Geoghegan wrote: > > * This means that there is a compatibility issue for anyone that is > > already right on the threshold -- we *really* don't want to see a > > REINDEX fail, but that seems like a possibility that we need to talk > > about now. > > When would the REINDEX need to happen? Will the code still be able to > read and write v3 btrees?
The patch doesn't do that at the moment, because I've been busy refining it, and because there are a couple of outstanding questions about how to go about it -- the questions I pose on this thread. I accept that it's absolutely essential that nbtree be able to read both v2 and v3 indexes as part of any new v4. Without a measurable performance penalty. That's the standard that this project should be held to. A REINDEX will never *need* to happen. v2 and v3 indexes will gradually go extinct, without many users actually noticing. The on-disk representation of my patch leaves several free status bits in INDEX_ALT_TID_MASK tuples free (3 total will remain, since I'm now using 1 of the 4 for BT_HEAP_TID_ATTR), so it should be easier to add various further enhancements to a v5 or v6 of nbtree. This is similar to how changes to GIN were managed in the past (it may be interesting to look at a GIN leaf page with pg_hexedit, since it'll show you the gory details in a relatively accessible way). I can imagine a INDEX_ALT_TID_MASK bit being used for tuples that point to the heap -- not just for pivot tuples. I have an eye on things like duplicate lists on the leaf level, which would probably work like a GIN posting list. > Could there perhaps be an amcheck or > pageinspect feature that tells you ahead of time if there are too large > items in an old index? That would be easy, but it might not be any better than just having REINDEX or CREATE INDEX [CONCURRENTLY] throw an error. They're already pretty fast. I could easily raise a WARNING when amcheck is run against an index of a version before v4, that has an index tuple that's too big to make it under the lower limit. Actually, I could even write an SQL query that had pageinspect notice affected tuples, without changing any C code. Bear in mind that TOAST compression accidentally plays a big role here. It makes it very unlikely that indexes in the field are right at the old 2712 byte threshold, without even 8 bytes of wiggle room, because it's impossible to predict how well the pglz compression will work with that kind of precision. Several highly improbable things need to happen at the same time before REINDEX can break. I cannot see how any app could have evolved to depend on having 2712 bytes, without even a single MAXALIGN() quantum to spare. I wrote a stress test around the new "1/3 of a page" restriction. It involved a large text attribute with PLAIN storage, since I couldn't sensibly test the restriction while using pglz compression in the index. When all of your tuples are 2704 bytes, you end up with a ridiculously tall B-Tree, that performs horribly. I think that I saw that it had 11 levels with the test case. The tallest B-Tree that you'll ever see in the wild is probably one that's 5 levels deep, which is very tall indeed. Because of the logarithmic nature of how a new level is added to a B-Tree, 11 levels is just ludicrous. (Granted, you only have to have one tuple that's precisely 2712 bytes in length for REINDEX to break.) -- Peter Geoghegan