Hi Peter, Thank you very much for your attention to this patch. Let me comment some points of your message.
On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <p...@bowt.ie> wrote: > On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova > <a.lubennik...@postgrespro.ru> wrote: > > The new version of the patch is attached. > > This version is even simpler than the previous one, > > thanks to the recent btree design changes and all the feedback I received. > > I consider it ready for review and testing. > > I took a closer look at this patch, and have some general thoughts on > its design, and specific feedback on the implementation. > > Preserving the *logical contents* of B-Tree indexes that use > compression is very important -- that should not change in a way that > outside code can notice. The heap TID itself should count as logical > contents here, since we want to be able to implement retail index > tuple deletion in the future. Even without retail index tuple > deletion, amcheck's "rootdescend" verification assumes that it can > find one specific tuple (which could now just be one specific "logical > tuple") using specific key values from the heap, including the heap > tuple's heap TID. This requirement makes things a bit harder for your > patch, because you have to deal with one or two edge-cases that you > currently don't handle: insertion of new duplicates that fall inside > the min/max range of some existing posting list. That should be rare > enough in practice, so the performance penalty won't be too bad. This > probably means that code within _bt_findinsertloc() and/or > _bt_binsrch_insert() will need to think about a logical tuple as a > distinct thing from a physical tuple, though that won't be necessary > in most places. Could you please elaborate more on preserving the logical contents? I can understand it as following: "B-Tree should have the same structure and invariants as if each TID in posting list be a separate tuple". So, if we imagine each TID to become separate tuple it would be the same B-tree, which just can magically sometimes store more tuples in page. Is my understanding correct? But outside code will still notice changes as soon as it directly accesses B-tree pages (like contrib/amcheck does). Do you mean we need an API for accessing logical B-tree tuples or something? > The need to "preserve the logical contents" also means that the patch > will need to recognize when indexes are not safe as a target for > compression/deduplication (maybe we should call this feature > deduplilcation, so it's clear how it differs from TOAST?). For > example, if we have a case-insensitive ICU collation, then it is not > okay to treat an opclass-equal pair of text strings that use the > collation as having the same value when considering merging the two > into one. You don't actually do that in the patch, but you also don't > try to deal with the fact that such a pair of strings are equal, and > so must have their final positions determined by the heap TID column > (deduplication within _bt_compress_one_page() must respect that). > Possibly equal-but-distinct values seems like a problem that's not > worth truly fixing, but it will be necessary to store metadata about > whether or not we're willing to do deduplication in the meta page, > based on operator class and collation details. That seems like a > restriction that we're just going to have to accept, though I'm not > too worried about exactly what that will look like right now. We can > work it out later. I think in order to deduplicate "equal but distinct" values we need at least to give up with index only scans. Because we have no restriction that equal according to B-tree opclass values are same for other operations and/or user output. > I think that the need to be careful about the logical contents of > indexes already causes bugs, even with "safe for compression" indexes. > For example, I can sometimes see an assertion failure > within_bt_truncate(), at the point where we check if heap TID values > are safe: > > /* > * Lehman and Yao require that the downlink to the right page, which is to > * be inserted into the parent page in the second phase of a page split be > * a strict lower bound on items on the right page, and a non-strict upper > * bound for items on the left page. Assert that heap TIDs follow these > * invariants, since a heap TID value is apparently needed as a > * tiebreaker. > */ > #ifndef DEBUG_NO_TRUNCATE > Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), > BTreeTupleGetMinTID(firstright)) < 0); > ... > > This bug is not that easy to see, but it will happen with a big index, > even without updates or deletes. I think that this happens because > compression can allow the "logical tuples" to be in the wrong heap TID > order when there are multiple posting lists for the same value. As I > said, I think that it's necessary to see a posting list as being > comprised of multiple logical tuples in the context of inserting new > tuples, even when you're not performing compression or splitting the > page. I also see that amcheck's bt_index_parent_check() function > fails, though bt_index_check() does not fail when I don't use any of > its extra verification options. (You haven't updated amcheck, but I > don't think that you need to update it for these basic checks to > continue to work.) Do I understand correctly that current patch may produce posting lists of the same value with overlapping ranges of TIDs? If so, it's definitely wrong. > * Maybe we could do compression with unique indexes when inserting > values with NULLs? Note that we now treat an insertion of a tuple with > NULLs into a unique index as if it wasn't even a unique index -- see > the "checkingunique" optimization at the beginning of _bt_doinsert(). > Having many NULL values in a unique index is probably fairly common. I think unique indexes may benefit from deduplication not only because of NULL values. Non-HOT updates produce duplicates of non-NULL values in unique indexes. And those duplicates can take significant space. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company