On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > The new version of the patch is attached. > This version is even simpler than the previous one, > thanks to the recent btree design changes and all the feedback I received. > I consider it ready for review and testing.
I took a closer look at this patch, and have some general thoughts on its design, and specific feedback on the implementation. Preserving the *logical contents* of B-Tree indexes that use compression is very important -- that should not change in a way that outside code can notice. The heap TID itself should count as logical contents here, since we want to be able to implement retail index tuple deletion in the future. Even without retail index tuple deletion, amcheck's "rootdescend" verification assumes that it can find one specific tuple (which could now just be one specific "logical tuple") using specific key values from the heap, including the heap tuple's heap TID. This requirement makes things a bit harder for your patch, because you have to deal with one or two edge-cases that you currently don't handle: insertion of new duplicates that fall inside the min/max range of some existing posting list. That should be rare enough in practice, so the performance penalty won't be too bad. This probably means that code within _bt_findinsertloc() and/or _bt_binsrch_insert() will need to think about a logical tuple as a distinct thing from a physical tuple, though that won't be necessary in most places. The need to "preserve the logical contents" also means that the patch will need to recognize when indexes are not safe as a target for compression/deduplication (maybe we should call this feature deduplilcation, so it's clear how it differs from TOAST?). For example, if we have a case-insensitive ICU collation, then it is not okay to treat an opclass-equal pair of text strings that use the collation as having the same value when considering merging the two into one. You don't actually do that in the patch, but you also don't try to deal with the fact that such a pair of strings are equal, and so must have their final positions determined by the heap TID column (deduplication within _bt_compress_one_page() must respect that). Possibly equal-but-distinct values seems like a problem that's not worth truly fixing, but it will be necessary to store metadata about whether or not we're willing to do deduplication in the meta page, based on operator class and collation details. That seems like a restriction that we're just going to have to accept, though I'm not too worried about exactly what that will look like right now. We can work it out later. I think that the need to be careful about the logical contents of indexes already causes bugs, even with "safe for compression" indexes. For example, I can sometimes see an assertion failure within_bt_truncate(), at the point where we check if heap TID values are safe: /* * Lehman and Yao require that the downlink to the right page, which is to * be inserted into the parent page in the second phase of a page split be * a strict lower bound on items on the right page, and a non-strict upper * bound for items on the left page. Assert that heap TIDs follow these * invariants, since a heap TID value is apparently needed as a * tiebreaker. */ #ifndef DEBUG_NO_TRUNCATE Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), BTreeTupleGetMinTID(firstright)) < 0); ... This bug is not that easy to see, but it will happen with a big index, even without updates or deletes. I think that this happens because compression can allow the "logical tuples" to be in the wrong heap TID order when there are multiple posting lists for the same value. As I said, I think that it's necessary to see a posting list as being comprised of multiple logical tuples in the context of inserting new tuples, even when you're not performing compression or splitting the page. I also see that amcheck's bt_index_parent_check() function fails, though bt_index_check() does not fail when I don't use any of its extra verification options. (You haven't updated amcheck, but I don't think that you need to update it for these basic checks to continue to work.) Other feedback on specific things: * A good way to assess whether or not the "logical tuple versus physical tuple" thing works is to make sure that amcheck's "rootdescend" verification works with a variety of indexes. As I said, it has the same requirements for nbtree as retail index tuple deletion will. * _bt_findinsertloc() should not call _bt_compress_one_page() for !heapkeyspace (version 3) indexes -- the second call to _bt_compress_one_page() should be removed. * Why can't compression be used on system catalog indexes? I understand that they are not a compelling case, but we tend to do things the same way with catalog tables and indexes unless there is a very good reason not to (e.g. HOT, suffix truncation). I see that the tests fail when that restriction is removed, but I don't think that that has anything to do with system catalogs. I think that that's due to a bug somewhere else. Why have this restriction at all? * It looks like we could be less conservative in nbtsplitloc.c to good effect. We know for sure that a posting list will be truncated down to one heap TID even in the worst case, so we can safely assume that the new high key will be a lot smaller than the firstright tuple that it is based on when it has a posting list. We only have to keep one TID. This will allow us to leave more tuples on the left half of the page in certain cases, further improving space utilization. * Don't you need to update nbtdesc.c? * Maybe we could do compression with unique indexes when inserting values with NULLs? Note that we now treat an insertion of a tuple with NULLs into a unique index as if it wasn't even a unique index -- see the "checkingunique" optimization at the beginning of _bt_doinsert(). Having many NULL values in a unique index is probably fairly common. * It looks like amcheck's heapallindexed verification needs to have normalization added, to avoid false positives. This situation is specifically anticipated by existing comments above bt_normalize_tuple(). Again, being careful about "logical versus physical tuple" seems necessary. * Doesn't the nbtsearch.c/_bt_readpage() code that deals with backwards scans need to return posting lists backwards, not forwards? It seems like a good idea to try to "preserve the logical contents" here too, just to be conservative. Within nbtsort.c: * Is the new code in _bt_buildadd() actually needed? If so, why? * insert_itupprev_to_page_buildadd() is only called within nbtsort.c, and so should be static. The name also seems very long. * add_item_to_posting() is called within both nbtsort.c and nbtinsert.c, and so should remain non-static, but have less generic (and shorter) name. (Use the usual _bt_* style instead.) * Is nbtsort.c the right place for these functions, anyway? (Maybe, but maybe not, IMV.) I ran pgindent on the patch, and made some small manual whitespace adjustments, which is attached. There are no real changes, but some of the formatting in the original version you posted was hard to read. Please work off this for your next revision. -- Peter Geoghegan
0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patch
Description: Binary data