On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > > * Included my own pageinspect hack to visualize the minimum TIDs in > > posting lists. It's broken out into a separate patch file. The code is > > very rough, but it might help someone else, so I thought I'd include > > it. > Cool, I think we should add it to the final patchset, > probably, as separate function by analogy with tuple_data_split.
Good idea. Attached is v5, which is based on your v4. The three main differences between this and v4 are: * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my July 24 e-mail. We can always add something like this back during performance validation of the patch. Right now, having no BT_COMPRESS_THRESHOLD limit definitely improves space utilization for certain important cases, which seems more important than the uncertain/speculative downside. * We now have experimental support for unique indexes. This is broken out into its own patch. * We now handle LP_DEAD items in a special way within _bt_insertonpg_in_posting(). As you pointed out already, we do need to think about LP_DEAD items directly, rather than assuming that they cannot be on the page that _bt_insertonpg_in_posting() must process. More on that later. > If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple > can be > larger. It may happen if keysize <= 4 byte. > In this situation original tuples must have been aligned to size 16 > bytes each, > and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is > also safe. I still need to think about the exact details of alignment within _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I could be wrong. > I changed DEBUG message to ERROR in v4 and it passes all regression tests. > I doubt that it covers all corner cases, so I'll try to add more special > tests. It also passes my tests, FWIW. > Hmm, I can't get the problem. > In current implementation each posting tuple is smaller than BTMaxItemSize, > so no split can lead to having tuple of larger size. That sounds correct, then. > No, we don't need them both. I don't mind combining them into one macro. > Actually, we never needed BTreeTupleGetMinTID(), > since its functionality is covered by BTreeTupleGetHeapTID. I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just have a comment next to BTreeTupleGetHeapTID(), and another comment next to BTreeTupleGetMaxTID(). > The main reason why I decided to avoid applying compression to unique > indexes > is the performance of microvacuum. It is not applied to items inside a > posting > tuple. And I expect it to be important for unique indexes, which ideally > contain only a few live values. I found that the performance of my experimental patch with unique index was significantly worse. It looks like this is a bad idea, as you predicted, though we may still want to do deduplication/compression with NULL values in unique indexes. I did learn a few things from implementing unique index support, though. BTW, there is a subtle bug in how my unique index patch does WAL-logging -- see my comments within index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if replication isn't used. I don't think that we're going to use this experimental patch at all, so I didn't bother fixing the bug. > if (ItemIdIsDead(itemId)) > continue; > > In the previous review Rafia asked about "some reason". > Trying to figure out if this situation possible, I changed this line to > Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a > performance > test. Unfortunately, I was not able to reproduce it. I found it easy enough to see LP_DEAD items within _bt_insertonpg_in_posting() when running pgbench with the extra unique index patch. To give you a simple example of how this can happen, consider the comments about BTP_HAS_GARBAGE within _bt_delitems_vacuum(). That probably isn't the only way it can happen, either. ISTM that we need to be prepared for LP_DEAD items during deduplication, rather than trying to prevent deduplication from ever having to see an LP_DEAD item. v5 makes _bt_insertonpg_in_posting() prepared to overwrite an existing item if it's an LP_DEAD item that falls in the same TID range (that's _bt_compare()-wise "equal" to an existing tuple, which may or may not be a posting list tuple already). I haven't made this code do something like call index_compute_xid_horizon_for_tuples(), even though that's needed for correctness (i.e. this new code is currently broken in the same way that I mentioned unique index support is broken). I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. How do you feel about officially calling this deduplication, not compression? I think that it's a more accurate name for the technique. -- Peter Geoghegan
v5-0001-Compression-deduplication-in-nbtree.patch
Description: Binary data
v5-0003-DEBUG-Add-pageinspect-instrumentation.patch
Description: Binary data
v5-0002-Experimental-support-for-unique-indexes.patch
Description: Binary data