Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Thu, 04 Jul 2019 10:39:14 -0700

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> i - number of distinct values in the index.
> So i=1 means that all rows have the same key,
> and i=10000000 means that all keys are different.
>
> i / old size (MB) / new size (MB)
> 1            215    88
> 1000        215    90
> 100000        215    71
> 10000000    214    214
>
> For more, see the attached diagram with test results.


I tried this on my own "UK land registry" test data [1], which was
originally used for the v12 nbtree work. My test case has a low
cardinality, multi-column text index. I chose this test case because
it was convenient for me.

On v12/master, the index is 1100MB. Whereas with your patch, it ends
up being 196MB -- over 5.5x smaller!

I also tried it out with the "Mouse genome informatics" database [2],
which was already improved considerably by the v12 work on duplicates.
This is helped tremendously by your patch. It's not quite 5.5x across
the board, of course. There are 187 indexes (on 28 tables), and almost
all of the indexes are smaller. Actually, *most* of the indexes are
*much* smaller. Very often 50% smaller.

I don't have time to do an in-depth analysis of these results today,
but clearly the patch is very effective on real world data. I think
that we tend to underestimate just how common indexes with a huge
number of duplicates are.

[1] 
https://https:/postgr.es/m/cah2-wzn_nayk4pr0hrwo0stwhmxjp5qyu+x8vppt030xpqr...@mail.gmail.com
[2] http://www.informatics.jax.org/software.shtml
--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to