On 08.08.2020 21:18, Alexander Korotkov wrote:
On Sat, Aug 8, 2020 at 5:07 PM Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
I agree with your that loosing sequential order of B-Tree pages may have
negative impact on performance.
But it first of all critical for order-by and range queries, when we
should traverse several subsequent leave pages.
It is less critical for exact-search or delete/insert operations.
Efficiency of merge operations mostly depends on how much keys
will be stored at the same B-Tree page.
What do you mean by "mostly"?  Given PostgreSQL has quite small (8k)
pages, sequential read in times faster than random read on SSDs
(dozens of times on HDDs).  I don't think this is something to
neglect.

When yo insert one record in B-Tree, the order of pages doesn't matter at all. If you insert ten records at one leaf page then order is also not so important.
If you insert 100 records, 50 got to one page and 50 to the next page,
then insertion may be faster if second page follows on  the disk first one.
But such insertion may cause page split and so allocation of new page,
so sequential write order can still be violated.

And it is first of all
determined by size of top index and key distribution.
How can you be sure that the top index can fit memory?  On production
systems, typically there are multiple consumers of memory: other
tables, indexes, other LSMs.  This is one of reasons why LSM
implementations have multiple levels: they don't know in advance which
levels fit memory.  Another reason is dealing with very large
datasets.  And I believe there is a quite strong reason to keep page
order sequential within level.

There is no any warranty that top index is kept in memory.
But as far top index pages are frequently accessed,  I hope that buffer management cache replacement
algorithm does it best to keep them in memory.

I'm OK with your design for a third-party extension.  It's very cool
to have.  But I'm -1 for something like this to get into core
PostgreSQL, assuming it's feasible to push some effort and get
state-of-art LSM there.
I realize that it is not true LSM.
But still I wan to notice that it is able to provide ~10 times increase of insert speed when size of index is comparable with RAM size. And "true LSM" from RocksDB shows similar results. May be if size of index will be 100 times larger then size of RAM, RocksDB will be significantly faster than Lsm3. But modern servers has 0.5-1Tb of RAM.
Can't believe that there are databases with 100Tb indexes.



Reply via email to