On 05.08.2020 02:59, Alexander Korotkov wrote:
The things you're writing makes me uneasy. I initially understood
lsm3 as a quick and dirty prototype, while you're probably keeping
some design in your mind (for instance, original design of LSM).
However, your message makes me think you're trying to defend the
approach currently implemented in lsm3 extension. Therefore, I've to
criticise this approach.
1) The base index can degrade. At first, since merge can cause page
splits. Therefore logical ordering of pages will become less
correlated with their physical ordering with each merge.
2) If your workload will include updates and/or deletes, page
utilization may also degrade.
3) While base index degrades, merge performance also degrades.
Traverse of base index in logical order will require more and more
random reads (at some point almost every page read will be random).
While the base index becomes large and/or bloat, you push less top
index tuples to a single base index page (at some point you will push
one tuple per page).
Original LSM design implies strict guarantees over average resources
spent per index operation. Your design doesn't. Moreover, I bet lsm3
will degrade significantly even on insert-only workload. It should
degrade to the performance level of B-tree once you insert enough
data. Try something like number_of_merges =
numer_of_tuples_per_index_page * 2 and you should see this
degradation. Real LSM doesn't degrade that way.
I mostly agree with your critices.
My Lsm3 is not true LSM, but from my point of view it preserves basic
principles of LSM: fast and small top index and bulk updates of main index.
My experiments with RocksDB shows that degradation also takes place in
this case. More experiments are needed to compare two approaches.
Concerning degrade of basic index - B-Tree itself is balanced tree. Yes,
insertion of random keys can cause split of B-Tree page.
In the worst case half of B-Tree page will be empty. So B-Tree size will
be two times larger than ideal tree.
It may cause degrade up to two times. But that is all. There should not
be infinite degrade of speed tending to zero.
Right now vacuum process Lsm3 indexes in usual way.
Removing records from top indexes may be not needed at all (just because
this indexes will be truncated in any case).
But as far as size of top index is expected to be small enough
vacuumming it should not take a long time,
so I didn't to avoid it (although it should not be difficult - just
disable ambulkdelete for correspondent nbtree wrappers).
It doesn't seem important, but I don't get your point here. Postgres
expects ambulkdelete to delete TIDs from index. If you don't delete
it from the top index, this TID will be merged to the base index. And
that could lead wrong query answers unless you eliminate those TIDs in
a different way (during the merge stage or something).
Yes, your are right. It is not possible to avoid delete TIDs from top
indexes.
Concerning deletes from main index - I do not understand how it can be
optimized.
This is a trick you can learn from almost every LSM implementation.
For instance, check docs for leveldb [1] about "delete marker". For
sure, that requires some redesign of the vacuum and can't be done in
extension (at least in the reasonable way). But, frankly speaking, I
think core modifications are inevitable to utilize the power of LSM in
PostgreSQL.
The main idea of Lsm3 was to investigate whether it is possible to
achieve the same result as with "classical" LSM
using standard Postgres nbtree indexes. Right now it seems t me that
answer is positive, but I have not performed
exhaustive measurements. For example I have not measured vacuum overhead
(it was enabled, so vacuumming takes place
in my benchmark, but I have not tries to separate its overhead and
influence on performance), index search speed,...
Links
1. https://github.com/google/leveldb/blob/master/doc/impl.md
------
Regards,
Alexander Korotkov