[
https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-17571:
-----------------------------------
Epic Link: (was: IGNITE-16923)
> Implement GC for MV storages
> ----------------------------
>
> Key: IGNITE-17571
> URL: https://issues.apache.org/jira/browse/IGNITE-17571
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h3. Basics
> Current implementation can only work with infinite storage space. This is
> because the storage works in appen-only mode (except for transactions
> rollbacks).
> There's a value in the system called "low watermark". It's guaranteed, that
> no new RO transactions will be started at the time earlier then LW. Existence
> of such value allows us to safely delete all versions before it. We will not
> discuss the mechanics of acquiring such value, assuming that it is simply
> given.
> It's expected that deletion will be performed in background workers, without
> holding any transactions locks. Process is independent on each RAFT group
> member. GC shouldn't affect transactions in any way, if possible.
> h3. API
> Original intended design looks like following:
> {code:java}
> cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive,
> @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code}
> Now, I don't think that this is necessarily a good design. Main problem with
> it is the existence of bounds:
> * First of all, why not just have inclusive lower bound and exclusive upper
> bound, like almost all methods with bounds in existence.
> * Second, I believe that this API has been proposed in assumption that row
> ids will have a uniform distribution and every "range" cleanup will result in
> somewhat equal amount of job executed. This is simply not true in current
> implementation.
> RowId is a timestamp based value that exist in a very narrow time slice,
> making most of ranges empty and meaningless.
> Then, the way they're stored is actually up to the storage. There's no
> restrictions on byte order when physically storing RowId objects.
> h3. Problems
> There's one thing that worries me: indexes.
> Because storages are unaware of table schemas, it's impossible to cleanup
> indexes. This gets me thinking. API should be flexible enough so that indexes
> cleanup can be performed as an external operation over the storage. This will
> also reduce the amount of job that we need to do for the implementation.
> To be specific, it feels like the method would look like this:
> {code:java}
> RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows,
> BiConsumer<RowId, BinaryRow> indexCleaner);{code}
> Explanation is required.
> * timestamp represents the same thing as before - low watermark.
> * startId - the row that should be first to iterate in current batch.
> * numRows - number of rows that should be cleaned in current batch. By this
> I mean individual versions, not chains.
> * cleaner closure must be used to clean indexes after every individual
> version removal. Right now it doesn't look optimal to me, but I struggle to
> find a good solution on efficient indexes cleanup.
> * next rowId is returned, that should be used a startId in next call. "null"
> if cleanup is complete. In this case it can be started from the beginning or
> simply postponed until new low watermark value is available.
> How to operate it:
> * numRows has two strategic purposes:
> ** to control the size of batches in "runConsistently" closures.
> ** to be able to run many cleanups in parallel avoiding pools starvation.
> Every task is split into small chunks.
> * cleanup should start from the smallest possible row id.
> * low watermark value can be changed in-between calls. This is fine and even
> preferable.
> h3. Alternative
> Some might find given API too complicated. There's another way that involves
> less actions from the "client".
> {code:java}
> void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId,
> BinaryRow> indexCleaner);{code}
> In a way, both of these methods are equivalent, the difference is minuscule.
> One can be transformed into the other with relative ease. If anything, this
> one is even preferable as a first implementation, because it's smaller and
> there's less room for errors.
> h3. Different engines
> This is the part where I express my thoughts of what can (and will) go wrong
> in different storages.
> Generally speaking, right now there are three implementations that must be
> compatible in behavior:
> # test storage that uses a ConcurrentSkipListMap and actual chains,
> represented with singly-linked lists
> # B+Tree-based storages, two of them. They use trees with, also, chains,
> also represented with singly-linked lists, but off-heap this time
> # RocksDB-based storage. It uses a flat structure, so it's just a big mess
> of (RowId, Timestamp?) pairs in keyset
> Let's discuss it one by one:
> # This is the easiest one. Currently links to next chain elements are final.
> They must become volatile. There's no manual "freeing" of the memory used by
> chains, so it'll just work.
> # This one is hard (in my opinion). Imagine having a "RowVersion" instance
> and trying to read the "next" row versions, while either looking for a
> specific version or performing index build/cleanup, doesn't matter. How do
> you guarantee that the link is valid and has not been deleted while you were
> doing your things?
> Out of the box, there's no way to tell that item in a data page have been
> deleted (to be completely honest, there _might_ be a loop-hole that exploits
> the fact that there's only a single _write_ thread that inserts anything and
> maybe a bunch of GC threads that delete data, but I wouldn't go that route).
> This is because _links_ have to rotation id, it's just not there. And we will
> not add them for this reason only, this would complicate the design too much
> and possibly even compromise a lot of things, like data density or something.
> For this reason, some external synchronization is required. Links must never
> be de-referenced without 100% certainty that they actually hold data that
> developers expects them to have. There are no exceptions to this rule.
> Exact way to synchronize things is up to the one who will implement GC in the
> engine.
> (There's also a problem with completely deleting rows from the tree. It may
> clash with iterators that cache stuff from the pages they have read. This may
> also result in obsolete links)
> # RocksDB has an interesting thing about it - it effectively creates a
> snapshot for each cursor that's created. I don't think that we really ever
> explored the realm of bugs that can occur because of that. Things that may
> broke include:
> - transactional scans
> - index cleanup
> - GC
> I believe that occasional call of {{RocksIteratorInterface#refresh}} will
> help here. But before doing that, we should clearly identify problematic
> places and write tests that actually show a problem. Doing such things
> blindly is dangerous.
> h3. Random notes
> Passed closure should only be executed when specific version is already
> deleted. Otherwise the data in row store will match indexes and indexes would
> not be cleaned, which is wrong. There must be a test for it, of course.
> Unsurprisingly, integration of this method into a raft machine is out of
> scope. I don't think that LW propagation is implemented at this point. I may
> be wrong. Anyways, we better discuss it with guys who implement transactions
> right now.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)