[ https://issues.apache.org/jira/browse/IGNITE-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov resolved IGNITE-17571. ------------------------------------ Fix Version/s: 3.0 Resolution: Fixed > Implement GC for MV storages > ---------------------------- > > Key: IGNITE-17571 > URL: https://issues.apache.org/jira/browse/IGNITE-17571 > Project: Ignite > Issue Type: Epic > Reporter: Ivan Bessonov > Priority: Major > Labels: ignite-3 > Fix For: 3.0 > > > h3. Basics > Current implementation can only work with infinite storage space. This is > because the storage works in append-only mode (except for transactions > rollbacks). > There's a value in the system called "low watermark". It's guaranteed, that > no new RO transactions will be started at the time earlier than LW. Existence > of such value allows us to safely delete all versions before it. We will not > discuss the mechanics of acquiring such value, assuming that it is simply > given. > It's expected that deletion will be performed in background workers, without > holding any transactions locks. Process is independent on each RAFT group > member. GC shouldn't affect transactions in any way, if possible. > h3. API > Original intended design looks like following: > {code:java} > cleanupCompletedVersions(@Nullable byte[] lower, boolean fromInclusive, > @Nullable byte[] upper, boolean toInclusive, Timestamp timestamp){code} > Now, I don't think that this is necessarily a good design. Main problem with > it is the existence of bounds: > * First of all, why not just have inclusive lower bound and exclusive upper > bound, like almost all methods with bounds in existence. > * Second, I believe that this API has been proposed in assumption that row > ids will have a uniform distribution and every "range" cleanup will result in > somewhat equal amount of job executed. This is simply not true in current > implementation. > RowId is a timestamp based value that exist in a very narrow time slice, > making most of ranges empty and meaningless. > Then, the way they're stored is actually up to the storage. There's no > restrictions on byte order when physically storing RowId objects. > h3. Problems > There's one thing that worries me: indexes. > Because storages are unaware of table schemas, it's impossible to cleanup > indexes. This gets me thinking. API should be flexible enough so that indexes > cleanup can be performed as an external operation over the storage. This will > also reduce the amount of job that we need to do for the implementation. > To be specific, it feels like the method would look like this: > {code:java} > RowId cleanup(HybridTimestamp threshold, RowId startId, int numRows, > BiConsumer<RowId, BinaryRow> indexCleaner);{code} > Explanation is required. > * timestamp represents the same thing as before - low watermark. > * startId - the row that should be first to iterate in current batch. > * numRows - number of rows that should be cleaned in current batch. By this > I mean individual versions, not chains. > * cleaner closure must be used to clean indexes after every individual > version removal. Right now it doesn't look optimal to me, but I struggle to > find a good solution on efficient indexes cleanup. > * next rowId is returned, that should be used a startId in next call. "null" > if cleanup is complete. In this case it can be started from the beginning or > simply postponed until new low watermark value is available. > How to operate it: > * numRows has two strategic purposes: > ** to control the size of batches in "runConsistently" closures. > ** to be able to run many cleanups in parallel avoiding pools starvation. > Every task is split into small chunks. > * cleanup should start from the smallest possible row id. > * low watermark value can be changed in-between calls. This is fine and even > preferable. > BUT, such design is a "backwards" design and it gives the storage too much > responsibility. > h3. Alternative > Some might find given API too complicated. There's another way that involves > less actions from the "client". > {code:java} > void cleanup(HybridTimestamp threshold, RowId startId, BiConsumer<RowId, > BinaryRow> indexCleaner);{code} > In a way, both of these methods are equivalent, the difference is minuscule. > One can be transformed into the other with relative ease. If anything, this > one is even preferable as a first implementation, because it's smaller and > there's less room for errors. > Another alternative it to have an internal queue, like in case of TTL in > Ignite 2. This would be the simplest API to implement: > {code:java} > @Nullable Pair<RowId, BinaryRow> enqueue(HybridTimestamp threshold);{code} > h3. Different engines > This is the part where I express my thoughts of what can (and will) go wrong > in different storages. > Generally speaking, right now there are three implementations that must be > compatible in behavior: > # test storage that uses a ConcurrentSkipListMap and actual chains, > represented with singly-linked lists > # B+Tree-based storages, two of them. They use trees with, also, chains, > also represented with singly-linked lists, but off-heap this time > # RocksDB-based storage. It uses a flat structure, so it's just a big mess > of (RowId, Timestamp?) pairs in keyset > Let's discuss it one by one: > # This is the easiest one. Currently links to next chain elements are final. > They must become volatile. There's no manual "freeing" of the memory used by > chains, so it'll just work. > # This one is hard (in my opinion). Imagine having a "RowVersion" instance > and trying to read the "next" row versions, while either looking for a > specific version or performing index build/cleanup, doesn't matter. How do > you guarantee that the link is valid and has not been deleted while you were > doing your things? > Out of the box, there's no way to tell that item in a data page have been > deleted (to be completely honest, there _might_ be a loop-hole that exploits > the fact that there's only a single _write_ thread that inserts anything and > maybe a bunch of GC threads that delete data, but I wouldn't go that route). > This is because _links_ have no rotation id, it's just not there. And we will > not add them for this reason alone, this would complicate the design too much > and possibly even compromise a lot of things, like data density or something. > For this reason, some external synchronization is required. Links must never > be de-referenced without 100% certainty that they actually hold data that > developers expect them to have. There are no exceptions to this rule. > Exact way to synchronize things is up to the one who will implement GC in the > engine. > (There's also a problem with completely deleting rows from the tree. It may > clash with iterators that cache stuff from the pages they have read. This may > also result in obsolete links) > Two directions for searching the answer. Truth lies somewhere in-between: > - update closures guarantee isolation, but they affect throughput > - explicit latches for row ids also guarantee isolation, but they also > affect throughput > Ignite 2 goes the second route, but the reasoning is different, in that > particular case we already have latches because of the atomic protocol, > they're simply reused. > # RocksDB has an interesting thing about it - it effectively creates a > snapshot for each cursor that's created. I don't think that we really ever > explored the realm of bugs that can occur because of that. Things that may > broke include: > - transactional scans > - index cleanup > - GC > I believe that occasional call of {{RocksIteratorInterface#refresh}} will > help here. But before doing that, we should clearly identify problematic > places and write tests that actually show a problem. Doing such things > blindly is dangerous. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)