Ivan Bessonov created IGNITE-19269:
--------------------------------------
Summary: Introduce proper local RowId locks
Key: IGNITE-19269
URL: https://issues.apache.org/jira/browse/IGNITE-19269
Project: Ignite
Issue Type: Improvement
Reporter: Ivan Bessonov
Currently, concurrency in MV storages is a bit convoluted. List of issues
include:
* Implicit assumption that one does not insert the same RowId concurrently.
There's a chance that this assumption is already violated. This could lead to
B+Tree corruptions, for example.
* Hidden locks, that cause non-trivial bugs like this:
https://issues.apache.org/jira/browse/IGNITE-19169
* MvPartitionStorage#scanVersions implementation for B+Tree reads the entire
chain at once, because we couldn't figure out how to do it properly in
concurrent environment. This may lead to drastic performance drops in the
future.
* GC code is complicated and sub-optimal. We need to implement the same
approach that's used in Ignite 2.x - read, lock, dequeue, cleanup, unlock*. We
can also cache part of the queue beforehand, making the process faster. It is
not possible in current implementation.
* Maybe something else that I couldn't remember.
h3. Proposed solution
As wise man once said, "{_}Explicit is better than implicit{_}".
I propose having an explicit concurrency model akin Ignite 2.x's. We need to
introduce 2 new methods to MV partition storage interface.
{code:java}
/**
* Synchronously locks passed RowId.
*/
void lock(RowId rowId);
/**
* Tries to synchronously lock passed RowId.
* Used to implement deadlock prevention.
*/
boolean tryLock(RowId rowId);{code}
These methods should delegate locking to the client, basically exposing the
{{ReentrantLockByRowId}} API. Expected invariants:
* Locks can only be allowed inside of the "runConsistently" closure.
* Unlock is still supposed to happen at the end of "runConsistently",
automatically. This is predicated by the way RocksDB implementation works
(batches).
Client must guarantee, that there are no deadlocks, either by sorting the data
or by using "tryLock".
* All read operations, except for "scanVersions", may be executed without
locks.
* "scanVersions" must have a lock, because it's not atomic. This is not a big
deal, because the only two usages are:
** index cleanup, we already hold a lock
** rebalance, we already hold another lock, so adding a new one is not a
problem
* "scanVersions" implementation must also be fixed. Maybe we need to add a
hint, like "loadAllVersions", but it's not required.
This approach, with proper implementation, solves all mentioned problems,
except for GC.
h3. Changes to GC
As stated before, GC flow must be changed to something like this:
{code:java}
var entry = partition.peek(lw);
if (entry == null) {
return;
}
if (!partition.tryLock(entry.rowId())) {
return; // Must exit closure to avoid deadlocks.
}
if (!partition.dequeue(entry)) {
continue; // Or return. We may or may not exit the closure
// in this case, both options are valid.
}
cleanupIndexes(entry);
continue;{code}
Right now, all storage implementations have similar construction inside of
them. All I propose is to move it outside.
h3. Tangent changes
org.apache.ignite.internal.table.distributed.raft.PartitionDataStorage is
basically a copy of the original, except for a few methods. Maybe copied
methods should exist in a shared super-interface. It's up to developer to
decide.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)