Ivan Bessonov created IGNITE-19269:
--------------------------------------

             Summary: Introduce proper local RowId locks
                 Key: IGNITE-19269
                 URL: https://issues.apache.org/jira/browse/IGNITE-19269
             Project: Ignite
          Issue Type: Improvement
            Reporter: Ivan Bessonov


Currently, concurrency in MV storages is a bit convoluted. List of issues 
include:
 * Implicit assumption that one does not insert the same RowId concurrently. 
There's a chance that this assumption is already violated. This could lead to 
B+Tree corruptions, for example.
 * Hidden locks, that cause non-trivial bugs like this: 
https://issues.apache.org/jira/browse/IGNITE-19169
 * MvPartitionStorage#scanVersions implementation for B+Tree reads the entire 
chain at once, because we couldn't figure out how to do it properly in 
concurrent environment. This may lead to drastic performance drops in the 
future.
 * GC code is complicated and sub-optimal. We need to implement the same 
approach that's used in Ignite 2.x - read, lock, dequeue, cleanup, unlock*. We 
can also cache part of the queue beforehand, making the process faster. It is 
not possible in current implementation.
 * Maybe something else that I couldn't remember.

h3. Proposed solution

As wise man once said, "{_}Explicit is better than implicit{_}".

I propose having an explicit concurrency model akin Ignite 2.x's. We need to 
introduce 2 new methods to MV partition storage interface.
{code:java}
/**
 * Synchronously locks passed RowId.
 */
void lock(RowId rowId);

/**
 * Tries to synchronously lock passed RowId.
 * Used to implement deadlock prevention.
 */
boolean tryLock(RowId rowId);{code}
These methods should delegate locking to the client, basically exposing the 
{{ReentrantLockByRowId}} API. Expected invariants:
 * Locks can only be allowed inside of the "runConsistently" closure.
 * Unlock is still supposed to happen at the end of "runConsistently", 
automatically. This is predicated by the way RocksDB implementation works 
(batches).
Client must guarantee, that there are no deadlocks, either by sorting the data 
or by using "tryLock".
 * All read operations, except for "scanVersions", may be executed without 
locks.
 * "scanVersions" must have a lock, because it's not atomic. This is not a big 
deal, because the only two usages are:

 ** index cleanup, we already hold a lock
 ** rebalance, we already hold another lock, so adding a new one is not a 
problem
 * "scanVersions" implementation must also be fixed. Maybe we need to add a 
hint, like "loadAllVersions", but it's not required.

This approach, with proper implementation, solves all mentioned problems, 
except for GC.
h3. Changes to GC

As stated before, GC flow must be changed to something like this:
{code:java}
var entry = partition.peek(lw);

if (entry == null) {
    return;
}

if (!partition.tryLock(entry.rowId())) {
    return; // Must exit closure to avoid deadlocks.
}

if (!partition.dequeue(entry)) {
    continue; // Or return. We may or may not exit the closure
              // in this case, both options are valid.
}

cleanupIndexes(entry);

continue;{code}
Right now, all storage implementations have similar construction inside of 
them. All I propose is to move it outside.
h3. Tangent changes

org.apache.ignite.internal.table.distributed.raft.PartitionDataStorage is 
basically a copy of the original, except for a few methods. Maybe copied 
methods should exist in a shared super-interface. It's up to developer to 
decide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to