[ https://issues.apache.org/jira/browse/IGNITE-19269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719454#comment-17719454 ]
Roman Puchkovskiy commented on IGNITE-19269: -------------------------------------------- Sorry, I forgot to do it timely: LGTM! > Introduce proper local RowId locks > ---------------------------------- > > Key: IGNITE-19269 > URL: https://issues.apache.org/jira/browse/IGNITE-19269 > Project: Ignite > Issue Type: Improvement > Reporter: Ivan Bessonov > Assignee: Ivan Bessonov > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > Time Spent: 6h > Remaining Estimate: 0h > > Currently, concurrency in MV storages is a bit convoluted. List of issues > include: > * Implicit assumption that one does not insert the same RowId concurrently. > There's a chance that this assumption is already violated. This could lead to > B+Tree corruptions, for example. > * Hidden locks, that cause non-trivial bugs like this: > https://issues.apache.org/jira/browse/IGNITE-19169 > * MvPartitionStorage#scanVersions implementation for B+Tree reads the entire > chain at once, because we couldn't figure out how to do it properly in > concurrent environment. This may lead to drastic performance drops in the > future. > * GC code is complicated and sub-optimal. We need to implement the same > approach that's used in Ignite 2.x - read, lock, dequeue, cleanup, unlock*. > We can also cache part of the queue beforehand, making the process faster. It > is not possible in current implementation. > * Maybe something else that I couldn't remember. > h3. Proposed solution > As wise man once said, "{_}Explicit is better than implicit{_}". > I propose having an explicit concurrency model akin Ignite 2.x's. We need to > introduce 2 new methods to MV partition storage interface. > {code:java} > /** > * Synchronously locks passed RowId. > */ > void lock(RowId rowId); > /** > * Tries to synchronously lock passed RowId. > * Used to implement deadlock prevention. > */ > boolean tryLock(RowId rowId);{code} > These methods should delegate locking to the client, basically exposing the > {{ReentrantLockByRowId}} API. Expected invariants: > * Locks can only be allowed inside of the "runConsistently" closure. > * Unlock is still supposed to happen at the end of "runConsistently", > automatically. This is predicated by the way RocksDB implementation works > (batches). > Client must guarantee, that there are no deadlocks, either by sorting the > data or by using "tryLock". > * All read operations, except for "scanVersions", may be executed without > locks. > * "scanVersions" must have a lock, because it's not atomic. This is not a > big deal, because the only two usages are: > ** index cleanup, we already hold a lock > ** rebalance, we already hold another lock, so adding a new one is not a > problem > * "scanVersions" implementation must also be fixed. Maybe we need to add a > hint, like "loadAllVersions", but it's not required. > This approach, with proper implementation, solves all mentioned problems, > except for GC. > h3. Changes to GC > As stated before, GC flow must be changed to something like this: > {code:java} > var entry = partition.peek(lw); > if (entry == null) { > return; > } > if (!partition.tryLock(entry.rowId())) { > return; // Must exit closure to avoid deadlocks. > } > if (!partition.dequeue(entry)) { > continue; // Or return. We may or may not exit the closure > // in this case, both options are valid. > } > cleanupIndexes(entry); > continue;{code} > Right now, all storage implementations have similar construction inside of > them. All I propose is to move it outside. > h3. Tangent changes > org.apache.ignite.internal.table.distributed.raft.PartitionDataStorage is > basically a copy of the original, except for a few methods. Maybe copied > methods should exist in a shared super-interface. It's up to developer to > decide. -- This message was sent by Atlassian Jira (v8.20.10#820010)