[ 
https://issues.apache.org/jira/browse/IGNITE-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-25030:
--------------------------------------

    Assignee: Ivan Bessonov

> Metastorage compaction blocks other threads by reading from drive
> -----------------------------------------------------------------
>
>                 Key: IGNITE-25030
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25030
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Problem description
> This happened:
> {code:java}
> 2025-04-02 17:30:25:607 +0000 
> [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure 
> suppressed according to a configured handler [hnd=NoOpFailureHandler 
> [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: 
> IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread 
> is blocked for 46737 ms that is more than the allowed 500 ms, it is 
> "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned 
> by "%node18%metastorage-compaction-executor-0" Id=324{code}
> The fact that we're doing it in a network thread will be addressed in a 
> separate issue. All we need to know in this context is that:
>  * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're 
> waiting for read lock, which is kind of stupid. This problem actually scales 
> to all the other read operations. I'm calling it stupid because meta-storage 
> is a multi-versioned storage by design, and such locks should not exist.
>  * When we're holding a write lock in compaction thread, or potentially any 
> other write operation in a meta-storage FSM thread, we might spend an 
> unexpectedly large amount of time there, which is simply unacceptable. 
> Reading data from the storage should be avoided when we're holding a lock 
> like that.
> I created this issue as a *Bug* because it is a serious design flaw, and it 
> makes the system completely unresponsive in certain cases.
> h3. Potential improvements
> Here I will present a semi-organized list of thoughts, that might be 
> converted into a design later on.
>  * RW lock in its current form must be removed.
>  * Every time we read the data, we:
>  ** Fix the revision number for the operation.
>  ** Read the consistent cut.
>  ** Read optimistically, assuming that compaction doesn't happen.
>  ** If we figured that compaction has indeed happen when we finished reading 
> data, and the data might be invalid, we throw compaction exception.
>  * Resolving timestamp into a revision (or the opposite) should never require 
> lock either.
>  ** We should use the same optimistic strategy as with regular reads.
>  ** We should introduce an on-heap cache for the most recent values, like we 
> did for terms in raft log manager.
>  *** While doing that, we should ensure that it'll make sense. In order to do 
> that, we need to check that we are more likely to resolve recent 
> timestamps/revisions.
>  *** Unlike term values in raft log manager, here we might want to cache a 
> few hundred values. We also need a concurrent access for reading the cache, 
> meaning that the implementation will different. Just make sure to make it 
> {*}non blocking{*}.
>  *** _Should be done in a separate Jira_
>  * Same for reading index/term/confguration/etc. There must be a simple way 
> to make it non blocking either.
>  * The fun part - FSM operations. Several ways to improve them:
>  ** Remove the write lock. Use read-evaluate-write approach.
>  ** When we read the data in FSM, we should not fix the revision, we should 
> read the "latest" revision, it is more efficient. _Might be done in a 
> separate Jira._
>  ** We should always remember - there are only two writing agents here. A 
> state machine and a compaction process.
>  ** State machine is a latency-sensitive part of the system. We should avoid 
> blocking it with concurrent compaction. This restriction dictates the rest of 
> the design.
>  ** When we read a list of existing revisions for an entry and prepend a new 
> revision to it, we don't lock anything. This means that by the time we write 
> this list back into a storage, it might contain some compacted revisions. 
> *This is fine.* The compactor itself should keep this race in mind and expect 
> such a trash in revisions list. If the value is not found - it's been 
> compacted previously. In my understanding, this is the only adequate approach 
> for the task.
>  ** By the way, we should also expect that we have some trash there while 
> just reading the data. If something is in the process of being compacted, we 
> shouldn't even try to read it, but if we do, we should be ready to the data 
> that concurrently disappeared.
>  *  Compactor:
>  ** Here we can't allow to write trash data into a revisions list of any 
> entry, because (unlike in FSM) that would lead to inconsistent data.
>  ** I propose doing the {{db.write(writeOptions, batch);}} calls while 
> holding an exclusive lock.
>  ** FSM just acquires the lock and writes the data, without any retries, 
> because it has a priority.
>  ** Compactor should have a check, like "is it safe to write a batch that I 
> prepared":
>  *** If the answer is "yes", we proceed with the write.
>  *** If the answer is "no" or "idk", we:
>  **** Don't proceed with writing the batch.
>  **** Refresh the cursor.
>  **** Re-read the data and prepare a new batch.
>  **** Repeat.
>  *** The goal is to have a check that will have a small likelihood of 
> "maybe". I propose having a sufficiently large bloom filter and update it 
> synchronously. It should be cleared every time we refresh the iterator.
>  ** Such an approach will save us from constant "refresh" calls on iterator, 
> which are slow. They will become rare. They will almost disappear if there's 
> not much concurrent load.
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to