[ https://issues.apache.org/jira/browse/IGNITE-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov reassigned IGNITE-25030: -------------------------------------- Assignee: Ivan Bessonov > Metastorage compaction blocks other threads by reading from drive > ----------------------------------------------------------------- > > Key: IGNITE-25030 > URL: https://issues.apache.org/jira/browse/IGNITE-25030 > Project: Ignite > Issue Type: Bug > Reporter: Ivan Bessonov > Assignee: Ivan Bessonov > Priority: Major > Labels: ignite-3 > > h3. Problem description > This happened: > {code:java} > 2025-04-02 17:30:25:607 +0000 > [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure > suppressed according to a configured handler [hnd=NoOpFailureHandler > [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], > failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: > IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread > is blocked for 46737 ms that is more than the allowed 500 ms, it is > "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned > by "%node18%metastorage-compaction-executor-0" Id=324{code} > The fact that we're doing it in a network thread will be addressed in a > separate issue. All we need to know in this context is that: > * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're > waiting for read lock, which is kind of stupid. This problem actually scales > to all the other read operations. I'm calling it stupid because meta-storage > is a multi-versioned storage by design, and such locks should not exist. > * When we're holding a write lock in compaction thread, or potentially any > other write operation in a meta-storage FSM thread, we might spend an > unexpectedly large amount of time there, which is simply unacceptable. > Reading data from the storage should be avoided when we're holding a lock > like that. > I created this issue as a *Bug* because it is a serious design flaw, and it > makes the system completely unresponsive in certain cases. > h3. Potential improvements > Here I will present a semi-organized list of thoughts, that might be > converted into a design later on. > * RW lock in its current form must be removed. > * Every time we read the data, we: > ** Fix the revision number for the operation. > ** Read the consistent cut. > ** Read optimistically, assuming that compaction doesn't happen. > ** If we figured that compaction has indeed happen when we finished reading > data, and the data might be invalid, we throw compaction exception. > * Resolving timestamp into a revision (or the opposite) should never require > lock either. > ** We should use the same optimistic strategy as with regular reads. > ** We should introduce an on-heap cache for the most recent values, like we > did for terms in raft log manager. > *** While doing that, we should ensure that it'll make sense. In order to do > that, we need to check that we are more likely to resolve recent > timestamps/revisions. > *** Unlike term values in raft log manager, here we might want to cache a > few hundred values. We also need a concurrent access for reading the cache, > meaning that the implementation will different. Just make sure to make it > {*}non blocking{*}. > *** _Should be done in a separate Jira_ > * Same for reading index/term/confguration/etc. There must be a simple way > to make it non blocking either. > * The fun part - FSM operations. Several ways to improve them: > ** Remove the write lock. Use read-evaluate-write approach. > ** When we read the data in FSM, we should not fix the revision, we should > read the "latest" revision, it is more efficient. _Might be done in a > separate Jira._ > ** We should always remember - there are only two writing agents here. A > state machine and a compaction process. > ** State machine is a latency-sensitive part of the system. We should avoid > blocking it with concurrent compaction. This restriction dictates the rest of > the design. > ** When we read a list of existing revisions for an entry and prepend a new > revision to it, we don't lock anything. This means that by the time we write > this list back into a storage, it might contain some compacted revisions. > *This is fine.* The compactor itself should keep this race in mind and expect > such a trash in revisions list. If the value is not found - it's been > compacted previously. In my understanding, this is the only adequate approach > for the task. > ** By the way, we should also expect that we have some trash there while > just reading the data. If something is in the process of being compacted, we > shouldn't even try to read it, but if we do, we should be ready to the data > that concurrently disappeared. > * Compactor: > ** Here we can't allow to write trash data into a revisions list of any > entry, because (unlike in FSM) that would lead to inconsistent data. > ** I propose doing the {{db.write(writeOptions, batch);}} calls while > holding an exclusive lock. > ** FSM just acquires the lock and writes the data, without any retries, > because it has a priority. > ** Compactor should have a check, like "is it safe to write a batch that I > prepared": > *** If the answer is "yes", we proceed with the write. > *** If the answer is "no" or "idk", we: > **** Don't proceed with writing the batch. > **** Refresh the cursor. > **** Re-read the data and prepare a new batch. > **** Repeat. > *** The goal is to have a check that will have a small likelihood of > "maybe". I propose having a sufficiently large bloom filter and update it > synchronously. It should be cleared every time we refresh the iterator. > ** Such an approach will save us from constant "refresh" calls on iterator, > which are slow. They will become rare. They will almost disappear if there's > not much concurrent load. > * -- This message was sent by Atlassian Jira (v8.20.10#820010)