[ https://issues.apache.org/jira/browse/IGNITE-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov updated IGNITE-25030: ----------------------------------- Description: h3. Problem description This happened: {code:java} 2025-04-02 17:30:25:607 +0000 [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure suppressed according to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is blocked for 46737 ms that is more than the allowed 500 ms, it is "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by "%node18%metastorage-compaction-executor-0" Id=324{code} The fact that we're doing it in a network thread will be addressed in a separate issue. All we need to know in this context is that: * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're waiting for read lock, which is kind of stupid. This problem actually scales to all the other read operations. I'm calling it stupid because meta-storage is a multi-versioned storage by design, and such locks should not exist. * When we're holding a write lock in compaction thread, or potentially any other write operation in a meta-storage FSM thread, we might spend an unexpectedly large amount of time there, which is simply unacceptable. Reading data from the storage should be avoided when we're holding a lock like that. I created this issue as a *Bug* because it is a serious design flaw, and it makes the system completely unresponsive in certain cases. h3. Potential improvements Here I will present a semi-organized list of thoughts, that might be converted into a design later on. * RW lock in its current form must be removed. * Every time we read the data, we: ** Fix the revision number for the operation. ** Read the consistent cut. ** Read optimistically, assuming that compaction doesn't happen. ** If we figured that compaction has indeed happen when we finished reading data, and the data might be invalid, we throw compaction exception. * Resolving timestamp into a revision (or the opposite) should never require lock either. ** We should use the same optimistic strategy as with regular reads. ** We should introduce an on-heap cache for the most recent values, like we did for terms in raft log manager. *** While doing that, we should ensure that it'll make sense. In order to do that, we need to check that we are more likely to resolve recent timestamps/revisions. *** Unlike term values in raft log manager, here we might want to cache a few hundred values. We also need a concurrent access for reading the cache, meaning that the implementation will different. Just make sure to make it {*}non blocking{*}. *** _Should be done in a separate Jira_ * Same for reading index/term/confguration/etc. There must be a simple way to make it non blocking either. * The fun part - FSM operations. Several ways to improve them: ** Remove the write lock. Use read-evaluate-write approach. ** When we read the data in FSM, we should not fix the revision, we should read the "latest" revision, it is more efficient. _Might be done in a separate Jira._ ** We should always remember - there are only two writing agents here. A state machine and a compaction process. ** State machine is a latency-sensitive part of the system. We should avoid blocking it with concurrent compaction. This restriction dictates the rest of the design. ** When we read a list of existing revisions for an entry and prepend a new revision to it, we don't lock anything. This means that by the time we write this list back into a storage, it might contain some compacted revisions. *This is fine.* The compactor itself should keep this race in mind and expect such a trash in revisions list. If the value is not found - it's been compacted previously. In my understanding, this is the only adequate approach for the task. ** By the way, we should also expect that we have some trash there while just reading the data. If something is in the process of being compacted, we shouldn't even try to read it, but if we do, we should be ready to the data that concurrently disappeared. * Compactor: ** Here we can't allow to write trash data into a revisions list of any entry, because (unlike in FSM) that would lead to inconsistent data. ** I propose doing the {{db.write(writeOptions, batch);}} calls while holding an exclusive lock. ** FSM just acquires the lock and writes the data, without any retries, because it has a priority. ** Compactor should have a check, like "is it safe to write a batch that I prepared": *** If the answer is "yes", we proceed with the write. *** If the answer is "no" or "idk", we: **** Don't proceed with writing the batch. **** Refresh the cursor. **** Re-read the data and prepare a new batch. **** Repeat. *** The goal is to have a check that will have a small likelihood of "maybe". I propose having a sufficiently large bloom filter and update it synchronously. It should be cleared every time we refresh the iterator. ** Such an approach will save us from constant "refresh" calls on iterator, which are slow. They will become rare. They will almost disappear if there's not much concurrent load. * was: h3. Problem description This happened: {code:java} 2025-04-02 17:30:25:607 +0000 [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure suppressed according to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is blocked for 46737 ms that is more than the allowed 500 ms, it is "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by "%node18%metastorage-compaction-executor-0" Id=324{code} The fact that we're doing it in a network thread will be addressed in a separate issue. All we need to know in this context is that: * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're waiting for read lock, which is kind of stupid. This problem actually scales to all the other read operations. I'm calling it stupid because meta-storage is a multi-versioned storage by design, and such locks should not exist. * When we're holding a write lock in compaction thread, or potentially any other write operation in a meta-storage FSM thread, we might spend an unexpectedly large amount of time there, which is simply unacceptable. Reading data from the storage should be avoided when we're holding a lock like that. I created this issue as a *Bug* because it is a serious design flaw, and it makes the system completely unresponsive in certain cases. h3. Potential improvements Here I will present a semi-organized list of thoughts, that might be converted into a design later on. * RW lock in its current form must be removed. * Every time we read the data, we: ** Fix the revision number for the operation. ** Read the consistent cut. ** Read optimistically, assuming that compaction doesn't happen. ** If we figured that compaction has indeed happen when we finished reading data, and the data might be invalid, we throw compaction exception. * Resolving timestamp into a revision (or the opposite) should never require lock either. ** We should use the same optimistic strategy as with regular reads. ** We should introduce an on-hep cache for the most recent values, like we did for terms in raft log manager. *** While doing that, we should ensure that it'll make sense. In order to do that, we need to check that we are more likely to resolve recent timestamps/revisions. *** Unlike term values in raft log manager, here we might want to cache a few hundred values. We also need a concurrent access for reading the cache, meaning that the implementation will different. Just make sure to make it {*}non blocking{*}. *** _Should be done in a separate Jira_ * Same for reading index/term/confguration/etc. There must be a simple way to make it non blocking either. * The fun part - FSM operations. Several ways to improve them: ** Remove the write lock. Use read-evaluate-write approach. ** When we read the data in FSM, we should not fix the revision, we should read the "latest" revision, it is more efficient. _Might be done in a separate Jira._ ** We should always remember - there are only two writing agents here. A state machine and a compaction process. ** State machine is a latency-sensitive part of the system. We should avoid blocking it with concurrent compaction. This restriction dictates the rest of the design. ** When we read a list of existing revisions for an entry and prepend a new revision to it, we don't lock anything. This means that by the time we write this list back into a storage, it might contain some compacted revisions. *This is fine.* The compactor itself should keep this race in mind and expect such a trash in revisions list. If the value is not found - it's been compacted previously. In my understanding, this is the only adequate approach for the task. ** By the way, we should also expect that we have some trash there while just reading the data. If something is in the process of being compacted, we shouldn't even try to read it, but if we do, we should be ready to the data that concurrently disappeared. * Compactor: ** Here we can't allow to write trash data into a revisions list of any entry, because (unlike in FSM) that would lead to inconsistent data. ** I propose doing the {{db.write(writeOptions, batch);}} calls while holding an exclusive lock. ** FSM just acquires the lock and writes the data, without any retries, because it has a priority. ** Compactor should have a check, like "is it safe to write a batch that I prepared": *** If the answer is "yes", we proceed with the write. *** If the answer is "no" or "idk", we: **** Don't proceed with writing the batch. **** Refresh the cursor. **** Re-read the data and prepare a new batch. **** Repeat. *** The goal is to have a check that will have a small likelihood of "maybe". I propose having a sufficiently large bloom filter and update it synchronously. It should be cleared every time we refresh the iterator. ** Such an approach will save us from constant "refresh" calls on iterator, which are slow. They will become rare. They will almost disappear if there's not much concurrent load. * > Metastorage compaction blocks other threads by reading from drive > ----------------------------------------------------------------- > > Key: IGNITE-25030 > URL: https://issues.apache.org/jira/browse/IGNITE-25030 > Project: Ignite > Issue Type: Bug > Reporter: Ivan Bessonov > Priority: Major > Labels: ignite-3 > > h3. Problem description > This happened: > {code:java} > 2025-04-02 17:30:25:607 +0000 > [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure > suppressed according to a configured handler [hnd=NoOpFailureHandler > [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet > [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], > failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: > IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread > is blocked for 46737 ms that is more than the allowed 500 ms, it is > "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on > java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned > by "%node18%metastorage-compaction-executor-0" Id=324{code} > The fact that we're doing it in a network thread will be addressed in a > separate issue. All we need to know in this context is that: > * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're > waiting for read lock, which is kind of stupid. This problem actually scales > to all the other read operations. I'm calling it stupid because meta-storage > is a multi-versioned storage by design, and such locks should not exist. > * When we're holding a write lock in compaction thread, or potentially any > other write operation in a meta-storage FSM thread, we might spend an > unexpectedly large amount of time there, which is simply unacceptable. > Reading data from the storage should be avoided when we're holding a lock > like that. > I created this issue as a *Bug* because it is a serious design flaw, and it > makes the system completely unresponsive in certain cases. > h3. Potential improvements > Here I will present a semi-organized list of thoughts, that might be > converted into a design later on. > * RW lock in its current form must be removed. > * Every time we read the data, we: > ** Fix the revision number for the operation. > ** Read the consistent cut. > ** Read optimistically, assuming that compaction doesn't happen. > ** If we figured that compaction has indeed happen when we finished reading > data, and the data might be invalid, we throw compaction exception. > * Resolving timestamp into a revision (or the opposite) should never require > lock either. > ** We should use the same optimistic strategy as with regular reads. > ** We should introduce an on-heap cache for the most recent values, like we > did for terms in raft log manager. > *** While doing that, we should ensure that it'll make sense. In order to do > that, we need to check that we are more likely to resolve recent > timestamps/revisions. > *** Unlike term values in raft log manager, here we might want to cache a > few hundred values. We also need a concurrent access for reading the cache, > meaning that the implementation will different. Just make sure to make it > {*}non blocking{*}. > *** _Should be done in a separate Jira_ > * Same for reading index/term/confguration/etc. There must be a simple way > to make it non blocking either. > * The fun part - FSM operations. Several ways to improve them: > ** Remove the write lock. Use read-evaluate-write approach. > ** When we read the data in FSM, we should not fix the revision, we should > read the "latest" revision, it is more efficient. _Might be done in a > separate Jira._ > ** We should always remember - there are only two writing agents here. A > state machine and a compaction process. > ** State machine is a latency-sensitive part of the system. We should avoid > blocking it with concurrent compaction. This restriction dictates the rest of > the design. > ** When we read a list of existing revisions for an entry and prepend a new > revision to it, we don't lock anything. This means that by the time we write > this list back into a storage, it might contain some compacted revisions. > *This is fine.* The compactor itself should keep this race in mind and expect > such a trash in revisions list. If the value is not found - it's been > compacted previously. In my understanding, this is the only adequate approach > for the task. > ** By the way, we should also expect that we have some trash there while > just reading the data. If something is in the process of being compacted, we > shouldn't even try to read it, but if we do, we should be ready to the data > that concurrently disappeared. > * Compactor: > ** Here we can't allow to write trash data into a revisions list of any > entry, because (unlike in FSM) that would lead to inconsistent data. > ** I propose doing the {{db.write(writeOptions, batch);}} calls while > holding an exclusive lock. > ** FSM just acquires the lock and writes the data, without any retries, > because it has a priority. > ** Compactor should have a check, like "is it safe to write a batch that I > prepared": > *** If the answer is "yes", we proceed with the write. > *** If the answer is "no" or "idk", we: > **** Don't proceed with writing the batch. > **** Refresh the cursor. > **** Re-read the data and prepare a new batch. > **** Repeat. > *** The goal is to have a check that will have a small likelihood of > "maybe". I propose having a sufficiently large bloom filter and update it > synchronously. It should be cleared every time we refresh the iterator. > ** Such an approach will save us from constant "refresh" calls on iterator, > which are slow. They will become rare. They will almost disappear if there's > not much concurrent load. > * -- This message was sent by Atlassian Jira (v8.20.10#820010)