[jira] [Updated] (IGNITE-25030) Metastorage compaction blocks other threads by reading from drive

Ivan Bessonov (Jira) Fri, 04 Apr 2025 09:03:22 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-25030:
-----------------------------------
    Description: 
h3. Problem description

This happened:
{code:java}
2025-04-02 17:30:25:607 +0000 
[WARNING][%node18%common-scheduler-0][FailureManager] Possible failure 
suppressed according to a configured handler [hnd=NoOpFailureHandler 
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: 
IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is 
blocked for 46737 ms that is more than the allowed 500 ms, it is 
"%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by 
"%node18%metastorage-compaction-executor-0" Id=324{code}
The fact that we're doing it in a network thread will be addressed in a 
separate issue. All we need to know in this context is that:
 * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're 
waiting for read lock, which is kind of stupid. This problem actually scales to 
all the other read operations. I'm calling it stupid because meta-storage is a 
multi-versioned storage by design, and such locks should not exist.
 * When we're holding a write lock in compaction thread, or potentially any 
other write operation in a meta-storage FSM thread, we might spend an 
unexpectedly large amount of time there, which is simply unacceptable. Reading 
data from the storage should be avoided when we're holding a lock like that.

I created this issue as a *Bug* because it is a serious design flaw, and it 
makes the system completely unresponsive in certain cases.
h3. Potential improvements

Here I will present a semi-organized list of thoughts, that might be converted 
into a design later on.
 * RW lock in its current form must be removed.
 * Every time we read the data, we:
 ** Fix the revision number for the operation.
 ** Read the consistent cut.
 ** Read optimistically, assuming that compaction doesn't happen.
 ** If we figured that compaction has indeed happen when we finished reading 
data, and the data might be invalid, we throw compaction exception.
 * Resolving timestamp into a revision (or the opposite) should never require 
lock either.
 ** We should use the same optimistic strategy as with regular reads.
 ** We should introduce an on-heap cache for the most recent values, like we 
did for terms in raft log manager.
 *** While doing that, we should ensure that it'll make sense. In order to do 
that, we need to check that we are more likely to resolve recent 
timestamps/revisions.
 *** Unlike term values in raft log manager, here we might want to cache a few 
hundred values. We also need a concurrent access for reading the cache, meaning 
that the implementation will different. Just make sure to make it {*}non 
blocking{*}.
 *** _Should be done in a separate Jira_
 * Same for reading index/term/confguration/etc. There must be a simple way to 
make it non blocking either.
 * The fun part - FSM operations. Several ways to improve them:
 ** Remove the write lock. Use read-evaluate-write approach.
 ** When we read the data in FSM, we should not fix the revision, we should 
read the "latest" revision, it is more efficient. _Might be done in a separate 
Jira._
 ** We should always remember - there are only two writing agents here. A state 
machine and a compaction process.
 ** State machine is a latency-sensitive part of the system. We should avoid 
blocking it with concurrent compaction. This restriction dictates the rest of 
the design.
 ** When we read a list of existing revisions for an entry and prepend a new 
revision to it, we don't lock anything. This means that by the time we write 
this list back into a storage, it might contain some compacted revisions. *This 
is fine.* The compactor itself should keep this race in mind and expect such a 
trash in revisions list. If the value is not found - it's been compacted 
previously. In my understanding, this is the only adequate approach for the 
task.
 ** By the way, we should also expect that we have some trash there while just 
reading the data. If something is in the process of being compacted, we 
shouldn't even try to read it, but if we do, we should be ready to the data 
that concurrently disappeared.
 *  Compactor:
 ** Here we can't allow to write trash data into a revisions list of any entry, 
because (unlike in FSM) that would lead to inconsistent data.
 ** I propose doing the {{db.write(writeOptions, batch);}} calls while holding 
an exclusive lock.
 ** FSM just acquires the lock and writes the data, without any retries, 
because it has a priority.
 ** Compactor should have a check, like "is it safe to write a batch that I 
prepared":
 *** If the answer is "yes", we proceed with the write.
 *** If the answer is "no" or "idk", we:
 **** Don't proceed with writing the batch.
 **** Refresh the cursor.
 **** Re-read the data and prepare a new batch.
 **** Repeat.
 *** The goal is to have a check that will have a small likelihood of "maybe". 
I propose having a sufficiently large bloom filter and update it synchronously. 
It should be cleared every time we refresh the iterator.
 ** Such an approach will save us from constant "refresh" calls on iterator, 
which are slow. They will become rare. They will almost disappear if there's 
not much concurrent load.

 *  

  was:
h3. Problem description

This happened:
{code:java}
2025-04-02 17:30:25:607 +0000 
[WARNING][%node18%common-scheduler-0][FailureManager] Possible failure 
suppressed according to a configured handler [hnd=NoOpFailureHandler 
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: 
IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread is 
blocked for 46737 ms that is more than the allowed 500 ms, it is 
"%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned by 
"%node18%metastorage-compaction-executor-0" Id=324{code}
The fact that we're doing it in a network thread will be addressed in a 
separate issue. All we need to know in this context is that:
 * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're 
waiting for read lock, which is kind of stupid. This problem actually scales to 
all the other read operations. I'm calling it stupid because meta-storage is a 
multi-versioned storage by design, and such locks should not exist.
 * When we're holding a write lock in compaction thread, or potentially any 
other write operation in a meta-storage FSM thread, we might spend an 
unexpectedly large amount of time there, which is simply unacceptable. Reading 
data from the storage should be avoided when we're holding a lock like that.

I created this issue as a *Bug* because it is a serious design flaw, and it 
makes the system completely unresponsive in certain cases.
h3. Potential improvements

Here I will present a semi-organized list of thoughts, that might be converted 
into a design later on.
 * RW lock in its current form must be removed.
 * Every time we read the data, we:
 ** Fix the revision number for the operation.
 ** Read the consistent cut.
 ** Read optimistically, assuming that compaction doesn't happen.
 ** If we figured that compaction has indeed happen when we finished reading 
data, and the data might be invalid, we throw compaction exception.
 * Resolving timestamp into a revision (or the opposite) should never require 
lock either.
 ** We should use the same optimistic strategy as with regular reads.
 ** We should introduce an on-hep cache for the most recent values, like we did 
for terms in raft log manager.
 *** While doing that, we should ensure that it'll make sense. In order to do 
that, we need to check that we are more likely to resolve recent 
timestamps/revisions.
 *** Unlike term values in raft log manager, here we might want to cache a few 
hundred values. We also need a concurrent access for reading the cache, meaning 
that the implementation will different. Just make sure to make it {*}non 
blocking{*}.
 *** _Should be done in a separate Jira_
 * Same for reading index/term/confguration/etc. There must be a simple way to 
make it non blocking either.
 * The fun part - FSM operations. Several ways to improve them:
 ** Remove the write lock. Use read-evaluate-write approach.
 ** When we read the data in FSM, we should not fix the revision, we should 
read the "latest" revision, it is more efficient. _Might be done in a separate 
Jira._
 ** We should always remember - there are only two writing agents here. A state 
machine and a compaction process.
 ** State machine is a latency-sensitive part of the system. We should avoid 
blocking it with concurrent compaction. This restriction dictates the rest of 
the design.
 ** When we read a list of existing revisions for an entry and prepend a new 
revision to it, we don't lock anything. This means that by the time we write 
this list back into a storage, it might contain some compacted revisions. *This 
is fine.* The compactor itself should keep this race in mind and expect such a 
trash in revisions list. If the value is not found - it's been compacted 
previously. In my understanding, this is the only adequate approach for the 
task.
 ** By the way, we should also expect that we have some trash there while just 
reading the data. If something is in the process of being compacted, we 
shouldn't even try to read it, but if we do, we should be ready to the data 
that concurrently disappeared.
 *  Compactor:
 ** Here we can't allow to write trash data into a revisions list of any entry, 
because (unlike in FSM) that would lead to inconsistent data.
 ** I propose doing the {{db.write(writeOptions, batch);}} calls while holding 
an exclusive lock.
 ** FSM just acquires the lock and writes the data, without any retries, 
because it has a priority.
 ** Compactor should have a check, like "is it safe to write a batch that I 
prepared":
 *** If the answer is "yes", we proceed with the write.
 *** If the answer is "no" or "idk", we:
 **** Don't proceed with writing the batch.
 **** Refresh the cursor.
 **** Re-read the data and prepare a new batch.
 **** Repeat.
 *** The goal is to have a check that will have a small likelihood of "maybe". 
I propose having a sufficiently large bloom filter and update it synchronously. 
It should be cleared every time we refresh the iterator.
 ** Such an approach will save us from constant "refresh" calls on iterator, 
which are slow. They will become rare. They will almost disappear if there's 
not much concurrent load.

 *  


> Metastorage compaction blocks other threads by reading from drive
> -----------------------------------------------------------------
>
>                 Key: IGNITE-25030
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25030
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Problem description
> This happened:
> {code:java}
> 2025-04-02 17:30:25:607 +0000 
> [WARNING][%node18%common-scheduler-0][FailureManager] Possible failure 
> suppressed according to a configured handler [hnd=NoOpFailureHandler 
> [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=SYSTEM_WORKER_BLOCKED] org.apache.ignite.lang.IgniteException: 
> IGN-WORKERS-1 TraceId:c2c4156f-a173-4531-b5c4-2adb672f4825 A critical thread 
> is blocked for 46737 ms that is more than the allowed 500 ms, it is 
> "%node18%MessagingService-inbound-Default-0-2" prio=10 Id=149 WAITING on 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@21433869 owned 
> by "%node18%metastorage-compaction-executor-0" Id=324{code}
> The fact that we're doing it in a network thread will be addressed in a 
> separate issue. All we need to know in this context is that:
>  * When we're doing {{{}RocksDbKeyValueStorage.timestampByRevision{}}}, we're 
> waiting for read lock, which is kind of stupid. This problem actually scales 
> to all the other read operations. I'm calling it stupid because meta-storage 
> is a multi-versioned storage by design, and such locks should not exist.
>  * When we're holding a write lock in compaction thread, or potentially any 
> other write operation in a meta-storage FSM thread, we might spend an 
> unexpectedly large amount of time there, which is simply unacceptable. 
> Reading data from the storage should be avoided when we're holding a lock 
> like that.
> I created this issue as a *Bug* because it is a serious design flaw, and it 
> makes the system completely unresponsive in certain cases.
> h3. Potential improvements
> Here I will present a semi-organized list of thoughts, that might be 
> converted into a design later on.
>  * RW lock in its current form must be removed.
>  * Every time we read the data, we:
>  ** Fix the revision number for the operation.
>  ** Read the consistent cut.
>  ** Read optimistically, assuming that compaction doesn't happen.
>  ** If we figured that compaction has indeed happen when we finished reading 
> data, and the data might be invalid, we throw compaction exception.
>  * Resolving timestamp into a revision (or the opposite) should never require 
> lock either.
>  ** We should use the same optimistic strategy as with regular reads.
>  ** We should introduce an on-heap cache for the most recent values, like we 
> did for terms in raft log manager.
>  *** While doing that, we should ensure that it'll make sense. In order to do 
> that, we need to check that we are more likely to resolve recent 
> timestamps/revisions.
>  *** Unlike term values in raft log manager, here we might want to cache a 
> few hundred values. We also need a concurrent access for reading the cache, 
> meaning that the implementation will different. Just make sure to make it 
> {*}non blocking{*}.
>  *** _Should be done in a separate Jira_
>  * Same for reading index/term/confguration/etc. There must be a simple way 
> to make it non blocking either.
>  * The fun part - FSM operations. Several ways to improve them:
>  ** Remove the write lock. Use read-evaluate-write approach.
>  ** When we read the data in FSM, we should not fix the revision, we should 
> read the "latest" revision, it is more efficient. _Might be done in a 
> separate Jira._
>  ** We should always remember - there are only two writing agents here. A 
> state machine and a compaction process.
>  ** State machine is a latency-sensitive part of the system. We should avoid 
> blocking it with concurrent compaction. This restriction dictates the rest of 
> the design.
>  ** When we read a list of existing revisions for an entry and prepend a new 
> revision to it, we don't lock anything. This means that by the time we write 
> this list back into a storage, it might contain some compacted revisions. 
> *This is fine.* The compactor itself should keep this race in mind and expect 
> such a trash in revisions list. If the value is not found - it's been 
> compacted previously. In my understanding, this is the only adequate approach 
> for the task.
>  ** By the way, we should also expect that we have some trash there while 
> just reading the data. If something is in the process of being compacted, we 
> shouldn't even try to read it, but if we do, we should be ready to the data 
> that concurrently disappeared.
>  *  Compactor:
>  ** Here we can't allow to write trash data into a revisions list of any 
> entry, because (unlike in FSM) that would lead to inconsistent data.
>  ** I propose doing the {{db.write(writeOptions, batch);}} calls while 
> holding an exclusive lock.
>  ** FSM just acquires the lock and writes the data, without any retries, 
> because it has a priority.
>  ** Compactor should have a check, like "is it safe to write a batch that I 
> prepared":
>  *** If the answer is "yes", we proceed with the write.
>  *** If the answer is "no" or "idk", we:
>  **** Don't proceed with writing the batch.
>  **** Refresh the cursor.
>  **** Re-read the data and prepare a new batch.
>  **** Repeat.
>  *** The goal is to have a check that will have a small likelihood of 
> "maybe". I propose having a sufficiently large bloom filter and update it 
> synchronously. It should be cleared every time we refresh the iterator.
>  ** Such an approach will save us from constant "refresh" calls on iterator, 
> which are slow. They will become rare. They will almost disappear if there's 
> not much concurrent load.
>  *  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-25030) Metastorage compaction blocks other threads by reading from drive

Reply via email to