[ 
https://issues.apache.org/jira/browse/IGNITE-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-19419:
----------------------------------
    Description: 
I tried to run a SQL benchmark (slightly modified [Benchbase TPC-C 
scenario|https://github.com/cmu-db/benchbase/wiki/TPC-C]) against Ignite3 
(commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).

Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit 
(increased memory size comparing to IGNITE-19412). After some time it has 
become unavailable to the client. Log is filled with error messages like the 
following:

 
{code:java}
2023-05-03 17:51:57:705 +0300 
[WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error 
[type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue 
overload.]].
2023-05-03 17:51:57:705 +0300 
[WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable 
error during the request type=ActionRequestImpl occurred (will be retried on 
the randomly selected node): 
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at 
java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
        at 
java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
        at 
java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at 
java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        ... 7 more
2023-05-03 17:51:57:705 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] failed, 
status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:705 +0300 
[WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] 
FSMCaller already in error status, ignore new error
Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk 
queue overload.]]
        at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
        at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
        at 
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
        at 
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-05-03 17:51:57:706 +0300 
[INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] Fail 
to find the next candidate.
2023-05-03 17:51:57:706 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] 
failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:706 +0300 
[INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] 
onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) is 
in error.].
2023-05-03 17:51:57:706 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] failed, 
status=Status[EIO<1014>: Corrupted LogStorage].
{code}
More errors could be observed in the attached log archive.

Actually the log storage is not corrupted but the queue is overloaded. There 
are some performance improvements in main JRaft repository ( 
[https://github.com/sofastack/sofa-jraft] ) that may fix that, more exactly it 
is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.

  was:
I tried to run a SQL benchmark (slightly modified [Benchbase TPC-C 
scenario|https://github.com/cmu-db/benchbase/wiki/TPC-C]) against Ignite3 
(commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).

Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit 
(increased memory size comparing to IGNITE-19412). After some time it has 
become unavailable to the client. Log is filled with error messages like the 
following:

 {code:java}
2023-05-03 17:51:57:705 +0300 
[WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error 
[type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue 
overload.]].
2023-05-03 17:51:57:705 +0300 
[WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable 
error during the request type=ActionRequestImpl occurred (will be retried on 
the randomly selected node): 
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at 
java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
        at 
java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
        at 
java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
        at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
        at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
        at 
java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
        ... 7 more
2023-05-03 17:51:57:705 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] failed, 
status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:705 +0300 
[WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] 
FSMCaller already in error status, ignore new error
Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk 
queue overload.]]
        at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
        at 
org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
        at 
org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
        at 
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
        at 
org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
        at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
        at java.base/java.lang.Thread.run(Thread.java:829)
2023-05-03 17:51:57:706 +0300 
[INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] Fail 
to find the next candidate.
2023-05-03 17:51:57:706 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] 
failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:706 +0300 
[INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] 
onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) is 
in error.].
2023-05-03 17:51:57:706 +0300 
[ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
<30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] failed, 
status=Status[EIO<1014>: Corrupted LogStorage].
{code}

More errors could be observed in the attached log archive.


> Corrupted LogStorage under intensive SQL load
> ---------------------------------------------
>
>                 Key: IGNITE-19419
>                 URL: https://issues.apache.org/jira/browse/IGNITE-19419
>             Project: Ignite
>          Issue Type: Bug
>          Components: sql
>    Affects Versions: 3.0.0-beta1
>            Reporter: Andrey Khitrin
>            Priority: Major
>              Labels: ignite-3, storage-engine
>         Attachments: ai3.jstack, logs.zip
>
>
> I tried to run a SQL benchmark (slightly modified [Benchbase TPC-C 
> scenario|https://github.com/cmu-db/benchbase/wiki/TPC-C]) against Ignite3 
> (commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).
> Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit 
> (increased memory size comparing to IGNITE-19412). After some time it has 
> become unavailable to the client. Log is filled with error messages like the 
> following:
>  
> {code:java}
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error 
> [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk 
> queue overload.]].
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable 
> error during the request type=ActionRequestImpl occurred (will be retried on 
> the randomly selected node): 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> 2023-05-03 17:51:57:705 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> 2023-05-03 17:51:57:705 +0300 
> [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] 
> FSMCaller already in error status, ignore new error
> Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, 
> disk queue overload.]]
>       at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
>       at 
> org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
>       at 
> org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
>       at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
>       at 
> org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
>       at 
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> 2023-05-03 17:51:57:706 +0300 
> [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] 
> Fail to find the next candidate.
> 2023-05-03 17:51:57:706 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> 2023-05-03 17:51:57:706 +0300 
> [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] 
> onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) 
> is in error.].
> 2023-05-03 17:51:57:706 +0300 
> [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node 
> <30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] 
> failed, status=Status[EIO<1014>: Corrupted LogStorage].
> {code}
> More errors could be observed in the attached log archive.
> Actually the log storage is not corrupted but the queue is overloaded. There 
> are some performance improvements in main JRaft repository ( 
> [https://github.com/sofastack/sofa-jraft] ) that may fix that, more exactly 
> it is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to