[ 
https://issues.apache.org/jira/browse/IGNITE-22424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikita Sivkov resolved IGNITE-22424.
------------------------------------
    Resolution: Cannot Reproduce

> Insert and read errors under high cluster load
> ----------------------------------------------
>
>                 Key: IGNITE-22424
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22424
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Nikita Sivkov
>            Priority: Major
>              Labels: ignite-3, ignite3_performance
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h2. Setup
>  * 1 or 3 server nodes cluster
>  * 3 client nodes
>  * 64+ threads per client performing KV put/get
>  * Each node (server or client) is an AWS instance of {{c5d.4xlarge}}
> h2. Steps
>  * Client nodes do put 15m records
>  * Client nodes do get 15m records
> h2. Expected result
> No errors occurred, all records inserted and read after successfully.
> h2. Actual result
> On some point, cluster degrades and throws errors on inserting/reading.
> Error examples:
>  * The primary replica has changed
> {code:java}
> org.apache.ignite.lang.IgniteException: The primary replica has changed 
> [expectedLeaseholderName=poc-tester-SERVER-192.168.1.58-id-0, 
> currentLeaseholderName=null, 
> expectedLeaseholderId=b4425409-2d71-482d-a5d6-468522569cd0, 
> currentLeaseholderId=null, 
> expectedEnlistmentConsistencyToken=112569527664115719, 
> currentEnlistmentConsistencyToken=null]
>       at 
> java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) 
> ~[?:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.copyExceptionWithCauseIfPossible(ClientUtils.java:73)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.ensurePublicException(ClientUtils.java:54)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.sync(ClientUtils.java:97) 
> ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:78)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:59)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at site.ycsb.db.ignite3.IgniteClient.read(IgniteClient.java:90) 
> [ignite3-binding-2024.11.jar:?]
>       at site.ycsb.DBWrapper.read(DBWrapper.java:157) [core-2024.11.jar:?]
>       at 
> site.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:803) 
> [core-2024.11.jar:?]
>       at 
> site.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:722) 
> [core-2024.11.jar:?]
>       at site.ycsb.ClientThread.run(ClientThread.java:145) 
> [core-2024.11.jar:?]
>       at java.lang.Thread.run(Thread.java:829) [?:?] {code}
>  
>  * Replication is timed out
> {code:java}
> org.apache.ignite.tx.TransactionException: Replication is timed out 
> [replicaGrpId=10_part_4]
>       at 
> java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) 
> ~[?:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525)
>  ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.copyExceptionWithCauseIfPossible(ClientUtils.java:73)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.ensurePublicException(ClientUtils.java:54)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.ClientUtils.sync(ClientUtils.java:97) 
> ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:78)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at 
> org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.get(ClientKeyValueBinaryView.java:59)
>  ~[ignite-client-3.0.0-SNAPSHOT.jar:?]
>       at site.ycsb.db.ignite3.IgniteClient.read(IgniteClient.java:90) 
> [ignite3-binding-2024.11.jar:?]
>       at site.ycsb.DBWrapper.read(DBWrapper.java:157) [core-2024.11.jar:?]
>       at 
> site.ycsb.workloads.CoreWorkload.doTransactionRead(CoreWorkload.java:803) 
> [core-2024.11.jar:?]
>       at 
> site.ycsb.workloads.CoreWorkload.doTransaction(CoreWorkload.java:722) 
> [core-2024.11.jar:?]
>       at site.ycsb.ClientThread.run(ClientThread.java:145) 
> [core-2024.11.jar:?]
>       at java.lang.Thread.run(Thread.java:829) [?:?] {code}
>  
>  * A critical thread is blocked for XXX ms that is more than the allowed 500 
> ms
> {code:java}
> 2024-06-04 22:26:49:031 +0300 
> [ERROR][%poc-tester-SERVER-192.168.1.119-id-0%common-scheduler-0][CriticalWorkerWatchdog]
>  A critical thread is blocked for 512 ms that is more than the allowed 500 
> ms, it is "poc-tester-SERVER-192.168.1.119-id-0-client-32" prio=10 Id=122 
> RUNNABLE (in native)
>     at java.base@11.0.12/sun.nio.ch.EPoll.wait(Native Method)
>     at 
> java.base@11.0.12/sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:120)
>     at 
> java.base@11.0.12/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
>     -  locked io.netty.channel.nio.SelectedSelectionKeySet@71be3d25
>     -  locked sun.nio.ch.EPollSelectorImpl@5aebdd7
>     at java.base@11.0.12/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141)
>     at 
> app//io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
>     at app//io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:879)
>     at app//io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:526)
>     at 
> app//io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
>     at 
> app//io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>     at 
> app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>     at java.base@11.0.12/java.lang.Thread.run(Thread.java:829){code}
>  
>  * Recoverable error during the request occurred (will be retried on the 
> randomly selected node)
> {code:java}
> 2024-06-04 22:26:49:306 +0300 
> [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable 
> error during the request occurred (will be retried on the randomly selected 
> node) [request=WriteActionRequestImpl [command=[0, 9, 41, -117, -128, -40, 
> -77, -3, -106, -7, -57, 1], deserializedCommand=SafeTimeSyncCommandImpl 
> [safeTimeLong=112559993935757322], groupId=10_part_10], peer=Peer 
> [consistentId=poc-tester-SERVER-192.168.1.119-id-0, idx=0], newPeer=Peer 
> [consistentId=poc-tester-SERVER-192.168.1.119-id-0, idx=0]].
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>     at 
> java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>     at 
> java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>     at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>     at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>     at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to