Kirill Sizov created IGNITE-24742:
--------------------------------------

             Summary: Disaster recovery fails on unstable topology
                 Key: IGNITE-24742
                 URL: https://issues.apache.org/jira/browse/IGNITE-24742
             Project: Ignite
          Issue Type: Bug
            Reporter:  Kirill Sizov


*Preconditions*

Randomly start and stop nodes of the cluster and immediately perform manual 
reset.

 

*Expected behavior*

Manual reset completes successfully

 

*Actual behavior*

Sometimes manual reset fails.
{noformat}
org.apache.ignite.compute.ComputeException: Job execution failed: 
org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException:
 IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/192.168.210.9:3344
    at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:733) 
~[?:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:877) 
~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:811)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:613)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ViewUtils.copyExceptionWithCauseIfPossible(ViewUtils.java:91)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ViewUtils.ensurePublicException(ViewUtils.java:71)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.compute.ClientCompute.sync(ClientCompute.java:569)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.compute.ClientCompute.execute(ClientCompute.java:236)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at org.apache.ignite.compute.IgniteCompute.execute(IgniteCompute.java:200) 
~[ignite-api-3.1.0-SNAPSHOT.jar:?]
    at 
org.gridgain.poc.framework.worker.ignite3.task.ChaosTask.body0(ChaosTask.java:151)
 ~[poc-tester-ignite3-0.5.0-SNAPSHOT.jar:?]
    at 
org.gridgain.poc.framework.worker.ignite3.task.AbstractTask.body(AbstractTask.java:356)
 ~[poc-tester-ignite3-0.5.0-SNAPSHOT.jar:?]
    at 
org.gridgain.poc.framework.worker.task.TaskLooper.run(TaskLooper.java:77) 
[poc-tester-core-0.5.0-SNAPSHOT.jar:?]
    at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.apache.ignite.compute.ComputeException: Job execution failed: 
org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException:
 IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/192.168.210.9:3344
    at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:733) 
~[?:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:877) 
~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:811)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:613)
 ~[ignite-core-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.TcpClientChannel.readError(TcpClientChannel.java:555)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.TcpClientChannel.processNextMessage(TcpClientChannel.java:449)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.TcpClientChannel.onMessage(TcpClientChannel.java:272)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.io.netty.NettyClientConnection.onMessage(NettyClientConnection.java:117)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.client.io.netty.NettyClientMessageHandler.channelRead(NettyClientMessageHandler.java:33)
 ~[ignite-client-3.1.0-SNAPSHOT.jar:?]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
 ~[netty-codec-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
 ~[netty-codec-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) 
~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
 ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) 
~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) 
~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
 ~[netty-common-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
~[netty-common-4.1.119.Final.jar:4.1.119.Final]
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[netty-common-4.1.119.Final.jar:4.1.119.Final]
    ... 1 more {noformat}
The exception on server is:
{noformat}
 2025-03-10 08:10:22:071 +0000 
[ERROR][%poc-tester-SERVER-192.168.210.164-id-0%partition-operations-8][GroupUpdateRequest]
 Failed to reset partition
java.util.concurrent.CompletionException: 
org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException:
 IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/192.168.210.9:3344
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
    at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
    at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936)
    at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:614)
    at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:914)
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: 
org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException:
 IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/192.168.210.9:3344
    at 
org.apache.ignite.internal.table.distributed.disaster.DisasterRecoveryManager.lambda$localPartitionStatesInternal$9(DisasterRecoveryManager.java:550)
    at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
    ... 8 more
Caused by: java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/192.168.210.9:3344
    at 
java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368)
    at 
java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377)
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1152)
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
    at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194)
    at 
org.apache.ignite.internal.network.netty.NettyUtils.lambda$toCompletableFuture$0(NettyUtils.java:74)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
    at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
    at 
io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
    at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
    at 
io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:326)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:342)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    ... 1 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection refused: /192.168.210.9:3344
Caused by: java.net.ConnectException: Connection refused
    at java.base/sun.nio.ch.Net.pollConnect(Native Method)
    at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
    at 
java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973)
    at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:1583){noformat}

Likely the logical topology has not been updated at the time of the request, 
disaster recovery manager sends messages to all known nodes, receives a 
connection exception and fails the whole reset. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to