Kirill Sizov created IGNITE-24742: --------------------------------------
Summary: Disaster recovery fails on unstable topology Key: IGNITE-24742 URL: https://issues.apache.org/jira/browse/IGNITE-24742 Project: Ignite Issue Type: Bug Reporter: Kirill Sizov *Preconditions* Randomly start and stop nodes of the cluster and immediately perform manual reset. *Expected behavior* Manual reset completes successfully *Actual behavior* Sometimes manual reset fails. {noformat} org.apache.ignite.compute.ComputeException: Job execution failed: org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException: IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:733) ~[?:?] at org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:877) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:811) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:613) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ViewUtils.copyExceptionWithCauseIfPossible(ViewUtils.java:91) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ViewUtils.ensurePublicException(ViewUtils.java:71) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.compute.ClientCompute.sync(ClientCompute.java:569) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.compute.ClientCompute.execute(ClientCompute.java:236) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.compute.IgniteCompute.execute(IgniteCompute.java:200) ~[ignite-api-3.1.0-SNAPSHOT.jar:?] at org.gridgain.poc.framework.worker.ignite3.task.ChaosTask.body0(ChaosTask.java:151) ~[poc-tester-ignite3-0.5.0-SNAPSHOT.jar:?] at org.gridgain.poc.framework.worker.ignite3.task.AbstractTask.body(AbstractTask.java:356) ~[poc-tester-ignite3-0.5.0-SNAPSHOT.jar:?] at org.gridgain.poc.framework.worker.task.TaskLooper.run(TaskLooper.java:77) [poc-tester-core-0.5.0-SNAPSHOT.jar:?] at java.lang.Thread.run(Thread.java:1583) [?:?] Caused by: org.apache.ignite.compute.ComputeException: Job execution failed: org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException: IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:733) ~[?:?] at org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:877) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:811) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:613) ~[ignite-core-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.TcpClientChannel.readError(TcpClientChannel.java:555) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.TcpClientChannel.processNextMessage(TcpClientChannel.java:449) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.TcpClientChannel.onMessage(TcpClientChannel.java:272) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.io.netty.NettyClientConnection.onMessage(NettyClientConnection.java:117) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at org.apache.ignite.internal.client.io.netty.NettyClientMessageHandler.channelRead(NettyClientMessageHandler.java:33) ~[ignite-client-3.1.0-SNAPSHOT.jar:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[netty-codec-4.1.119.Final.jar:4.1.119.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[netty-codec-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:796) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) ~[netty-common-4.1.119.Final.jar:4.1.119.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.119.Final.jar:4.1.119.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.119.Final.jar:4.1.119.Final] ... 1 more {noformat} The exception on server is: {noformat} 2025-03-10 08:10:22:071 +0000 [ERROR][%poc-tester-SERVER-192.168.210.164-id-0%partition-operations-8][GroupUpdateRequest] Failed to reset partition java.util.concurrent.CompletionException: org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException: IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:936) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) at java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:614) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:914) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) Caused by: org.apache.ignite.internal.table.distributed.disaster.exceptions.DisasterRecoveryException: IGN-RECOVERY-3 TraceId:3ccaf9ac-39ab-4c51-88df-1fe2f38e2e0d io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 at org.apache.ignite.internal.table.distributed.disaster.DisasterRecoveryManager.lambda$localPartitionStatesInternal$9(DisasterRecoveryManager.java:550) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934) ... 8 more Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368) at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377) at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1152) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2194) at org.apache.ignite.internal.network.netty.NettyUtils.lambda$toCompletableFuture$0(NettyUtils.java:74) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:326) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:342) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.210.9:3344 Caused by: java.net.ConnectException: Connection refused at java.base/sun.nio.ch.Net.pollConnect(Native Method) at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682) at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:1583){noformat} Likely the logical topology has not been updated at the time of the request, disaster recovery manager sends messages to all known nodes, receives a connection exception and fails the whole reset. -- This message was sent by Atlassian Jira (v8.20.10#820010)