Dear Team,
I've been banging my head against this issue for weeks now and I really
hope you'll be able to help me out in understanding what's going wrong.
I have two Ignite nodes running on two OVH VPS with Ubuntu 18.04: let's
call them "coordinator" and "backup".
The two nodes share the same configuration, including a
TcpDiscoveryVmIpFinder with static IPs, an SSL context factory for
exchanging data securely, etc. (see configuration at the end of the email)
The persisted caches are set so that data has one backup, the objective
being that I have my data on two separate servers, in case OVH's servers
burn down (which is exactly what happened to Strasbourg' site this
year...). The code goes:
CacheConfiguration<String, User> configuration = new
CacheConfiguration<String, User>()
.setSqlSchema(SQL_SCHEMA)
.setName(MY_CACHE)
.setCacheMode(CacheMode.PARTITIONED)
.setIndexedTypes(String.class, Entity.class)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
.setBackups(1)
.setAffinity(new RendezvousAffinityFunction(true));
So the nodes are started and for a few days all seems to go well. If I
look at the baseline using control.sh I get:
Control utility [ver. 2.10.0#20210310-sha1:bc24f6ba]
2021 Copyright(C) Apache Software Foundation
User: root
Time: 2021-08-18T04:00:04.663
Command [BASELINE] started
Arguments: --baseline
--------------------------------------------------------------------------------
Cluster state: active
Current topology version: 5
Baseline auto adjustment disabled: softTimeout=300000
Current topology version: 5 (Coordinator:
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,
Address=vps611671.ovh.net/coordinator.ip.address, Order=2)
Baseline nodes:
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,
Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2
ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,
Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
Number of baseline nodes: 2
Other nodes not found.
Command [BASELINE] finished with code: 0
Control utility has completed execution at: 2021-08-18T04:00:06.313
Execution time: 1650 ms
Unfortunately after a few days I go to check the baseline and noticed
that each node has gone its own way: the coordinator moved to topVer 6
and the backup node to topVer 7. Each node detects the other one as
being OFFLINE.
Coordinator's baseline:
--------------------------------------------------------------------------------
Cluster state: active
Current topology version: 6
Baseline auto adjustment disabled: softTimeout=300000
Current topology version: 6 (Coordinator:
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,
Address=vps611671.ovh.net/coordinator.ip.address, Order=2)
Baseline nodes:
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,
Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2
ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, State=OFFLINE
--------------------------------------------------------------------------------
Number of baseline nodes: 2
Other nodes not found.
Command [BASELINE] finished with code: 0
Control utility has completed execution at: 2021-08-18T19:46:56.847
Execution time: 1173 ms
Backup's baseline:
--------------------------------------------------------------------------------
Cluster state: active
Current topology version: 7
Baseline auto adjustment disabled: softTimeout=300000
Current topology version: 7 (Coordinator:
ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,
Address=205.ip-54-38-190.eu/backup.ip.address, Order=5)
Baseline nodes:
ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, State=OFFLINE
ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,
Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
Number of baseline nodes: 2
Other nodes not found.
Command [BASELINE] finished with code: 0
Control utility has completed execution at: 2021-08-18T19:49:38.300
Execution time: 397 ms
At this point I can't restart the "backup" node, otherwise I get this
nice error:
Caused by: class org.apache.ignite.spi.IgniteSpiException:
BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0)
is not compatible with BaselineTopology in the cluster. Branching
history of cluster BlT ([1060612220]) doesn't contain branching point
hash of joining node BlT (173037243). Consider cleaning persistent
storage of the node and adding it to the cluster again.
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154)
at
org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
... 13 more
The only option I've found so far is to remove the backup node from the
coordinator's instance with a "--baseline remove", then essentially wipe
out the "backup" node's work folder, restart it, and reattach it to the
coordinator with a "--baseline add".
Obviously that's a drastic workaround, since that triggers a rebalancing
of various gigs of data. Unfortunately it's the only workaround I've
found so far (as suggested in
https://stackoverflow.com/questions/66807984/clusters-baselinehistory-doesnt-contain-branching-point-hash-of-joining-node-b).
I've looked at the logs: though it indeed seem that the nodes lose sight
of each other I can't seem to find the reason of this splitting. I
thought the whole point of a baseline was to keep the nodes'
configuration together even when one or the other becomes unhealthy for
a while...
Below I have included logs for the coordinator node, the backup node and
the logs of the application that connects to Ignite (a Java-based app
that connects using the client mode).
You can see that at 04:06:53 the app warns of a failed node (consistent
ID points to the backup node), then a few seconds later spews an
IgniteException followed by some Thread exceptions.
At exactly the same moment (04:06:53) the coordinator node is warning of
a "Timed out waiting for message delivery receipt" followed by a "Failed
to send message to next node". It then moves to topology version 6...
Something similar happens on the "backup" node at about the same time
(time in logs is set two hours earlier on that server for some reason:
it points 2:00:00 but it actually is 4:00:00). It ends up moving to its
own topology version 7.
These three logs have been appended at the bottom of this email.
So I have two questions here:
1) When this happens, is there any way to "roll back" to topology
version 5, which included both nodes and worked like a charm? Am I
condemned to wiping out the offending node?
2) What on earth might be causing this? I'm pretty sure Ignite works
great, otherwise ING wouldn't be using it. So I guess it must be
something on my side? Any ideas?
Any light or suggestion you could bring me on this matter would be truly
appreciated.
Thanks,
Diego
*** nodes configuration ***
<?xml version="1.0" encoding="UTF-8" ?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<bean class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="igniteInstanceName" value="production-ignite"/>
<property name="metricsLogFrequency" value="0"/>
<property name="clientMode" value="false"/>
<property name="sslContextFactory">
<bean class="org.apache.ignite.ssl.SslContextFactory">
<property name="keyStoreFilePath"
value="/etc/apache-ignite/node.jks"/>
<property name="keyStorePassword" value="************"/>
<property name="trustStoreFilePath"
value="/etc/apache-ignite/trust.jks"/>
<property name="trustStorePassword" value="************"/>
<property name="protocol" value="TLSv1.2"/>
</bean>
</property>
<property name="binaryConfiguration">
<bean
class="org.apache.ignite.configuration.BinaryConfiguration">
<property name="idMapper">
<bean
class="org.apache.ignite.binary.BinaryBasicIdMapper">
<property name="lowerCase" value="false"/>
</bean>
</property>
<property name="nameMapper">
<bean
class="org.apache.ignite.binary.BinaryBasicNameMapper"/>
</property>
</bean>
</property>
<property name="dataStorageConfiguration">
<bean
class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="persistenceEnabled" value="true"/>
</bean>
</property>
<property name="dataRegionConfigurations">
<list>
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name" value="Srp6aRegion"/>
<property name="persistenceEnabled"
value="false"/>
</bean>
<bean
class="org.apache.ignite.configuration.DataRegionConfiguration">
<property name="name"
value="NotificationsRegion"/>
<property name="persistenceEnabled"
value="false"/>
<property name="pageEvictionMode"
value="RANDOM_2_LRU"/>
<property name="initialSize" value="#{10L *
1024 * 1024}"/><!-- 10Mb -->
<property name="maxSize" value="#{500L *
1024 * 1024}"/><!-- 500Mb -->
</bean>
</list>
</property>
</bean>
</property>
<property name="communicationSpi">
<bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
<property name="localPort" value="47100"/>
<property name="localPortRange" value="3"/>
</bean>
</property>
<property name="discoverySpi">
<bean
class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
<property name="localPort" value="47500"/>
<property name="localPortRange" value="3"/>
<property name="ipFinder">
<bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
<property name="addresses">
<list>
<value>127.0.0.1:47500..47501</value>
<value>coordinator.ip.address</value>
<value>backup.ip.address</value>
</list>
</property>
</bean>
</property>
</bean>
</property>
</bean>
</beans>
*** Application logs ***
2021-08-18 04:06:53.977 WARN 7663 ---
[disco-event-worker-#56%local-client%]
o.a.i.i.m.d.GridDiscoveryManager : Node FAILED: TcpDiscoveryNode
[id=703bd274-96ea-4567-8c2c-215efd859a2e,
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,
intOrder=4, lastExchangeTime=1628259536802, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
2021-08-18 04:07:10.160 WARN 7663 --- [jvm-pause-detector-worker]
o.a.i.i.IgniteKernal%local-client : Possible too long JVM pause:
15328 milliseconds.
2021-08-18 04:07:10.158 ERROR 7663 ---
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]
o.apache.ignite.internal.util.typedef.G : Blocked system-critical
thread has been detected. This can lead to cluster-wide undefined
behaviour [workerName=partition-exchanger,
threadName=exchange-worker-#57%local-client%, blockedFor=15s]
2021-08-18 04:07:10.200 WARN 7663 ---
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]
ROOT : Possible failure suppressed
accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler
[tryStop=false, timeout=0, super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=local-client,
finished=false, heartbeatTs=1629252430180]]]
org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,
igniteInstanceName=local-client, finished=false, heartbeatTs=1629252430180]
at java.util.Arrays.fill(Arrays.java:2846) ~[na:1.8.0_292]
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.CachePartitionFullCountersMap.clear(CachePartitionFullCountersMap.java:99)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:599)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initTopologies(GridDhtPartitionsExchangeFuture.java:1204)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1037)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3373)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3195)
~[ignite-core-2.10.0.jar!/:2.10.0]
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
~[ignite-core-2.10.0.jar!/:2.10.0]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_292]
2021-08-18 04:07:10.218 WARN 7663 ---
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]
o.a.i.i.p.failure.FailureProcessor : No deadlocked threads detected.
2021-08-18 04:07:10.357 WARN 7663 ---
[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]
o.a.i.i.p.failure.FailureProcessor : Thread dump at 2021/08/18
04:07:10 CEST
Thread [name="sys-#23585%local-client%", id=25717, state=TIMED_WAITING,
blockCnt=0, waitCnt=1]
Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723,
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Thread [name="sys-#23584%local-client%", id=25716, state=TIMED_WAITING,
blockCnt=0, waitCnt=1]
Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723,
ownerName=null, ownerId=-1]
at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
*** "coordinator" node logs ***
[04:06:53,070][WARNING][tcp-disco-msg-worker-[703bd274
backup.ip.address:47500
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]
Timed out waiting for message delivery receipt (most probably, the
reason is in long GC pauses on remote node; consider tuning GC and
increasing 'ackTimeout' configuration property). Will retry to send
message with increased timeout [currentTimeout=9490,
rmtAddr=205.ip-54-38-190.eu/backup.ip.address:47500, rmtPort=47500]
[04:06:53,120][WARNING][tcp-disco-msg-worker-[703bd274
backup.ip.address:47500
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]
Failed to send message to next node
[msg=TcpDiscoveryConnectionCheckMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa,
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,
isClient=false]], next=TcpDiscoveryNode
[id=703bd274-96ea-4567-8c2c-215efd859a2e,
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,
intOrder=4, lastExchangeTime=1628259536555, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to
send message to next node [msg=TcpDiscoveryConnectionCheckMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa,
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,
isClient=false]], next=ClusterNode
[id=703bd274-96ea-4567-8c2c-215efd859a2e, order=5,
addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], daemon=false]]]
[04:06:53,132][WARNING][tcp-disco-msg-worker-[703bd274
backup.ip.address:47500
crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]
Local node has detected failed nodes and started cluster-wide procedure.
To speed up failure detection please see 'Failure Detection' section
under javadoc for 'TcpDiscoverySpi'
[04:06:53,214][INFO][tcp-disco-sock-reader-[703bd274
backup.ip.address:45841]-#10%production-ignite%-#185%production-ignite%][TcpDiscoverySpi]
Finished serving remote node connection
[rmtAddr=/backup.ip.address:45841, rmtPort=45841
[04:06:53,362][WARNING][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e,
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,
intOrder=4, lastExchangeTime=1628259536555, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
[04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]
Topology snapshot [ver=6, locNode=c79804fa, servers=1, clients=1,
state=ACTIVE, CPUs=2, offheap=3.5GB, heap=3.4GB]
[04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]
^-- Baseline [id=11, size=2, online=1, offline=1]
[04:06:53,942][INFO][exchange-worker-#57%production-ignite%][time]
Started exchange init [topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], crd=true, evt=NODE_FAILED,
evtNode=703bd274-96ea-4567-8c2c-215efd859a2e, customEvt=null,
allowMerge=false, exchangeFreeSwitch=true]
[04:06:54,030][INFO][sys-#116790%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,032][INFO][exchange-worker-#57%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,035][INFO][sys-#116796%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,036][INFO][sys-#116793%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,038][INFO][sys-#116792%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,043][INFO][sys-#116791%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[04:06:54,092][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,
futInfo=NA, mode=DISTRIBUTED]
[04:06:54,097][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partitions release latch: ServerLatch [permits=0,
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid
[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]]]
[04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,
futInfo=NA, mode=DISTRIBUTED]
[04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partitions release latch: ServerLatch [permits=0,
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid
[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]]]
[04:06:54,240][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=703bd274-96ea-4567-8c2c-215efd859a2e,
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,
intOrder=4, lastExchangeTime=1628259536555, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true,
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]
[04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange timings [startVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], stage="Waiting in exchange queue" (39 ms),
stage="Exchange parameters initialization" (30 ms), stage="Determine
exchange type" (131 ms), stage="Preloading notification" (5 ms),
stage="Wait partitions release [latch=exchange-free-replicated]" (9 ms),
stage="Wait partitions release latch [latch=exchange-free-replicated]"
(5 ms), stage="Wait partitions release
[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions
release latch [latch=exchange-free-partitioned]" (0 ms), stage="After
states restored callback" (18 ms), stage="WAL history reservation" (19
ms), stage="Finalize update counters" (52 ms), stage="Detect lost
partitions" (57 ms), stage="Exchange done" (61 ms), stage="Total time"
(426 ms)]
[04:06:54,316][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange longest local stages [startVer=AffinityTopologyVersion
[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], stage="Affinity initialization (exchange-free switch on
fully-rebalanced topology) [grp=MessageLifeCache]" (108 ms)
(parent=Determine exchange type), stage="Affinity initialization
(exchange-free switch on fully-rebalanced topology) [grp=CliauthCache]"
(105 ms) (parent=Determine exchange type), stage="Affinity
initialization (exchange-free switch on fully-rebalanced topology)
[grp=UserEventCache]" (100 ms) (parent=Determine exchange type)]
[04:06:54,357][INFO][exchange-worker-#57%production-ignite%][time]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], crd=true]
[04:06:54,473][INFO][exchange-worker-#57%production-ignite%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED,
node=703bd274-96ea-4567-8c2c-215efd859a2e]
*** "backup" node logs ***
(remember there's a 2 hour difference in the logs. 02:06:54 is in fact
04:06:54)
[02:06:54,048][WARNING][tcp-disco-msg-worker-[c79804fa
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]
Timed out waiting for message delivery receipt (most probably, the
reason is in long GC pauses on remote node; consider tuning GC and
increasing 'ackTimeout' configuration property). Will retry to send
message with increased timeout [currentTimeout=9494,
rmtAddr=vps611671.ovh.net/coordinator.ip.address:47500, rmtPort=47500]
[02:06:54,125][WARNING][tcp-disco-msg-worker-[c79804fa
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]
Failed to send message to next node
[msg=TcpDiscoveryConnectionCheckMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e,
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,
isClient=false]], next=TcpDiscoveryNode
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,
intOrder=2, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to
send message to next node [msg=TcpDiscoveryConnectionCheckMessage
[super=TcpDiscoveryAbstractMessage [sndNodeId=null,
id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e,
verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,
isClient=false]], next=ClusterNode
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, order=2,
addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
daemon=false]]]
[02:06:54,130][WARNING][tcp-disco-msg-worker-[c79804fa
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]
Local node has detected failed nodes and started cluster-wide procedure.
To speed up failure detection please see 'Failure Detection' section
under javadoc for 'TcpDiscoverySpi'
[02:06:54,135][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,
intOrder=2, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]
[02:06:54,137][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
Topology snapshot [ver=6, locNode=703bd274, servers=1, clients=1,
state=ACTIVE, CPUs=4, offheap=3.5GB, heap=3.4GB]
[02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
Coordinator changed [prev=TcpDiscoveryNode
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,
intOrder=2, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], cur=TcpDiscoveryNode
[id=703bd274-96ea-4567-8c2c-215efd859a2e,
consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet
[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,
205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,
intOrder=4, lastExchangeTime=1629252404041, loc=true,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]]
[02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
^-- Baseline [id=11, size=2, online=1, offline=1]
[02:06:54,149][INFO][disco-event-worker-#59%production-ignite%][MvccProcessorImpl]
Assigned mvcc coordinator [crd=MvccCoordinator
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0],
nodeId=703bd274-96ea-4567-8c2c-215efd859a2e, ver=1628257415621,
local=true, initialized=false]]
[02:06:54,167][INFO][sys-#122238%production-ignite%][ExchangeLatchManager]
Become new coordinator 703bd274-96ea-4567-8c2c-215efd859a2e
[02:06:54,168][INFO][exchange-worker-#61%production-ignite%][time]
Started exchange init [topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], crd=true, evt=NODE_FAILED,
evtNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, customEvt=null,
allowMerge=false, exchangeFreeSwitch=true]
[02:06:54,170][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,171][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,172][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,175][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,177][INFO][sys-#122240%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,185][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal
[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024,
expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%]
[02:06:54,195][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,
futInfo=NA, mode=DISTRIBUTED]
[02:06:54,196][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partitions release latch: ServerLatch [permits=0,
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid
[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]]]
[02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future
[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,
futInfo=NA, mode=DISTRIBUTED]
[02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finished waiting for partitions release latch: ServerLatch [permits=0,
pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid
[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]]]
[02:06:54,229][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=703bd274-96ea-4567-8c2c-215efd859a2e,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,
consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,
/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,
intOrder=2, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true,
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange timings [startVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], stage="Waiting in exchange queue" (0 ms),
stage="Exchange parameters initialization" (0 ms), stage="Determine
exchange type" (18 ms), stage="Preloading notification" (0 ms),
stage="Wait partitions release [latch=exchange-free-replicated]" (8 ms),
stage="Wait partitions release latch [latch=exchange-free-replicated]"
(1 ms), stage="Wait partitions release
[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions
release latch [latch=exchange-free-partitioned]" (0 ms), stage="After
states restored callback" (0 ms), stage="WAL history reservation" (3
ms), stage="Finalize update counters" (10 ms), stage="Detect lost
partitions" (25 ms), stage="Exchange done" (6 ms), stage="Total time"
(71 ms)]
[02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange longest local stages [startVer=AffinityTopologyVersion
[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], stage="Affinity initialization (exchange-free switch on
fully-rebalanced topology) [grp=ParameterCache]" (13 ms)
(parent=Determine exchange type), stage="Affinity initialization
(exchange-free switch on fully-rebalanced topology) [grp=UserCache]" (10
ms) (parent=Determine exchange type), stage="Affinity initialization
(exchange-free switch on fully-rebalanced topology)
[grp=UserEventCache]" (9 ms) (parent=Determine exchange type)]
[02:06:54,243][INFO][exchange-worker-#61%production-ignite%][time]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0], crd=true]
[02:06:54,253][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED,
node=c79804fa-52e8-457f-88c8-eeb78b3f34fa]
[02:07:14,182][WARNING][tcp-disco-msg-worker-[c79804fa
coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]
Failing client node due to not receiving metrics updates from client
node within 'IgniteConfiguration.clientFailureDetectionTimeout'
(consider increasing configuration property) [timeout=30000,
node=TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,
intOrder=3, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]]
[02:07:14,183][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,
intOrder=3, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]
[02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
Topology snapshot [ver=7, locNode=703bd274, servers=1, clients=0,
state=ACTIVE, CPUs=2, offheap=3.5GB, heap=1.7GB]
[02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]
^-- Baseline [id=11, size=2, online=1, offline=1]
[02:07:14,189][INFO][exchange-worker-#61%production-ignite%][time]
Started exchange init [topVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], crd=true, evt=NODE_FAILED,
evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, customEvt=null,
allowMerge=true, exchangeFreeSwitch=false]
[02:07:14,190][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=703bd274-96ea-4567-8c2c-215efd859a2e,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode
[id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,
consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList
[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],
sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,
intOrder=3, lastExchangeTime=1628259538158, loc=false,
ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true], rebalanced=true,
done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0]]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange timings [startVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], stage="Waiting in exchange queue" (0 ms),
stage="Exchange parameters initialization" (0 ms), stage="Determine
exchange type" (0 ms), stage="Exchange done" (2 ms), stage="Total time"
(2 ms)]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]
Exchange longest local stages [startVer=AffinityTopologyVersion
[topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0]]
[02:07:14,192][INFO][exchange-worker-#61%production-ignite%][time]
Finished exchange init [topVer=AffinityTopologyVersion [topVer=7,
minorTopVer=0], crd=true]
[02:07:14,195][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager]
Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion
[topVer=7, minorTopVer=0], evt=NODE_FAILED,
evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, client=false]