Nodes splitting and going on their separate ways (each with different topology version)

privacyfi...@codesandnotes.be Wed, 18 Aug 2021 12:46:06 -0700

Dear Team,

I've been banging my head against this issue for weeks now and I reallyhope you'll be able to help me out in understanding what's going wrong.

I have two Ignite nodes running on two OVH VPS with Ubuntu 18.04: let'scall them "coordinator" and "backup".

The two nodes share the same configuration, including aTcpDiscoveryVmIpFinder with static IPs, an SSL context factory forexchanging data securely, etc. (see configuration at the end of the email)

The persisted caches are set so that data has one backup, the objectivebeing that I have my data on two separate servers, in case OVH's serversburn down (which is exactly what happened to Strasbourg' site thisyear...). The code goes:

CacheConfiguration<String, User> configuration = newCacheConfiguration<String, User>()

                .setSqlSchema(SQL_SCHEMA)
                .setName(MY_CACHE)
                .setCacheMode(CacheMode.PARTITIONED)
                .setIndexedTypes(String.class, Entity.class)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
                .setBackups(1)
                .setAffinity(new RendezvousAffinityFunction(true));

So the nodes are started and for a few days all seems to go well. If Ilook at the baseline using control.sh I get:


    Control utility [ver. 2.10.0#20210310-sha1:bc24f6ba]
    2021 Copyright(C) Apache Software Foundation
    User: root
    Time: 2021-08-18T04:00:04.663
    Command [BASELINE] started
    Arguments: --baseline
--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 5
    Baseline auto adjustment disabled: softTimeout=300000

Current topology version: 5 (Coordinator:ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,Address=vps611671.ovh.net/coordinator.ip.address, Order=2)


    Baseline nodes:

ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2 ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5

--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T04:00:06.313
    Execution time: 1650 ms

Unfortunately after a few days I go to check the baseline and noticedthat each node has gone its own way: the coordinator moved to topVer 6and the backup node to topVer 7. Each node detects the other one asbeing OFFLINE.


Coordinator's baseline:

--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 6
    Baseline auto adjustment disabled: softTimeout=300000

Current topology version: 6 (Coordinator:ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,Address=vps611671.ovh.net/coordinator.ip.address, Order=2)


    Baseline nodes:

ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277,Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2

        ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, State=OFFLINE
--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T19:46:56.847
    Execution time: 1173 ms

Backup's baseline:

--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 7
    Baseline auto adjustment disabled: softTimeout=300000

Current topology version: 7 (Coordinator:ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,Address=205.ip-54-38-190.eu/backup.ip.address, Order=5)


    Baseline nodes:
        ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, State=OFFLINE

ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46,Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5

--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T19:49:38.300
    Execution time: 397 ms

At this point I can't restart the "backup" node, otherwise I get thisnice error:

Caused by: class org.apache.ignite.spi.IgniteSpiException:BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0)is not compatible with BaselineTopology in the cluster. Branchinghistory of cluster BlT ([1060612220]) doesn't contain branching pointhash of joining node BlT (173037243). Consider cleaning persistentstorage of the node and adding it to the cluster again. atorg.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052) atorg.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197) atorg.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472) atorg.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154) atorg.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)

        ... 13 more

The only option I've found so far is to remove the backup node from thecoordinator's instance with a "--baseline remove", then essentially wipeout the "backup" node's work folder, restart it, and reattach it to thecoordinator with a "--baseline add".

Obviously that's a drastic workaround, since that triggers a rebalancingof various gigs of data. Unfortunately it's the only workaround I'vefound so far (as suggested inhttps://stackoverflow.com/questions/66807984/clusters-baselinehistory-doesnt-contain-branching-point-hash-of-joining-node-b).

I've looked at the logs: though it indeed seem that the nodes lose sightof each other I can't seem to find the reason of this splitting. Ithought the whole point of a baseline was to keep the nodes'configuration together even when one or the other becomes unhealthy fora while...

Below I have included logs for the coordinator node, the backup node andthe logs of the application that connects to Ignite (a Java-based appthat connects using the client mode).

You can see that at 04:06:53 the app warns of a failed node (consistentID points to the backup node), then a few seconds later spews anIgniteException followed by some Thread exceptions.

At exactly the same moment (04:06:53) the coordinator node is warning ofa "Timed out waiting for message delivery receipt" followed by a "Failedto send message to next node". It then moves to topology version 6...

Something similar happens on the "backup" node at about the same time(time in logs is set two hours earlier on that server for some reason:it points 2:00:00 but it actually is 4:00:00). It ends up moving to itsown topology version 7.


These three logs have been appended at the bottom of this email.


So I have two questions here:

1) When this happens, is there any way to "roll back" to topologyversion 5, which included both nodes and worked like a charm? Am Icondemned to wiping out the offending node?

2) What on earth might be causing this? I'm pretty sure Ignite worksgreat, otherwise ING wouldn't be using it. So I guess it must besomething on my side? Any ideas?

Any light or suggestion you could bring me on this matter would be trulyappreciated.


Thanks,

Diego


*** nodes configuration ***


<?xml version="1.0" encoding="UTF-8" ?>

<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd";>

    <bean class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="igniteInstanceName" value="production-ignite"/>
        <property name="metricsLogFrequency" value="0"/>
        <property name="clientMode" value="false"/>

        <property name="sslContextFactory">
            <bean class="org.apache.ignite.ssl.SslContextFactory">

<property name="keyStoreFilePath"value="/etc/apache-ignite/node.jks"/>

                <property name="keyStorePassword" value="************"/>

<property name="trustStoreFilePath"value="/etc/apache-ignite/trust.jks"/>

                <property name="trustStorePassword" value="************"/>
                <property name="protocol" value="TLSv1.2"/>
            </bean>
        </property>

        <property name="binaryConfiguration">

<beanclass="org.apache.ignite.configuration.BinaryConfiguration">

                <property name="idMapper">

<beanclass="org.apache.ignite.binary.BinaryBasicIdMapper">

                        <property name="lowerCase" value="false"/>
                    </bean>
                </property>
                <property name="nameMapper">

<beanclass="org.apache.ignite.binary.BinaryBasicNameMapper"/>

                </property>
            </bean>
        </property>

        <property name="dataStorageConfiguration">

<beanclass="org.apache.ignite.configuration.DataStorageConfiguration">


                <property name="defaultDataRegionConfiguration">

<beanclass="org.apache.ignite.configuration.DataRegionConfiguration">

                        <property name="persistenceEnabled" value="true"/>
                    </bean>
                </property>

                <property name="dataRegionConfigurations">
                    <list>

<beanclass="org.apache.ignite.configuration.DataRegionConfiguration">

                            <property name="name" value="Srp6aRegion"/>

<property name="persistenceEnabled"value="false"/>

                        </bean>

<beanclass="org.apache.ignite.configuration.DataRegionConfiguration"> <property name="name"value="NotificationsRegion"/> <property name="persistenceEnabled"value="false"/> <property name="pageEvictionMode"value="RANDOM_2_LRU"/> <property name="initialSize" value="#{10L *1024 * 1024}"/> <property name="maxSize" value="#{500L *1024 * 1024}"/>

                        </bean>
                    </list>
                </property>
            </bean>
        </property>

        <property name="communicationSpi">

<beanclass="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">

                <property name="localPort" value="47100"/>
                <property name="localPortRange" value="3"/>
            </bean>
        </property>

        <property name="discoverySpi">

<beanclass="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">

                <property name="localPort" value="47500"/>
                <property name="localPortRange" value="3"/>
                <property name="ipFinder">

<beanclass="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">

                        <property name="addresses">
                            <list>
<value>127.0.0.1:47500..47501</value>
<value>coordinator.ip.address</value>
<value>backup.ip.address</value>
                            </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>
    </bean>
</beans>


*** Application logs ***

2021-08-18 04:06:53.977 WARN 7663 ---[disco-event-worker-#56%local-client%]o.a.i.i.m.d.GridDiscoveryManager : Node FAILED: TcpDiscoveryNode[id=703bd274-96ea-4567-8c2c-215efd859a2e,consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,intOrder=4, lastExchangeTime=1628259536802, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]2021-08-18 04:07:10.160 WARN 7663 --- [jvm-pause-detector-worker]o.a.i.i.IgniteKernal%local-client : Possible too long JVM pause:15328 milliseconds.2021-08-18 04:07:10.158 ERROR 7663 ---[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]o.apache.ignite.internal.util.typedef.G : Blocked system-criticalthread has been detected. This can lead to cluster-wide undefinedbehaviour [workerName=partition-exchanger,threadName=exchange-worker-#57%local-client%, blockedFor=15s]2021-08-18 04:07:10.200 WARN 7663 ---[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]ROOT : Possible failure suppressedaccordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler[tryStop=false, timeout=0, super=AbstractFailureHandler[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker[name=partition-exchanger, igniteInstanceName=local-client,finished=false, heartbeatTs=1629252430180]]]

org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger,igniteInstanceName=local-client, finished=false, heartbeatTs=1629252430180]

    at java.util.Arrays.fill(Arrays.java:2846) ~[na:1.8.0_292]

atorg.apache.ignite.internal.processors.cache.distributed.dht.preloader.CachePartitionFullCountersMap.clear(CachePartitionFullCountersMap.java:99)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:599)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initTopologies(GridDhtPartitionsExchangeFuture.java:1204)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1037)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3373)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3195)~[ignite-core-2.10.0.jar!/:2.10.0] atorg.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)~[ignite-core-2.10.0.jar!/:2.10.0]

    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_292]

2021-08-18 04:07:10.218 WARN 7663 ---[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]o.a.i.i.p.failure.FailureProcessor : No deadlocked threads detected.2021-08-18 04:07:10.357 WARN 7663 ---[tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%]o.a.i.i.p.failure.FailureProcessor : Thread dump at 2021/08/1804:07:10 CESTThread [name="sys-#23585%local-client%", id=25717, state=TIMED_WAITING,blockCnt=0, waitCnt=1] Lock[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723,ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

atjava.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) atjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) atjava.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) atjava.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

Thread [name="sys-#23584%local-client%", id=25716, state=TIMED_WAITING,blockCnt=0, waitCnt=1] Lock[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723,ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)



*** "coordinator" node logs ***

[04:06:53,070][WARNING][tcp-disco-msg-worker-[703bd274backup.ip.address:47500crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]Timed out waiting for message delivery receipt (most probably, thereason is in long GC pauses on remote node; consider tuning GC andincreasing 'ackTimeout' configuration property). Will retry to sendmessage with increased timeout [currentTimeout=9490,rmtAddr=205.ip-54-38-190.eu/backup.ip.address:47500, rmtPort=47500][04:06:53,120][WARNING][tcp-disco-msg-worker-[703bd274backup.ip.address:47500crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]Failed to send message to next node[msg=TcpDiscoveryConnectionCheckMessage[super=TcpDiscoveryAbstractMessage [sndNodeId=null,id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa,verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,isClient=false]], next=TcpDiscoveryNode[id=703bd274-96ea-4567-8c2c-215efd859a2e,consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,intOrder=4, lastExchangeTime=1628259536555, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed tosend message to next node [msg=TcpDiscoveryConnectionCheckMessage[super=TcpDiscoveryAbstractMessage [sndNodeId=null,id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa,verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,isClient=false]], next=ClusterNode[id=703bd274-96ea-4567-8c2c-215efd859a2e, order=5,addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], daemon=false]]][04:06:53,132][WARNING][tcp-disco-msg-worker-[703bd274backup.ip.address:47500crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi]Local node has detected failed nodes and started cluster-wide procedure.To speed up failure detection please see 'Failure Detection' sectionunder javadoc for 'TcpDiscoverySpi'[04:06:53,214][INFO][tcp-disco-sock-reader-[703bd274backup.ip.address:45841]-#10%production-ignite%-#185%production-ignite%][TcpDiscoverySpi]Finished serving remote node connection[rmtAddr=/backup.ip.address:45841, rmtPort=45841[04:06:53,362][WARNING][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]Node FAILED: TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e,consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,intOrder=4, lastExchangeTime=1628259536555, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false][04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]Topology snapshot [ver=6, locNode=c79804fa, servers=1, clients=1,state=ACTIVE, CPUs=2, offheap=3.5GB, heap=3.4GB][04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager]^-- Baseline [id=11, size=2, online=1, offline=1][04:06:53,942][INFO][exchange-worker-#57%production-ignite%][time]Started exchange init [topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], crd=true, evt=NODE_FAILED,evtNode=703bd274-96ea-4567-8c2c-215efd859a2e, customEvt=null,allowMerge=false, exchangeFreeSwitch=true][04:06:54,030][INFO][sys-#116790%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,032][INFO][exchange-worker-#57%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,035][INFO][sys-#116796%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,036][INFO][sys-#116793%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,038][INFO][sys-#116792%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,043][INFO][sys-#116791%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][04:06:54,092][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partition release future[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,futInfo=NA, mode=DISTRIBUTED][04:06:54,097][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partitions release latch: ServerLatch [permits=0,pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]]]][04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partition release future[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,futInfo=NA, mode=DISTRIBUTED][04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partitions release latch: ServerLatch [permits=0,pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]]]][04:06:54,240][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Finish exchange future [startVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true][04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Completed partition exchange[localNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa,exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode[id=703bd274-96ea-4567-8c2c-215efd859a2e,consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,intOrder=4, lastExchangeTime=1628259536555, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true,done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]][04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange timings [startVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], stage="Waiting in exchange queue" (39 ms),stage="Exchange parameters initialization" (30 ms), stage="Determineexchange type" (131 ms), stage="Preloading notification" (5 ms),stage="Wait partitions release [latch=exchange-free-replicated]" (9 ms),stage="Wait partitions release latch [latch=exchange-free-replicated]"(5 ms), stage="Wait partitions release[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitionsrelease latch [latch=exchange-free-partitioned]" (0 ms), stage="Afterstates restored callback" (18 ms), stage="WAL history reservation" (19ms), stage="Finalize update counters" (52 ms), stage="Detect lostpartitions" (57 ms), stage="Exchange done" (61 ms), stage="Total time"(426 ms)][04:06:54,316][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange longest local stages [startVer=AffinityTopologyVersion[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], stage="Affinity initialization (exchange-free switch onfully-rebalanced topology) [grp=MessageLifeCache]" (108 ms)(parent=Determine exchange type), stage="Affinity initialization(exchange-free switch on fully-rebalanced topology) [grp=CliauthCache]"(105 ms) (parent=Determine exchange type), stage="Affinityinitialization (exchange-free switch on fully-rebalanced topology)[grp=UserEventCache]" (100 ms) (parent=Determine exchange type)][04:06:54,357][INFO][exchange-worker-#57%production-ignite%][time]Finished exchange init [topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], crd=true][04:06:54,473][INFO][exchange-worker-#57%production-ignite%][GridCachePartitionExchangeManager]Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED,node=703bd274-96ea-4567-8c2c-215efd859a2e]



*** "backup" node logs ***

(remember there's a 2 hour difference in the logs. 02:06:54 is in fact04:06:54)

[02:06:54,048][WARNING][tcp-disco-msg-worker-[c79804facoordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]Timed out waiting for message delivery receipt (most probably, thereason is in long GC pauses on remote node; consider tuning GC andincreasing 'ackTimeout' configuration property). Will retry to sendmessage with increased timeout [currentTimeout=9494,rmtAddr=vps611671.ovh.net/coordinator.ip.address:47500, rmtPort=47500][02:06:54,125][WARNING][tcp-disco-msg-worker-[c79804facoordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]Failed to send message to next node[msg=TcpDiscoveryConnectionCheckMessage[super=TcpDiscoveryAbstractMessage [sndNodeId=null,id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e,verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,isClient=false]], next=TcpDiscoveryNode[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,intOrder=2, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed tosend message to next node [msg=TcpDiscoveryConnectionCheckMessage[super=TcpDiscoveryAbstractMessage [sndNodeId=null,id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e,verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,isClient=false]], next=ClusterNode[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, order=2,addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],daemon=false]]][02:06:54,130][WARNING][tcp-disco-msg-worker-[c79804facoordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]Local node has detected failed nodes and started cluster-wide procedure.To speed up failure detection please see 'Failure Detection' sectionunder javadoc for 'TcpDiscoverySpi'[02:06:54,135][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]Node FAILED: TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,intOrder=2, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false][02:06:54,137][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]Topology snapshot [ver=6, locNode=703bd274, servers=1, clients=1,state=ACTIVE, CPUs=4, offheap=3.5GB, heap=3.4GB][02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]Coordinator changed [prev=TcpDiscoveryNode[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,intOrder=2, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], cur=TcpDiscoveryNode[id=703bd274-96ea-4567-8c2c-215efd859a2e,consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet[/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500,205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5,intOrder=4, lastExchangeTime=1629252404041, loc=true,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]][02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]^-- Baseline [id=11, size=2, online=1, offline=1][02:06:54,149][INFO][disco-event-worker-#59%production-ignite%][MvccProcessorImpl]Assigned mvcc coordinator [crd=MvccCoordinator[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0],nodeId=703bd274-96ea-4567-8c2c-215efd859a2e, ver=1628257415621,local=true, initialized=false]][02:06:54,167][INFO][sys-#122238%production-ignite%][ExchangeLatchManager]Become new coordinator 703bd274-96ea-4567-8c2c-215efd859a2e[02:06:54,168][INFO][exchange-worker-#61%production-ignite%][time]Started exchange init [topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], crd=true, evt=NODE_FAILED,evtNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, customEvt=null,allowMerge=false, exchangeFreeSwitch=true][02:06:54,170][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,171][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,172][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,175][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,177][INFO][sys-#122240%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,185][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache]Local node affinity assignment distribution is not ideal[cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024,expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%][02:06:54,195][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partition release future[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,futInfo=NA, mode=DISTRIBUTED][02:06:54,196][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partitions release latch: ServerLatch [permits=0,pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid[id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]]]][02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partition release future[topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms,futInfo=NA, mode=DISTRIBUTED][02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finished waiting for partitions release latch: ServerLatch [permits=0,pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid[id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]]]][02:06:54,229][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finish exchange future [startVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true][02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Completed partition exchange[localNode=703bd274-96ea-4567-8c2c-215efd859a2e,exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion[topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode[id=c79804fa-52e8-457f-88c8-eeb78b3f34fa,consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500,/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2,intOrder=2, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true,done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0]][02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange timings [startVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], stage="Waiting in exchange queue" (0 ms),stage="Exchange parameters initialization" (0 ms), stage="Determineexchange type" (18 ms), stage="Preloading notification" (0 ms),stage="Wait partitions release [latch=exchange-free-replicated]" (8 ms),stage="Wait partitions release latch [latch=exchange-free-replicated]"(1 ms), stage="Wait partitions release[latch=exchange-free-partitioned]" (0 ms), stage="Wait partitionsrelease latch [latch=exchange-free-partitioned]" (0 ms), stage="Afterstates restored callback" (0 ms), stage="WAL history reservation" (3ms), stage="Finalize update counters" (10 ms), stage="Detect lostpartitions" (25 ms), stage="Exchange done" (6 ms), stage="Total time"(71 ms)][02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange longest local stages [startVer=AffinityTopologyVersion[topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], stage="Affinity initialization (exchange-free switch onfully-rebalanced topology) [grp=ParameterCache]" (13 ms)(parent=Determine exchange type), stage="Affinity initialization(exchange-free switch on fully-rebalanced topology) [grp=UserCache]" (10ms) (parent=Determine exchange type), stage="Affinity initialization(exchange-free switch on fully-rebalanced topology)[grp=UserEventCache]" (9 ms) (parent=Determine exchange type)][02:06:54,243][INFO][exchange-worker-#61%production-ignite%][time]Finished exchange init [topVer=AffinityTopologyVersion [topVer=6,minorTopVer=0], crd=true][02:06:54,253][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager]Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion[topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED,node=c79804fa-52e8-457f-88c8-eeb78b3f34fa][02:07:14,182][WARNING][tcp-disco-msg-worker-[c79804facoordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi]Failing client node due to not receiving metrics updates from clientnode within 'IgniteConfiguration.clientFailureDetectionTimeout'(consider increasing configuration property) [timeout=30000,node=TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,intOrder=3, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]][02:07:14,183][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]Node FAILED: TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,intOrder=3, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true][02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]Topology snapshot [ver=7, locNode=703bd274, servers=1, clients=0,state=ACTIVE, CPUs=2, offheap=3.5GB, heap=1.7GB][02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager]^-- Baseline [id=11, size=2, online=1, offline=1][02:07:14,189][INFO][exchange-worker-#61%production-ignite%][time]Started exchange init [topVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], crd=true, evt=NODE_FAILED,evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, customEvt=null,allowMerge=true, exchangeFreeSwitch=false][02:07:14,190][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Finish exchange future [startVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true][02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Completed partition exchange[localNode=703bd274-96ea-4567-8c2c-215efd859a2e,exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion[topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode[id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad,consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address],sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3,intOrder=3, lastExchangeTime=1628259538158, loc=false,ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true], rebalanced=true,done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=7,minorTopVer=0]][02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange timings [startVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], stage="Waiting in exchange queue" (0 ms),stage="Exchange parameters initialization" (0 ms), stage="Determineexchange type" (0 ms), stage="Exchange done" (2 ms), stage="Total time"(2 ms)][02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture]Exchange longest local stages [startVer=AffinityTopologyVersion[topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7,minorTopVer=0]][02:07:14,192][INFO][exchange-worker-#61%production-ignite%][time]Finished exchange init [topVer=AffinityTopologyVersion [topVer=7,minorTopVer=0], crd=true][02:07:14,195][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager]Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion[topVer=7, minorTopVer=0], evt=NODE_FAILED,evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, client=false]

Nodes splitting and going on their separate ways (each with different topology version)

Reply via email to