Dear Team,

I've been banging my head against this issue for weeks now and I really hope you'll be able to help me out in understanding what's going wrong.

I have two Ignite nodes running on two OVH VPS with Ubuntu 18.04: let's call them "coordinator" and "backup".

The two nodes share the same configuration, including a TcpDiscoveryVmIpFinder with static IPs, an SSL context factory for exchanging data securely, etc. (see configuration at the end of the email)

The persisted caches are set so that data has one backup, the objective being that I have my data on two separate servers, in case OVH's servers burn down (which is exactly what happened to Strasbourg' site this year...). The code goes:

CacheConfiguration<String, User> configuration = new CacheConfiguration<String, User>()
                .setSqlSchema(SQL_SCHEMA)
                .setName(MY_CACHE)
                .setCacheMode(CacheMode.PARTITIONED)
                .setIndexedTypes(String.class, Entity.class)
.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
                .setBackups(1)
                .setAffinity(new RendezvousAffinityFunction(true));

So the nodes are started and for a few days all seems to go well. If I look at the baseline using control.sh I get:

    Control utility [ver. 2.10.0#20210310-sha1:bc24f6ba]
    2021 Copyright(C) Apache Software Foundation
    User: root
    Time: 2021-08-18T04:00:04.663
    Command [BASELINE] started
    Arguments: --baseline
--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 5
    Baseline auto adjustment disabled: softTimeout=300000

    Current topology version: 5 (Coordinator: ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, Address=vps611671.ovh.net/coordinator.ip.address, Order=2)

    Baseline nodes:
        ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2         ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T04:00:06.313
    Execution time: 1650 ms

Unfortunately after a few days I go to check the baseline and noticed that each node has gone its own way: the coordinator moved to topVer 6 and the backup node to topVer 7. Each node detects the other one as being OFFLINE.

Coordinator's baseline:

--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 6
    Baseline auto adjustment disabled: softTimeout=300000

    Current topology version: 6 (Coordinator: ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, Address=vps611671.ovh.net/coordinator.ip.address, Order=2)

    Baseline nodes:
        ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, Address=vps611671.ovh.net/coordinator.ip.address, State=ONLINE, Order=2
        ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, State=OFFLINE
--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T19:46:56.847
    Execution time: 1173 ms

Backup's baseline:

--------------------------------------------------------------------------------
    Cluster state: active
    Current topology version: 7
    Baseline auto adjustment disabled: softTimeout=300000

    Current topology version: 7 (Coordinator: ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, Address=205.ip-54-38-190.eu/backup.ip.address, Order=5)

    Baseline nodes:
        ConsistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, State=OFFLINE
        ConsistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, Address=205.ip-54-38-190.eu/backup.ip.address, State=ONLINE, Order=5
--------------------------------------------------------------------------------
    Number of baseline nodes: 2

    Other nodes not found.
    Command [BASELINE] finished with code: 0
    Control utility has completed execution at: 2021-08-18T19:49:38.300
    Execution time: 397 ms

At this point I can't restart the "backup" node, otherwise I get this nice error:

Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (72fbc939-bf09-42cb-a7e4-12896046cfc0) is not compatible with BaselineTopology in the cluster. Branching history of cluster BlT ([1060612220]) doesn't contain branching point hash of joining node BlT (173037243). Consider cleaning persistent storage of the node and adding it to the cluster again.         at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:2052)         at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:1197)         at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:472)         at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2154)         at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:278)
        ... 13 more

The only option I've found so far is to remove the backup node from the coordinator's instance with a "--baseline remove", then essentially wipe out the "backup" node's work folder, restart it, and reattach it to the coordinator with a "--baseline add".

Obviously that's a drastic workaround, since that triggers a rebalancing of various gigs of data. Unfortunately it's the only workaround I've found so far (as suggested in https://stackoverflow.com/questions/66807984/clusters-baselinehistory-doesnt-contain-branching-point-hash-of-joining-node-b).

I've looked at the logs: though it indeed seem that the nodes lose sight of each other I can't seem to find the reason of this splitting. I thought the whole point of a baseline was to keep the nodes' configuration together even when one or the other becomes unhealthy for a while...

Below I have included logs for the coordinator node, the backup node and the logs of the application that connects to Ignite (a Java-based app that connects using the client mode).

You can see that at 04:06:53 the app warns of a failed node (consistent ID points to the backup node), then a few seconds later spews an IgniteException followed by some Thread exceptions.

At exactly the same moment (04:06:53) the coordinator node is warning of a "Timed out waiting for message delivery receipt" followed by a "Failed to send message to next node". It then moves to topology version 6...

Something similar happens on the "backup" node at about the same time (time in logs is set two hours earlier on that server for some reason: it points 2:00:00 but it actually is 4:00:00). It ends up moving to its own topology version 7.

These three logs have been appended at the bottom of this email.


So I have two questions here:

1) When this happens, is there any way to "roll back" to topology version 5, which included both nodes and worked like a charm? Am I condemned to wiping out the offending node?

2) What on earth might be causing this? I'm pretty sure Ignite works great, otherwise ING wouldn't be using it. So I guess it must be something on my side? Any ideas?

Any light or suggestion you could bring me on this matter would be truly appreciated.

Thanks,

Diego


*** nodes configuration ***


<?xml version="1.0" encoding="UTF-8" ?>

<beans xmlns="http://www.springframework.org/schema/beans";
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd";>

    <bean class="org.apache.ignite.configuration.IgniteConfiguration">

        <property name="igniteInstanceName" value="production-ignite"/>
        <property name="metricsLogFrequency" value="0"/>
        <property name="clientMode" value="false"/>

        <property name="sslContextFactory">
            <bean class="org.apache.ignite.ssl.SslContextFactory">
                <property name="keyStoreFilePath" value="/etc/apache-ignite/node.jks"/>
                <property name="keyStorePassword" value="************"/>
                <property name="trustStoreFilePath" value="/etc/apache-ignite/trust.jks"/>
                <property name="trustStorePassword" value="************"/>
                <property name="protocol" value="TLSv1.2"/>
            </bean>
        </property>

        <property name="binaryConfiguration">
            <bean class="org.apache.ignite.configuration.BinaryConfiguration">
                <property name="idMapper">
                    <bean class="org.apache.ignite.binary.BinaryBasicIdMapper">
                        <property name="lowerCase" value="false"/>
                    </bean>
                </property>
                <property name="nameMapper">
                    <bean class="org.apache.ignite.binary.BinaryBasicNameMapper"/>
                </property>
            </bean>
        </property>

        <property name="dataStorageConfiguration">
            <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

                <property name="defaultDataRegionConfiguration">
                    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                        <property name="persistenceEnabled" value="true"/>
                    </bean>
                </property>

                <property name="dataRegionConfigurations">
                    <list>
                        <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
                            <property name="name" value="Srp6aRegion"/>
                            <property name="persistenceEnabled" value="false"/>
                        </bean>

                        <bean class="org.apache.ignite.configuration.DataRegionConfiguration">                             <property name="name" value="NotificationsRegion"/>                             <property name="persistenceEnabled" value="false"/>                             <property name="pageEvictionMode" value="RANDOM_2_LRU"/>                             <property name="initialSize" value="#{10L * 1024 * 1024}"/><!-- 10Mb -->                             <property name="maxSize" value="#{500L * 1024 * 1024}"/><!-- 500Mb -->
                        </bean>
                    </list>
                </property>
            </bean>
        </property>

        <property name="communicationSpi">
            <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
                <property name="localPort" value="47100"/>
                <property name="localPortRange" value="3"/>
            </bean>
        </property>

        <property name="discoverySpi">
            <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
                <property name="localPort" value="47500"/>
                <property name="localPortRange" value="3"/>
                <property name="ipFinder">
                    <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                        <property name="addresses">
                            <list>
<value>127.0.0.1:47500..47501</value>
<value>coordinator.ip.address</value>
<value>backup.ip.address</value>
                            </list>
                        </property>
                    </bean>
                </property>
            </bean>
        </property>
    </bean>
</beans>


*** Application logs ***


2021-08-18 04:06:53.977  WARN 7663 --- [disco-event-worker-#56%local-client%] o.a.i.i.m.d.GridDiscoveryManager         : Node FAILED: TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1628259536802, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false] 2021-08-18 04:07:10.160  WARN 7663 --- [jvm-pause-detector-worker] o.a.i.i.IgniteKernal%local-client        : Possible too long JVM pause: 15328 milliseconds. 2021-08-18 04:07:10.158 ERROR 7663 --- [tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] o.apache.ignite.internal.util.typedef.G  : Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=partition-exchanger, threadName=exchange-worker-#57%local-client%, blockedFor=15s] 2021-08-18 04:07:10.200  WARN 7663 --- [tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] ROOT                                     : Possible failure suppressed accordingly to a configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=local-client, finished=false, heartbeatTs=1629252430180]]]

org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=local-client, finished=false, heartbeatTs=1629252430180]
    at java.util.Arrays.fill(Arrays.java:2846) ~[na:1.8.0_292]
    at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.CachePartitionFullCountersMap.clear(CachePartitionFullCountersMap.java:99) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:599) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initTopologies(GridDhtPartitionsExchangeFuture.java:1204) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1037) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3373) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3195) ~[ignite-core-2.10.0.jar!/:2.10.0]     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) ~[ignite-core-2.10.0.jar!/:2.10.0]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_292]

2021-08-18 04:07:10.218  WARN 7663 --- [tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] o.a.i.i.p.failure.FailureProcessor       : No deadlocked threads detected. 2021-08-18 04:07:10.357  WARN 7663 --- [tcp-client-disco-msg-worker-#4%local-client%-#42%local-client%] o.a.i.i.p.failure.FailureProcessor       : Thread dump at 2021/08/18 04:07:10 CEST Thread [name="sys-#23585%local-client%", id=25717, state=TIMED_WAITING, blockCnt=0, waitCnt=1]     Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723, ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)         at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Thread [name="sys-#23584%local-client%", id=25716, state=TIMED_WAITING, blockCnt=0, waitCnt=1]     Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@418da723, ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)         at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)


*** "coordinator" node logs ***


[04:06:53,070][WARNING][tcp-disco-msg-worker-[703bd274 backup.ip.address:47500 crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=9490, rmtAddr=205.ip-54-38-190.eu/backup.ip.address:47500, rmtPort=47500] [04:06:53,120][WARNING][tcp-disco-msg-worker-[703bd274 backup.ip.address:47500 crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1628259536555, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=116986c1b71-c79804fa-52e8-457f-88c8-eeb78b3f34fa, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, order=5, addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], daemon=false]]] [04:06:53,132][WARNING][tcp-disco-msg-worker-[703bd274 backup.ip.address:47500 crd]-#2%production-ignite%-#48%production-ignite%][TcpDiscoverySpi] Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' [04:06:53,214][INFO][tcp-disco-sock-reader-[703bd274 backup.ip.address:45841]-#10%production-ignite%-#185%production-ignite%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/backup.ip.address:45841, rmtPort=45841 [04:06:53,362][WARNING][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1628259536555, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false] [04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] Topology snapshot [ver=6, locNode=c79804fa, servers=1, clients=1, state=ACTIVE, CPUs=2, offheap=3.5GB, heap=3.4GB] [04:06:53,444][INFO][disco-event-worker-#55%production-ignite%][GridDiscoveryManager] ^-- Baseline [id=11, size=2, online=1, offline=1] [04:06:53,942][INFO][exchange-worker-#57%production-ignite%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=703bd274-96ea-4567-8c2c-215efd859a2e, customEvt=null, allowMerge=false, exchangeFreeSwitch=true] [04:06:54,030][INFO][sys-#116790%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,032][INFO][exchange-worker-#57%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,035][INFO][sys-#116796%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,036][INFO][sys-#116793%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,038][INFO][sys-#116792%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,043][INFO][sys-#116791%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [04:06:54,092][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, futInfo=NA, mode=DISTRIBUTED] [04:06:54,097][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partitions release latch: ServerLatch [permits=0, pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid [id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]]]] [04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, futInfo=NA, mode=DISTRIBUTED] [04:06:54,098][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partitions release latch: ServerLatch [permits=0, pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid [id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]]]] [04:06:54,240][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true] [04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1628259536555, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true, done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]] [04:06:54,308][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], stage="Waiting in exchange queue" (39 ms), stage="Exchange parameters initialization" (30 ms), stage="Determine exchange type" (131 ms), stage="Preloading notification" (5 ms), stage="Wait partitions release [latch=exchange-free-replicated]" (9 ms), stage="Wait partitions release latch [latch=exchange-free-replicated]" (5 ms), stage="Wait partitions release [latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions release latch [latch=exchange-free-partitioned]" (0 ms), stage="After states restored callback" (18 ms), stage="WAL history reservation" (19 ms), stage="Finalize update counters" (52 ms), stage="Detect lost partitions" (57 ms), stage="Exchange done" (61 ms), stage="Total time" (426 ms)] [04:06:54,316][INFO][exchange-worker-#57%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange longest local stages [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=MessageLifeCache]" (108 ms) (parent=Determine exchange type), stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=CliauthCache]" (105 ms) (parent=Determine exchange type), stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=UserEventCache]" (100 ms) (parent=Determine exchange type)] [04:06:54,357][INFO][exchange-worker-#57%production-ignite%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=true] [04:06:54,473][INFO][exchange-worker-#57%production-ignite%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED, node=703bd274-96ea-4567-8c2c-215efd859a2e]


*** "backup" node logs ***
(remember there's a 2 hour difference in the logs. 02:06:54 is in fact 04:06:54)


[02:06:54,048][WARNING][tcp-disco-msg-worker-[c79804fa coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=9494, rmtAddr=vps611671.ovh.net/coordinator.ip.address:47500, rmtPort=47500] [02:06:54,125][WARNING][tcp-disco-msg-worker-[c79804fa coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=4654f4c1b71-703bd274-96ea-4567-8c2c-215efd859a2e, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, order=2, addr=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], daemon=false]]] [02:06:54,130][WARNING][tcp-disco-msg-worker-[c79804fa coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' [02:06:54,135][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false] [02:06:54,137][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] Topology snapshot [ver=6, locNode=703bd274, servers=1, clients=1, state=ACTIVE, CPUs=4, offheap=3.5GB, heap=3.4GB] [02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] Coordinator changed [prev=TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], cur=TcpDiscoveryNode [id=703bd274-96ea-4567-8c2c-215efd859a2e, consistentId=c9fe91c7-117c-449e-b7bc-829c1bc8ae46, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, backup.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, 205.ip-54-38-190.eu/backup.ip.address:47500], discPort=47500, order=5, intOrder=4, lastExchangeTime=1629252404041, loc=true, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false]] [02:06:54,138][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] ^-- Baseline [id=11, size=2, online=1, offline=1] [02:06:54,149][INFO][disco-event-worker-#59%production-ignite%][MvccProcessorImpl] Assigned mvcc coordinator [crd=MvccCoordinator [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], nodeId=703bd274-96ea-4567-8c2c-215efd859a2e, ver=1628257415621, local=true, initialized=false]] [02:06:54,167][INFO][sys-#122238%production-ignite%][ExchangeLatchManager] Become new coordinator 703bd274-96ea-4567-8c2c-215efd859a2e [02:06:54,168][INFO][exchange-worker-#61%production-ignite%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=c79804fa-52e8-457f-88c8-eeb78b3f34fa, customEvt=null, allowMerge=false, exchangeFreeSwitch=true] [02:06:54,170][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=CliauthCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,171][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=MessageCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,172][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=MessageLifeCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,175][INFO][exchange-worker-#61%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=UserEventCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,177][INFO][sys-#122240%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=UserCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,185][INFO][sys-#122239%production-ignite%][GridAffinityAssignmentCache] Local node affinity assignment distribution is not ideal [cache=ParameterCache, expectedPrimary=1024.00, actualPrimary=1024, expectedBackups=1024.00, actualBackups=0, warningThreshold=50.00%] [02:06:54,195][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, futInfo=NA, mode=DISTRIBUTED] [02:06:54,196][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partitions release latch: ServerLatch [permits=0, pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid [id=exchange-free-replicated, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]]]] [02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partition release future [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], waitTime=0ms, futInfo=NA, mode=DISTRIBUTED] [02:06:54,197][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finished waiting for partitions release latch: ServerLatch [permits=0, pendingAcks=HashSet [], super=CompletableLatch [id=CompletableLatchUid [id=exchange-free-partitioned, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]]]] [02:06:54,229][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true] [02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=703bd274-96ea-4567-8c2c-215efd859a2e, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=c79804fa-52e8-457f-88c8-eeb78b3f34fa, consistentId=1defbc7d-427c-4264-b64c-23c8dd6f9277, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [vps611671.ovh.net/coordinator.ip.address:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=false], rebalanced=true, done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]] [02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters initialization" (0 ms), stage="Determine exchange type" (18 ms), stage="Preloading notification" (0 ms), stage="Wait partitions release [latch=exchange-free-replicated]" (8 ms), stage="Wait partitions release latch [latch=exchange-free-replicated]" (1 ms), stage="Wait partitions release [latch=exchange-free-partitioned]" (0 ms), stage="Wait partitions release latch [latch=exchange-free-partitioned]" (0 ms), stage="After states restored callback" (0 ms), stage="WAL history reservation" (3 ms), stage="Finalize update counters" (10 ms), stage="Detect lost partitions" (25 ms), stage="Exchange done" (6 ms), stage="Total time" (71 ms)] [02:06:54,242][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange longest local stages [startVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=ParameterCache]" (13 ms) (parent=Determine exchange type), stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=UserCache]" (10 ms) (parent=Determine exchange type), stage="Affinity initialization (exchange-free switch on fully-rebalanced topology) [grp=UserEventCache]" (9 ms) (parent=Determine exchange type)] [02:06:54,243][INFO][exchange-worker-#61%production-ignite%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0], crd=true] [02:06:54,253][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=6, minorTopVer=0], force=false, evt=NODE_FAILED, node=c79804fa-52e8-457f-88c8-eeb78b3f34fa] [02:07:14,182][WARNING][tcp-disco-msg-worker-[c79804fa coordinator.ip.address:47500]-#2%production-ignite%-#55%production-ignite%][TcpDiscoverySpi] Failing client node due to not receiving metrics updates from client node within 'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing configuration property) [timeout=30000, node=TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, intOrder=3, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true]] [02:07:14,183][WARNING][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, intOrder=3, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true] [02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] Topology snapshot [ver=7, locNode=703bd274, servers=1, clients=0, state=ACTIVE, CPUs=2, offheap=3.5GB, heap=1.7GB] [02:07:14,183][INFO][disco-event-worker-#59%production-ignite%][GridDiscoveryManager] ^-- Baseline [id=11, size=2, online=1, offline=1] [02:07:14,189][INFO][exchange-worker-#61%production-ignite%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], crd=true, evt=NODE_FAILED, evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, customEvt=null, allowMerge=true, exchangeFreeSwitch=false] [02:07:14,190][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], err=null, rebalanced=true, wasRebalanced=true] [02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=703bd274-96ea-4567-8c2c-215efd859a2e, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=TcpDiscoveryNode [id=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, consistentId=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, coordinator.ip.address], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, vps611671.ovh.net/coordinator.ip.address:0], discPort=0, order=3, intOrder=3, lastExchangeTime=1628259538158, loc=false, ver=2.10.0#20210310-sha1:bc24f6ba, isClient=true], rebalanced=true, done=true, newCrdFut=null], topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0]] [02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], stage="Waiting in exchange queue" (0 ms), stage="Exchange parameters initialization" (0 ms), stage="Determine exchange type" (0 ms), stage="Exchange done" (2 ms), stage="Total time" (2 ms)] [02:07:14,192][INFO][exchange-worker-#61%production-ignite%][GridDhtPartitionsExchangeFuture] Exchange longest local stages [startVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=7, minorTopVer=0]] [02:07:14,192][INFO][exchange-worker-#61%production-ignite%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=7, minorTopVer=0], crd=true] [02:07:14,195][INFO][exchange-worker-#61%production-ignite%][GridCachePartitionExchangeManager] Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion [topVer=7, minorTopVer=0], evt=NODE_FAILED, evtNode=40d8cfb1-a6ff-4ffe-8c46-9816e8a2e6ad, client=false]



Reply via email to