Hello!

It looks like your nodes has re-joined with different consistent IDs/data
dirs and thus some of your data was not accessible.

Please make sure that your nodes preserve their consistent IDs/data dirs
over restart.

It also looks like your one failing node has formed this topology as
opposed to two surviving ones, which seem to have restarted and rejoined to
it. What's the specific ordering of events?

Regards,
-- 
Ilya Kasnacheev


пт, 11 июн. 2021 г. в 04:30, Devin Bost <devin.b...@gmail.com>:

> We encountered a situation after a node unexpectedly went down and came
> back up.
> After it came back, none of our transactions were going through (due to
> rollbacks), and we started getting a lot of exceptions in the logs. (I've
> added the exceptions at the bottom of this message.)
> We were getting "Failed to execute the cache operation (all partition
> owners have left the grid, partition data has been lost)", so we tried to
> reset the partitions (since these are persistent caches), and the commands
> succeeded, but we kept seeing errors.
>
> We checked the cluster state, and it looks like we have two nodes that
> came up with different IDs.
>
> Cluster state: active
> Current topology version: 1170
> Baseline auto adjustment disabled: softTimeout=300000
> Current topology version: 1170 (Coordinator:
> ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1)
> Baseline nodes:
>     ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE,
> Order=1169
>     ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE
>     ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE
>
> --------------------------------------------------------------------------------
> Number of baseline nodes: 3
> Other nodes:
>     ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1
>     ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2
>
>
> Could this be a split-brain scenario?
>
> Here's the more complete logs:
>
>     javax.cache.CacheException: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=430, key=com.company.PropensityKey [idHash=362342248,
> hash=42458921, customerId=142045188, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
>         at
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
>         at
> org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
>         at
> org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
>         at
> org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
>         at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>         at
> org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>     Caused by: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=430, key=com.company.PropensityKey [idHash=362342248,
> hash=42458921, customerId=142045188, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107)
>         ... 13 more
>
> [21:48:07,376][SEVERE][client-connector-#212][ClientListenerNioListener]
> Failed to process client request
> [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@78739a4f
> ]
>     javax.cache.CacheException: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=343, key=com.company.PropensityKey [idHash=1459588566,
> hash=-1494388806, customerId=156017972, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
>         at
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
>         at
> org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
>         at
> org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
>         at
> org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
>         at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>         at
> org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>     Caused by: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=343, key=com.company.PropensityKey [idHash=1459588566,
> hash=-1494388806, customerId=156017972, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107)
>         ... 13 more
>
> [21:48:07,388][SEVERE][client-connector-#245][ClientListenerNioListener]
> Failed to process client request
> [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@32d3379d
> ]
>     javax.cache.CacheException: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=343, key=com.company.PropensityKey [idHash=1589948377,
> hash=-1494388806, customerId=156017972, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
>         at
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
>         at
> org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
>         at
> org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109)
>         at
> org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97)
>         at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>         at
> org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>     Caused by: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=343, key=com.company.PropensityKey [idHash=1589948377,
> hash=-1494388806, customerId=156017972, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244)
>         at
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810)
>         at
> org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107)
>         ... 13 more
>
> [21:48:07,410][SEVERE][client-connector-#226][ClientListenerNioListener]
> Failed to process client request
> [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@254ae7a
> ]
>     javax.cache.CacheException: class
> org.apache.ignite.internal.processors.cache.CacheInvalidStateException:
> Failed to execute the cache operation (all partition owners have left the
> grid, partition data has been lost) [cacheName=propensity-customer,
> partition=752, key=com.company.PropensityKey [idHash=224478459,
> hash=-1842810664, customerId=165875208, variant=MODEL_A]]
>         at
> org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083)
>         at
> org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110)
>         at
> org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676)
>         at
> org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41)
>         at
> org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202)
>         at
> org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56)
>         at
> org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279)
>
>
> Devin G. Bost
>

Reply via email to