We encountered a situation after a node unexpectedly went down and came back up. After it came back, none of our transactions were going through (due to rollbacks), and we started getting a lot of exceptions in the logs. (I've added the exceptions at the bottom of this message.) We were getting "Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost)", so we tried to reset the partitions (since these are persistent caches), and the commands succeeded, but we kept seeing errors.
We checked the cluster state, and it looks like we have two nodes that came up with different IDs. Cluster state: active Current topology version: 1170 Baseline auto adjustment disabled: softTimeout=300000 Current topology version: 1170 (Coordinator: ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1) Baseline nodes: ConsistentId=1a0aa611-58b7-479a-b1e6-735e31f87ed9, State=ONLINE, Order=1169 ConsistentId=92bc8407-30f1-433d-9c32-5eeb759c73be, State=OFFLINE ConsistentId=b5875ab9-7923-46c9-b3f3-1550455a24e5, State=OFFLINE -------------------------------------------------------------------------------- Number of baseline nodes: 3 Other nodes: ConsistentId=1455b414-5389-454a-9609-8dd1d15a2430, Order=1 ConsistentId=5e8f3b03-aa20-45aa-892a-37e988e3741f, Order=2 Could this be a split-brain scenario? Here's the more complete logs: javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=430, key=com.company.PropensityKey [idHash=362342248, hash=42458921, customerId=142045188, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) at org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) at org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109) at org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=430, key=com.company.PropensityKey [idHash=362342248, hash=42458921, customerId=142045188, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244) at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107) ... 13 more [21:48:07,376][SEVERE][client-connector-#212][ClientListenerNioListener] Failed to process client request [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@78739a4f ] javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=343, key=com.company.PropensityKey [idHash=1459588566, hash=-1494388806, customerId=156017972, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) at org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) at org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109) at org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=343, key=com.company.PropensityKey [idHash=1459588566, hash=-1494388806, customerId=156017972, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244) at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107) ... 13 more [21:48:07,388][SEVERE][client-connector-#245][ClientListenerNioListener] Failed to process client request [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@32d3379d ] javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=343, key=com.company.PropensityKey [idHash=1589948377, hash=-1494388806, customerId=156017972, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) at org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) at org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedMessageReceived(GridNioFilterAdapter.java:109) at org.apache.ignite.internal.util.nio.GridNioAsyncNotifyFilter$3.body(GridNioAsyncNotifyFilter.java:97) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at org.apache.ignite.internal.util.worker.GridWorkerPool$1.run(GridWorkerPool.java:70) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=343, key=com.company.PropensityKey [idHash=1589948377, hash=-1494388806, customerId=156017972, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateKey(GridDhtTopologyFutureAdapter.java:209) at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTopologyFutureAdapter.validateCache(GridDhtTopologyFutureAdapter.java:128) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.validate(GridPartitionedSingleGetFuture.java:859) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.map(GridPartitionedSingleGetFuture.java:277) at org.apache.ignite.internal.processors.cache.distributed.dht.GridPartitionedSingleGetFuture.init(GridPartitionedSingleGetFuture.java:244) at org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.getAsync(GridDhtColocatedCache.java:297) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4844) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4810) at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1469) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1107) ... 13 more [21:48:07,410][SEVERE][client-connector-#226][ClientListenerNioListener] Failed to process client request [req=o.a.i.i.processors.platform.client.cache.ClientCacheGetRequest@254ae7a] javax.cache.CacheException: class org.apache.ignite.internal.processors.cache.CacheInvalidStateException: Failed to execute the cache operation (all partition owners have left the grid, partition data has been lost) [cacheName=propensity-customer, partition=752, key=com.company.PropensityKey [idHash=224478459, hash=-1842810664, customerId=165875208, variant=MODEL_A]] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1270) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.cacheException(IgniteCacheProxyImpl.java:2083) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1110) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:676) at org.apache.ignite.internal.processors.platform.client.cache.ClientCacheGetRequest.process(ClientCacheGetRequest.java:41) at org.apache.ignite.internal.processors.platform.client.ClientRequestHandler.handle(ClientRequestHandler.java:99) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:202) at org.apache.ignite.internal.processors.odbc.ClientListenerNioListener.onMessage(ClientListenerNioListener.java:56) at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onMessageReceived(GridNioFilterChain.java:279) Devin G. Bost