Hi Yakov, 1. Yes 2. if you mean that nodeMap is accessed in onNodeRemoved(UUID nodeID) method of the GridCacheSemaphoreImpl class, it shouldn't be a problem, but it can be changed easily not to do so;
3. org.apache.ignite.internal.processors.cache.datastructures.GridCacheAbstractDataStructuresFailoverSelfTest#testSemaphoreConstantTopologyChangeFailoverSafe() org.apache.ignite.internal.processors.cache.datastructures.GridCacheAbstractDataStructuresFailoverSelfTest#testSemaphoreConstantMultipleTopologyChangeFailoverSafe() I think the problem is with the atomicity of the simulated grid failure; once stopGrid() is called for a node, other threads on this same node start throwing interrupted exceptions, which are in turn not handled properly in the GridCacheAbstractDataStructuresFailoverSelfTest; Those exceptions shouldn't be dealt with inside the GridCacheSemaphoreImpl itself. In a realworld node failure scenario, all those threads would fail at the same time (none of them would influence the rest of the grid anymore); I think fixing the issue Denis is working on can fix this (IGNITE-801 and IGNITE-803) Am i right? Does it makes sense? Best regards, Vladisav On Tue, Nov 17, 2015 at 5:40 PM, Yakov Zhdanov <yzhda...@apache.org> wrote: > Vladislav, > > I started to review the latest changes and have couple of questions: > > 1. latest changes are here - https://github.com/apache/ignite/pull/120? Is > that correct? > 2. > org.apache.ignite.internal.processors.datastructures.GridCacheSemaphoreImpl.Sync#nodeMap > is accessed in both sync and unsync context. Are you sure this is fine. > 3. As far as failing test - can you please isolate it into separate junit > or point out existing one? > > --Yakov > > 2015-11-11 12:33 GMT+03:00 Vladisav Jelisavcic <vladis...@gmail.com>: > > > Yakov, > > > > sorry for running a bit late. > > > > > Vladislav, do you have any updates for > > > https://issues.apache.org/jira/browse/IGNITE-638? Or any questions? > > > > > > --Yakov > > > > I have problems with some fail-over scenarios; > > It seems that if the two nodes are in the middle of acquiring or > releasing > > the semaphore, > > and one of them fails, all nodes get: > > > > [09:36:38,509][ERROR][ignite-#13%pub-null%][GridCacheSemaphoreImpl] > > <ignite-atomics-sys-cache> Failed to compare and set: > > o.a.i.i.processors.datastructures.GridCacheSemaphoreImpl$Sync$1@5528b728 > > class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: > > Failed to acquire lock for keys (primary node left grid, retry > transaction > > if possible) [keys=[UserKeyCacheObjectImpl [val=GridCacheInternalKeyImpl > > [name=ac83b8cb-3052-49a6-9301-81b20b0ecf3a], hasValBytes=true]], > > node=c321fcc4-5db5-4b03-9811-6a5587f2c253] > > ... > > Caused by: class > > org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: > Failed > > to acquire lock for keys (primary node left grid, retry transaction if > > possible) [keys=[UserKeyCacheObjectImpl [val=GridCacheInternalKeyImpl > > [name=ac83b8cb-3052-49a6-9301-81b20b0ecf3a], hasValBytes=true]], > > node=c321fcc4-5db5-4b03-9811-6a5587f2c253] > > at > > > > > org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.newTopologyException(GridDhtColocatedLockFuture.java:1199) > > ... 10 more > > > > > > I'm still trying to find out how to exactly reproduce this behavior, > > I'll send you more details once I try few more things. > > > > I am still using partitioned cache, does it make sense to use replicated > > cache instead? > > > > > > Other than that, I'm done with everything else. > > > > Thanks, > > Vladisav > > > > >