Hi Yakov,
1. Yes

2. if you mean that nodeMap is accessed in onNodeRemoved(UUID nodeID)
method of the GridCacheSemaphoreImpl class,
it shouldn't be a problem, but it can be changed easily not to do so;

3.
org.apache.ignite.internal.processors.cache.datastructures.GridCacheAbstractDataStructuresFailoverSelfTest#testSemaphoreConstantTopologyChangeFailoverSafe()
org.apache.ignite.internal.processors.cache.datastructures.GridCacheAbstractDataStructuresFailoverSelfTest#testSemaphoreConstantMultipleTopologyChangeFailoverSafe()

I think the problem is with the atomicity of the simulated grid failure;
once stopGrid() is called for a node, other threads on this same node start
throwing interrupted exceptions,
which are in turn not handled properly in the
GridCacheAbstractDataStructuresFailoverSelfTest;
Those exceptions shouldn't be dealt with inside the GridCacheSemaphoreImpl
itself.
In a realworld node failure scenario, all those threads would fail at the
same time
(none of them would influence the rest of the grid anymore);

I think fixing the issue Denis is working on can fix this (IGNITE-801 and
IGNITE-803)
Am i right? Does it makes sense?


Best regards,
Vladisav




On Tue, Nov 17, 2015 at 5:40 PM, Yakov Zhdanov <yzhda...@apache.org> wrote:

> Vladislav,
>
> I started to review the latest changes and have couple of questions:
>
> 1. latest changes are here - https://github.com/apache/ignite/pull/120? Is
> that correct?
> 2.
> org.apache.ignite.internal.processors.datastructures.GridCacheSemaphoreImpl.Sync#nodeMap
> is accessed in both sync and unsync context. Are you sure this is fine.
> 3. As far as failing test - can you please isolate it into separate junit
> or point out existing one?
>
> --Yakov
>
> 2015-11-11 12:33 GMT+03:00 Vladisav Jelisavcic <vladis...@gmail.com>:
>
> > Yakov,
> >
> > sorry  for running a bit late.
> >
> > > Vladislav, do you have any updates for
> > > https://issues.apache.org/jira/browse/IGNITE-638? Or any questions?
> > >
> > > --Yakov
> >
> > I have problems with some fail-over scenarios;
> > It seems that if the two nodes are in the middle of acquiring or
> releasing
> > the semaphore,
> > and one of them fails, all nodes get:
> >
> > [09:36:38,509][ERROR][ignite-#13%pub-null%][GridCacheSemaphoreImpl]
> > <ignite-atomics-sys-cache> Failed to compare and set:
> > o.a.i.i.processors.datastructures.GridCacheSemaphoreImpl$Sync$1@5528b728
> > class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
> > Failed to acquire lock for keys (primary node left grid, retry
> transaction
> > if possible) [keys=[UserKeyCacheObjectImpl [val=GridCacheInternalKeyImpl
> > [name=ac83b8cb-3052-49a6-9301-81b20b0ecf3a], hasValBytes=true]],
> > node=c321fcc4-5db5-4b03-9811-6a5587f2c253]
> > ...
> > Caused by: class
> > org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
> Failed
> > to acquire lock for keys (primary node left grid, retry transaction if
> > possible) [keys=[UserKeyCacheObjectImpl [val=GridCacheInternalKeyImpl
> > [name=ac83b8cb-3052-49a6-9301-81b20b0ecf3a], hasValBytes=true]],
> > node=c321fcc4-5db5-4b03-9811-6a5587f2c253]
> > at
> >
> >
> org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedLockFuture.newTopologyException(GridDhtColocatedLockFuture.java:1199)
> > ... 10 more
> >
> >
> > I'm still trying to find out how to exactly reproduce this behavior,
> > I'll send you more details once I try few more things.
> >
> > I am still using partitioned cache, does it make sense to use replicated
> > cache instead?
> >
> >
> > Other than that, I'm done with everything else.
> >
> > Thanks,
> > Vladisav
> >
> >
>

Reply via email to