[
https://issues.apache.org/jira/browse/IGNITE-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432094#comment-16432094
]
Alexey Goncharuk commented on IGNITE-7871:
------------------------------------------
I also found this deadlock in TC tests:
{code}
Found one Java-level deadlock:
=============================
"sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%":
waiting to lock monitor 0x00007f58a019a7c8 (object 0x00000000e3e33370, a
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch),
which is held by
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%"
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%":
waiting for ownable synchronizer 0x00000000de084358, (a
java.util.concurrent.locks.ReentrantLock$NonfairSync),
which is held by "sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%"
Java stack information for the threads listed above:
===================================================
"sys-#55123%dht.GridCacheAtomicNearCacheSelfTest2%":
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.newCoordinator(ExchangeLatchManager.java:565)
- waiting to lock <0x00000000e3e33370> (a
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.access$300(ExchangeLatchManager.java:521)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.processNodeLeft(ExchangeLatchManager.java:373)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.lambda$null$1(ExchangeLatchManager.java:115)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$$Lambda$36/1235895228.run(Unknown
Source)
at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6746)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$1.body(GridClosureProcessor.java:827)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"exchange-worker-#55118%dht.GridCacheAtomicNearCacheSelfTest2%":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000de084358> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.processAck(ExchangeLatchManager.java:268)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager.lambda$new$0(ExchangeLatchManager.java:101)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$$Lambda$1/832828638.onMessage(Unknown
Source)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1632)
at
org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1715)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.sendAck(ExchangeLatchManager.java:578)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch.countDown(ExchangeLatchManager.java:596)
- locked <0x00000000e3e33370> (a
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.latch.ExchangeLatchManager$ClientLatch)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.waitPartitionRelease(GridDhtPartitionsExchangeFuture.java:1322)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:1111)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:712)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2401)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2290)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:745)
{code}
> Implement 2-phase waiting for partition release
> -----------------------------------------------
>
> Key: IGNITE-7871
> URL: https://issues.apache.org/jira/browse/IGNITE-7871
> Project: Ignite
> Issue Type: Improvement
> Components: cache
> Affects Versions: 2.4
> Reporter: Pavel Kovalenko
> Assignee: Alexey Goncharuk
> Priority: Major
> Fix For: 2.5
>
>
> Using validation implemented in IGNITE-7467 we can observe the following
> situation:
> Let's we have some partition and nodes which owning it N1 (primary) and N2
> (backup)
> 1) Exchange is started
> 2) N2 finished waiting for partitions release and started to create Single
> message (with update counters).
> 3) N1 waits for partitions release.
> 4) We have pending cache update N1 -> N2. This update is done after step 2.
> 5) This update increments update counters both on N1 and N2.
> 6) N1 finished waiting for partitions release, while N2 already sent Single
> message to coordinator with outdated update counter.
> 7) Coordinator sees different partition update counters for N1 and N2.
> Validation is failed, while data is equal.
> Solution:
> Every server node participating in PME should wait while all other server
> nodes will finish their ongoing updates (finish wait for partition release
> method)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)