Hi all,
This is regarding a rather recent issue that we’ve been facing. We run 2 client 
instances and 26 apache ignite instances. All are AWS R4.2xLarge nodes. 
Recently we’ve been seeing this issue where when trying to fetch an atomicLong 
or atomicReference, the executing thread gets stuck and doesn’t return. This 
issue usually happens on 1 or 2 ignite instances. I am not sure why this 
happens and so any help on this would be really appreciated. The version of 
Ignite we use is 2.7.5
This is the thread dump while trying to get an atomicReference:
"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M 
defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition  
[0x00007f4cece90000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbfe7c7d0> (a 
java.util.concurrent.CountDownLatch$Sync)
                at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at 
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
                at 
org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
                at 
org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
                at 
company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
                at 
company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
                at 
company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
                at 
org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
                at 
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
                at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
                at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
                - locked <0x00007f4cbf072a38> (a 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
                at 
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
                at 
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
                at 
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
                at 
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
                at 
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
                at 
org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
                at org.apache.ignite.Ignition.start(Ignition.java:348)
                at 
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)

Since this is stuck any Ignition.ignite calls fail as well and cause the job 
not to go through:
"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K 
defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition  
[0x00007f40375f6000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbf16d9e0> (a 
java.util.concurrent.CountDownLatch$Sync)
                at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at 
org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
                at 
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
                at 
org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
                at 
org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
                at org.apache.ignite.Ignition.ignite(Ignition.java:489)
                at 
company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
                at 
company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
                at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

Similarly this is an instance where the thread is waiting for CountDownLatch 
when trying to get atomicLong:
"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K 
defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition  
[0x00007f48359e1000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f518aba6060> (a 
java.util.concurrent.CountDownLatch$Sync)
                at 
java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at 
java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at 
org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at 
org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
                at 
org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
                at 
org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
                at 
company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
                at 
company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
                at 
company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
                at 
company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
                at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at 
org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

These issues have only started coming up as of the past 2 months or so. The 
system itself has been very stable for a long time. I haven’t posted the entire 
thread dump as it would be quite large. If needed, I can post it on pastebin or 
upload it somewhere.
Since this really isn’t a very consistent issue I am not sure about how to 
create a reproducer project. But I can provide any logs or so if needed.
The entire thread dumps have been posted on pastebin. Please find the links 
below:
Atomic Reference related thread dump: pastebin.com/ydNMFSEP
Atomic Long related thread dump: pastebin.com/psJgwi3F
Any help is much appreciated. Thanks!
Best Regards,
Paul
---------------------------------------------------------------------------------------Disclaimer----------------------------------------------------------------------------------------------
 

****Views and opinions expressed in this e-mail belong to  their author and do 
not necessarily represent views and opinions  of Ugam. 
Our employees are obliged not to make any defamatory statement or infringe any 
legal right. 
Therefore, Ugam does not accept any responsibility or liability for such 
statements. The content of this email is confidential and intended for the 
recipient specified in message only. It is strictly forbidden to share any part 
of this message with any third party, without a written consent of the sender.
If you have received this message by mistake, please reply to this message and 
follow with its deletion, so that we can ensure such a mistake does not occur 
in the future. 
Warning: Sufficient measures have been taken to scan any presence of viruses 
however the recipient should check this email and any attachments for the 
presence of viruses as full security of the email cannot be ensured despite our 
best efforts.
Therefore, Ugam accepts no liability for any damage inflicted by viewing the 
content of this email.. ****

Please do not print this email unless it is necessary. Every unprinted email 
helps the environment. 

Reply via email to