Hello Justin, We managed to get thread dump of the service while issue is in effect, relevant thread stacks are below. These thread stacks were taken with artemis version *2.6.4*, we also managed to reproduce the issue locally on latest version of artemis ( *2.10.1*).
Stacks of 2 *deadlocked *threads: "Thread-1074 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a)" - *Thread t@11854* java.lang.Thread.State: BLOCKED at org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174) - waiting to lock <1eb68b32> (a org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl) owned by "Thread-25 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)" t@202 at org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59) at org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25) at org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79) at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785) at org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91) at org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) Locked ownable synchronizers: - locked <4c6aa3d9> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Thread-25 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)" - *Thread t@202* java.lang.Thread.State: WAITING at jdk.internal.misc.Unsafe.park(Native Method) - parking to wait for <7807bddf> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232) at org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62) at org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073) at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943) at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360) at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308) - locked <1eb68b32> (a org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl) at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254) at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904) at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799) at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801) - locked <3396ca5d> (a org.apache.activemq.artemis.core.server.impl.ServerSessionImpl) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993) - locked <67aa6e6> (a java.lang.Object) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown Source) at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown Source) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) Locked ownable synchronizers: - locked <7fc2876c> (a java.util.concurrent.ThreadPoolExecutor$Worker) >From what we can see this issue occurs only on large message processing, these two threads are locked like this: 1. *Thread t@11854* JournalImpl.appendAddRecord() -> appendExecutor (JournalImpl:946) is delayed for some reason 2. *Thread t@202* JournalImpl.appendDeleteRecord() is executed because appendAdd is async-> takes lock on LargeMessage object (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind *Thread t@11854* (JournalImpl:1058) 3. *Thread t@11854* JournalImpl.appendAddRecord() -> appendExecutor gets to the part where it needs the lock on same LargeMessage object (LargeServerMessageImpl:173), but it can't get it because *Thread t@11854* keeps it So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage object that will never be released by *Thread t@202* because it is waiting for processing on appendExecutor queue behind *Thread t@11854* This eventually blocks all other producer threads, so *no messages can be produced to the queue*: "flow-action-completed-355" - Thread t@12775 java.lang.Thread.State: TIMED_WAITING at jdk.internal.misc.Unsafe.park(Native Method) - parking to wait for <4876219f> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389) - locked <5582112e> (a java.lang.Object) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319) at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234) at org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280) - locked <5fe5842b> (a org.apache.activemq.artemis.jms.client.ActiveMQConnection) "flow-action-completed-351" - Thread t@12771 java.lang.Thread.State: BLOCKED at org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234) - waiting to lock <5fe5842b> (a org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by "flow-action-completed-355" t@12775 at org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481) at org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232) at org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648) This issue occurred 3 times in the course of 2 weeks. In the meantime we updated the artemis version to the latest one (2.10.1) and for now we increased *minLargeMessageSize *so we process all messages as regular messages because we run artemis embedded, keep entire messages in memory, and never actually get the benefit of processing the message as a large message. This way we avoid part of the code that causes the deadlock to occur. We would just like for someone to confirm that there is a bug present here. If more info is needed we will provide it. Kind regards, Mario On Mon, 18 Nov 2019 18:21, Justin Bertram <jbert...@apache.org> wrote: > > stack traces for regular/large messages attached Email attachments are scrubbed so none came through. > Service is running within a docker container and folder containing the > journal is mapped to the host machine. Could you be more specific here? How exactly is the folder mapped? Is a > networked file-system involved here? > After some debugging we came to the conclusion that either threads > writing to the journal were blocked for an extended period of time, or > journal compact operation lasted a long time/was blocked for some reason > and held write lock on journal during that time. What specifically led you to this conclusion? It's hard to offer insight > without additional details, especially thread dumps from the time of the incident. Have you seen this issue just once in > over a year of service? Justin On Mon, 18 Nov 2019 at 16:14, Mario Mahovlić <mariomahov...@gmail.com> wrote: > We run Artemis embedded on our Spring service, it ran ok for over a year, > however at some point we started getting timeout exceptions when producing > messages to the queue. (stack traces for regular/large messages attached). > > We produce both regular and large messages to the queue and we got > timeouts for both types (large messages are ~130kb on average). Message > production rate to the queue at the time of incident was ~100k per hour. > > Artemis is running in persistent mode using file journal on disk. As > mentioned in the title no error or warn level logs were logged on artemis > server side and timeouts stopped after service restart. > > > Service is running within a docker container and folder containing the > journal is mapped to the host machine. > > Metrics for the node on which service was running show no disk I/O issues > at that time. > > Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE > > Relevant artemis settings (rest of the settings are default): > > durable: true > max-size-bytes : 1GB > address-full-policy: FAIL > journal-sync-non-transactional : false > journal-sync-transactional: false > > If more info is needed we will try to provide it on request. > > After some debugging we came to the conclusion that either threads writing > to the journal were blocked for an extended period of time, or journal > compact operation lasted a long time/was blocked for some reason and held > write lock on journal during that time. > > Unfortunately we took no thread dumps during the incident to see where > exactly the threads were stuck. We didn't manage to find any similar > incidents reported on these boards so we would like to check out if anyone > has any other idea what might cause this behavior? >