Hello Justin,

We managed to get thread dump of the service while issue is in effect,
relevant thread stacks are below.
These thread stacks were taken with artemis version *2.6.4*, we also
managed to reproduce the issue locally on latest version of artemis (
*2.10.1*).

Stacks of 2 *deadlocked *threads:

"Thread-1074
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a)"
- *Thread t@11854*
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174)
        - waiting to lock <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
owned by "Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
t@202
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
        at
org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <4c6aa3d9> (a
java.util.concurrent.ThreadPoolExecutor$Worker)


"Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
- *Thread t@202*
   java.lang.Thread.State: WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <7807bddf> (a
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
        at
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
        at
org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308)
        - locked <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799)
        at
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801)
        - locked <3396ca5d> (a
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993)
        - locked <67aa6e6> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <7fc2876c> (a
java.util.concurrent.ThreadPoolExecutor$Worker)

>From what we can see this issue occurs only on large message processing,
these two threads are locked like this:

   1. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   (JournalImpl:946) is delayed for some reason
   2. *Thread t@202*  JournalImpl.appendDeleteRecord() is executed because
   appendAdd is async-> takes lock on LargeMessage object
   (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind *Thread
   t@11854* (JournalImpl:1058)
   3. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   gets to the part where it needs the lock on same LargeMessage object
   (LargeServerMessageImpl:173), but it can't get it because *Thread
   t@11854* keeps it

So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage object
that will never be released by *Thread t@202* because it is waiting for
processing on appendExecutor queue behind *Thread t@11854*

This eventually blocks all other producer threads, so *no messages can be
produced to the queue*:

"flow-action-completed-355" - Thread t@12775
   java.lang.Thread.State: TIMED_WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <4876219f> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389)
        - locked <5582112e> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543)
        at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234)
        at
org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280)
        - locked <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection)


"flow-action-completed-351" - Thread t@12771
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
        - waiting to lock <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by
"flow-action-completed-355" t@12775
        at
org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481)
        at
org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232)
        at
org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648)

This issue occurred 3 times in the course of 2 weeks.
In the meantime we updated the artemis version to the latest one (2.10.1)
and for now we increased *minLargeMessageSize *so we process all messages
as regular messages because we run artemis embedded, keep entire messages
in memory, and never actually get the benefit of processing the message as
a large message. This way we avoid part of the code that causes the
deadlock to occur.
We would just like for someone to confirm that there is a bug present here.
If more info is needed we will provide it.

Kind regards,
Mario

 On Mon, 18 Nov 2019 18:21, Justin Bertram <jbert...@apache.org> wrote:


> > stack traces for regular/large messages attached

Email attachments are scrubbed so none came through.

> Service is running within a docker container and folder containing the
> journal is mapped to the host machine.

Could you be more specific here? How exactly is the folder mapped? Is a
> networked file-system involved here?

> After some debugging we came to the conclusion that either threads
> writing to the journal were blocked for an extended period of time, or
> journal compact operation lasted a long time/was blocked for some reason
> and held write lock on journal during that time.

What specifically led you to this conclusion? It's hard to offer insight
> without additional details, especially thread

dumps from the time of the incident. Have you seen this issue just once in
> over a year of service?

Justin


On Mon, 18 Nov 2019 at 16:14, Mario Mahovlić <mariomahov...@gmail.com>
wrote:

> We run Artemis embedded on our Spring service, it ran ok for over a year,
> however at some point we started getting timeout exceptions when producing
> messages to the queue. (stack traces for regular/large messages attached).
>
> We produce both regular and large messages to the queue and we got
> timeouts for both types (large messages are ~130kb on average). Message
> production rate to the queue at the time of incident was ~100k per hour.
>
> Artemis is running in persistent mode using file journal on disk. As
> mentioned in the title no error or warn level logs were logged on artemis
> server side and timeouts stopped after service restart.
>
>
> Service is running within a docker container and folder containing the
> journal is mapped to the host machine.
>
> Metrics for the node on which service was running show no disk I/O issues
> at that time.
>
> Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
>
> Relevant artemis settings (rest of the settings are default):
>
> durable: true
> max-size-bytes : 1GB
> address-full-policy: FAIL
> journal-sync-non-transactional : false
> journal-sync-transactional: false
>
> If more info is needed we will try to provide it on request.
>
> After some debugging we came to the conclusion that either threads writing
> to the journal were blocked for an extended period of time, or journal
> compact operation lasted a long time/was blocked for some reason and held
> write lock on journal during that time.
>
> Unfortunately we took no thread dumps during the incident to see where
> exactly the threads were stuck. We didn't manage to find any similar
> incidents reported on these boards so we would like to check out if anyone
> has any other idea what might cause this behavior?
>

Reply via email to