[ 
https://issues.apache.org/jira/browse/ARTEMIS-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frederik Fournes resolved ARTEMIS-5907.
---------------------------------------
    Resolution: Invalid

> OOME caused by accumulation of PageTransactionInfoImpl and JournalRecord 
> objects in paged queue with no consumer
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-5907
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-5907
>             Project: Artemis
>          Issue Type: Bug
>    Affects Versions: 2.39.0
>            Reporter: Frederik Fournes
>            Priority: Minor
>
> Hi all,
> we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes 
> environment and experienced an OutOfMemoryError on our production broker. We 
> are assuming that this could be a bug. We have been investigating the root 
> cause and would appreciate the community's input on our findings and open 
> questions.
> h3. Environment
>  - Artemis version: 2.39.0
>  - Java: OpenJDK 17 (G1 GC)
>  - JVM: -Xms 4G, -Xmx 9G
>  - global-max-size: 800M
> h3. Situation
> Our setup uses a software with an internal Artemis broker per endpoint. The 
> broker handles message routing between a Business Application (BA), the 
> endpoint itself, and a central broker.
> A feature called "AMQP Send Handler" writes a SendEvent into the queue 
> `ecp.endpoint.send.event` for every message the endpoint sends. This handler 
> was enabled, but no consumer was ever connected to this queue.
> Over approximately 1.5 years of continuous operation, this queue accumulated 
> 22,240,016 messages with 0 consumers and 0 acknowledgements.
> h3. The OOME
> The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the 
> GC managing to recover each time. Eventually a single spike pushed usage to 
> ~99.9% and triggered the OutOfMemoryError.
> h3. Heap analysis
> We ran `jcmd 1 GC.class_histogram` on the production broker and found the 
> following top heap consumers:
> |Class|Instances|Bytes|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|1,062,352,320 (~1 GB)|
> |ConcurrentHashMap$Node|22,160,468|709,134,976 (~676 MB)|
> |JournalRecord|22,256,890|534,165,360 (~509 MB)|
> |Long|22,132,609|531,182,616 (~506 MB)|
> The instance counts correlate almost exactly with the 22M stuck messages in 
> `ecp.endpoint.send.event`. These four object types alone consumed 
> approximately 2.8 GB of heap.
> All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had 
> normal counts (~130K instances), consistent with the actively processed 
> queues.
> h3. Resolution
> We purged the 22M messages from the queue using `removeMessages` with a low 
> flushLimit. The heap usage dropped significantly after the purge. We also 
> disabled the Send Handler to prevent re-accumulation.
> h3. Reproduction attempt (ACCE environment)
> We attempted to reproduce this on a test broker with the same Artemis version 
> and identical broker.xml configuration, but with -Xmx 1G. We sent >10M 
> messages to the same queue (no consumer). However, the heap histogram showed 
> a very different picture:
> |Class|PROD (22M msgs)|ACCE (10M+ msgs)|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|188,264|
> |JournalRecord|22,256,890|315,352|
> |MessageReferenceImpl|124,518|127,035|
> Despite having millions of paged messages, the ACCE broker only held ~188K 
> PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed 
> stable around 50%.
> h3. Questions for the community
> 1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and 
> JournalRecord in heap for each paged message as long as the message is not 
> consumed/acknowledged? Is this by design?
> 2. Why is there such a large discrepancy between PROD and ACCE? Both have the 
> same broker configuration, both had millions of paged messages with 0 
> consumers. Our hypothesis is that the long-running production environment 
> (1.5 years, continuous message flow across other queues) leads to journal 
> fragmentation/accumulation that prevents journal compaction from cleaning up 
> the PageTransactionInfoImpl records, whereas in the short-lived test scenario 
> the compaction process works efficiently. Is this plausible?
> 3. Shouldn't the paging mechanism prevent excactly this szenario, heap 
> getting filled up due to a lot of messages?
> Thanks in advance for any insights.
> Best regards
> Frederik



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to