[
https://issues.apache.org/jira/browse/ARTEMIS-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frederik Fournes updated ARTEMIS-5907:
--------------------------------------
Description:
Hi all,
we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes
environment and experienced an OutOfMemoryError on our production broker. We
are assuming that this could be a bug. We have been investigating the root
cause and would appreciate the community's input on our findings and open
questions.
h3. Environment
- Artemis version: 2.39.0
- Java: OpenJDK 17 (G1 GC)
- JVM: -Xms 4G, -Xmx 9G
- global-max-size: 800M
h3. Situation
Our setup uses a software with an internal Artemis broker per endpoint. The
broker handles message routing between a Business Application (BA), the
endpoint itself, and a central broker.
A feature called "AMQP Send Handler" writes a SendEvent into the queue
`ecp.endpoint.send.event` for every message the endpoint sends. This handler
was enabled, but no consumer was ever connected to this queue.
Over approximately 1.5 years of continuous operation, this queue accumulated
22,240,016 messages with 0 consumers and 0 acknowledgements.
h3. The OOME
The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the GC
managing to recover each time. Eventually a single spike pushed usage to ~99.9%
and triggered the OutOfMemoryError.
h3. Heap analysis
We ran `jcmd 1 GC.class_histogram` on the production broker and found the
following top heap consumers:
|Class|Instances|Bytes|
|—|—|—|
|PageTransactionInfoImpl|22,132,340|1,062,352,320 (~1 GB)|
|ConcurrentHashMap$Node|22,160,468|709,134,976 (~676 MB)|
|JournalRecord|22,256,890|534,165,360 (~509 MB)|
|Long|22,132,609|531,182,616 (~506 MB)|
The instance counts correlate almost exactly with the 22M stuck messages in
`ecp.endpoint.send.event`. These four object types alone consumed approximately
2.8 GB of heap.
All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had normal
counts (~130K instances), consistent with the actively processed queues.
h3. Resolution
We purged the 22M messages from the queue using `removeMessages` with a low
flushLimit. The heap usage dropped significantly after the purge. We also
disabled the Send Handler to prevent re-accumulation.
h3. Reproduction attempt (ACCE environment)
We attempted to reproduce this on a test broker with the same Artemis version
and identical broker.xml configuration, but with -Xmx 1G. We sent >10M messages
to the same queue (no consumer). However, the heap histogram showed a very
different picture:
|Class|PROD (22M msgs)|ACCE (10M+ msgs)|
|—|—|—|
|PageTransactionInfoImpl|22,132,340|188,264|
|JournalRecord|22,256,890|315,352|
|MessageReferenceImpl|124,518|127,035|
Despite having millions of paged messages, the ACCE broker only held ~188K
PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed
stable around 50%.
h3. Questions for the community
1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and
JournalRecord in heap for each paged message as long as the message is not
consumed/acknowledged? Is this by design?
2. Why is there such a large discrepancy between PROD and ACCE? Both have the
same broker configuration, both had millions of paged messages with 0
consumers. Our hypothesis is that the long-running production environment (1.5
years, continuous message flow across other queues) leads to journal
fragmentation/accumulation that prevents journal compaction from cleaning up
the PageTransactionInfoImpl records, whereas in the short-lived test scenario
the compaction process works efficiently. Is this plausible?
3. Shouldn't the paging mechanism prevent excactly this szenario, heap getting
filled up due to a lot of messages?
Thanks in advance for any insights.
Best regards
Frederik
was:
Hi all,
we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes
environment and experienced an OutOfMemoryError on our production broker. We
are assuming that this could be a bug. We have been investigating the root
cause and would appreciate the community's input on our findings and open
questions.
## Environment
- Artemis version: 2.39.0
- Java: OpenJDK 17 (G1 GC)
- JVM: -Xms 4G, -Xmx 9G
- global-max-size: 800M
## Situation
Our setup uses a software with an internal Artemis broker per endpoint. The
broker handles message routing between a Business Application (BA), the
endpoint itself, and a central broker.
A feature called "AMQP Send Handler" writes a SendEvent into the queue
`ecp.endpoint.send.event` for every message the endpoint sends. This handler
was enabled, but no consumer was ever connected to this queue.
Over approximately 1.5 years of continuous operation, this queue accumulated
22,240,016 messages with 0 consumers and 0 acknowledgements.
## The OOME
The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the GC
managing to recover each time. Eventually a single spike pushed usage to ~99.9%
and triggered the OutOfMemoryError.
## Heap analysis
We ran `jcmd 1 GC.class_histogram` on the production broker and found the
following top heap consumers:
| Class | Instances | Bytes |
|---|---|---|
| PageTransactionInfoImpl | 22,132,340 | 1,062,352,320 (~1 GB) |
| ConcurrentHashMap$Node | 22,160,468 | 709,134,976 (~676 MB) |
| JournalRecord | 22,256,890 | 534,165,360 (~509 MB) |
| Long | 22,132,609 | 531,182,616 (~506 MB) |
The instance counts correlate almost exactly with the 22M stuck messages in
`ecp.endpoint.send.event`. These four object types alone consumed approximately
2.8 GB of heap.
All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had normal
counts (~130K instances), consistent with the actively processed queues.
## Resolution
We purged the 22M messages from the queue using `removeMessages` with a low
flushLimit. The heap usage dropped significantly after the purge. We also
disabled the Send Handler to prevent re-accumulation.
## Reproduction attempt (ACCE environment)
We attempted to reproduce this on a test broker with the same Artemis version
and identical broker.xml configuration, but with -Xmx 1G. We sent >10M messages
to the same queue (no consumer). However, the heap histogram showed a very
different picture:
| Class | PROD (22M msgs) | ACCE (10M+ msgs) |
|---|---|---|
| PageTransactionInfoImpl | 22,132,340 | 188,264 |
| JournalRecord | 22,256,890 | 315,352 |
| MessageReferenceImpl | 124,518 | 127,035 |
Despite having millions of paged messages, the ACCE broker only held ~188K
PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed
stable around 50%.
## Questions for the community
1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and
JournalRecord in heap for each paged message as long as the message is not
consumed/acknowledged? Is this by design?
2. Why is there such a large discrepancy between PROD and ACCE? Both have the
same broker configuration, both had millions of paged messages with 0
consumers. Our hypothesis is that the long-running production environment (1.5
years, continuous message flow across other queues) leads to journal
fragmentation/accumulation that prevents journal compaction from cleaning up
the PageTransactionInfoImpl records, whereas in the short-lived test scenario
the compaction process works efficiently. Is this plausible?
3. Shouldn't the paging mechanism prevent excactly this szenario, heap getting
filled up due to a lot of messages?
Thanks in advance for any insights.
Best regards
Frederik
> OOME caused by accumulation of PageTransactionInfoImpl and JournalRecord
> objects in paged queue with no consumer
> ----------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-5907
> URL: https://issues.apache.org/jira/browse/ARTEMIS-5907
> Project: Artemis
> Issue Type: Bug
> Affects Versions: 2.39.0
> Reporter: Frederik Fournes
> Priority: Minor
>
> Hi all,
> we are running Apache ActiveMQ Artemis 2.39.0 on OpenJDK 17 in a Kubernetes
> environment and experienced an OutOfMemoryError on our production broker. We
> are assuming that this could be a bug. We have been investigating the root
> cause and would appreciate the community's input on our findings and open
> questions.
> h3. Environment
> - Artemis version: 2.39.0
> - Java: OpenJDK 17 (G1 GC)
> - JVM: -Xms 4G, -Xmx 9G
> - global-max-size: 800M
> h3. Situation
> Our setup uses a software with an internal Artemis broker per endpoint. The
> broker handles message routing between a Business Application (BA), the
> endpoint itself, and a central broker.
> A feature called "AMQP Send Handler" writes a SendEvent into the queue
> `ecp.endpoint.send.event` for every message the endpoint sends. This handler
> was enabled, but no consumer was ever connected to this queue.
> Over approximately 1.5 years of continuous operation, this queue accumulated
> 22,240,016 messages with 0 consumers and 0 acknowledgements.
> h3. The OOME
> The JVM heap showed a sawtooth pattern consistently reaching ~95%, with the
> GC managing to recover each time. Eventually a single spike pushed usage to
> ~99.9% and triggered the OutOfMemoryError.
> h3. Heap analysis
> We ran `jcmd 1 GC.class_histogram` on the production broker and found the
> following top heap consumers:
> |Class|Instances|Bytes|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|1,062,352,320 (~1 GB)|
> |ConcurrentHashMap$Node|22,160,468|709,134,976 (~676 MB)|
> |JournalRecord|22,256,890|534,165,360 (~509 MB)|
> |Long|22,132,609|531,182,616 (~506 MB)|
> The instance counts correlate almost exactly with the 22M stuck messages in
> `ecp.endpoint.send.event`. These four object types alone consumed
> approximately 2.8 GB of heap.
> All other objects (AMQPStandardMessage, MessageReferenceImpl, etc.) had
> normal counts (~130K instances), consistent with the actively processed
> queues.
> h3. Resolution
> We purged the 22M messages from the queue using `removeMessages` with a low
> flushLimit. The heap usage dropped significantly after the purge. We also
> disabled the Send Handler to prevent re-accumulation.
> h3. Reproduction attempt (ACCE environment)
> We attempted to reproduce this on a test broker with the same Artemis version
> and identical broker.xml configuration, but with -Xmx 1G. We sent >10M
> messages to the same queue (no consumer). However, the heap histogram showed
> a very different picture:
> |Class|PROD (22M msgs)|ACCE (10M+ msgs)|
> |—|—|—|
> |PageTransactionInfoImpl|22,132,340|188,264|
> |JournalRecord|22,256,890|315,352|
> |MessageReferenceImpl|124,518|127,035|
> Despite having millions of paged messages, the ACCE broker only held ~188K
> PageTransactionInfoImpl objects in heap (vs. 22M in PROD). JVM usage stayed
> stable around 50%.
> h3. Questions for the community
> 1. Can someone confirm that Artemis keeps a PageTransactionInfoImpl and
> JournalRecord in heap for each paged message as long as the message is not
> consumed/acknowledged? Is this by design?
> 2. Why is there such a large discrepancy between PROD and ACCE? Both have the
> same broker configuration, both had millions of paged messages with 0
> consumers. Our hypothesis is that the long-running production environment
> (1.5 years, continuous message flow across other queues) leads to journal
> fragmentation/accumulation that prevents journal compaction from cleaning up
> the PageTransactionInfoImpl records, whereas in the short-lived test scenario
> the compaction process works efficiently. Is this plausible?
> 3. Shouldn't the paging mechanism prevent excactly this szenario, heap
> getting filled up due to a lot of messages?
> Thanks in advance for any insights.
> Best regards
> Frederik
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]