Hello,

As well as some previous issues [1] I have some problems with my Artemis 
cluster. My setup [2] is a symmetric two node cluster of colocated instances 
with scaledown. As well as the node restart causing a problematic state in 
replication [1] there are other issues, namely:

1) After running for approximately two weeks one of the nodes crashed to heap 
space exhaustion. Heap dump analysis would indicate that this is due to cluster 
connection failing and millions of messages would end up in the internal 
store-and-forward queue causing an eventual OOM exception - I guess the 
internal messages are not paged?

2) I have now run the cluster for ~2 weeks and the cluster has ended up in a 
state where messages are being redistributed from node 1 to node 2 BUT not the 
other way around. This can be the same issue as 1) but I cannot tell for sure. 
I tried setting the core server logging level to DEBUG on node 2 and sending 
messages to a test topic but I get no references to the address name in Artemis 
logs.

I realize that it's difficult to address these problems given the information 
at hand and due to the problematic nature of the circumstances in which they 
occur: they (excl. the issue described in [1]) start to appear after running a 
cluster for a long time and there's no apparent cause or easy way of 
replication. I would however appreciate if anyone has tips to debug this issue 
further or has advice on where to look for a probable cause.

- Ilkka

[1] Backup voting issue: 
http://activemq.2283324.n4.nabble.com/Artemis-2-5-0-Problems-with-colocated-scaledown-td4737583.html#a4737808
[2] Sample brokers: 
https://github.com/ilkkavi/activemq-artemis/tree/scaledown-issue/issues/IssueExample/src/main/resources/activemq

Reply via email to