On Wed, May 2, 2018 at 3:01 AM, Ilkka Virolainen <ilkka.virolai...@bitwise.fi> wrote: > Hello, > > As well as some previous issues [1] I have some problems with my Artemis > cluster. My setup [2] is a symmetric two node cluster of colocated instances > with scaledown. As well as the node restart causing a problematic state in > replication [1] there are other issues, namely: > > 1) After running for approximately two weeks one of the nodes crashed to heap > space exhaustion. Heap dump analysis would indicate that this is due to > cluster connection failing and millions of messages would end up in the > internal store-and-forward queue causing an eventual OOM exception - I guess > the internal messages are not paged?
You can configure it to paging... Also.. on cluster conneciton you can configure the max-retry of the cluster-connectoin... I'm not talking about replication here. .this is probably about another node that still connected. > > 2) I have now run the cluster for ~2 weeks and the cluster has ended up in a > state where messages are being redistributed from node 1 to node 2 BUT not > the other way around. This can be the same issue as 1) but I cannot tell for > sure. I tried setting the core server logging level to DEBUG on node 2 and > sending messages to a test topic but I get no references to the address name > in Artemis logs. Check what I talked about reconnects on cluster connection. If you were using master.. there's a way you can consume messages from the internal queue.. and send them manually using producer / consumer.. you will need to get a snapshot from master. > > I realize that it's difficult to address these problems given the information > at hand and due to the problematic nature of the circumstances in which they > occur: they (excl. the issue described in [1]) start to appear after running > a cluster for a long time and there's no apparent cause or easy way of > replication. I would however appreciate if anyone has tips to debug this > issue further or has advice on where to look for a probable cause. > > - Ilkka > > [1] Backup voting issue: > http://activemq.2283324.n4.nabble.com/Artemis-2-5-0-Problems-with-colocated-scaledown-td4737583.html#a4737808 > [2] Sample brokers: > https://github.com/ilkkavi/activemq-artemis/tree/scaledown-issue/issues/IssueExample/src/main/resources/activemq -- Clebert Suconic