Re: Artemis failed but HA secondary did not take over

Justin Bertram Wed, 20 Jan 2021 09:03:16 -0800

> java.lang.OutOfMemoryError: Java heap space

My guess is that you were approaching your heap limit which caused the
preceding issues. Once you hit an OutOfMemoryError then in my experience
all bets are off so to speak and you need to restart the broker.


It's not clear why you couldn't connect to the slave.

I recommend you take a look at your global-max-size as well as your
max-size-bytes configuration as these are designed to help prevent OOME
when receiving lots of messages.


Justin

On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <horseatingwe...@gmail.com>
wrote:

> I had an incident last night. I can share the actual log files with anyone
> interested in seeing the full picture. Below is a summary. I also have OS
> resource tracking data.
>
> I have a pair of Artemis servers using a shared store -- an NFS mounted
> folder. My use case is enterprise integration. I use artemis to pass
> messages between interfaces. For example, the website publishes an order to
> an address which has several queues. All the interfaces that need that
> message subscribe to their queue.
>
> It looks to me like an issue with the page file. Is the page file in this
> case mean virtual memory, or is it talking about the paging folder inside
> the data folder?
>
> This morning I could not bring up the console, but I was able to send and
> receive messages using JMS. I tried restarting the primary -- which did not
> help the console, but after that I could no longer get a JMS connection.
> After shutting down the primary, the secondary did not take over. From the
> logs it looked like it did, the objects were deployed, but the whole
> Artemis tab disappeared from the console and I was not able to make a JMS
> connection.
>
> So then I shut both down. Waited a few minutes, and started the primary. It
> came up fine. So then I started the secondary, waited a little while, and
> shut down the primary, just as a test. This time the secondary took over
> quickly. I restarted the primary and it happily took control back.
>
> Any advice would be much appreciated.
>
> Thanks!
>
> Ben
>
>
>
> *** Log Summary ***
>
> 2021-01-19 14:58:51 UTC this appears in the log:
>
> ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> 33193877
>
> Along with some Java dump, directly followed by a few of these:
>
> WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> completing the transaction, code = 6, message = Cannot find tx with id
> 33193877
>
> 2021-01-19 15:41:10  The configuration reloads and the objects all
> redeploy. But then:
>
> 2021-01-19 15:52:51  ERROR
> [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> 33318276
>
> Then several
>
> 2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222104: IO Error completing the transaction, code = 6, message = Cannot
> find tx with id 33318276
>
> And a few minutes later I see about 1 billion of these:
>
> 2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222033: Page file 000000008.page had incomplete records at position
> 8,081,827 at record number 6,504
> 2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
> with starting message 6,504. This represents a logic error or inconsistency
> on the data, and the system will try once again from the beggining of the
> page file.
>
> These end the following day with several of these:
>
> 2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
> AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
> AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> connection TTL. The connection will now be closed.
> [code=CONNECTION_TIMEDOUT]
>
> For several different IP addresses and ports. Then I see both of these two
> errors happening together, until I get a nice
>
> Java heap space: java.lang.OutOfMemoryError: Java heap space
>
> Followed by a lot of interesting looking Java dump. A bunch of these:
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with /
> 52.179.166.52:61538 has occurred.
>
> Then a bunch of handshake errors
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with /
> 52.179.166.52:5972 has occurred.
>
> A bunch more of these:
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with
> /{IP}.52:22481 has occurred.
>
> This one appears and looks useful maybe:
>
> 2021-01-20 03:30:48,627 ERROR
> [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> submit a listener notification task. Event loop shut down?:
> java.util.concurrent.RejectedExecutionException: event executor terminated
>
> More heap space complaints and java dump... Then numerous of these:
>
> WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
> registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
> 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> java.util.concurrent.RejectedExecutionException: event executor terminated
>
> And that's the state where I found it this morning.
>

Re: Artemis failed but HA secondary did not take over

Reply via email to