Re: Artemis failed but HA secondary did not take over

Justin Bertram Wed, 20 Jan 2021 19:30:16 -0800

I can't really say if your proposed configuration is "wise" or not since I
don't know all the details of your use-case. However, I can say a few
things conclusively...


First, the global-max-size is measured in bytes so using 4096 will only
give you 4KB of space. If you actually want 4 gigabytes then you can use
"4G", "4096MB", etc.

Second, since the default global-max-size is calculated as 1/2 of Xmx using
an Xmx of 8G would give you a global-max-size of 4G by default so
explicitly configuring global-max-size to the same value would be
functionally redundant. Since this default calculation wasn't sufficient to
mitigate OOME previously I suspect it probably wouldn't be sufficient to
prevent OOME in the future if your load gets large enough.

As I indicated before, you'll probably need to monitor your broker under
load and potentially adjust the global-max-size if you want to find an
optimal value. Typically this kind of tuning would be done in a
pre-production environment with a work-load equivalent to what you'd
experience in production.

If you just want a safe, "ballpark" type of estimation then I'd recommend
starting with 1/4 of Xmx. Therefore, if your Xmx was 8G then use
<global-max-size>2G</global-max-size>.


Justin

On Wed, Jan 20, 2021 at 6:04 PM Ben Warrick <horseatingwe...@gmail.com>
wrote:

> I see. With 16G on the machine, is this a wise setup?
>
> -Xms1024M -Xmx8G
>
> <global-max-size>4096</global-max-size>
>
> On Wed, Jan 20, 2021 at 1:28 PM Justin Bertram <jbert...@apache.org>
> wrote:
>
> > > I don't have max-size-bytes or global-max-size set. However, according
> to
> > the comments in the broker.xml file, global-max-size defaults to half of
> > -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
> >
> > Yes. However, given a few recent issues like ARTEMIS-3021 [1] and
> > ARTEMIS-3061 [2] using half of Xmx may not be sufficiently conservative
> to
> > mitigate OOME.
> >
> > I recommend monitoring your broker(s) and explicitly setting
> max-size-bytes
> > and/or global-max-size for your use-case.
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-3021
> > [2] https://issues.apache.org/jira/browse/ARTEMIS-3061
> >
> > On Wed, Jan 20, 2021 at 12:15 PM Ben Warrick <horseatingwe...@gmail.com>
> > wrote:
> >
> > > Hi Justin,
> > >
> > > There is 16GB of memory on the VM. And I have JVM memory set to
> -Xms512M
> > > -Xmx2G in the artemis.profile file, so I can increase those.
> > >
> > > I don't have max-size-bytes or global-max-size set. However, according
> to
> > > the comments in the broker.xml file, global-max-size defaults to half
> of
> > > -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
> > >
> > > I've been tracking heap usage, and it did spike a few times when the
> > > problem started, so you're probably right about this.
> > >
> > > Ben
> > >
> > > On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <jbert...@apache.org>
> > > wrote:
> > >
> > > > > java.lang.OutOfMemoryError: Java heap space
> > > >
> > > > My guess is that you were approaching your heap limit which caused
> the
> > > > preceding issues. Once you hit an OutOfMemoryError then in my
> > experience
> > > > all bets are off so to speak and you need to restart the broker.
> > > >
> > > > It's not clear why you couldn't connect to the slave.
> > > >
> > > > I recommend you take a look at your global-max-size as well as your
> > > > max-size-bytes configuration as these are designed to help prevent
> OOME
> > > > when receiving lots of messages.
> > > >
> > > >
> > > > Justin
> > > >
> > > > On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <
> horseatingwe...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > I had an incident last night. I can share the actual log files with
> > > > anyone
> > > > > interested in seeing the full picture. Below is a summary. I also
> > have
> > > OS
> > > > > resource tracking data.
> > > > >
> > > > > I have a pair of Artemis servers using a shared store -- an NFS
> > mounted
> > > > > folder. My use case is enterprise integration. I use artemis to
> pass
> > > > > messages between interfaces. For example, the website publishes an
> > > order
> > > > to
> > > > > an address which has several queues. All the interfaces that need
> > that
> > > > > message subscribe to their queue.
> > > > >
> > > > > It looks to me like an issue with the page file. Is the page file
> in
> > > this
> > > > > case mean virtual memory, or is it talking about the paging folder
> > > inside
> > > > > the data folder?
> > > > >
> > > > > This morning I could not bring up the console, but I was able to
> send
> > > and
> > > > > receive messages using JMS. I tried restarting the primary -- which
> > did
> > > > not
> > > > > help the console, but after that I could no longer get a JMS
> > > connection.
> > > > > After shutting down the primary, the secondary did not take over.
> > From
> > > > the
> > > > > logs it looked like it did, the objects were deployed, but the
> whole
> > > > > Artemis tab disappeared from the console and I was not able to
> make a
> > > JMS
> > > > > connection.
> > > > >
> > > > > So then I shut both down. Waited a few minutes, and started the
> > > primary.
> > > > It
> > > > > came up fine. So then I started the secondary, waited a little
> while,
> > > and
> > > > > shut down the primary, just as a test. This time the secondary took
> > > over
> > > > > quickly. I restarted the primary and it happily took control back.
> > > > >
> > > > > Any advice would be much appreciated.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Ben
> > > > >
> > > > >
> > > > >
> > > > > *** Log Summary ***
> > > > >
> > > > > 2021-01-19 14:58:51 UTC this appears in the log:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find
> tx
> > > with
> > > > > id 33193877: java.lang.IllegalStateException: Cannot find tx with
> id
> > > > > 33193877
> > > > >
> > > > > Along with some Java dump, directly followed by a few of these:
> > > > >
> > > > > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > > > > completing the transaction, code = 6, message = Cannot find tx with
> > id
> > > > > 33193877
> > > > >
> > > > > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > > > > redeploy. But then:
> > > > >
> > > > > 2021-01-19 15:52:51  ERROR
> > > > > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find
> tx
> > > with
> > > > > id 33318276: java.lang.IllegalStateException: Cannot find tx with
> id
> > > > > 33318276
> > > > >
> > > > > Then several
> > > > >
> > > > > 2021-01-19 15:52:51,666 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222104: IO Error completing the transaction, code = 6, message =
> > > > Cannot
> > > > > find tx with id 33318276
> > > > >
> > > > > And a few minutes later I see about 1 billion of these:
> > > > >
> > > > > 2021-01-19 16:01:21,164 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222033: Page file 000000008.page had incomplete records at
> > position
> > > > > 8,081,827 at record number 6,504
> > > > > 2021-01-19 16:01:21,165 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222288: Page 8, message 6,504 could not be found on offset
> > > 8,081,827,
> > > > > with starting message 6,504. This represents a logic error or
> > > > inconsistency
> > > > > on the data, and the system will try once again from the beggining
> of
> > > the
> > > > > page file.
> > > > >
> > > > > These end the following day with several of these:
> > > > >
> > > > > 2021-01-20 03:10:22,764 WARN
> > [org.apache.activemq.artemis.core.client]
> > > > > AMQ212037: Connection failure to /13.68.117.113:42767 has been
> > > detected:
> > > > > AMQ229014: Did not receive data from /{IP}:{port} within the
> 60,000ms
> > > > > connection TTL. The connection will now be closed.
> > > > > [code=CONNECTION_TIMEDOUT]
> > > > >
> > > > > For several different IP addresses and ports. Then I see both of
> > these
> > > > two
> > > > > errors happening together, until I get a nice
> > > > >
> > > > > Java heap space: java.lang.OutOfMemoryError: Java heap space
> > > > >
> > > > > Followed by a lot of interesting looking Java dump. A bunch of
> these:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > > 52.179.166.52:61538 has occurred.
> > > > >
> > > > > Then a bunch of handshake errors
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > > 52.179.166.52:5972 has occurred.
> > > > >
> > > > > A bunch more of these:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with
> > > > > /{IP}.52:22481 has occurred.
> > > > >
> > > > > This one appears and looks useful maybe:
> > > > >
> > > > > 2021-01-20 03:30:48,627 ERROR
> > > > > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed
> to
> > > > > submit a listener notification task. Event loop shut down?:
> > > > > java.util.concurrent.RejectedExecutionException: event executor
> > > > terminated
> > > > >
> > > > > More heap space complaints and java dump... Then numerous of these:
> > > > >
> > > > > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel
> > whose
> > > > > registration task was not accepted by an event loop: [id:
> 0x6be932f5,
> > > L:/
> > > > > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > > > > java.util.concurrent.RejectedExecutionException: event executor
> > > > terminated
> > > > >
> > > > > And that's the state where I found it this morning.
> > > > >
> > > >
> > >
> >
>

Re: Artemis failed but HA secondary did not take over

Reply via email to