I see. With 16G on the machine, is this a wise setup? -Xms1024M -Xmx8G
<global-max-size>4096</global-max-size> On Wed, Jan 20, 2021 at 1:28 PM Justin Bertram <jbert...@apache.org> wrote: > > I don't have max-size-bytes or global-max-size set. However, according to > the comments in the broker.xml file, global-max-size defaults to half of > -Xmx, so shouldn't Artemis start using the page file after it uses 1G? > > Yes. However, given a few recent issues like ARTEMIS-3021 [1] and > ARTEMIS-3061 [2] using half of Xmx may not be sufficiently conservative to > mitigate OOME. > > I recommend monitoring your broker(s) and explicitly setting max-size-bytes > and/or global-max-size for your use-case. > > > Justin > > [1] https://issues.apache.org/jira/browse/ARTEMIS-3021 > [2] https://issues.apache.org/jira/browse/ARTEMIS-3061 > > On Wed, Jan 20, 2021 at 12:15 PM Ben Warrick <horseatingwe...@gmail.com> > wrote: > > > Hi Justin, > > > > There is 16GB of memory on the VM. And I have JVM memory set to -Xms512M > > -Xmx2G in the artemis.profile file, so I can increase those. > > > > I don't have max-size-bytes or global-max-size set. However, according to > > the comments in the broker.xml file, global-max-size defaults to half of > > -Xmx, so shouldn't Artemis start using the page file after it uses 1G? > > > > I've been tracking heap usage, and it did spike a few times when the > > problem started, so you're probably right about this. > > > > Ben > > > > On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <jbert...@apache.org> > > wrote: > > > > > > java.lang.OutOfMemoryError: Java heap space > > > > > > My guess is that you were approaching your heap limit which caused the > > > preceding issues. Once you hit an OutOfMemoryError then in my > experience > > > all bets are off so to speak and you need to restart the broker. > > > > > > It's not clear why you couldn't connect to the slave. > > > > > > I recommend you take a look at your global-max-size as well as your > > > max-size-bytes configuration as these are designed to help prevent OOME > > > when receiving lots of messages. > > > > > > > > > Justin > > > > > > On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <horseatingwe...@gmail.com > > > > > wrote: > > > > > > > I had an incident last night. I can share the actual log files with > > > anyone > > > > interested in seeing the full picture. Below is a summary. I also > have > > OS > > > > resource tracking data. > > > > > > > > I have a pair of Artemis servers using a shared store -- an NFS > mounted > > > > folder. My use case is enterprise integration. I use artemis to pass > > > > messages between interfaces. For example, the website publishes an > > order > > > to > > > > an address which has several queues. All the interfaces that need > that > > > > message subscribe to their queue. > > > > > > > > It looks to me like an issue with the page file. Is the page file in > > this > > > > case mean virtual memory, or is it talking about the paging folder > > inside > > > > the data folder? > > > > > > > > This morning I could not bring up the console, but I was able to send > > and > > > > receive messages using JMS. I tried restarting the primary -- which > did > > > not > > > > help the console, but after that I could no longer get a JMS > > connection. > > > > After shutting down the primary, the secondary did not take over. > From > > > the > > > > logs it looked like it did, the objects were deployed, but the whole > > > > Artemis tab disappeared from the console and I was not able to make a > > JMS > > > > connection. > > > > > > > > So then I shut both down. Waited a few minutes, and started the > > primary. > > > It > > > > came up fine. So then I started the secondary, waited a little while, > > and > > > > shut down the primary, just as a test. This time the secondary took > > over > > > > quickly. I restarted the primary and it happily took control back. > > > > > > > > Any advice would be much appreciated. > > > > > > > > Thanks! > > > > > > > > Ben > > > > > > > > > > > > > > > > *** Log Summary *** > > > > > > > > 2021-01-19 14:58:51 UTC this appears in the log: > > > > > > > > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl] > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx > > with > > > > id 33193877: java.lang.IllegalStateException: Cannot find tx with id > > > > 33193877 > > > > > > > > Along with some Java dump, directly followed by a few of these: > > > > > > > > WARN [org.apache.activemq.artemis.core.server] AMQ222104: IO Error > > > > completing the transaction, code = 6, message = Cannot find tx with > id > > > > 33193877 > > > > > > > > 2021-01-19 15:41:10 The configuration reloads and the objects all > > > > redeploy. But then: > > > > > > > > 2021-01-19 15:52:51 ERROR > > > > [org.apache.activemq.artemis.core.journal.impl.JournalImpl] > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx > > with > > > > id 33318276: java.lang.IllegalStateException: Cannot find tx with id > > > > 33318276 > > > > > > > > Then several > > > > > > > > 2021-01-19 15:52:51,666 WARN > [org.apache.activemq.artemis.core.server] > > > > AMQ222104: IO Error completing the transaction, code = 6, message = > > > Cannot > > > > find tx with id 33318276 > > > > > > > > And a few minutes later I see about 1 billion of these: > > > > > > > > 2021-01-19 16:01:21,164 WARN > [org.apache.activemq.artemis.core.server] > > > > AMQ222033: Page file 000000008.page had incomplete records at > position > > > > 8,081,827 at record number 6,504 > > > > 2021-01-19 16:01:21,165 WARN > [org.apache.activemq.artemis.core.server] > > > > AMQ222288: Page 8, message 6,504 could not be found on offset > > 8,081,827, > > > > with starting message 6,504. This represents a logic error or > > > inconsistency > > > > on the data, and the system will try once again from the beggining of > > the > > > > page file. > > > > > > > > These end the following day with several of these: > > > > > > > > 2021-01-20 03:10:22,764 WARN > [org.apache.activemq.artemis.core.client] > > > > AMQ212037: Connection failure to /13.68.117.113:42767 has been > > detected: > > > > AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms > > > > connection TTL. The connection will now be closed. > > > > [code=CONNECTION_TIMEDOUT] > > > > > > > > For several different IP addresses and ports. Then I see both of > these > > > two > > > > errors happening together, until I get a nice > > > > > > > > Java heap space: java.lang.OutOfMemoryError: Java heap space > > > > > > > > Followed by a lot of interesting looking Java dump. A bunch of these: > > > > > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout > (10 > > > > seconds) on acceptor "artemis" during protocol handshake with / > > > > 52.179.166.52:61538 has occurred. > > > > > > > > Then a bunch of handshake errors > > > > > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout > (10 > > > > seconds) on acceptor "artemis" during protocol handshake with / > > > > 52.179.166.52:5972 has occurred. > > > > > > > > A bunch more of these: > > > > > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout > (10 > > > > seconds) on acceptor "artemis" during protocol handshake with > > > > /{IP}.52:22481 has occurred. > > > > > > > > This one appears and looks useful maybe: > > > > > > > > 2021-01-20 03:30:48,627 ERROR > > > > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to > > > > submit a listener notification task. Event loop shut down?: > > > > java.util.concurrent.RejectedExecutionException: event executor > > > terminated > > > > > > > > More heap space complaints and java dump... Then numerous of these: > > > > > > > > WARN [io.netty.channel.AbstractChannel] Force-closing a channel > whose > > > > registration task was not accepted by an event loop: [id: 0x6be932f5, > > L:/ > > > > 10.40.50.17:61616 - R:/52.179.166.52:45604]: > > > > java.util.concurrent.RejectedExecutionException: event executor > > > terminated > > > > > > > > And that's the state where I found it this morning. > > > > > > > > > >