I'm not aware of any broker-specific dangers of adding +ExitOnOutOfMemoryError.
It's worth noting that the broker is written knowing that at any point the JVM or OS could crash or there could be a hardware failure of some kind. This is, in part, why transactions are implemented a particular way, why we only acknowledge durable messages once the date has been flushed to disk, etc. In short, there _should_ never be any message/journal corruption. If there was then it would be considered a bug and it would be fixed. Justin On Thu, Jan 2, 2025 at 12:21 AM Vilius Šumskas <vilius.sums...@rivile.lt> wrote: > I will rephrase my question. Are the any exact dangers by adding > +ExitOnOutOfMemoryError? For example, message/journal corruption if JVM > shuts down abruptly because of OOME in one broker part of the broker, but > another part is still working? > > -- > Vilius > > -----Original Message----- > From: Justin Bertram <jbert...@apache.org> > Sent: Thursday, January 2, 2025 3:42 AM > To: users@activemq.apache.org > Subject: Re: backup not activated after OOM on primary > > The caveat with adding +ExitOnOutOfMemoryError is that the JVM will now > exit when an OOME occurs now rather than simply carrying on. Keep in mind > that an OOME isn't necessarily a death sentence in and of itself. It is > technically possible (although unlikely) for the broker to recover. > > I believe it's not in the default artemis.profile because it was > essentially brand new when Artemis 2.0 was released, and it hasn't been > possible to add it and change the default behavior in a minor release. > > > Justin > > On Wed, Jan 1, 2025 at 4:32 PM Vilius Šumskas <vilius.sums...@rivile.lt> > wrote: > > > Are there any caveats adding +ExitOnOutOfMemoryError? Just wondering > > why it's not in the default JAVA_ARGS in "artemis.profile". > > > > -- > > Vilius > > > > -----Original Message----- > > From: Justin Bertram <jbert...@apache.org> > > Sent: Wednesday, January 1, 2025 11:01 PM > > To: users@activemq.apache.org > > Subject: Re: backup not activated after OOM on primary > > > > I think what you're seeing is expected. An OOME usually isn't enough > > to trigger the broker to fail completely and trigger a failover. As > > you can see, the broker continued to run after the OOME which means it > > was still holding the lock on the shared journal (preventing the > > backup from activating). If you want to ensure the broker fails over > > in this situation you should pass this to the JVM: > > > > -XX:+ExitOnOutOfMemoryError > > > > This will ensure the JVM stops when an OOME occurs which will then > > allow the backup to activate. > > > > It might also be worth passing this as well: > > > > -XX:+HeapDumpOnOutOfMemoryError > > > > This will allow you to do some post-mortem analysis and see exactly > > why the OOME occurred. > > > > > > Justin > > > > On Wed, Jan 1, 2025 at 12:17 PM Vilius Šumskas > > <vilius.sums...@rivile.lt> > > wrote: > > > > > Hi, > > > > > > we had an incident where our applications sent too much traffic to > > > Artemis broker and the broker got Java Heap Out Of Memory errors. > > > I’m trying to understand why backup broker never became primary > > > after this > > happened. > > > We run Artemis static cluster with two nodes (primary and backup) > > > under Shared Storage. Configuration is pretty straightforward: > > > <ha-policy> > > > <shared-store> > > > <primary> > > > <failover-on-shutdown>true</failover-on-shutdown> > > > </primary> > > > </shared-store> > > > </ha-policy> > > > > > > <ha-policy> > > > <shared-store> > > > <backup> > > > <failover-on-shutdown>true</failover-on-shutdown> > > > </backup> > > > </shared-store> > > > </ha-policy> > > > > > > Backup always becomes primary if we reboot primary during maintenance. > > > We also tested our HA configuration with various other tests, like > > > disabling network connections, killing storage mount point, etc, so > > > I’m positive configuration should be correct. > > > > > > Primary logs during that time: > > > https://p.defau.lt/?cNVdPPEMN2qM8XbLZomKdQ > > > Backup logs during that time: > > > https://p.defau.lt/?nJvfQKdc4rrUGC9lL7_JCA > > > > > > OOM have happened at ~2:51 and the primary was in this state until I > > > have restarted it ~13:25. > > > > > > Any pointers are much appreciated! > > > > > > -- > > > Best Regards, > > > > > > Vilius Šumskas > > > Rivile > > > IT manager > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: users-unsubscr...@activemq.apache.org > > For additional commands, e-mail: users-h...@activemq.apache.org For > > further information, visit: https://activemq.apache.org/contact > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@activemq.apache.org > For additional commands, e-mail: users-h...@activemq.apache.org > For further information, visit: https://activemq.apache.org/contact > >