Re: backup not activated after OOM on primary

Justin Bertram Wed, 01 Jan 2025 17:42:53 -0800

The caveat with adding +ExitOnOutOfMemoryError is that the JVM will now
exit when an OOME occurs now rather than simply carrying on. Keep in mind
that an OOME isn't necessarily a death sentence in and of itself. It is
technically possible (although unlikely) for the broker to recover.


I believe it's not in the default artemis.profile because it was
essentially brand new when Artemis 2.0 was released, and it hasn't been
possible to add it and change the default behavior in a minor release.


Justin

On Wed, Jan 1, 2025 at 4:32 PM Vilius Šumskas <vilius.sums...@rivile.lt>
wrote:

> Are there any caveats adding +ExitOnOutOfMemoryError? Just wondering why
> it's not in the default JAVA_ARGS in "artemis.profile".
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <jbert...@apache.org>
> Sent: Wednesday, January 1, 2025 11:01 PM
> To: users@activemq.apache.org
> Subject: Re: backup not activated after OOM on primary
>
> I think what you're seeing is expected. An OOME usually isn't enough to
> trigger the broker to fail completely and trigger a failover. As you can
> see, the broker continued to run after the OOME which means it was still
> holding the lock on the shared journal (preventing the backup from
> activating). If you want to ensure the broker fails over in this situation
> you should pass this to the JVM:
>
>   -XX:+ExitOnOutOfMemoryError
>
> This will ensure the JVM stops when an OOME occurs which will then allow
> the backup to activate.
>
> It might also be worth passing this as well:
>
>   -XX:+HeapDumpOnOutOfMemoryError
>
> This will allow you to do some post-mortem analysis and see exactly why
> the OOME occurred.
>
>
> Justin
>
> On Wed, Jan 1, 2025 at 12:17 PM Vilius Šumskas <vilius.sums...@rivile.lt>
> wrote:
>
> > Hi,
> >
> > we had an incident where our applications sent too much traffic to
> > Artemis broker and the broker got Java Heap Out Of Memory errors. I’m
> > trying to understand why backup broker never became primary after this
> happened.
> > We run Artemis static cluster with two nodes (primary and backup)
> > under Shared Storage. Configuration is pretty straightforward:
> >       <ha-policy>
> >          <shared-store>
> >             <primary>
> >                <failover-on-shutdown>true</failover-on-shutdown>
> >             </primary>
> >          </shared-store>
> >       </ha-policy>
> >
> >       <ha-policy>
> >          <shared-store>
> >             <backup>
> >                <failover-on-shutdown>true</failover-on-shutdown>
> >             </backup>
> >          </shared-store>
> >       </ha-policy>
> >
> > Backup always becomes primary if we reboot primary during maintenance.
> > We also tested our HA configuration with various other tests, like
> > disabling network connections, killing storage mount point, etc, so
> > I’m positive configuration should be correct.
> >
> > Primary logs during that time:
> > https://p.defau.lt/?cNVdPPEMN2qM8XbLZomKdQ
> > Backup logs during that time:
> > https://p.defau.lt/?nJvfQKdc4rrUGC9lL7_JCA
> >
> > OOM have happened at ~2:51 and the primary was in this state until I
> > have restarted it ~13:25.
> >
> > Any pointers are much appreciated!
> >
> > --
> >    Best Regards,
> >
> >     Vilius Šumskas
> >     Rivile
> >     IT manager
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@activemq.apache.org
> For additional commands, e-mail: users-h...@activemq.apache.org
> For further information, visit: https://activemq.apache.org/contact
>
>

Re: backup not activated after OOM on primary

Reply via email to