I think what you're seeing is expected. An OOME usually isn't enough to
trigger the broker to fail completely and trigger a failover. As you can
see, the broker continued to run after the OOME which means it was still
holding the lock on the shared journal (preventing the backup from
activating). If you want to ensure the broker fails over in this situation
you should pass this to the JVM:

  -XX:+ExitOnOutOfMemoryError

This will ensure the JVM stops when an OOME occurs which will then allow
the backup to activate.

It might also be worth passing this as well:

  -XX:+HeapDumpOnOutOfMemoryError

This will allow you to do some post-mortem analysis and see exactly why the
OOME occurred.


Justin

On Wed, Jan 1, 2025 at 12:17 PM Vilius Šumskas <vilius.sums...@rivile.lt>
wrote:

> Hi,
>
> we had an incident where our applications sent too much traffic to Artemis
> broker and the broker got Java Heap Out Of Memory errors. I’m trying to
> understand why backup broker never became primary after this happened.
> We run Artemis static cluster with two nodes (primary and backup) under
> Shared Storage. Configuration is pretty straightforward:
>       <ha-policy>
>          <shared-store>
>             <primary>
>                <failover-on-shutdown>true</failover-on-shutdown>
>             </primary>
>          </shared-store>
>       </ha-policy>
>
>       <ha-policy>
>          <shared-store>
>             <backup>
>                <failover-on-shutdown>true</failover-on-shutdown>
>             </backup>
>          </shared-store>
>       </ha-policy>
>
> Backup always becomes primary if we reboot primary during maintenance. We
> also tested our HA configuration with various other tests, like disabling
> network connections, killing storage mount point, etc, so I’m positive
> configuration should be correct.
>
> Primary logs during that time: https://p.defau.lt/?cNVdPPEMN2qM8XbLZomKdQ
> Backup logs during that time: https://p.defau.lt/?nJvfQKdc4rrUGC9lL7_JCA
>
> OOM have happened at ~2:51 and the primary was in this state until I have
> restarted it ~13:25.
>
> Any pointers are much appreciated!
>
> --
>    Best Regards,
>
>     Vilius Šumskas
>     Rivile
>     IT manager
>
>

Reply via email to