Re: backup not activated after OOM on primary

Justin Bertram Mon, 06 Jan 2025 11:07:47 -0800

I'm not aware of any broker-specific dangers of adding
+ExitOnOutOfMemoryError.


It's worth noting that the broker is written knowing that at any point the
JVM or OS could crash or there could be a hardware failure of some kind.
This is, in part, why transactions are implemented a particular way, why we
only acknowledge durable messages once the date has been flushed to disk,
etc. In short, there _should_ never be any message/journal corruption. If
there was then it would be considered a bug and it would be fixed.


Justin

On Thu, Jan 2, 2025 at 12:21 AM Vilius Šumskas <vilius.sums...@rivile.lt>
wrote:

> I will rephrase my question. Are the any exact dangers by adding
> +ExitOnOutOfMemoryError? For example, message/journal corruption if JVM
> shuts down abruptly because of OOME in one broker part of the broker, but
> another part is still working?
>
> --
>     Vilius
>
> -----Original Message-----
> From: Justin Bertram <jbert...@apache.org>
> Sent: Thursday, January 2, 2025 3:42 AM
> To: users@activemq.apache.org
> Subject: Re: backup not activated after OOM on primary
>
> The caveat with adding +ExitOnOutOfMemoryError is that the JVM will now
> exit when an OOME occurs now rather than simply carrying on. Keep in mind
> that an OOME isn't necessarily a death sentence in and of itself. It is
> technically possible (although unlikely) for the broker to recover.
>
> I believe it's not in the default artemis.profile because it was
> essentially brand new when Artemis 2.0 was released, and it hasn't been
> possible to add it and change the default behavior in a minor release.
>
>
> Justin
>
> On Wed, Jan 1, 2025 at 4:32 PM Vilius Šumskas <vilius.sums...@rivile.lt>
> wrote:
>
> > Are there any caveats adding +ExitOnOutOfMemoryError? Just wondering
> > why it's not in the default JAVA_ARGS in "artemis.profile".
> >
> > --
> >     Vilius
> >
> > -----Original Message-----
> > From: Justin Bertram <jbert...@apache.org>
> > Sent: Wednesday, January 1, 2025 11:01 PM
> > To: users@activemq.apache.org
> > Subject: Re: backup not activated after OOM on primary
> >
> > I think what you're seeing is expected. An OOME usually isn't enough
> > to trigger the broker to fail completely and trigger a failover. As
> > you can see, the broker continued to run after the OOME which means it
> > was still holding the lock on the shared journal (preventing the
> > backup from activating). If you want to ensure the broker fails over
> > in this situation you should pass this to the JVM:
> >
> >   -XX:+ExitOnOutOfMemoryError
> >
> > This will ensure the JVM stops when an OOME occurs which will then
> > allow the backup to activate.
> >
> > It might also be worth passing this as well:
> >
> >   -XX:+HeapDumpOnOutOfMemoryError
> >
> > This will allow you to do some post-mortem analysis and see exactly
> > why the OOME occurred.
> >
> >
> > Justin
> >
> > On Wed, Jan 1, 2025 at 12:17 PM Vilius Šumskas
> > <vilius.sums...@rivile.lt>
> > wrote:
> >
> > > Hi,
> > >
> > > we had an incident where our applications sent too much traffic to
> > > Artemis broker and the broker got Java Heap Out Of Memory errors.
> > > I’m trying to understand why backup broker never became primary
> > > after this
> > happened.
> > > We run Artemis static cluster with two nodes (primary and backup)
> > > under Shared Storage. Configuration is pretty straightforward:
> > >       <ha-policy>
> > >          <shared-store>
> > >             <primary>
> > >                <failover-on-shutdown>true</failover-on-shutdown>
> > >             </primary>
> > >          </shared-store>
> > >       </ha-policy>
> > >
> > >       <ha-policy>
> > >          <shared-store>
> > >             <backup>
> > >                <failover-on-shutdown>true</failover-on-shutdown>
> > >             </backup>
> > >          </shared-store>
> > >       </ha-policy>
> > >
> > > Backup always becomes primary if we reboot primary during maintenance.
> > > We also tested our HA configuration with various other tests, like
> > > disabling network connections, killing storage mount point, etc, so
> > > I’m positive configuration should be correct.
> > >
> > > Primary logs during that time:
> > > https://p.defau.lt/?cNVdPPEMN2qM8XbLZomKdQ
> > > Backup logs during that time:
> > > https://p.defau.lt/?nJvfQKdc4rrUGC9lL7_JCA
> > >
> > > OOM have happened at ~2:51 and the primary was in this state until I
> > > have restarted it ~13:25.
> > >
> > > Any pointers are much appreciated!
> > >
> > > --
> > >    Best Regards,
> > >
> > >     Vilius Šumskas
> > >     Rivile
> > >     IT manager
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@activemq.apache.org
> > For additional commands, e-mail: users-h...@activemq.apache.org For
> > further information, visit: https://activemq.apache.org/contact
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@activemq.apache.org
> For additional commands, e-mail: users-h...@activemq.apache.org
> For further information, visit: https://activemq.apache.org/contact
>
>

Re: backup not activated after OOM on primary

Reply via email to