Re: AMQ clients hang waiting for ACK after large messages hits maxPageReadBytes

Christian Kurmann Wed, 19 Feb 2025 14:05:27 -0800

Yes but I’d want to make sure it reproduces the issues we’ve seen. And due
to the nature of the messages it takes time to make data that is fully
anonymous and still realistic


On Wed, 19 Feb 2025 at 21:06, Justin Bertram <jbert...@apache.org> wrote:

> On the other thread you mentioned a "little Java microservice that
> simulated my specific messages and size distribution...and load on specific
> queues and topics." Is that something you could provide (e.g. on GitHub)?
>
>
> Justin
>
> On Wed, Feb 19, 2025 at 2:02 PM Christian Kurmann <c...@tere.tech> wrote:
>
> > I will see what I can do. As mentioned I processed millions of real
> > financial messages. This will take a while.
> >
> > On Wed, 19 Feb 2025 at 20:59, Justin Bertram <jbert...@apache.org>
> wrote:
> >
> > > Could you provide some form of the test-case you were running so I
> could
> > > investigate further?
> > >
> > >
> > > Justin
> > >
> > > On Wed, Feb 19, 2025 at 1:56 PM Christian Kurmann <c...@tere.tech>
> wrote:
> > >
> > > > No, sorry to be unclear on this.
> > > > We never got long term high performance runs with Artemis.
> > > > We tried for about 1 year.
> > > > It’s fine when we only have low traffic but even then seemed to grow
> > > > sluggish requiring us to start with weekly restarts.
> > > > It was simply unstable as soon as we got a large amount of data to
> > > process
> > > > and we didn’t manage to tune it or our software for our usecase.
> > > > ActiveMQ Classic works fine
> > > >
> > > > On Wed, 19 Feb 2025 at 20:44, Justin Bertram <jbert...@apache.org>
> > > wrote:
> > > >
> > > > > To be clear, everything worked in Artemis 2.30.0 and then began
> > failing
> > > > > after an upgrade?
> > > > >
> > > > >
> > > > > Justin
> > > > >
> > > > > On Wed, Feb 19, 2025 at 1:40 PM Christian Kurmann <c...@tere.tech>
> > > wrote:
> > > > >
> > > > > > Hi Justin.
> > > > > > Yes, and I’m sorry it didn't work out for us.
> > > > > > The problem was a combination of using operators on openshift
> which
> > > > among
> > > > > > other things make it difficult to change settings due to the
> > internal
> > > > > > translation of attributes to actual config values in the running
> > > pods,
> > > > > and
> > > > > > the issues I alluded to earlier where by we got several
> variations
> > of
> > > > > > bottlenecks that interacted to soft lock the system.
> > > > > > Very often the disks run full but this is preceded by a sudden
> drop
> > > in
> > > > > > performance. I believe our microservice architecture where
> > processes
> > > > read
> > > > > > from a queue and then write to another queue on the same activeMQ
> > is
> > > a
> > > > > bit
> > > > > > of an anti pattern for the assumptions Artemis has for its
> > throttling
> > > > > >
> > > > > > We had a consistent issue with dead connections that didn’t show
> up
> > > on
> > > > > the
> > > > > > Artemis GUI and only became noticeable in the logs (and in failed
> > > > > > connection attempts by clients) but we never made any headway
> into
> > > > > finding
> > > > > > out why they exist or how to close them other than by restarting
> > and
> > > > > > «solved» the issue simply by removing limits. Not sure if this
> was
> > a
> > > > > cause
> > > > > > or a symptom or just another open issue.
> > > > > >
> > > > > > We tried several versions of Artemis and drivers (both out
> original
> > > > open
> > > > > > wire and a rewritten setup with Core protocol) and paid redhat
> > > support
> > > > > > without being able to get rid of the crashes. We did see
> different
> > > > > patterns
> > > > > > in how fast and how bad it crashed with Openwire vs core. Core
> was
> > > > > > definitely faster and held out longer but the. Crashed in a worse
> > > way.
> > > > > > Again i assume just an artifact of our data load being too much
> for
> > > > > Artemis
> > > > > > to handle gracefully.
> > > > > >
> > > > > > Switching back to ActiveMq classic resolved the issues for us in
> so
> > > far
> > > > > as
> > > > > > we managed to pass all of our performance tests and its been
> > running
> > > in
> > > > > > several production sites for months now without the previous
> > issues.
> > > > > > It could be that it’s just luck in that some things are a little
> > > slower
> > > > > on
> > > > > > ActiveMq Classic letting us process all our data successfully in
> an
> > > > > overall
> > > > > > faster time because we don’t hit the same bottlenecks or
> > throttling.
> > > > > >
> > > > > > Sorry, long email without any real information. I did try sending
> > > logs
> > > > > and
> > > > > > more specific details earlier though, especially to redhat.
> > > > > >
> > > > > > Overall thanks for asking and I massively appreciate the work you
> > > guys
> > > > > all
> > > > > > do on the software and here on the help thread
> > > > > >
> > > > > > On Wed, 19 Feb 2025 at 20:20, Justin Bertram <
> jbert...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > I saw your email on the "Are there any hardware recommendations
> > for
> > > > > > > ActiveMQ Classic?" about performance issue you observed with
> > > Artemis.
> > > > > Is
> > > > > > > this the issue you were referring to? If so, did you try the
> > > > > > configuration
> > > > > > > change I mentioned? Any follow-up you could provide here would
> be
> > > > > > valuable.
> > > > > > >
> > > > > > >
> > > > > > > Justin
> > > > > > >
> > > > > > > On Fri, May 10, 2024 at 6:51 AM Christian Kurmann
> <c...@tere.tech
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > > Would someone be able to guide me to documentation for
> options
> > to
> > > > > > handle
> > > > > > > a
> > > > > > > > problem I've encountered testing Artemis > 2.30.
> > > > > > > >
> > > > > > > > I have a set of ~50 java microservices processing 1.5million
> > > > messages
> > > > > > > which
> > > > > > > > communicate among themselves via Artemis using Openwire.
> > > > > > > > To test I have a dataset containing lots of small and a
> > realistic
> > > > > > amount
> > > > > > > of
> > > > > > > > large messages (> 10KB) which simulates a full prod day but
> can
> > > be
> > > > > > > > processed in under 1h.
> > > > > > > >
> > > > > > > > Using Artemis 2.30 I can get my data processed, however with
> > > later
> > > > > > > versions
> > > > > > > > the system hangs due to a large backlog on an important Queue
> > > which
> > > > > is
> > > > > > > used
> > > > > > > > by all services to write trace data.
> > > > > > > > All microservices process data in batches of max 100msgs
> while
> > > > > reading
> > > > > > > and
> > > > > > > > writing to the queues.
> > > > > > > >
> > > > > > > > The main issue is that when this happens, both the reading
> and
> > > > > writing
> > > > > > > > clients of this queue simply hang waiting for Artemis to
> return
> > > ACK
> > > > > > which
> > > > > > > > never comes.
> > > > > > > > During this time Artemis does not log anything suspicious.
> > > However
> > > > >
> > > > > 10
> > > > > > > > minutes later I do see paging starting and I get the warning:
> > > > > > > >
> > > > > > > > 2024-05-10 09:26:12,605 INFO [io.hawt.web.auth.LoginServlet]
> > > Hawtio
> > > > > > login
> > > > > > > > is using 1800 sec. HttpSession timeout
> > > > > > > > 2024-05-10 09:26:12,614 INFO [io.hawt.web.auth.LoginServlet]
> > > > Logging
> > > > > in
> > > > > > > > user: webadmin
> > > > > > > > 2024-05-10 09:26:12,906 INFO
> > > > > > [io.hawt.web.auth.keycloak.KeycloakServlet]
> > > > > > > > Keycloak integration is disabled
> > > > > > > > 2024-05-10 09:26:12,955 INFO [io.hawt.web.proxy.ProxyServlet]
> > > Proxy
> > > > > > > servlet
> > > > > > > > is disabled
> > > > > > > > 2024-05-10 09:48:56,955 INFO
> > > > > [org.apache.activemq.artemis.core.server]
> > > > > > > > AMQ222038: Starting paging on address 'IMS.PRINTS.V2';
> > > > > size=10280995859
> > > > > > > > bytes (1820178 messages); maxSize=-1 bytes (-1 messages);
> > > > > > > > globalSize=10309600627 bytes (1825169 messages);
> > > > > > > globalMaxSize=10309599232
> > > > > > > > bytes (-1 messages);
> > > > > > > > 2024-05-10 09:48:56,962 WARN
> > > > > > > > [org.apache.activemq.artemis.core.server.Queue]
> > > > > > > > AMQ224127: Message dispatch from paging is blocked. Address
> > > > > > > > IMS.PRINTS.V2/Queue IMS.PRINTS.V2 will not read any more
> > messages
> > > > > from
> > > > > > > > paging until pending messages are acknowledged. There are
> > > currently
> > > > > > 14500
> > > > > > > > messages pending (51829364 bytes) with max reads at
> > > > > > > maxPageReadMessages(-1)
> > > > > > > > and maxPageReadBytes(20971520). Either increase reading
> > > attributes
> > > > at
> > > > > > the
> > > > > > > > address-settings or change your consumers to acknowledge more
> > > > often.
> > > > > > > > 2024-05-10 09:49:24,458 INFO
> > > > > [org.apache.activemq.artemis.core.server]
> > > > > > > > AMQ222038: Starting paging on address 'PRINTS'; size=28608213
> > > bytes
> > > > > > (4992
> > > > > > > > messages); maxSize=-1 bytes (-1 messages);
> > globalSize=10309604144
> > > > > bytes
> > > > > > > > (1825170 messages); globalMaxSize=10309599232 bytes (-1
> > > messages);
> > > > > > > >
> > > > > > > > I see in the commit that added this warning that the
> > > > maxPageReadBytes
> > > > > > is
> > > > > > > > not something I can actually change so I assume any solution
> > > needs
> > > > to
> > > > > > > > happen earlier in the chain
> > > > > > > >
> > > > >
> > https://www.mail-archive.com/commits@activemq.apache.org/msg61667.html
> > > > > > > >
> > > > > > > > But I can't find anything sensible to configure to help me
> > here.
> > > > > > > > I also am concerned about this suddenly appearing with a new
> > > > Artemis
> > > > > > > > version and I can reproduce the crash with my dataset as well
> > as
> > > > > verify
> > > > > > > > that with 2.30 it still works fine.
> > > > > > > > The fact that there is nothing in the logs also worries me.
> > > > > > > > Reconnecting the clients allows messages to be processed
> > however
> > > > the
> > > > > > > system
> > > > > > > > crashes again very quickly due to all the pending
> transactions
> > > > > starting
> > > > > > > > again and leading to the same issue.
> > > > > > > >
> > > > > > > > Running on Openshift using AMQ Cloud operator.
> > > > > > > > Clients connect via Openwire and using
> > activemq-client-5.17.2.jar
> > > > and
> > > > > > > > java17
> > > > > > > >
> > > > > > > > All input is appreciated and I'm happy to run as many tests
> as
> > > > > > required.
> > > > > > > >
> > > > > > > > Regards Chris
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: AMQ clients hang waiting for ACK after large messages hits maxPageReadBytes

Reply via email to