Hello.

We're using ActiveMQ Classic version 5.18.3. In our last load tests we have
some errors that seem to be related to unordered message delivery. The
broker is configured with a Postgresql database persistence.

The Camel+Java producers generate messages in different queues : some for
business data, other for tracing. The tracing queue is consumed by a Java
application with a single consumer that expects messages to be delivered in
the order of the business operations.

If we stop the consumer application, and let the producers fill the tracing
queue, and restart the consumer when all has been processed, we have never
seen any error. This doesn't mean it could not happen, only that with
millions of messages we've got no error.

But if we let both the producers and the consumer run together, a random
error happens because of unordered operations. For a test batch of 248000
tracing messages, about 2 of 5 executions leads to at least one order error.

The test injection (using JMeter) generates 15000 business JMS  messages
into 15 queues that are processed by Camel routes that produce 248000
tracing messages. JMeter says it sends about 70 messages per second.

We've added a trigger on the activemq_msgs table to copy the tracing
queue's messages into a backup table. From this table we can see below the
ID and time of the 5 tracing messages for an operation that fails.

|id    |etape               |date_traitement
|jms_timestamp_ms|jms_timestamp                | Consumer timestamp
|------|--------------------|------------------------|----------------|-----------------------------|
|13 050|initier_echange     |2025-01-14T19:05:01.731Z|1736881502569
|2025-01-14 20:05:02.569 +0100| 2025-01-14 20:16:15.226 +0100
|13 073|initier_utr         |2025-01-14T19:05:01.770Z|1736881501763
|2025-01-14 20:05:01.763 +0100| 2025-01-14 20:16:16.335 +0100
|13 113|diffuser_donnees_utr|2025-01-14T19:05:01.836Z|1736881501763
|2025-01-14 20:05:01.763 +0100| 2025-01-14 20:15:44,892 +0100 <<< ERROR
|13 129|diffuser_donnees_utr|2025-01-14T19:05:01.868Z|1736881501763
|2025-01-14 20:05:01.763 +0100| 2025-01-14 20:16:19.240 +0100
|13 130|diffuser_utr        |2025-01-14T19:05:01.874Z|1736881501763
|2025-01-14 20:05:01.763 +0100| 2025-01-14 20:16:19.274 +0100

The column 1 is the persistence ID generated by ActiveMQ. This ID seems in
the right order.
The column 2 is our tracing step, steps are expected to be processed in
this order.
The column 3 is the date+time set as a header before the JMS message is
constructed and sent. All seems consistent in this case.
The column 4 is the JMSTimestamp (and the 5 as a human readable timestamp)
generated by the ActiveMQ sender. Here we see the first strange values :
the four last messages have the same timestamp, and this timestamp is
before the first sent message one.

The last added column for rows 1,2, 4 and 5 is the timestamp of the trace
processing on the consumer side. We can see that messages arrive late since
we have a single consumer, but these four messages are delivered in the
right order.

The last timestamp on row 3 is the error trace : this step
(diffuser_donnees_utr) can't be processed before the two previous steps,
and a DB constraint throws an error. According to these values, message 3
at ' 2025-01-14 20:15:44,892 ' has been received before message 1 at '
2025-01-14 20:16:15.226'...

FYI, both Karaf, ActiveMQ and the consuming application run on the same K8s
cluster, and all the pods have the right unique time.

As we have very few errors like that, we don't know how to interpret them.
The seem to appear only when many messages on the tracing queue are
produced and consumed simultaneously.

Any previous report of the same kind of issue ?
How can we explain that the JMSTimestamp values are not consistent with the
message creation order ?
Is the delivery order based on a timestamp ? This could be an explanation
for message 3, but message 2, 4 and 5 are processed at their correct time.

Any idea to find an explanation is welcome.

Thanks in advance for your help.

Regards.

Reply via email to