We recently implemented message scheduling on an Artemis system that had
been otherwise stable for two years.  The system processes updates from a
web service that a partner submits 24x7.  Because some of the internal
systems we depend on are often offline for maintenance between midnight and
7 AM, we implemented delayed delivery for the messages dispatched to the
queue between those hours.  We set the overnight messages to be delivered
at 08:00, tested it, and put the change into production.   For about two
weeks this worked fine.

When the volume increased to about 50,000+ such messages being scheduled
per night, the system began going into a predictable deadlock every morning
a minute or two past 8 AM.   For 2+ weeks the system would consistently
deadlock every morning. The critical-analyzer would detect the fault, and
halt the VM within a couple minutes of when the processing of the delayed
messages scheduled for 08:00 got underway.

We fixed (or maybe just avoided) the issue by changing the delayed-delivery
slightly, using a PRNG to select random numbers between 0 and 59 for each
message.  Using these numbers as the minutes and seconds component to
calculate the delay time, we spread out the delivery between 08:00:00 and
08:59:59.  As soon as we implemented that change, the deadlocks never
returned.

Does this sound like a misconfiguration of Artemis on my part, or something
that I should bundle up all the logs and config files and submit as a bug?
Here's a quick overview of the configuration:

Artemis servers: Two running v2.11.0 in symmetric cluster mode (tried
running it in standalone mode during the issue, it didn't help)
JVM: OpenJDK v11 - J9 VM
Global-Max-Size: 104857600
Xms/XmX: 512M/2G
Platform: Windows
Message size: < 1K/message
Scheduled messages: 10k - 100k / night
Total messages: approx 500k/day
Consumers:  MDBs on 12 Wildfly servers, about 500 instances

Thanks for any insights anyone can offer,
-a

Reply via email to