We recently implemented message scheduling on an Artemis system that had been otherwise stable for two years. The system processes updates from a web service that a partner submits 24x7. Because some of the internal systems we depend on are often offline for maintenance between midnight and 7 AM, we implemented delayed delivery for the messages dispatched to the queue between those hours. We set the overnight messages to be delivered at 08:00, tested it, and put the change into production. For about two weeks this worked fine.
When the volume increased to about 50,000+ such messages being scheduled per night, the system began going into a predictable deadlock every morning a minute or two past 8 AM. For 2+ weeks the system would consistently deadlock every morning. The critical-analyzer would detect the fault, and halt the VM within a couple minutes of when the processing of the delayed messages scheduled for 08:00 got underway. We fixed (or maybe just avoided) the issue by changing the delayed-delivery slightly, using a PRNG to select random numbers between 0 and 59 for each message. Using these numbers as the minutes and seconds component to calculate the delay time, we spread out the delivery between 08:00:00 and 08:59:59. As soon as we implemented that change, the deadlocks never returned. Does this sound like a misconfiguration of Artemis on my part, or something that I should bundle up all the logs and config files and submit as a bug? Here's a quick overview of the configuration: Artemis servers: Two running v2.11.0 in symmetric cluster mode (tried running it in standalone mode during the issue, it didn't help) JVM: OpenJDK v11 - J9 VM Global-Max-Size: 104857600 Xms/XmX: 512M/2G Platform: Windows Message size: < 1K/message Scheduled messages: 10k - 100k / night Total messages: approx 500k/day Consumers: MDBs on 12 Wildfly servers, about 500 instances Thanks for any insights anyone can offer, -a