On Apr 29, 2015 4:21 AM, "James Green" <james.mk.gr...@gmail.com> wrote: > > On 28 April 2015 at 14:09, Tim Bain <tb...@alumni.duke.edu> wrote: > > > On Apr 28, 2015 3:21 AM, "James Green" <james.mk.gr...@gmail.com> wrote: > > > > > > > > > So to re-work my understanding, the pre-fetch buffer in the connected > > > client is filled with messages and the broker's queue counters react as > > if > > > this was not the case (else we'd see 1,000 drop from the pending queue > > > extremely quickly, and slowly get dequeued which we do not see). > > > > Pretty much right, though I wouldn't say the broker's counters react as if > > this was not the case; rather, the broker's dispatch counter increases > > immediately but the dequeue counter won't increase until the broker removes > > the message, and that won't happen until the consumer acks it. Until that > > happens, the message exists in both places and the counters reflect that. > > > > It sounds like you're observing the broker through the web console; there's > > WAY more information available through the JMX beans and you'll understand > > this better by watching them instead of the web console. So I highly > > recommend firing up JConsole and looking at the JMX beans. > > > > We're going through a firewall which pretty much rules JMX out - plenty of > bad experiences of that scenario.
Even if you can't do it on the production broker, you should use JConsole on a dev/test one and just poke around; you'll understand the broker much better for having done that. (Doing exactly that last year is how I'm able to answer your questions now.) This is a learning opportunity, not a solution for your immediate problem (though it might have helped with that too); take the time and do it. With that being said, why can't you run JConsole on the machine hosting the production broker, to avoid the firewall? I get that there are probably limited users who can SSH to that box and you wouldn't want to do that regularly, but given that you're hitting problems in production that you need to debug, I'd think that wouldn't be an unreasonable thing to do. > We tried with Hawt.io but it does not work properly against a remote > broker. The queues appear but they cannot be browsed and the health tab is > empty. Others report the same. > > I agree, I don't understand that, particularly because even if the broker > > was so loaded down that you were hitting that timeout, I don't see how that > > would result in a failed delivery attempt. Your receive() call would just > > return null and Camel would just call receive() again and everything would > > be fine. (This is exactly what happens when there aren't any messages on > > the queue, and nothing bad happens then.) So my gut reaction is that the > > timeout is a red herring and something else is going on. Have you switched > > that setting between the two values while playing identical messages > > (either generated or recorded) to be sure that that setting really is the > > cause of this behavior? > > > > We've not, mainly because we've not spent the time on recording the > individual messages for individual playback. Let us know what you see when you do that. I still think this is probably a red herring and something else is the root cause, but I don't have any guesses about what it might be. > > Also, when messages are failing, do all of them fail? If it's only some of > > them, what's the common thread? > > > > > We get bursts into the DLQ, which we've put down to possible machine > loading issues at the time. Still nothing recorded since the time-out > change. > > Those that were in the DLQ were consumed fine when we reconfigured the app > to read from there so it's not individual messages that will always fail, > either. > > James That's good to know, but there still could be a common theme between those messages that causes them to be the ones that fail when this happens. Let us know when/if you find one. Can you reproduce the problem on a non-production instance? If so, attach a debugger to the broker and set breakpoints on each line that triggers a redelivery (or a move to the DLQ) and see if stepping through tells you what the problem is. Tim