On Apr 29, 2015 4:21 AM, "James Green" <james.mk.gr...@gmail.com> wrote:
>
> On 28 April 2015 at 14:09, Tim Bain <tb...@alumni.duke.edu> wrote:
>
> > On Apr 28, 2015 3:21 AM, "James Green" <james.mk.gr...@gmail.com> wrote:
> > >
> > >
> > > So to re-work my understanding, the pre-fetch buffer in the connected
> > > client is filled with messages and the broker's queue counters react
as
> > if
> > > this was not the case (else we'd see 1,000 drop from the pending queue
> > > extremely quickly, and slowly get dequeued which we do not see).
> >
> > Pretty much right, though I wouldn't say the broker's counters react as
if
> > this was not the case; rather, the broker's dispatch counter increases
> > immediately but the dequeue counter won't increase until the broker
removes
> > the message, and that won't happen until the consumer acks it.  Until
that
> > happens, the message exists in both places and the counters reflect
that.
> >
> > It sounds like you're observing the broker through the web console;
there's
> > WAY more information available through the JMX beans and you'll
understand
> > this better by watching them instead of the web console.  So I highly
> > recommend firing up JConsole and looking at the JMX beans.
> >
>
> We're going through a firewall which pretty much rules JMX out - plenty of
> bad experiences of that scenario.

Even if you can't do it on the production broker, you should use JConsole
on a dev/test one and just poke around; you'll understand the broker much
better for having done that.  (Doing exactly that last year is how I'm able
to answer your questions now.)  This is a learning opportunity, not a
solution for your immediate problem (though it might have helped with that
too); take the time and do it.

With that being said, why can't you run JConsole on the machine hosting the
production broker, to avoid the firewall?  I get that there are probably
limited users who can SSH to that box and you wouldn't want to do that
regularly, but given that you're hitting problems in production that you
need to debug, I'd think that wouldn't be an unreasonable thing to do.

> We tried with Hawt.io but it does not work properly against a remote
> broker. The queues appear but they cannot be browsed and the health tab is
> empty. Others report the same.
>
> I agree, I don't understand that, particularly because even if the broker
> > was so loaded down that you were hitting that timeout, I don't see how
that
> > would result in a failed delivery attempt.   Your receive() call would
just
> > return null and Camel would just call receive() again and everything
would
> > be fine.  (This is exactly what happens when there aren't any messages
on
> > the queue, and nothing bad happens then.)  So my gut reaction is that
the
> > timeout is a red herring and something else is going on.  Have you
switched
> > that setting between the two values while playing identical messages
> > (either generated or recorded) to be sure that that setting really is
the
> > cause of this behavior?
> >
>
> We've not, mainly because we've not spent the time on recording the
> individual messages for individual playback.

Let us know what you see when you do that.  I still think this is probably
a red herring and something else is the root cause, but I don't have any
guesses about what it might be.

> > Also, when messages are failing, do all of them fail?  If it's only
some of
> > them, what's the common thread?
> >
> >
> We get bursts into the DLQ, which we've put down to possible machine
> loading issues at the time. Still nothing recorded since the time-out
> change.
>
> Those that were in the DLQ were consumed fine when we reconfigured the app
> to read from there so it's not individual messages that will always fail,
> either.
>
> James

That's good to know, but there still could be a common theme between those
messages that causes them to be the ones that fail when this happens.  Let
us know when/if you find one.

Can you reproduce the problem on a non-production instance?  If so, attach
a debugger to the broker and set breakpoints on each line that triggers a
redelivery (or a move to the DLQ) and see if stepping through tells you
what the problem is.

Tim

Reply via email to