Bumping the logging level is key to this; even if you succeed in finding
data files that demonstrate what happened, without the more detailed
logging it may be impossible to tell *why* it happened. If you get a large
enough disk attached to the machine to hold months of logs, that's great,
just make sure that you turn off Log4J's feature to only keep a limited
number of files (I think it's set to 10 by default). Or you could run a
script in a tight loop that looks to see if a log file has rolled over and
copies the newly-rolled-over file to a network file located or to cloud
storage such as S3.

I'd suggest you keep at least the last two sets of saved data files (that
is, don't delete your current set when you analyze it and it doesn't have
the problem, wait till the next one checks out before you delete it),
because once you find the first bad set, you'll want to compare it to the
last good set to see what changed. There's a good chance that you already
thought of this (so please ignore the suggestion if I'm saying something
you already figured out), but since this takes months I'd hate to see you
get to the end and then realize that you don't have everything you
want/need.

I'd also suggest you run the analysis more frequently (hourly, maybe, or
even more often if possible), to try to limit the size of the delta between
the last good and first bad snapshots. The closer they are, the better the
odds that you'll be able to identify which difference between the files is
the relevant difference (because most of the differences will be valid and
unrelated). You're looking for a needle in a haystack and there's a strong
possibility that the needle is invisible and its presence can only be
deduced from context, so anything you can do to shrink the size of the
haystack is to your best interest, especially since this takes so long to
manifest itself.

Sorry I don't have any better ideas for how to analyze this.

Tim

On Oct 27, 2017 1:10 AM, "hakanj" <hakan.johans...@jeppesen.com> wrote:

> It is possible that there is a bug in the ack compaction feature, but if so
> it only shows itself in
> some very special cases. I have tried stressing it, but to no avail. In all
> my tests the broker
> does the right thing.
>
> There are no cron jobs deleting any files in this system. The broker thinks
> that the KahaDB
> database is structurally sound. No complaints about any missing files
> during
> startup.
>
> When we looked in the logs of our other applications we could see that the
> resurrected messages
> were processed as expected, when expected, including a transactional commit
> to the broker. Before
> the broker was taken down the queue was empty, with matching EnqueueCount
> and DequeueCount.
>
> The TRACE level idea is great and will probably show what is happening, but
> I am afraid of running
> out of disk space. The logs would be huge as we need to run the system for
> several months before
> the issue happens. The broker is running on a virtual machine with limited
> available disk. There is
> also the performance impact to take into consideration. We would need to do
> some performance
> measurements before trying this in production. If the performance impact is
> not too bad, then we
> could ask for a proper disk to be added to the machine and move the logging
> there with the TRACE
> enabled.
>
> My current plan is to take a copy of the kahadb directory once a week and
> take a look at it. I have
> found some utilities that shows the contents of those files. Right now the
> error has not yet
> occurred, but it has only been running for a week so far since the last
> restart.
>
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>

Reply via email to