The broker does already have a variety of checks that attempt to warn you defensively about certain potential problems and to prevent you from running into them (producer flow control, for example), but I'm not aware of a check specifically on the number of live KahaDB log files. In part, I'm not aware of any universal limitation on the number of files that a broker (or any arbitrary process, for that matter) can open sequentially, so I'm not sure that anyone would have programmed a check against that number, nor that there's a universal number that could be applied to all installations no matter what their environments. If you wanted to provide more information about exactly what errors are seen when the broker "shuts down because there are too many files in the directory (I think it was 3000), [and] cannot be restarted we think due to a JVM limitation," we could see if maybe there is indeed some reasonable threshold that could be universally applied.
Tim On Mon, Nov 6, 2017 at 3:05 AM, Lionel van den Berg <lion...@gmail.com> wrote: > Thanks for the detailed response. I generally agree it's not flawed and it > is likely my configuration, I'm trying to take steps to track down the > cause, but of course it is behaving now. > > I do think some defensiveness to protect it from shutting down regardless > of configuration would be a good idea. Is there some system in place that > would allow me to be alerted of issues that could be catastrophic before > they happen? > > On 21 October 2017 at 14:08, Tim Bain <tb...@alumni.duke.edu> wrote: > > > Responses inline. > > > > On Fri, Oct 20, 2017 at 5:46 AM, Lionel van den Berg <lion...@gmail.com> > > wrote: > > > > > Hi, thanks for the response. > > > > > > Some questions on these points from the troubleshooting. > > > > > > > > > 1. *It contains a pending message for a destination or durable topic > > > subscription* > > > > > > This seems a little flawed, if a consumer who I have little control of > is > > > mis-behaving then my ActiveMQ can end up shutting down and > unrecoverable. > > > Is there some way of timing this out or similar? > > > > > > > There are multiple ways of discarding messages that are not being > consumed, > > which are detailed at http://activemq.apache.org/ > > slow-consumer-handling.html > > (several of which it sounds like you're already using). Keep in mind that > > unconsumed DLQ messages are unconsumed messages, so you'll want to make > > sure you address those messages as well; > > http://activemq.apache.org/message-redelivery-and-dlq-handling.html > > contains additional information about handling messages in the context of > > the DLQ. And no, I wouldn't say it's flawed, it just means you have to do > > some configuration work that you haven't yet done. > > > > > > > *2. It contains an ack for a message which is in an in-use data file - > > the > > > ack cannot be removed as a recovery would then mark the message for > > > redelivery* > > > > > > Same comment as 1. > > > > > > > Same response as for #1. There's one additional wrinkle (KahaDB keeps an > > entire data file alive because of a single live message, which in turn > > means you have to keep the acks for the later messages which are in later > > data files), but that's been partially mitigated by the addition of the > > ability to compact acks by replaying them into the current data file, > which > > should allow any data file that contains no live non-ack messages to be > > GC'ed. So there's a small portion of this that's purely the result of > > KahaDB's design as a non-compacting data store, but it's a problem only > > when there's an old unacknowledged message, which takes us back to #1. > > > > > > > *3. The journal references a pending transaction* > > > > > > I'm not using transactions, but are there transactions under the hood? > > > > > > > No, this would only apply if you were directly using transactions, so > this > > doesn't apply to you. > > > > > > > *4. It is a journal file, and there may be a pending write to it* > > > > > > Why would this be the case? > > > > > > > If we haven't finished flushing the file, using a buffer-then-flush > > paradigm. This will be an infrequent situation, and should only be a > small > > number of data files, so if you're having a problem with the number of > > files kept, it's not because of this. It's just included in the list for > > completeness. > > > > I'll see if I can change the logging settings, since the first occurrence > > > the number of log files does not seem to have been an issue. I have it > > > configured to keep messages for 7 days so regardless of the above > > > conditions I would have thought that at that expiry the log would be > > > cleaned up so we don't end up in such a situation where the system > stops > > > and cannot restart. > > > > > > > If you are indeed configured as you describe, I would think that log > > cleanup would indeed happen as you expect, which means that either > there's > > an undiscovered bug in our code or you're not configured the way you > think > > you are. > > > > The page I linked to originally has instructions for how to determine > which > > destinations have messages that are preventing the KahaDB data files from > > being deleted, which might let you investigate further (for example, by > > looking at the messages and their attributes to see if timestamps are > being > > set correctly). > > > > Tim > > >