[ https://issues.apache.org/jira/browse/KAFKA-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382511#comment-15382511 ]
Peter Davis commented on KAFKA-3894: ------------------------------------ Re: "the broker seems to be working" You may regret not taking action now. As Tim mentioned from the talk at the Kafka Summit (http://www.slideshare.net/jjkoshy/kafkaesque-days-at-linked-in-in-2015/49), if __consumer_offsets is not compacted and has accumulated millions (or billions!) of messages, it can take many minutes for the broker to elect a new coordinator after any kind of hiccup. *Your new consumers may be hung during this time!* However, even shutting down brokers to change the configuration will cause coordinator elections which will cause an outage. It seems like not having a "hot spare" for Offset Managers is a liability here⦠We were bit by this bug and it caused all kinds of headaches until we managed to get __consumer_offsets cleaned up again. > Log Cleaner thread crashes and never restarts > --------------------------------------------- > > Key: KAFKA-3894 > URL: https://issues.apache.org/jira/browse/KAFKA-3894 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.8.2.2, 0.9.0.1 > Environment: Oracle JDK 8 > Ubuntu Precise > Reporter: Tim Carey-Smith > Labels: compaction > > The log-cleaner thread can crash if the number of keys in a topic grows to be > too large to fit into the dedupe buffer. > The result of this is a log line: > {quote} > broker=0 pri=ERROR t=kafka-log-cleaner-thread-0 at=LogCleaner > \[kafka-log-cleaner-thread-0\], Error due to > java.lang.IllegalArgumentException: requirement failed: 9750860 messages in > segment MY_FAVORITE_TOPIC-2/00000000000047580165.log but offset map can fit > only 5033164. You can increase log.cleaner.dedupe.buffer.size or decrease > log.cleaner.threads > {quote} > As a result, the broker is left in a potentially dangerous situation where > cleaning of compacted topics is not running. > It is unclear if the broader strategy for the {{LogCleaner}} is the reason > for this upper bound, or if this is a value which must be tuned for each > specific use-case. > Of more immediate concern is the fact that the thread crash is not visible > via JMX or exposed as some form of service degradation. > Some short-term remediations we have made are: > * increasing the size of the dedupe buffer > * monitoring the log-cleaner threads inside the JVM -- This message was sent by Atlassian JIRA (v6.3.4#6332)