Hi, You're running into the issue in https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3894 and possibly https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3587 (which is fixed in 0.10). Sadly right now there's no way to know how high a dedupe buffer size you need - it depends on the write throughput and number of unique keys going to that topic. For now I'd recommend:
a) Upgrade to 0.10 as KAFKA-3587 is fixed there. Kafka doesn't backport patches (as far as I'm aware), so you need to upgrade. b) monitor and alert on the log cleaner thread dying. This can be done by getting a thread dump from jmx, loading the thread names and ensuring one with "log-cleaner" is always running. Alternatively monitoring the number of log segments for compacted topics, or the number of file descriptors will serve as an ok proxy. When this thread does crash, you have to remediate by increasing the dedupe buffer size We're exploring solutions in KAFKA-3894, and would love your feedback there if you have any thoughts. Thanks Tom Crayford Heroku Kafka On Wednesday, 13 July 2016, Rakesh Vidyadharan <rvidyadha...@gracenote.com> wrote: > We ran into this as well, and I ended up with the following that works for > us. > > log.cleaner.dedupe.buffer.size=536870912 > log.cleaner.io.buffer.size=20000000 > > > > > > On 13/07/2016 14:01, "Lawrence Weikum" <lwei...@pandora.com <javascript:;>> > wrote: > > >Apologies. Here is the full trace from a broker: > > > >[2016-06-24 09:57:39,881] ERROR [kafka-log-cleaner-thread-0], Error due > to (kafka.log.LogCleaner) > >java.lang.IllegalArgumentException: requirement failed: 9730197928 > messages in segment __consumer_offsets-36/00000000000000000000.log but > offset map can fit only 5033164. You can increase > log.cleaner.dedupe.buffer.size or decrease log.cleaner.threads > > at scala.Predef$.require(Predef.scala:219) > > at > kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:584) > > at > kafka.log.Cleaner$$anonfun$buildOffsetMap$4.apply(LogCleaner.scala:580) > > at > scala.collection.immutable.Stream$StreamWithFilter.foreach(Stream.scala:570) > > at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:580) > > at kafka.log.Cleaner.clean(LogCleaner.scala:322) > > at > kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:230) > > at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:208) > > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > >[2016-06-24 09:57:39,881] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > > > > > >Is log.cleaner.dedupe.buffer.size a broker setting? What is a good > number to set it to? > > > > > > > >Lawrence Weikum > > > > > >On 7/13/16, 11:18 AM, "Manikumar Reddy" <manikumar.re...@gmail.com > <javascript:;>> wrote: > > > >Can you post the complete error stack trace? Yes, you need to > >restart the affected > >brokers. > >You can tweak log.cleaner.dedupe.buffer.size, log.cleaner.io.buffer.size > >configs. > > > >Some related JIRAs: > > > >https://issues.apache.org/jira/browse/KAFKA-3587 > >https://issues.apache.org/jira/browse/KAFKA-3894 > >https://issues.apache.org/jira/browse/KAFKA-3915 > > > >On Wed, Jul 13, 2016 at 10:36 PM, Lawrence Weikum <lwei...@pandora.com > <javascript:;>> > >wrote: > > > >> Oh interesting. I didn’t know about that log file until now. > >> > >> The only error that has been populated among all brokers showing this > >> behavior is: > >> > >> ERROR [kafka-log-cleaner-thread-0], Error due to (kafka.log.LogCleaner) > >> > >> Then we see many messages like this: > >> > >> INFO Compaction for partition [__consumer_offsets,30] is resumed > >> (kafka.log.LogCleaner) > >> INFO The cleaning for partition [__consumer_offsets,30] is aborted > >> (kafka.log.LogCleaner) > >> > >> Using Visual VM, I do not see any log-cleaner threads in those > brokers. I > >> do see it in the brokers not showing this behavior though. > >> > >> Any idea why the LogCleaner failed? > >> > >> As a temporary fix, should we restart the affected brokers? > >> > >> Thanks again! > >> > >> > >> Lawrence Weikum > >> > >> On 7/13/16, 10:34 AM, "Manikumar Reddy" <manikumar.re...@gmail.com > <javascript:;>> wrote: > >> > >> Hi, > >> > >> Are you seeing any errors in log-cleaner.log? The log-cleaner thread > can > >> crash on certain errors. > >> > >> Thanks > >> Manikumar > >> > >> On Wed, Jul 13, 2016 at 9:54 PM, Lawrence Weikum <lwei...@pandora.com > <javascript:;>> > >> wrote: > >> > >> > Hello, > >> > > >> > We’re seeing a strange behavior in Kafka 0.9.0.1 which occurs about > every > >> > other week. I’m curious if others have seen it and know of a > solution. > >> > > >> > Setup and Scenario: > >> > > >> > - Brokers initially setup with log compaction turned off > >> > > >> > - After 30 days, log compaction was turned on > >> > > >> > - At this time, the number of Open FDs was ~ 30K per broker. > >> > > >> > - After 2 days, the __consumer_offsets topic was compacted > >> > fully. Open FDs reduced to ~5K per broker. > >> > > >> > - Cluster has been under normal load for roughly 7 days. > >> > > >> > - At the 7 day mark, __consumer_offsets topic seems to have > >> > stopped compacting on two of the brokers, and on those brokers, the FD > >> > count is up to ~25K. > >> > > >> > > >> > We have tried rebalancing the partitions before. The first time, the > >> > destination broker had compacted the data fine and open FDs were low. > The > >> > second time, the destination broker kept the FDs open. > >> > > >> > > >> > In all the broker logs, we’re seeing this messages: > >> > INFO [Group Metadata Manager on Broker 8]: Removed 0 expired offsets > in 0 > >> > milliseconds. (kafka.coordinator.GroupMetadataManager) > >> > > >> > There are only 4 consumers at the moment on the cluster; one topic > with > >> 92 > >> > partitions. > >> > > >> > Is there a reason why log compaction may stop working or why the > >> > __consumer_offsets topic would start holding thousands of FDs? > >> > > >> > Thank you all for your help! > >> > > >> > Lawrence Weikum > >> > > >> > > >> > >> > >> > > > > >