[ https://issues.apache.org/jira/browse/KAFKA-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Chao resolved KAFKA-5452. ------------------------------ Resolution: Resolved Following up after a long while. After talking offline with [~wushujames], the original thought was to choose a sensible default in relation to disk I/O. I think it's best to leave this default and prevent assumptions on the underlying infrastructure. That way, operators are free to tune to their expectations. Closing this. > Aggressive log compaction ratio appears to have no negative effect on > log-compacted topics > ------------------------------------------------------------------------------------------ > > Key: KAFKA-5452 > URL: https://issues.apache.org/jira/browse/KAFKA-5452 > Project: Kafka > Issue Type: Improvement > Components: config, core, log > Affects Versions: 0.10.2.0, 0.10.2.1 > Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8 > Reporter: Jeff Chao > Labels: performance > Attachments: 200mbs-dirty0-dirty-1-dirty05.png, > flame-graph-200mbs-dirty0.png, flame-graph-200mbs-dirty0.svg > > > Some of our users are seeing unintuitive/unexpected behavior with > log-compacted topics where they receive multiple records for the same key > when consuming. This is a result of low throughput on log-compacted topics > such that conditions ({{min.cleanable.dirty.ratio = 0.5}}, default) aren't > met for compaction to kick in. > This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our > clusters. It appears that having more aggressive log compaction ratios don't > have negative effects on CPU and memory utilization. If this is truly the > case, we should consider changing the default from {{0.5}} to something more > aggressive. > Setup: > # 8 brokers > # 5 zk nodes > # 32 partitions on a topic > # replication factor 3 > # log roll 3 hours > # log segment bytes 1 GB > # log retention 24 hours > # all messages to a single key > # all messages to a unique key > # all messages to a bounded key range [0, 999] > # {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}} > # 200 MB/s sustained, produce and consume traffic > Observations: > We were able to verify log cleaner threads were performing work by checking > the logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. > We also observed the log cleaner's {{time-since-last-run-ms}} metric was > normal, never going above the default of 15 seconds. > Under-replicated partitions stayed steady, same for replication lag. > Here's an example test run where we try out {{min.cleanable.dirty.ratio = > 0}}, {{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = > 0.5}}. Troughs in between the peaks represent zero traffic and reconfiguring > of topics. > (200mbs-dirty-0-dirty1-dirty05.png attached) > !200mbs-dirty0-dirty-1-dirty05.png|thumbnail! > Memory utilization is fine, but more interestingly, CPU doesn't appear to > have much difference. > To get more detail, here is a flame graph (raw svg attached) of the run for > {{min.cleanable.dirty.ratio = 0}}. The conservative and default ratio flame > graphs are equivalent. > (flame-graph-200mbs-dirty0.png attached) > !flame-graph-200mbs-dirty0.png|thumbnail! > Notice that the majority of CPU is coming from: > # SSL operations (on reads/writes) > # KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages) > # KafkaApis::handleOffsetFetchRequest > We also have examples from small scale test runs which show similar behavior > but with scaled down CPU usage. > It seems counterintuitive that there's no apparent difference in CPU whether > it be aggressive or conservative compaction ratios, so we'd like to get some > thoughts from the community. > We're looking for feedback on whether or not anyone else has experienced this > behavior before as well or, if CPU isn't affected, has anyone seen something > related instead. > If this is true, then we'd be happy to discuss further and provide a patch. -- This message was sent by Atlassian JIRA (v6.4.14#64029)