
Jeff Chao resolved KAFKA-5452.
    Resolution: Resolved

Following up after a long while. After talking offline with [~wushujames], the 
original thought was to choose a sensible default in relation to disk I/O. I 
think it's best to leave this default and prevent assumptions on the underlying 
infrastructure. That way, operators are free to tune to their expectations. 
Closing this.

> Aggressive log compaction ratio appears to have no negative effect on 
> log-compacted topics
> ------------------------------------------------------------------------------------------
>                 Key: KAFKA-5452
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5452
>             Project: Kafka
>          Issue Type: Improvement
>          Components: config, core, log
>    Affects Versions:,
>         Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>            Reporter: Jeff Chao
>              Labels: performance
>         Attachments: 200mbs-dirty0-dirty-1-dirty05.png, 
> flame-graph-200mbs-dirty0.png, flame-graph-200mbs-dirty0.svg
> Some of our users are seeing unintuitive/unexpected behavior with 
> log-compacted topics where they receive multiple records for the same key 
> when consuming. This is a result of low throughput on log-compacted topics 
> such that conditions ({{min.cleanable.dirty.ratio = 0.5}}, default) aren't 
> met for compaction to kick in.
> This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our 
> clusters. It appears that having more aggressive log compaction ratios don't 
> have negative effects on CPU and memory utilization. If this is truly the 
> case, we should consider changing the default from {{0.5}} to something more 
> aggressive.
> Setup:
> # 8 brokers
> # 5 zk nodes
> # 32 partitions on a topic
> # replication factor 3
> # log roll 3 hours
> # log segment bytes 1 GB
> # log retention 24 hours
> # all messages to a single key
> # all messages to a unique key
> # all messages to a bounded key range [0, 999]
> # {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}}
> # 200 MB/s sustained, produce and consume traffic
> Observations:
> We were able to verify log cleaner threads were performing work by checking 
> the logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. 
> We also observed the log cleaner's {{time-since-last-run-ms}} metric was 
> normal, never going above the default of 15 seconds.
> Under-replicated partitions stayed steady, same for replication lag.
> Here's an example test run where we try out {{min.cleanable.dirty.ratio = 
> 0}}, {{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = 
> 0.5}}. Troughs in between the peaks represent zero traffic and reconfiguring 
> of topics.
> (200mbs-dirty-0-dirty1-dirty05.png attached)
> !200mbs-dirty0-dirty-1-dirty05.png|thumbnail!
> Memory utilization is fine, but more interestingly, CPU doesn't appear to 
> have much difference.
> To get more detail, here is a flame graph (raw svg attached) of the run for 
> {{min.cleanable.dirty.ratio = 0}}. The conservative and default ratio flame 
> graphs are equivalent.
> (flame-graph-200mbs-dirty0.png attached)
> !flame-graph-200mbs-dirty0.png|thumbnail!
> Notice that the majority of CPU is coming from:
> # SSL operations (on reads/writes)
> # KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages)
> # KafkaApis::handleOffsetFetchRequest
> We also have examples from small scale test runs which show similar behavior 
> but with scaled down CPU usage.
> It seems counterintuitive that there's no apparent difference in CPU whether 
> it be aggressive or conservative compaction ratios, so we'd like to get some 
> thoughts from the community.
> We're looking for feedback on whether or not anyone else has experienced this 
> behavior before as well or, if CPU isn't affected, has anyone seen something 
> related instead.
> If this is true, then we'd be happy to discuss further and provide a patch.

This message was sent by Atlassian JIRA

Reply via email to