Jeff Chao created KAFKA-5452:
--------------------------------

             Summary: Aggressive log compaction ratio appears to have no 
negative effect on log-compacted topics
                 Key: KAFKA-5452
                 URL: https://issues.apache.org/jira/browse/KAFKA-5452
             Project: Kafka
          Issue Type: Improvement
          Components: config, core, log
    Affects Versions: 0.10.2.1, 0.10.2.0
         Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
            Reporter: Jeff Chao
         Attachments: 200mbs-dirty0-dirty-1-dirty05.png, 
flame-graph-200mbs-dirty0.png, flame-graph-200mbs-dirty0.svg

Some of our users are seeing unintuitive/unexpected behavior with log-compacted 
topics where they receive multiple records for the same key when consuming. 
This is a result of low throughput on log-compacted topics such that conditions 
({{min.cleanable.dirty.ratio = 0.5}}, default) aren't met for compaction to 
kick in.

This prompted us to test and tune {{min.cleanable.dirty.ratio}} in our 
clusters. It appears that having more aggressive log compaction ratios don't 
have negative effects on CPU and memory utilization. If this is truly the case, 
we should consider changing the default from {{0.5}} to something more 
aggressive.

Setup:

# 1. 8 brokers
# 2. 5 zk nodes
# 3. 32 partitions on a topic
# 4. replication factor 3
# 5. log roll 3 hours
# 6. log segment bytes 1 GB
# 7. log retention 24 hours
# 8. all messages to a single key
# 9. all messages to a unique key
# 10. all messages to a bounded key range [0, 999]
# 11. {{min.cleanable.dirty.ratio}} per topic = {{0}}, {{0.5}}, and {{1}}
# 12. 200 MB/s sustained, produce and consume traffic

Observations:

We were able to verify log cleaner threads were performing work by checking the 
logs and verifying the {{cleaner-offset-checkpoint}} file for all topics. We 
also observed the log cleaner's {{time-since-last-run-ms}} metric was normal, 
never going above the default of 15 seconds.

Under-replicated partitions stayed steady, same for replication lag.

Here's an example test run where we try out {{min.cleanable.dirty.ratio = 0}}, 
{{min.cleanable.dirty.ratio = 1}}, and {{min.cleanable.dirty.ratio = 0.5}}. 
Troughs in between the peaks represent zero traffic and reconfiguring of topics.

!200mbs-dirty0-dirty-1-dirty05.png|thumbnail!

Memory utilization is fine, but more interestingly, CPU doesn't appear to have 
much difference.

To get more detail, here is a flame graph (raw svg attached) of the run for 
{{min.cleanable.dirty.ratio = 0}}. The conservative and default ratio flame 
graphs are equivalent.

!flame-graph-200mbs-dirty0.png|thumbnail!

Notice that the majority of CPU is coming from:

# 1. SSL operations (on reads/writes)
# 2. KafkaApis::handleFetchRequest (ReplicaManager::fetchMessages)
# 3. KafkaApis::handleOffsetFetchRequest

We also have examples from small scale test runs which show similar behavior 
but with scaled down CPU usage.

It seems counterintuitive that there's no apparent difference in CPU whether it 
be aggressive or conservative compaction ratios, so we'd like to get some 
thoughts from the community.

We're looking for feedback on whether or not anyone else has experienced this 
behavior before as well or, if CPU isn't affected, has anyone seen something 
related instead.

If this is true, then we'd be happy to discuss further and provide a patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to