[ https://issues.apache.org/jira/browse/KAFKA-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246565#comment-15246565 ]
Jiangjie Qin edited comment on KAFKA-3565 at 4/18/16 9:22 PM: -------------------------------------------------------------- [~ijuma] I see. So we do know that the throughput of a single-user-thread producer will be lower compared with 0.9, but we are trying to understand why the throughput seems even lower than our expectation considering the amount of overhead introduced in KIP-32. I did the following test: 1. Launch a one broker cluster running 0.9 2. Launch another one-broker cluster running trunk 3. Using tweaked 0.9 ProducerPerformance and trunk ProducerPerformance to producer to an 8-partition topic. I am not able to reproduce the result you had for gzip. {noformat} ./kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic becket_test_1_replica_8_partition --num-records 500000 --record-size 1000 --throughput 100000 --valueBound 50000 --producer-props bootstrap.servers=localhost:9092 acks=1 max.in.flight.requests.per.connection=1 compression.type=gzip batch.size=500000 client.id=becket The result form 0.9: 500000 records sent, 3734.548306 records/sec (3.56 MB/sec), 368.73 ms avg latency, 790.00 ms max latency, 368 ms 50th, 535 ms 95th, 597 ms 99th, 723 ms 99.9th. The result from trunk: 500000 records sent, 11028.276501 records/sec (10.52 MB/sec), 4.08 ms avg latency, 148.00 ms max latency, 4 ms 50th, 6 ms 95th, 9 ms 99th, 57 ms 99.9th. {noformat} The results of snappy with 100B messages are followling {noformat} ./kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic becket_test_1_replica_8_partition --num-records 100000000 --record-size 100 --throughput 10000 --valueBound 50000 --producer-props bootstrap.servers=localhost:9092 acks=1 max.in.flight.requests.per.connection=1 compression.type=snappy batch.size=500000 client.id=becket The result from 0.9: 100000000 records sent, 358709.649648 records/sec (34.21 MB/sec), 22.84 ms avg latency, 388.00 ms max latency, 21 ms 50th, 30 ms 95th, 44 ms 99th, 237 ms 99.9th. The result from trunk: 100000000 records sent, 272133.279995 records/sec (25.95 MB/sec), 13.96 ms avg latency, 1057.00 ms max latency, 9 ms 50th, 26 ms 95th, 145 ms 99th, 915 ms 99.9th. {noformat} I took a closer look at the ProducerPerformance metrics, there are a few differences with and w/o KIP-31/32. 1. The batch size: 212721(w) vs 475194(w/o) 2. The request rate: 134(w) vs 81(w/o) 3. record queue time: 4.9 ms(w) vs 18(w/o) This indicates that in general the sender thread is running more iterations after KIP31/32 due to the smaller latency from the broker on trunk (in fact I think this is the metric we should care about most). That also means more batches are rolled out, and more lock grabbing. Those things can impact the throughput for a single user thread. While the throughput of a single user thread is important, if we take the producer as a system, there are too many factors that can affect that. One thing I notice on the producer performance is that you have to tune it. e.g. if I change the configuration on the trunk ProducerPerformance to batch.size=100000 and linger.ms=100. The result I got is similar to the 0.9 result. {{100000000 records sent, 349094.971287 records/sec (33.29 MB/sec), 25.34 ms avg latency, 540.00 ms max latency, 23 ms 50th, 31 ms 95th, 107 ms 99th, 388 ms 99.9th.}} I think we can say that with KIP31/32 1. the brokers are able to handle the requests much faster so the throughput of the broker increased. 2. each user thread of producer might be slower because of the 8-bytes overhead. But user can increase user threads or tune the producer to get better throughput. was (Author: becket_qin): [~ijuma] I see. So we do know that the throughput of a single-user-thread producer will be lower compared with 0.9, but we are trying to understand why the throughput seems even lower than our expectation considering the amount of overhead introduced in KIP-32. I did the following test: 1. Launch a one broker cluster running 0.9 2. Launch another one-broker cluster running trunk 3. Using tweaked 0.9 ProducerPerformance and trunk ProducerPerformance to producer to an 8-partition topic. I am not able to reproduce the result you had for gzip. {noformat} ./kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic becket_test_1_replica_8_partition --num-records 500000 --record-size 1000 --throughput 100000 --valueBound 50000 --producer-props bootstrap.servers=localhost:9092 acks=1 max.in.flight.requests.per.connection=1 compression.type=gzip batch.size=500000 client.id=becket The result form 0.9: 500000 records sent, 3734.548306 records/sec (3.56 MB/sec), 368.73 ms avg latency, 790.00 ms max latency, 368 ms 50th, 535 ms 95th, 597 ms 99th, 723 ms 99.9th. The result from trunk: 500000 records sent, 11028.276501 records/sec (10.52 MB/sec), 4.08 ms avg latency, 148.00 ms max latency, 4 ms 50th, 6 ms 95th, 9 ms 99th, 57 ms 99.9th. {noformat} The results of snappy with 100B messages are followling {noformat} ./kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --topic becket_test_1_replica_8_partition --num-records 100000000 --record-size 100 --throughput 10000 --valueBound 50000 --producer-props bootstrap.servers=localhost:9092 acks=1 max.in.flight.requests.per.connection=1 compression.type=snappy batch.size=500000 client.id=becket The result from 0.9: 100000000 records sent, 358709.649648 records/sec (34.21 MB/sec), 22.84 ms avg latency, 388.00 ms max latency, 21 ms 50th, 30 ms 95th, 44 ms 99th, 237 ms 99.9th. The result from trunk: 100000000 records sent, 272133.279995 records/sec (25.95 MB/sec), 13.96 ms avg latency, 1057.00 ms max latency, 9 ms 50th, 26 ms 95th, 145 ms 99th, 915 ms 99.9th. {noformat} I took a closer look at the ProducerPerformance metrics, there are a few differences with and w/o KIP-31/32. 1. The batch size: 212721(w) vs 475194(w/o) 2. The request rate: 134(w) vs 81(w/o) 3. record queue time: 4.9 ms(w) vs 18(w/o) This indicates that in general the sender thread is running more iterations after KIP31/32 due to the smaller latency from the broker on trunk (in fact I think this is the metric we should care about most). That also means more batches are rolled out, and more lock grabbing. Those things can impact the throughput for a single user thread. While the throughput of a single user thread is important, if we take the producer as a system, there are too many factors that can affect that. One thing I notice on the producer performance is that you have to tune it. e.g. if I change the configuration on the trunk ProducerPerformance to batch.size=100000 and linger.ms=100. The result I got is similar to the 0.9 result. {{100000000 records sent, 349094.971287 records/sec (33.29 MB/sec), 25.34 ms avg latency, 540.00 ms max latency, 23 ms 50th, 31 ms 95th, 107 ms 99th, 388 ms 99.9th.}} I think we can say that wiht KIP31/32 1. the brokers are able to handle the requests much faster so the throughput of the broker increased. 2. each user thread of producer might be slower because of the 8-bytes overhead. But user can increase user threads or tune the producer to get better throughput. > Producer's throughput lower with compressed data after KIP-31/32 > ---------------------------------------------------------------- > > Key: KAFKA-3565 > URL: https://issues.apache.org/jira/browse/KAFKA-3565 > Project: Kafka > Issue Type: Bug > Reporter: Ismael Juma > Priority: Critical > Fix For: 0.10.0.0 > > > Relative offsets were introduced by KIP-31 so that the broker does not have > to recompress data (this was previously required after offsets were > assigned). The implicit assumption is that reducing CPU usage required by > recompression would mean that producer throughput for compressed data would > increase. > However, this doesn't seem to be the case: > {code} > Commit: eee95228fabe1643baa016a2d49fb0a9fe2c66bd (one before KIP-31/32) > test_id: > 2016-04-15--012.kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.acks=1.message_size=100.compression_type=snappy > status: PASS > run time: 59.030 seconds > {"records_per_sec": 519418.343653, "mb_per_sec": 49.54} > {code} > Full results: https://gist.github.com/ijuma/0afada4ff51ad6a5ac2125714d748292 > {code} > Commit: fa594c811e4e329b6e7b897bce910c6772c46c0f (KIP-31/32) > test_id: > 2016-04-15--013.kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.acks=1.message_size=100.compression_type=snappy > status: PASS > run time: 1 minute 0.243 seconds > {"records_per_sec": 427308.818848, "mb_per_sec": 40.75} > {code} > Full results: https://gist.github.com/ijuma/e49430f0548c4de5691ad47696f5c87d > The difference for the uncompressed case is smaller (and within what one > would expect given the additional size overhead caused by the timestamp > field): > {code} > Commit: eee95228fabe1643baa016a2d49fb0a9fe2c66bd (one before KIP-31/32) > test_id: > 2016-04-15--010.kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.acks=1.message_size=100 > status: PASS > run time: 1 minute 4.176 seconds > {"records_per_sec": 321018.17747, "mb_per_sec": 30.61} > {code} > Full results: https://gist.github.com/ijuma/5fec369d686751a2d84debae8f324d4f > {code} > Commit: fa594c811e4e329b6e7b897bce910c6772c46c0f (KIP-31/32) > test_id: > 2016-04-15--014.kafkatest.tests.benchmark_test.Benchmark.test_producer_throughput.topic=topic-replication-factor-three.security_protocol=PLAINTEXT.acks=1.message_size=100 > status: PASS > run time: 1 minute 5.079 seconds > {"records_per_sec": 291777.608696, "mb_per_sec": 27.83} > {code} > Full results: https://gist.github.com/ijuma/1d35bd831ff9931448b0294bd9b787ed -- This message was sent by Atlassian JIRA (v6.3.4#6332)