I have been testing kafka for the past week or so and figured I would share
my results so far.


I am not sure if the formatting will keep in email but here are the results
in a google doc...all 1,100 of them


https://docs.google.com/spreadsheets/d/1UL-o2MiV0gHZtL4jFWNyqRTQl41LFdM0upjRIwCWNgQ/edit?usp=sharing



One thing I found is there appears to be a bottleneck in
kafka-producer-perf-test.sh


The servers I used for testing have 12 7.2K drives and 16 cores.  I was NOT
unable to scale the broker past 350MBsec when adding drives even though I
was able to get 150MBsec from a single drive.  I wanted to determine the
source of the low utilization.


I tired changing the following

·        log.flush.interval.messages on the broker

·        log.flush.interval.ms flush on the broker

·        num.io.threads on the broker

·        thread settings on the producer

·        producer message  sizes

·        producer batch sizes

·        different number of topics (which impact the number of drives)

None of the above had any impact.  The last thing I tried was running
multiple producers which had a very noticeable impact.  As previously
mentioned I had already tested the thread setting of the producer and found
it to scale when increasing the thread count from 1,2,4 and 8.  After that
it plateaued so I had been using 8 threads for each test.   To show the
impact on number of producers I created 12 topics with partition counts
from 1 to 12.    I used a single broker with no replication and configured
the producer(s) to send 10 million 2200 byte messages in batches of 400
with no ack.


Running with three producers has almost double the throughput that one
producer will have.


Other Key points learned so far

·        Ensure you are using correct network interface.  ( use
advertised.host.name if the servers have multiple interfaces)

·        Use batching on the producer – With a single broker sending 2200
byte messages in batches of 200 resulted in  283MBsec vs. a batch size of 1
was 44MBsec

·        The message size, the configuration of request.required.acks and
the number of replicas (only when ack is set to all) had the most influence
on the overall throughput.

·        The following table shows results of testing with messages sizes
of 200, 300, 1000 and 2200 bytes on a three node cluster.  Each message
size was tested with the three available ack modes (NONE, LEADER and ALL)
and with replication of two and three copies.   Having three copies of data
is recommended, however both are included for reference.

*Replica=2*

*Replica=3*

*message.size*

*acks*

*MB.sec*

*nMsg.sec*

*MB.sec*

*nMsg.sec*

*Per Server MB.sec*

*Per Server nMsg.sec*

200

NONE

251

1,313,888

237

1,242,390

79

414,130

300

NONE

345

1,204,384

320

1,120,197

107

373,399

1000

NONE

522

546,896

515

540,541

172

180,180

2200

NONE

368

175,165

367

174,709

122

58,236

200

LEADER

115

604,376

141

739,754

47

246,585

300

LEADER

186

650,280

192

670,062

64

223,354

1000

LEADER

340

356,659

328

343,808

109

114,603

2200

LEADER

310

147,846

293

139,729

98

46,576

200

ALL

74

385,594

58

304,386

19

101,462

300

ALL

105

367,282

78

272,316

26

90,772

1000

ALL

203

212,400

124

130,305

41

43,435

2200

ALL

212

100,820

136

64,835

45

21,612



Some observations from the above table

·        Increasing the number of replicas when request.required.acks is
none or leader only has limited impact on overall performance (additional
resources are required to replicate data but during tests this did not
impact producer throughput)

·        Compression is not shown as it was found that the data generated
for the test is not realistic to a production workload.  (GZIP compressed
data 300:1 which is unrealistic )

·        For some reason a message size of 1000 bytes performed the best.
Need to look into this more.


Thanks

Bert

Reply via email to