Hi,

I talked with Gwen at Strata last week and promised to share some of my 
experiences benchmarking an app reliant on the new  producer. I'm using 
relatively meaty boxes running my producer code (24 core/64GB RAM) but I wasn't 
pushing them until I got them on the same 10GB fabric as the Kafka cluster they 
are using (saturating the prior 1GB NICs was just too easy). There are 5 
brokers, 24 core/192GB RAM/8*2TB disks, running 0.8.2.1.

With lots of cores and a dedicated box the question was then how to deploy my 
application. In particular how many worker threads and how many instances of 
the KafkaProducer  to share amongst them. I also wanted to see how things would 
change as I scale up the thread count.

I ripped out the data retrieval part of my app (it reads from S3) and instead 
replaced it with some code to produce random records of average size 500 bytes 
but varying between 250 and 750. I started the app running, ignored the first 
25m messages then measured the timing for the next 100m and  calculated the 
average messages/sec written to Kafka across that run.

Starting small I created 4 application threads with a range of approaches to 
sharing KafkaProducer instances. The records written to the Kafka cluster per 
second were as follows:

4 threads all sharing 1 client: 332K
4 threads sharing 2 clients: 333K
4 threads, dedicated client per thread: 310K

Note that when I had 2 KafkaProducer clients as in the second line above each 
was used by 2 threads. Similar below, number of threads/number of clients is 
the max number of threads per KafkaProducer instance.

As can be seen from the above there's not much in it. Scaling up to 8 
application threads the numbers  looked like:

8 threads sharing 1 client: 387K
8 threads sharing 2 clients: 609K
8 threads sharing 4 clients: 628K
8 threads with dedicated  client per thread: 527K

This time sharing a single producer client  across all threads has by far the 
worse performance and  isn't much better than when using 4 application threads. 
The 2 and 4 client options are much better and are in the ballpark of 2x the 4 
thread performance. A dedicated client per thread isn't quite as good but isn't 
so far off to be unusable. So then taking it to 16 application threads:

16 threads sharing 1 client: 380K
16 threads sharing 2 clients: 675K
16 threads sharing 4 clients: 869K
16 threads sharing 8 clients: 733K
16 threads  with a dedicated client per thread: 491K

This gives a much clearer performance curve. The 16 thread/4 producer client is 
by far the best performance but it is still far from 4x the 4-thread or 2x the 
8-thread mark. At this point I seem to be hitting some limiting factor. On the 
client machine memory was still lightly used, network was peaking just over 
4GB/sec but CPU load was showing 1 minute load averages around 18-20. CPU load 
did seem to increase with as did the number of KafkaProducer instances but that 
is more a conclusion from memory and not backed by hard numbers.

For completeness sake I did do a 24 thread test but the numbers are as you'd 
expect. 1 client and 24 both showed poor performance. 4,6 or 8 clients (24 has 
more  ways of dividing it by 2!) all showed performance around that of the 16 
thread/4 client run above. The other configs were in-between.

With my full application I've found the best deployment so far is to have   
multiple instances running on the same box. I can get much better performance 
from 3 instances each with 8 threads than 1 instance with 24 threads. This is 
almost certainly because when adding in my own application logic and the AWS 
clients there is just a lot more contention - not to mention much more i/o 
waits -- in each application instance. The benchmark variant doesn't have as 
much happening but just to compare I ran a few concurrent instances:

2 copies of 8 threads sharing 4 clients: 780K total
2 copies of 8 threads sharing 2 clients: 870K total
3 copies of 8 threads sharing 2 clients: 945k total

So bottom line - around 900K/sec is the max I can get from one of these hosts 
for my application. At which point I brought a 2nd host to bear and ran 2 
concurrent instances of the best performing config on each:

2 copies of 16 threads sharing 4 clients on 2 hosts: 1458k total

This doesn't quite give 2x the single box performance but it does show that the 
cluster has capacity to spare beyond what the single client host can drive. 
This was also backed up by the metrics on the brokers, they got busy but 
moderately so given the amount of work they were doing.

At this point things  did get a bit 'ragged edge'. I noticed a very high rate 
of ISR churn on the brokers, it looked like the replicas were having trouble 
keeping up with the master and hosts were constantly being dropped out then 
re-added to the ISR. I had set the test topic to have a relatively low 
partition count (1 per spindle) so I doubled that to see if it could help the 
ISRs remain  stable. And my performance fell through the floor. So whereas I 
thought this was an equation involving application threads and producer 
instances perhaps partition count is a third. I need look into that some more 
but so far it looks like that for my application - I'm not suggesting this is a 
universal truth -- sharing a KafkaProducer instance amongst around 4 threads is 
the sweet spot.

I'll be doing further profiling of my application so I'll flag to the list 
anything that appears within the producer itself. And because 900K messages/sec 
was so close to a significant number I modified my code that generates the 
messages to keep the key random for each message but to use repeated message 
bodies across multiple messages. At which point 1.05m messages/sec was possible 
- from a single box. Nice. :)

This turned out much longer than planned, I probably should have blogged this 
somewhere. If anyone reads this far hope it is useful or of interest, I'd be 
interested in hearing if the profiles I'm seeing are  expected and if any other 
tests would be useful.

Regards
Garry

Reply via email to