Hi Tom, I appreciate you taking the time to respond to my request. I believe at the moment we only have 2 to 4 virtuals running on a single host. This is probably not ideal, but this is what we are stuck with - essentially "cloud" VM hardware.
I am not sure about the producer, I believe it to be the new producer - how would I check this? This is part of our issue at the moment - we are having trouble with the metrics. Our ganglia server seems overwhelmed. We set up a small test cluster and found replica lag to be the biggest issue. Before we lost our metrics I noticed a lot of leader change activity - could this be a symptom of the offline partitions? Kind regards, Jahn Roux -----Original Message----- From: Tom Crayford [mailto:tcrayf...@heroku.com] Sent: Tuesday, May 24, 2016 3:07 PM To: Users Subject: Re: Large kafka deployment on virtual hardware Jahn, Are all these brokers running on the same underlying machine? Doing so seems highly against the usual fault tolerance properties of Kafka, and I'd expect there to be some hidden performance issues in the hypervisor at that point. Are you running with the new producer or the old one? Are you monitoring Kafka's internal metrics on each broker? Issues with e.g. offline partitions and other things could cause that kind of impact. Thanks Tom Crayford Heroku Kafka On Tue, May 24, 2016 at 9:56 AM, Jahn Roux <j...@comprsa.com> wrote: > Thank you for the response. Yes, we have had a number of experts > investigate the underlying resource provision and there are no clear > issues that stand out - from a virtual and host hardware/resource > perspective the system is busy but nothing indicates it is overburdened. > > Kind regards, > > Jahn Roux > > -----Original Message----- > From: Sharninder [mailto:sharnin...@gmail.com] > Sent: Tuesday, May 24, 2016 10:49 AM > To: users@kafka.apache.org > Subject: Re: Large kafka deployment on virtual hardware > > I'm sure you checked this but since these are virtual machines, is it > possible there is just contention for resources? Network clogged or > some other simpler explanation like that? > > On Mon, May 23, 2016 at 9:42 PM, Jahn Roux <j...@comprsa.com> wrote: > > > I have a large Kafka deployment on virtual hardware: 120 brokers on > > 32gb memory 8 core virtual machines. Gigabit network, RHEL 6.7. 4 > > Topics, 1200 partitions each, replication factor of 2 and running > > Kafka 0.8.1.2 > > > > > > > > We are running into issues where our cluster is not keeping up. We > > have 4 sets of producers (30 producers per set) set to produce to > > the > > 4 topics (producers produce to multiple topics). The messages are > > about 150 byte on average and we are attempting to produce between 1 > > million and 2 million messages a second per producer set. > > > > > > > > We run into issues after about 1 million messages a second - just > > for that producer set, the producer buffers fill up and we are > > blocked from producing messages. This does not seem to impact the > > other producer sets - they run without issues until they too reach > > about 1m messages a second. > > > > > > > > Looking at the metrics available to us we do not see a bottleneck, > > we don't see disk I/O maxing out, CPU and network are nominal. We > > have tried increasing and decreasing the Kafka cluster size to no > > avail, we have gone from 100 partitions to 1200 partitions per > > topic. We have increased and decreased the number of producers and > > yet we run into the same issues. Our Kafka config is mostly out the > > box - 1 hour log roll/retention, increased the buffer sizes a bit > > but other than that > it's out the box. > > > > > > > > I was wondering if someone has some recommendations for identifying > > the bottleneck and/or what configuration values we should be taking > > a > look at? > > Is there known issues with Kafka on virtualized hardware or things > > to watch out for when deploying to VMs? Are there use cases where > > Kafka is being used in a similar way - +4 million messages a second > > of discrete 150 byte messages? > > > > > > > > Kind regards, > > > > > > > > Jahn Roux > > > > > > > > > > > > --- > > This email has been checked for viruses by Avast antivirus software. > > https://www.avast.com/antivirus > > > > > > -- > -- > Sharninder > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > > --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus