2-4 seems ok with me, as long as the network isn't bound or dropping
packets/etc then you're probably ok.

The new producer is this class:
https://kafka.apache.org/082/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html
- if you're using that then it's new producer, otherwise it's old producer.

Lots of leader change activity seems accurate to me to fit with offline
partitions or at least other controller instability. Did you check the GC
logs on your brokers for long GCs? How much JVM heap are you giving each
broker? What about the zookeepers, do they have enough heap, have low pause
times etc?

Another thing to think about is that 4800 partitions is a lot, and 120
brokers *is* a lot of brokers. We've seen Kafka happily handle 1.5 million
messages/s on an 8 node cluster with 16 partitions (under 0.9.0.0 though).

Another thing is that 0.8.X has a number of bugs in which the Kafka
controller can get stuck. We've been finding 0.9 to be much more stable
overall (we run thousands of clusters with an ops team of 3 people who
aren't even full time), and would recommend switching off 0.8.X asap.

Thanks

Tom



On Tue, May 24, 2016 at 2:28 PM, Jahn Roux <j...@comprsa.com> wrote:

> Hi Tom, I appreciate you taking the time to respond to my request.
>
> I believe at the moment we only have 2 to 4 virtuals running on a single
> host. This is probably not ideal, but this is what we are stuck with -
> essentially "cloud" VM hardware.
>
> I am not sure about the producer, I believe it to be the new producer -
> how would I check this?
>
> This is part of our issue at the moment - we are having trouble with the
> metrics. Our ganglia server seems overwhelmed. We set up a small test
> cluster and found replica lag to be the biggest issue. Before we lost our
> metrics I noticed a lot of leader change activity - could this be a symptom
> of the offline partitions?
>
> Kind regards,
>
> Jahn Roux
>
>
> -----Original Message-----
> From: Tom Crayford [mailto:tcrayf...@heroku.com]
> Sent: Tuesday, May 24, 2016 3:07 PM
> To: Users
> Subject: Re: Large kafka deployment on virtual hardware
>
> Jahn,
>
> Are all these brokers running on the same underlying machine? Doing so
> seems highly against the usual fault tolerance properties of Kafka, and I'd
> expect there to be some hidden performance issues in the hypervisor at that
> point.
>
> Are you running with the new producer or the old one?
>
> Are you monitoring Kafka's internal metrics on each broker? Issues with
> e.g. offline partitions and other things could cause that kind of impact.
>
> Thanks
>
> Tom Crayford
> Heroku Kafka
>
> On Tue, May 24, 2016 at 9:56 AM, Jahn Roux <j...@comprsa.com> wrote:
>
> > Thank you for the response. Yes, we have had a number of experts
> > investigate the underlying resource provision and there are no clear
> > issues that stand out - from a virtual and host hardware/resource
> > perspective the system is busy but nothing indicates it is overburdened.
> >
> > Kind regards,
> >
> > Jahn Roux
> >
> > -----Original Message-----
> > From: Sharninder [mailto:sharnin...@gmail.com]
> > Sent: Tuesday, May 24, 2016 10:49 AM
> > To: users@kafka.apache.org
> > Subject: Re: Large kafka deployment on virtual hardware
> >
> > I'm sure you checked this but since these are virtual machines, is it
> > possible there is just contention for resources? Network clogged or
> > some other simpler explanation like that?
> >
> > On Mon, May 23, 2016 at 9:42 PM, Jahn Roux <j...@comprsa.com> wrote:
> >
> > > I have a large Kafka deployment on virtual hardware: 120 brokers on
> > > 32gb memory 8 core virtual machines. Gigabit network, RHEL 6.7. 4
> > > Topics, 1200 partitions each, replication factor of 2 and running
> > > Kafka 0.8.1.2
> > >
> > >
> > >
> > > We are running into issues where our cluster is not keeping up. We
> > > have 4 sets of producers (30 producers per set) set to produce to
> > > the
> > > 4 topics (producers produce to multiple topics). The messages are
> > > about 150 byte on average and we are attempting to produce between 1
> > > million and 2 million messages a second per producer set.
> > >
> > >
> > >
> > > We run into issues after about 1 million messages a second - just
> > > for that producer set, the producer buffers fill up and we are
> > > blocked from producing messages. This does not seem to impact the
> > > other producer sets - they run without issues until they too reach
> > > about 1m messages a second.
> > >
> > >
> > >
> > > Looking at the metrics available to us we do not see a bottleneck,
> > > we don't see disk I/O maxing out, CPU and network are nominal. We
> > > have tried increasing and decreasing the Kafka cluster size to no
> > > avail, we have gone from 100 partitions to 1200 partitions per
> > > topic. We have increased and decreased the number of producers and
> > > yet we run into the same issues. Our Kafka config is mostly out the
> > > box - 1 hour log roll/retention, increased the buffer sizes a bit
> > > but other than that
> > it's out the box.
> > >
> > >
> > >
> > > I was wondering if someone has some recommendations for identifying
> > > the bottleneck and/or what configuration values we should be taking
> > > a
> > look at?
> > > Is there known issues with Kafka on virtualized hardware or things
> > > to watch out for when deploying to VMs? Are there use cases where
> > > Kafka is being used in a similar way - +4 million messages a second
> > > of discrete 150 byte messages?
> > >
> > >
> > >
> > > Kind regards,
> > >
> > >
> > >
> > > Jahn Roux
> > >
> > >
> > >
> > >
> > >
> > > ---
> > > This email has been checked for viruses by Avast antivirus software.
> > > https://www.avast.com/antivirus
> > >
> >
> >
> >
> > --
> > --
> > Sharninder
> >
> >
> > ---
> > This email has been checked for viruses by Avast antivirus software.
> > https://www.avast.com/antivirus
> >
> >
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
>

Reply via email to