Re: Analysis of producer performance -- and Producer-Kafka reliability

S Ahmed Fri, 12 Apr 2013 08:28:20 -0700

Interesting topic.

How would buffering in RAM help in reality though (just trying to work
through the scenerio in my head):


producer tries to connect to a broker, it fails, so it appends the message
to a in-memory store.  If the broker is down for say 20 minutes and then
comes back online, won't this create problems now when the producer creates
a new message, and it has 20 minutes of backlog, and the broker now is
handling more load (assuming you are sending those in-memory messages using
a different thread)?




On Fri, Apr 12, 2013 at 11:21 AM, Philip O'Toole <phi...@loggly.com> wrote:

> This is just my opinion of course (who else's could it be? :-)) but I think
> from an engineering point of view, one must spend one's time making the
> Producer-Kafka connection solid, if it is mission-critical.
>
> Kafka is all about getting messages to disk, and assuming your disks are
> solid (and 0.8 has replication) those messages are safe. To then try to
> build a system to cope with the Kafka brokers being unavailable seems like
> you're setting yourself for infinite regress. And to write code in the
> Producer to spool to disk seems even more pointless. If you're that
> worried, why not run a dedicated Kafka broker on the same node as the
> Producer, and connect over localhost? To turn around and write code to
> spool to disk, because the primary system that *spools to disk* is down
> seems to be missing the point.
>
> That said, even by going over local-host, I guess the network connection
> could go down. In that case, Producers should buffer in RAM, and start
> sending some major alerts to the Operations team. But this should almost
> *never happen*. If it is happening regularly *something is fundamentally
> wrong with your system design*. Those Producers should also refuse any more
> incoming traffic and await intervention. Even bringing up "netcat -l" and
> letting it suck in the data and write it to disk would work then.
> Alternatives include having Producers connect to a load-balancer with
> multiple Kafka brokers behind it, which helps you deal with any one Kafka
> broker failing. Or just have your Producers connect directly to multiple
> Kafka brokers, and switch over as needed if any one broker goes down.
>
> I don't know if the standard Kafka producer that ships with Kafka supports
> buffering in RAM in an emergency. We wrote our own that does, with a focus
> on speed and simplicity, but I expect it will very rarely, if ever, buffer
> in RAM.
>
> Building and using semi-reliable system after semi-reliable system, and
> chaining them all together, hoping to be more tolerant of failure is not
> necessarily a good approach. Instead, identifying that one system that is
> critical, and ensuring that it remains up (redundant installations,
> redundant disks, redundant network connections etc) is a better approach
> IMHO.
>
> Philip
>
>
> On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao <jun...@gmail.com> wrote:
>
> > Another way to handle this is to provision enough client and broker
> servers
> > so that the peak load can be handled without spooling.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski <pi...@liveramp.com
> > >wrote:
> >
> > > Jun,
> > >
> > > When talking about "catastrophic consequences" I was actually only
> > > referring to the producer side. in our use case (logging requests from
> > > webapp servers), a spike in traffic would force us to either tolerate a
> > > dramatic increase in the response time, or drop messages, both of which
> > are
> > > really undesirable. Hence the need to absorb spikes with some system on
> > top
> > > of Kafka, unless the spooling feature mentioned by Wing (
> > > https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This
> is
> > > assuming there are a lot more producer machines than broker nodes, so
> > each
> > > producer would absorb a small part of the extra load from the spike.
> > >
> > > Piotr
> > >
> > > On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao <jun...@gmail.com> wrote:
> > >
> > > > Piotr,
> > > >
> > > > Actually, could you clarify what "catastrophic consequences" did you
> > see
> > > on
> > > > the broker side? Do clients timeout due to longer serving time or
> > > something
> > > > else?
> > > >
> > > > Going forward, we plan to add per client quotas (KAFKA-656) to
> prevent
> > > the
> > > > brokers from being overwhelmed by a runaway client.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic <
> > > > otis_gospodne...@yahoo.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Is there anything one can do to "defend" from:
> > > > >
> > > > > "Trying to push more data than the brokers can handle for any
> > sustained
> > > > > period of time has catastrophic consequences, regardless of what
> > > timeout
> > > > > settings are used. In our use case this means that we need to
> either
> > > > ensure
> > > > > we have spare capacity for spikes, or use something on top of Kafka
> > to
> > > > > absorb spikes."
> > > > >
> > > > > ?
> > > > > Thanks,
> > > > > Otis
> > > > > ----
> > > > > Performance Monitoring for Solr / ElasticSearch / HBase -
> > > > > http://sematext.com/spm
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > >________________________________
> > > > > > From: Piotr Kozikowski <pi...@liveramp.com>
> > > > > >To: users@kafka.apache.org
> > > > > >Sent: Tuesday, April 9, 2013 1:23 PM
> > > > > >Subject: Re: Analysis of producer performance
> > > > > >
> > > > > >Jun,
> > > > > >
> > > > > >Thank you for your comments. I'll reply point by point for
> clarity.
> > > > > >
> > > > > >1. We were aware of the migration tool but since we haven't used
> > Kafka
> > > > for
> > > > > >production yet we just started using the 0.8 version directly.
> > > > > >
> > > > > >2. I hadn't seen those particular slides, very interesting. I'm
> not
> > > sure
> > > > > >we're testing the same thing though. In our case we vary the
> number
> > of
> > > > > >physical machines, but each one has 10 threads accessing a pool of
> > > Kafka
> > > > > >producer objects and in theory a single machine is enough to
> > saturate
> > > > the
> > > > > >brokers (which our test mostly confirms). Also, assuming that the
> > > slides
> > > > > >are based on the built-in producer performance tool, I know that
> we
> > > > > started
> > > > > >getting very different numbers once we switched to use "real"
> > (actual
> > > > > >production log) messages. Compression may also be a factor in case
> > it
> > > > > >wasn't configured the same way in those tests.
> > > > > >
> > > > > >3. In the latency section, there are two tests, one for average
> and
> > > > > another
> > > > > >for maximum latency. Each one has two graphs presenting the exact
> > same
> > > > > data
> > > > > >but at different levels of zoom. The first one is to observe small
> > > > > >variations of latency when target throughput <= actual throughput.
> > The
> > > > > >second is to observe the overall shape of the graph once latency
> > > starts
> > > > > >growing when target throughput > actual throughput. I hope that
> > makes
> > > > > sense.
> > > > > >
> > > > > >4. That sounds great, looking forward to it.
> > > > > >
> > > > > >Piotr
> > > > > >
> > > > > >On Mon, Apr 8, 2013 at 9:48 PM, Jun Rao <jun...@gmail.com> wrote:
> > > > > >
> > > > > >> Piotr,
> > > > > >>
> > > > > >> Thanks for sharing this. Very interesting and useful study. A
> few
> > > > > comments:
> > > > > >>
> > > > > >> 1. For existing 0.7 users, we have a migration tool that mirrors
> > > data
> > > > > from
> > > > > >> an 0.7 cluster to an 0.8 cluster. Applications can upgrade to
> 0.8
> > by
> > > > > >> upgrading consumers first, followed by producers.
> > > > > >>
> > > > > >> 2. Have you looked at the Kafka ApacheCon slides (
> > > > > >>
> http://www.slideshare.net/junrao/kafka-replication-apachecon2013
> > )?
> > > > > Towards
> > > > > >> the end, there are some performance numbers too. The figure for
> > > > > throughput
> > > > > >> vs #producer is different from what you have. Not sure if this
> is
> > > > > because
> > > > > >> that you have turned on compression.
> > > > > >>
> > > > > >> 3. Not sure that I understand the difference btw the first 2
> > graphs
> > > in
> > > > > the
> > > > > >> latency section. What's different btw the 2 tests?
> > > > > >>
> > > > > >> 4. Post 0.8, we plan to improve the producer side throughput by
> > > > > >> implementing non-blocking socket on the client side.
> > > > > >>
> > > > > >> Jun
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski <
> > > pi...@liveramp.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi,
> > > > > >> >
> > > > > >> > At LiveRamp we are considering replacing Scribe with Kafka,
> and
> > > as a
> > > > > >> first
> > > > > >> > step we run some tests to evaluate producer performance. You
> can
> > > > find
> > > > > our
> > > > > >> > preliminary results here:
> > > > > >> >
> > > > >
> > https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/
> > > .
> > > > > >> We
> > > > > >> > hope this will be useful for some folks, and If anyone has
> > > comments
> > > > or
> > > > > >> > suggestions about what to do differently to obtain better
> > results
> > > > your
> > > > > >> > feedback will be very welcome.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Piotr
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Analysis of producer performance -- and Producer-Kafka reliability

Reply via email to