Interesting topic. How would buffering in RAM help in reality though (just trying to work through the scenerio in my head):
producer tries to connect to a broker, it fails, so it appends the message to a in-memory store. If the broker is down for say 20 minutes and then comes back online, won't this create problems now when the producer creates a new message, and it has 20 minutes of backlog, and the broker now is handling more load (assuming you are sending those in-memory messages using a different thread)? On Fri, Apr 12, 2013 at 11:21 AM, Philip O'Toole <phi...@loggly.com> wrote: > This is just my opinion of course (who else's could it be? :-)) but I think > from an engineering point of view, one must spend one's time making the > Producer-Kafka connection solid, if it is mission-critical. > > Kafka is all about getting messages to disk, and assuming your disks are > solid (and 0.8 has replication) those messages are safe. To then try to > build a system to cope with the Kafka brokers being unavailable seems like > you're setting yourself for infinite regress. And to write code in the > Producer to spool to disk seems even more pointless. If you're that > worried, why not run a dedicated Kafka broker on the same node as the > Producer, and connect over localhost? To turn around and write code to > spool to disk, because the primary system that *spools to disk* is down > seems to be missing the point. > > That said, even by going over local-host, I guess the network connection > could go down. In that case, Producers should buffer in RAM, and start > sending some major alerts to the Operations team. But this should almost > *never happen*. If it is happening regularly *something is fundamentally > wrong with your system design*. Those Producers should also refuse any more > incoming traffic and await intervention. Even bringing up "netcat -l" and > letting it suck in the data and write it to disk would work then. > Alternatives include having Producers connect to a load-balancer with > multiple Kafka brokers behind it, which helps you deal with any one Kafka > broker failing. Or just have your Producers connect directly to multiple > Kafka brokers, and switch over as needed if any one broker goes down. > > I don't know if the standard Kafka producer that ships with Kafka supports > buffering in RAM in an emergency. We wrote our own that does, with a focus > on speed and simplicity, but I expect it will very rarely, if ever, buffer > in RAM. > > Building and using semi-reliable system after semi-reliable system, and > chaining them all together, hoping to be more tolerant of failure is not > necessarily a good approach. Instead, identifying that one system that is > critical, and ensuring that it remains up (redundant installations, > redundant disks, redundant network connections etc) is a better approach > IMHO. > > Philip > > > On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao <jun...@gmail.com> wrote: > > > Another way to handle this is to provision enough client and broker > servers > > so that the peak load can be handled without spooling. > > > > Thanks, > > > > Jun > > > > > > On Thu, Apr 11, 2013 at 5:45 PM, Piotr Kozikowski <pi...@liveramp.com > > >wrote: > > > > > Jun, > > > > > > When talking about "catastrophic consequences" I was actually only > > > referring to the producer side. in our use case (logging requests from > > > webapp servers), a spike in traffic would force us to either tolerate a > > > dramatic increase in the response time, or drop messages, both of which > > are > > > really undesirable. Hence the need to absorb spikes with some system on > > top > > > of Kafka, unless the spooling feature mentioned by Wing ( > > > https://issues.apache.org/jira/browse/KAFKA-156) is implemented. This > is > > > assuming there are a lot more producer machines than broker nodes, so > > each > > > producer would absorb a small part of the extra load from the spike. > > > > > > Piotr > > > > > > On Wed, Apr 10, 2013 at 10:17 PM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > Piotr, > > > > > > > > Actually, could you clarify what "catastrophic consequences" did you > > see > > > on > > > > the broker side? Do clients timeout due to longer serving time or > > > something > > > > else? > > > > > > > > Going forward, we plan to add per client quotas (KAFKA-656) to > prevent > > > the > > > > brokers from being overwhelmed by a runaway client. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > > > > > On Wed, Apr 10, 2013 at 12:04 PM, Otis Gospodnetic < > > > > otis_gospodne...@yahoo.com> wrote: > > > > > > > > > Hi, > > > > > > > > > > Is there anything one can do to "defend" from: > > > > > > > > > > "Trying to push more data than the brokers can handle for any > > sustained > > > > > period of time has catastrophic consequences, regardless of what > > > timeout > > > > > settings are used. In our use case this means that we need to > either > > > > ensure > > > > > we have spare capacity for spikes, or use something on top of Kafka > > to > > > > > absorb spikes." > > > > > > > > > > ? > > > > > Thanks, > > > > > Otis > > > > > ---- > > > > > Performance Monitoring for Solr / ElasticSearch / HBase - > > > > > http://sematext.com/spm > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >________________________________ > > > > > > From: Piotr Kozikowski <pi...@liveramp.com> > > > > > >To: users@kafka.apache.org > > > > > >Sent: Tuesday, April 9, 2013 1:23 PM > > > > > >Subject: Re: Analysis of producer performance > > > > > > > > > > > >Jun, > > > > > > > > > > > >Thank you for your comments. I'll reply point by point for > clarity. > > > > > > > > > > > >1. We were aware of the migration tool but since we haven't used > > Kafka > > > > for > > > > > >production yet we just started using the 0.8 version directly. > > > > > > > > > > > >2. I hadn't seen those particular slides, very interesting. I'm > not > > > sure > > > > > >we're testing the same thing though. In our case we vary the > number > > of > > > > > >physical machines, but each one has 10 threads accessing a pool of > > > Kafka > > > > > >producer objects and in theory a single machine is enough to > > saturate > > > > the > > > > > >brokers (which our test mostly confirms). Also, assuming that the > > > slides > > > > > >are based on the built-in producer performance tool, I know that > we > > > > > started > > > > > >getting very different numbers once we switched to use "real" > > (actual > > > > > >production log) messages. Compression may also be a factor in case > > it > > > > > >wasn't configured the same way in those tests. > > > > > > > > > > > >3. In the latency section, there are two tests, one for average > and > > > > > another > > > > > >for maximum latency. Each one has two graphs presenting the exact > > same > > > > > data > > > > > >but at different levels of zoom. The first one is to observe small > > > > > >variations of latency when target throughput <= actual throughput. > > The > > > > > >second is to observe the overall shape of the graph once latency > > > starts > > > > > >growing when target throughput > actual throughput. I hope that > > makes > > > > > sense. > > > > > > > > > > > >4. That sounds great, looking forward to it. > > > > > > > > > > > >Piotr > > > > > > > > > > > >On Mon, Apr 8, 2013 at 9:48 PM, Jun Rao <jun...@gmail.com> wrote: > > > > > > > > > > > >> Piotr, > > > > > >> > > > > > >> Thanks for sharing this. Very interesting and useful study. A > few > > > > > comments: > > > > > >> > > > > > >> 1. For existing 0.7 users, we have a migration tool that mirrors > > > data > > > > > from > > > > > >> an 0.7 cluster to an 0.8 cluster. Applications can upgrade to > 0.8 > > by > > > > > >> upgrading consumers first, followed by producers. > > > > > >> > > > > > >> 2. Have you looked at the Kafka ApacheCon slides ( > > > > > >> > http://www.slideshare.net/junrao/kafka-replication-apachecon2013 > > )? > > > > > Towards > > > > > >> the end, there are some performance numbers too. The figure for > > > > > throughput > > > > > >> vs #producer is different from what you have. Not sure if this > is > > > > > because > > > > > >> that you have turned on compression. > > > > > >> > > > > > >> 3. Not sure that I understand the difference btw the first 2 > > graphs > > > in > > > > > the > > > > > >> latency section. What's different btw the 2 tests? > > > > > >> > > > > > >> 4. Post 0.8, we plan to improve the producer side throughput by > > > > > >> implementing non-blocking socket on the client side. > > > > > >> > > > > > >> Jun > > > > > >> > > > > > >> > > > > > >> On Mon, Apr 8, 2013 at 4:42 PM, Piotr Kozikowski < > > > pi...@liveramp.com> > > > > > >> wrote: > > > > > >> > > > > > >> > Hi, > > > > > >> > > > > > > >> > At LiveRamp we are considering replacing Scribe with Kafka, > and > > > as a > > > > > >> first > > > > > >> > step we run some tests to evaluate producer performance. You > can > > > > find > > > > > our > > > > > >> > preliminary results here: > > > > > >> > > > > > > > > https://blog.liveramp.com/2013/04/08/kafka-0-8-producer-performance-2/ > > > . > > > > > >> We > > > > > >> > hope this will be useful for some folks, and If anyone has > > > comments > > > > or > > > > > >> > suggestions about what to do differently to obtain better > > results > > > > your > > > > > >> > feedback will be very welcome. > > > > > >> > > > > > > >> > Thanks, > > > > > >> > > > > > > >> > Piotr > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >