RE: why did Kafka choose pull instead of push for a consumer ?

Tauzell, Dave Fri, 23 Sep 2016 05:11:28 -0700

Kafka writes each message but the OS is writing those to in memory disk cache.  
Kafka periodically calls fsync() to tell the OS to force the disk cache to 
actual disk.  Kafka gets high availability by replicating messages to other 
brokers so that the messages are in-memory on several machines at once.  If all 
the replicas fail around the same time you could lose data.


-Dave

-----Original Message-----
From: kant kodali [mailto:kanth...@gmail.com]
Sent: Friday, September 23, 2016 5:18 AM
To: users@kafka.apache.org
Subject: Re: why did Kafka choose pull instead of push for a consumer ?

@Gerard
Here are my initial benchmarks
Producer on Machine 1 (m4.xlarge on AWS)Broker on Machine 2 (m4.xlarge on AWS) 
Consumer on Machine 3 (m4.xlarge on AWS) Data size 1.2KB Receive throughtput: 
~24K Kafka Receive throughput ~58K (same exact configuration) All the 
benchmarks I ran are with default options So what pulsar guys are saying is 
that Kafka doesn't persist every message by default instead it would batch them 
for a period of time and then persist so if the JVM crashes before it persist 
all the messages that are in the batch are lost whereas pulsar guarantees 
strong durability by storing every message to write ahead log so messages are 
never lost.
My question now is that what settings I need to change in Kafka so it will 
store every message? that way I am comparing apples to apples.






On Fri, Sep 23, 2016 12:06 AM, Gerard Klijs gerard.kl...@dizzit.com
wrote:
I haven't tried it myself, nor very likely will in the near future, but

since it's also distributed I guess that with a large enough cluster you

will be able to handle any load. One of the things kafka might be better at

is more connecters available, a better at least once guarantee, better

monitoring options. I really don't know, but if latancy is really important

pulsar might be better, they used kafka before at yahoo and maybe still do

for some stuff, recent work on https://github.com/yahoo/kafka-manager seems

to suggest so.

Alternatively you could configure a kafka topic/producer/consumer to limit

latency, and that may also be enough to get a low enough latency. It would

certainly be interesting to compare the two, with the same hardware, and

with high load.




On Thu, Sep 22, 2016 at 6:01 PM kant kodali <kanth...@gmail.com> wrote:




> @Gerard Thanks for this. It looks good any benchmarks on this
> throughput

> wise?

>

>

>

>

>

>

> On Thu, Sep 22, 2016 7:45 AM, Gerard Klijs gerard.kl...@dizzit.com

> wrote:

> We have a simple application producing 1 msg/sec, and did nothing to

>

> optimise the performance and have about a 10 msec delay between
> consumer

>

> and producer. When low latency is important, maybe pulsar is a better
> fit,

>

> https://www.datanami.com/2016/09/07/yahoos-new-pulsar-kafka-competitor/ .

>

>

>

>

> On Tue, Sep 20, 2016 at 2:24 PM Michael Freeman <mikfree...@gmail.com>

>

> wrote:

>

>

>

>

> > Thanks for sharing Radek, great article.

>

> >

>

> > Michael

>

> >

>

> > > On 17 Sep 2016, at 21:13, Radoslaw Gruchalski
> > > <ra...@gruchalski.com>

>

> > wrote:

>

> > >

>

> > > Please read this article:

>

> > >

>

> >

>

>
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

>

> > >

>

> > > –

>

> > > Best regards,

>

> > > Radek Gruchalski

>

> > > ra...@gruchalski.com

>

> > >

>

> > >

>

> > > On September 17, 2016 at 9:49:43 PM, kant kodali
> > > (kanth...@gmail.com)

>

> > wrote:

>

> > >

>

> > > Still it should be possible to implement using reactive streams right.

>

> > > Could you please enlighten me on what are the some major
> > > differences

> you

>

> > > see

>

> > > between a commit log and a message queue? I see them being
> > > different

> only

>

> > > in the

>

> > > implementation but not functionality wise so I would be glad to
> > > hear

> your

>

> > > thoughts.

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > On Sat, Sep 17, 2016 12:39 PM, Radoslaw Gruchalski

> ra...@gruchalski.com

>

> > > wrote:

>

> > > Kafka is not a queue. It’s a distributed commit log.

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > –

>

> > >

>

> > > Best regards,

>

> > >

>

> > > Radek Gruchalski

>

> > >

>

> > > ra...@gruchalski.com

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > On September 17, 2016 at 9:23:09 PM, kant kodali
> > > (kanth...@gmail.com)

>

> > > wrote:

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > Hmm...Looks like Kafka is written in Scala. There is this thing
> > > called

>

> > >

>

> > > reactive

>

> > >

>

> > > streams where a slow consumer can apply back pressure if they are

>

> > consuming

>

> > >

>

> > > slow. Even with Java this is possible with a Library called RxJava
> > > and

>

> > >

>

> > > these

>

> > >

>

> > > ideas will be incorporated in Java 9 as well.

>

> > >

>

> > > I still don't see why they would pick poll just to solve this one

> problem

>

> > >

>

> > > and

>

> > >

>

> > > compensating on others. Poll just don't sound realtime. I heard
> > > from

> some

>

> > >

>

> > > people

>

> > >

>

> > > that they would set poll to 100ms. Well 1) that is a lot of time.
> > > 2)

>

> > >

>

> > > Financial

>

> > >

>

> > > applications requires micro second latency. Kafka from what I

> understand

>

> > >

>

> > > looks

>

> > >

>

> > > like has a very high latency and here is the article.

>

> > >

>

> > > http://bravenewgeek.com/dissecting-message-queues/ I usually don't
> > > go

> by

>

> > >

>

> > > articles but I ran my own experiments on different queues and my

> numbers

>

> > >

>

> > > are

>

> > >

>

> > > very close to this article so I would say whoever wrote this
> > > article

> has

>

> > >

>

> > > done a

>

> > >

>

> > > good Job. 3) poll does generate unnecessary traffic in case if the
> > > data

>

> > >

>

> > > isn't

>

> > >

>

> > > available.

>

> > >

>

> > > Finally still not sure why they would pick poll() ? or do they
> > > plan on

>

> > >

>

> > > introducing reactive streams?Thanks,kant

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > On Sat, Sep 17, 2016 5:14 AM, Radoslaw Gruchalski
> > > ra...@gruchalski.com

>

> > >

>

> > > wrote:

>

> > >

>

> > > I'm only guessing here regarding if this is the reason:

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > Pull is much more sensible when a lot of data is pushed through.
> > > It

>

> > allows

>

> > >

>

> > > consumers consuming at their own pace, slow consumers do not slow
> > > the

>

> > >

>

> > > complete

>

> > >

>

> > > system down.

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > --

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > Best regards,

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > Rad

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > On Sat, Sep 17, 2016 at 11:18 AM +0200, "kant kodali" <

>

> > kanth...@gmail.com>

>

> > >

>

> > > wrote:

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > why did Kafka choose pull instead of push for a consumer? push
> > > sounds

>

> > like

>

> > >

>

> > > it

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > is more realtime to me than poll and also wouldn't poll just keeps

>

> > polling

>

> > >

>

> > > even

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > when they are no messages in the broker causing more traffic?
> > > please

>

> > >

>

> > > enlighten

>

> > >

>

> > >

>

> > >

>

> > >

>

> > > me

>

> >
This e-mail and any files transmitted with it are confidential, may contain 
sensitive information, and are intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this e-mail in error, 
please notify the sender by reply e-mail immediately and destroy all copies of 
the e-mail and any attachments.

RE: why did Kafka choose pull instead of push for a consumer ?

Reply via email to