Hi Joris,

Thank you so much. I plan to write a Java Consumer and a Java Producer, for
my benchmark. Do you recommend an example that I can use as a reference to
write my basic Java producer and simple Java consumer? I'll for sure share
the through number I get with the community. Maybe even write a blog post
about it. I hope it is more than 23 messages per second per partition :PPPPP

Cheers,

M. Queen


On Thu, Jan 6, 2022 at 2:14 PM Joris Peeters <joris.mg.peet...@gmail.com>
wrote:

> I'd just follow the instructions in https://kafka.apache.org/quickstart to
> set up Kafka and Zookeeper on a single node, by running the Java processes
> directly. Or can run in Docker.
>
> For the producer and consumer I'd personally use Python, as it's the
> easiest to get going. You may want to look at
> https://kafka-python.readthedocs.io/en/master/# (easier) and
> https://github.com/confluentinc/confluent-kafka-python (faster). Similar
> things exist for Go, Java, C++, ...
> Or I'm sure there are some benchmark setups out there that you can tweak a
> little.
>
> I appreciate that setting up everything on localhost will be easier and
> lead to big numbers, but bear in mind that it's typically all the other
> real-life stuff (remote connections, replication, at-least-once, ...) that
> causes massive slowdowns compared to localhost, and those are things banks
> eventually tend to need (I work in finance industry myself). What you're
> doing is a very useful benchmark, but I'd surround it with the above
> caveats to avoid overpromising.
>
> -J
>
>
> On Thu, Jan 6, 2022 at 4:58 PM Marisa Queen <marisa.queen...@gmail.com>
> wrote:
>
> > Hi Joris,
> >
> > I've spoken to him. His answers are below:
> >
> >
> > On Thu, Jan 6, 2022 at 1:37 PM Joris Peeters <joris.mg.peet...@gmail.com
> >
> > wrote:
> >
> > > There's a few unknown parameters here that might influence the answer,
> > > though. From the top of my head, at least
> > > - How much replication of the data is needed (for high availability),
> and
> > > how many acks for the producer? (If fire-and-forget it can be faster,
> if
> > > need to replicate and ack from 3 brokers in different DC's then will be
> > > slower)
> > >
> >
> > Let's assume no high-availability for now, for simplicity's sake.
> > Fire-and-forget like he said. We don't want to overcomplicate this simple
> > benchmark and we want the highest possible throughput number.
> >
> >
> > > - Transactions? (If end-to-end exactly-once then it's a lot slower)
> > >
> >
> > Again no transactions. Let's keep it simple.
> >
> >
> > > - Size of the messages? (If each message is a GB it will obviously be
> > > slower)
> > >
> >
> > Let's assume 512 bytes. Powers of two are fun!
> >
> >
> > > - Distance and bandwidth between the producers, Kafka & the consumers?
> > (If
> > > the network links get saturated that would limit the performance.
> Latency
> > > is likely less important than throughput, but if your consumers are in
> > > Tokyo and the producer in London then it will likely also be slower)
> > >
> >
> >
> > Loopback, same machine, for the love of God. Let's not even go there. We
> > want the highest possible throughput. I accept the limit of the speed of
> > light. If network particularities, and distances, are to be included in
> > this measurement then it is basically worth nothing. Loopback eliminates
> > all those network variables that we surely don't want to include in the
> > benchmark.
> >
> >
> > >
> > > FWIW, I find that the producer side is generally the limiting factor,
> > > especially if there is only one.
> > > I'd take a look at e.g. the Appendix test details on
> > >
> https://docs.confluent.io/2.0.0/clients/librdkafka/INTRODUCTION_8md.html
> > .
> > > I
> > > haven't yet seen a faster Kafka impl than rdkafka, so those would be
> > > reasonable upper bounds.
> > >
> >
> >
> > Thanks for your reply, Joris. Can you point me to a Hello World Kafka
> > example, so I can set up this very basic and BARE BONES Kafka system,
> > without any of the complications you correctly mentioned above? I have 10
> > million messages that I need to send from producers to consumers. I have
> 1
> > topic, 1 producer for this topic, 4 partitions for this topic and 4
> > consumers, one for each partition. Everything loopback, same machine, no
> > high-availability, transactions, etc. just KAFKA BARE BONES. What can be
> > more trivial and basic than that?
> >
> > Cheers,
> >
> > M. Queen
> >
> >
> > >
> > > On Thu, Jan 6, 2022 at 4:25 PM Marisa Queen <marisa.queen...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Israel,
> > > >
> > > > Your email is great, but I'm afraid to forward it to my customer
> > because
> > > it
> > > > doesn't answer his question.
> > > >
> > > > I'm hoping that other members from this list will be able to give me
> a
> > > more
> > > > NUMERIC answer, let's wait to see.
> > > >
> > > > Just to give you some follow up on your answer, when you say:
> > > >
> > > > > 30 passengers per driver or aircraft per day may not sound
> impressive
> > > but
> > > > 750,000 passengers per day all together is how you should look at it
> > > >
> > > > Well, with this rationality one can come up with any desired
> throughput
> > > > number by just adding more partitions. Do you see my customer point
> > that
> > > > this does not make any sense? Adding more partitions also does not
> come
> > > for
> > > > free, because messages need to be separated into the newly created
> > > > partition and ordering will be lost. Order is important for some
> > > messages,
> > > > so to keep adding more partitions towards an infinite throughput is
> not
> > > an
> > > > option.
> > > >
> > > > I've just spoken to him here, his reply was:
> > > >
> > > > "Marisa, I'm asking a very simple question for a very basic Kafka
> > > scenario.
> > > > If I can't get an answer for that, then I'm in trouble. Can you
> please
> > > find
> > > > out with your peers/community what is a good throughput number to
> have
> > in
> > > > mind for the scenario I've been describing. Again it is a very basic
> > and
> > > > simple scenario: I have 10 million messages that I need to send from
> > > > producers to consumers. Let's assume I have 1 topic, 1 producer for
> > this
> > > > topic, 4 partitions for this topic and 4 consumers, one for each
> > > partition.
> > > > What I would like to know is: How long is it going to take for these
> 10
> > > > million messages to travel all the way from the producer to the
> > > consumers?
> > > > That's the throughput performance number I'm interested in."
> > > >
> > > > I surely won't tell him: "Hey, that's easy, you have 4 partitions,
> each
> > > > partition according to LinkedIn can handle 23 messages per second, so
> > we
> > > > are looking for a 92 messages per second throughput here!"
> > > >
> > > > Cheers,
> > > >
> > > > M. Queen
> > > >
> > > >
> > > > On Thu, Jan 6, 2022 at 12:58 PM Israel Ekpo <israele...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Marisa
> > > > >
> > > > > I think there may be some confusion about the throughput for each
> > > > partition
> > > > > and I want to explain briefly using some analogies
> > > > >
> > > > > Using transportation for example if we were to pick an airline or
> > > > > ridesharing organization to describe the volume of customers they
> can
> > > > > support per day we would have to look at how many total customers
> can
> > > > > American Airlines service in a day or how many customers can Uber
> or
> > > Lyft
> > > > > serve in a day. We would not zero in on only the number of
> customers
> > a
> > > > > particular driver can service or the number of passengers are
> > > particular
> > > > > aircraft than service in a day. That would be very limiting
> > considering
> > > > the
> > > > > hundreds of thousands of aircrafts or drivers actively transporting
> > > > > passengers in real time.
> > > > >
> > > > > 30 passengers per driver or aircraft per day may not sound
> impressive
> > > but
> > > > > 750,000 passengers per day all together is how you should look at
> it
> > > > >
> > > > > Partitions in Kafka are just a logical unit for organizing and
> > storing
> > > > data
> > > > > within a Kafka topic. You should not base your analysis on just
> what
> > a
> > > > > subunit of storage is able to support.
> > > > >
> > > > > I would recommend taking a look at Kafka Summit talks on
> performance
> > > and
> > > > > benchmarks to get some understanding how what Kafka is able to do
> and
> > > the
> > > > > applicable use cases in the Financial Services industry
> > > > >
> > > > > A lot of reputable organizations already trust Kafka today for
> their
> > > > needs
> > > > > so this is already proven
> > > > >
> > > > > https://kafka.apache.org/powered-by
> > > > >
> > > > > I hope this helps.
> > > > >
> > > > > Israel Ekpo
> > > > > Lead Instructor, IzzyAcademy.com
> > > > > https://www.youtube.com/c/izzyacademy
> > > > > https://izzyacademy.com/
> > > > >
> > > > >
> > > > > On Thu, Jan 6, 2022 at 10:01 AM Marisa Queen <
> > > marisa.queen...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Cheers from NYC!
> > > > > >
> > > > > > I'm trying to give a performance number to a potential client
> (from
> > > the
> > > > > > financial market) who asked me the following question:
> > > > > >
> > > > > > *"If I have a Kafka system setup in the best way possible for
> > > > > performance,
> > > > > > what is an approximate number that I can have in mind for the
> > > > throughput
> > > > > of
> > > > > > this system?"*
> > > > > >
> > > > > > The client proceeded to say:
> > > > > >
> > > > > > *"What I want to know specifically, is how many messages per
> second
> > > > can I
> > > > > > send from one side of my distributed system to the other side
> with
> > > > Apache
> > > > > > Kafka."*
> > > > > >
> > > > > > And he concluded with:
> > > > > >
> > > > > > *"To give you an example, let's say I have 10 million messages
> > that I
> > > > > need
> > > > > > to send from producers to consumers. Let's assume I have 1
> topic, 1
> > > > > > producer for this topic, 4 partitions for this topic and 4
> > consumers,
> > > > one
> > > > > > for each partition. What I would like to know is: How long is it
> > > going
> > > > to
> > > > > > take for these 10 million messages to travel all the way from the
> > > > > producer
> > > > > > to the consumers? That's the throughput performance number I'm
> > > > interested
> > > > > > in."*
> > > > > >
> > > > > > I read in a reddit post yesterday (for some reason I can't find
> the
> > > > post
> > > > > > anymore) that Kafka is able to handle 7 trillion messages per
> day.
> > > The
> > > > > > LinkedIn article about it, says:
> > > > > >
> > > > > >
> > > > > > *"We maintain over 100 Kafka clusters with more than 4,000
> brokers,
> > > > which
> > > > > > serve more than 100,000 topics and 7 million partitions. The
> total
> > > > number
> > > > > > of messages handled by LinkedIn’s Kafka deployments recently
> > > surpassed
> > > > 7
> > > > > > trillion per day."*
> > > > > >
> > > > > > The OP of the reddit post went on to say that WhatsApp is
> handling
> > > > around
> > > > > > 64 billion messages per day (740,000 msgs per sec x 24 x 60 x 60)
> > and
> > > > > that
> > > > > > 7
> > > > > > trillion for LinkedIn is a huge number, giving a whopping 81
> > million
> > > > > > messages per second for LinkedIn. But that doesn't matter for my
> > > > > question.
> > > > > >
> > > > > > 7 Trillion messages divided by 7 million partitions gives us 1
> > > million
> > > > > > messages per day per partition. So to calculate the throughput we
> > do:
> > > > > >
> > > > > >     1 million divided by 60 divided by 60 divided by 24 => *23
> > > messages
> > > > > per
> > > > > > second per partition*
> > > > > >
> > > > > > We'll all agree that 23 messages per second per partition for
> > > > throughput
> > > > > > performance is very low, so I can't give this number to my
> > potential
> > > > > > client.
> > > > > >
> > > > > > So my question is: *What number should I give to my potential
> > > client?*
> > > > > Note
> > > > > > that he is a stubborn and strict bank CTO, so he won't take any
> > talk
> > > > from
> > > > > > me. He wants a mathematical answer using the scientific method.
> > > > > >
> > > > > > Has anyone been in my shoes and can shed some light on this kafka
> > > > > > throughput performance topic?
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > M. Queen
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to