Thanks a bunch for the detailed response and tips!! Looks like I have a
couple of knobs one of which should work, I will be doing some runs to
figure out what works best for my use case.

Thanks again.

On Thu, Feb 26, 2015 at 9:03 AM, Jeff Wartes <jwar...@whitepages.com> wrote:

>
> A note on throughput with an at-least-once guarantee using the high-level
> consumer:
>
> The core unit of concurrency in kafka is the partition, because you can't
> have more clients than partitions. Although you can ask for two messages
> from a given client instance and process those in parallel, the commit
> semantics mean once you've asked for two messages, you *must* handle
> *both* of them successfully before you commit. If you commit before you’ve
> dealt with them both, you risk losing your at-least-once guarantee. It's
> easier to use single-threaded client message processing than manage
> outstanding messages if you can get away with it. You just get a message,
> process it, and repeat. (You'd probably still want to refrain from calling
> commit after every message though)
>
> Unfortunately, you've suggested processing a message takes some 400ms, so
> that’s 2.5 messages/sec/consumer. To get 5k/messages/sec, you’d need 2000
> partitions (clients) if you're using the partition as the unit of
> concurrency.
>
> Unfortunately again, the neat automatic partition assignment in the
> high-level consumer starts to bog down as the number of partitions goes
> up. My coworker found he was waiting more than 15 minutes for assignment
> to converge using 1000 partitions, and you suffer that whenever a consumer
> is created or destroyed.
>
>
> So you'll probably need to either handle batches of messages from one
> consumer instance concurrently, or use the lower-level consumer and handle
> partition assignment and broker failure cases yourself. Some tips if you
> want to stick with the high-level consumer:
>
> - Use a dedicated thread to manage a single kafka consumer. Don't use
> multiple threads talking to the same consumer instance. Load a batch of
> messages into a thread-safe data structure, and work with that to get your
> message processing concurrency.
> - Once you've pulled a batch of messages from the consumer, you must
> handle ALL of those messages before you can commit, or ask for more
> messages. Put another way, you can only commit when every message you've
> pulled has been handled to your satisfaction.
> - You'll still want a reasonable number of partitions. The partition count
> becomes the unit of max batch processing concurrency, rather than then
> message processing concurrency. You'll still need to do similar math to
> figure how many partitions you need based on how long it takes to process
> a batch. If you're processing every message in the batch in parallel, the
> time it takes to process the batch is the max of the time it took to
> process any message in the batch.
>
>
>
>
>
> On 2/25/15, 9:58 AM, "Gwen Shapira" <gshap...@cloudera.com> wrote:
>
> >I don't have good numbers, but I noticed that I usually scale number of
> >partitions by the consumer rates and not by producer rate.
> >
> >Writing to HDFS can be a bit slow (30MB/s is pretty typical, IIRC), so if
> >I
> >need to write 5G a second, I need at least 15 consumers, which means at
> >least 15 partitions. Hopefully your consumers will be doing better. Maybe
> >your bottleneck will be 1gE network speed. Who knows?
> >
> >Small scale benchmark on your specific setup can go a long way in capacity
> >planning :)
> >
> >Gwen
> >
> >On Wed, Feb 25, 2015 at 9:45 AM, Anand Somani <meatfor...@gmail.com>
> >wrote:
> >
> >> Sweet! that I would not depend on ZK more consumption anymore. Thanks
> >>for
> >> the response Gwen, I will take a look at the link you have provided.
> >>
> >> From what I have read so far, for my scenario to work correctly I would
> >> have multiple partitions and a consumer per partition, is that correct?
> >>So
> >> for me to be able to improve throughput on the consumer, will need to
> >>play
> >> with the number of partitions. Is there any recommendation on that ratio
> >> partition/topic or that can be scaled up/out with powerful/more
> >>hardware?
> >>
> >> Thanks
> >> Anand
> >>
> >> On Tue, Feb 24, 2015 at 8:11 PM, Gwen Shapira <gshap...@cloudera.com>
> >> wrote:
> >>
> >> > * ZK was not built for 5K/s writes type of load
> >> > * Kafka 0.8.2.0 allows you to commit messages to Kafka rather than
> >>ZK. I
> >> > believe this is recommended.
> >> > * You can also commit batches of messages (i.e. commit every 100
> >> messages).
> >> > This will reduce the writes and give you at least once while
> >>controlling
> >> > number of duplicates in case of failure.
> >> > * Yes, can be done in high level consumer. I give few tips here:
> >> >
> >> >
> >>
> >>
> http://ingest.tips/2014/10/12/kafka-high-level-consumer-frequently-missin
> >>g-pieces/
> >> >
> >> > Gwen
> >> >
> >> > On Tue, Feb 24, 2015 at 1:57 PM, Anand Somani <meatfor...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > It is a little long, since I wanted to explain the use case and then
> >> ask
> >> > > questions, so thanks for your attention
> >> > >
> >> > > Use case:
> >> > >
> >> > > We have a use case where everything in the queue has to be consumed
> >>at
> >> > > least once. So the consumer has to have "consumed" (saved in some
> >> > > destination database) the message before confirming consumption to
> >> kafka
> >> > > (or ZK). Now it is possible and from what I have read so far we will
> >> have
> >> > > consumer groups and partitions. Here are some facts/numbers for our
> >> case
> >> > >
> >> > > * We will potentially have messages with peaks of 5k /second.
> >> > > * We can play with the message size if that makes any difference
> >>(keep
> >> > it <
> >> > > 100 bytes for a link or put the entire message avg size of 2-5K
> >>bytes).
> >> > > * We do not need replication, but might have a kafka cluster to
> >>handle
> >> > the
> >> > > load.
> >> > > * Also work consumption will take anywhere from 300-500ms,
> >>generally we
> >> > > would like the consumer to be not behind by more than 1-2 minutes.
> >>So
> >> if
> >> > > the message shows up in a queue, it should show up in the database
> >> > within 2
> >> > > minutes.
> >> > >
> >> > > The questions I have are
> >> > >   * If this has been covered before, please point me to it. Thanks
> >> > >   * Is that possible/recommended "controlled commit per consumed
> >> message"
> >> > > for this load (have read about some concerns on ZK issues)?
> >> > >   * Are there any recommendations on configurations in terms of
> >> > partitions
> >> > > to number of messages OR consumers? Maybe more queues/topics
> >> > >   * Anything else that we might need to watch out for?
> >> > >   * As for the client, I should be able to do this (control when the
> >> > offset
> >> > > commit happens) with high level consumer I suppose?
> >> > >
> >> > >
> >> > > Thanks
> >> > > Anand
> >> > >
> >> >
> >>
>
>

Reply via email to