Re: New Consumer API discussion

Imran Rashid Tue, 11 Feb 2014 13:52:05 -0800

Hi,

thanks for sharing this and getting feedback.  Sorry I am probably missing
something basic, but I'm not sure how a multi-threaded consumer would
work.  I can imagine its either:

a) I just have one thread poll kafka.  If I want to process msgs in
multiple threads, than I deal w/ that after polling, eg. stick them into a
blocking queue or something, and have more threads that read from the queue.

b) each thread creates its own KafkaConsumer.  They are all registered the
same way, and I leave it to kafka to figure out what data to give to each
one.

(a) certainly makes things simple, but I worry about throughput -- is that
just as good as having one thread trying to consumer each partition?

(b) makes it a bit of a pain to figure out how many threads to use.  I
assume there is no point in using more threads than there are partitions,
so first you've got to figure out how many partitions there are in each
topic.  Might be nice if there were some util functions to simplify this.

Also, since the initial call to subscribe doesn't give the partition
assignment, does that mean the first call to poll() will always call the
ConsumerRebalanceCallback?

probably a short code-sample would clear up all my questions.  I'm
imagining pseudo-code like:

int numPartitions = ...
int numThreads = min(maxThreads, numPartitions);
//maybe should be something even more complicated, to take into account how
many other active consumers there are right now for the given group

List<MyConsumer> consumers = new ArrayList<MyConsumer>();
for (int i = 0; i < numThreads; i++) {
  MyConsumer c = new MyConsumer();
  c.subscribe(...);
  //if subscribe is expensive, then this should already happen in another
thread
  consumers.add(c);
}

// if each subscribe() happened in a different thread, we should put a
barrier in here, so everybody subscribes before they begin polling

//now launch a thread per consumer, where they each poll

If I'm on the right track, I'd like to expand this example, showing how
each "MyConsumer" can keep track of its partitions & offsets, even in the
face of rebalances.  As Jay said, I think a minimal code example could
really help us see the utility & faults of the api.

overall I really like what I see, seems like a big improvement!

thanks,
Imran

On Mon, Feb 10, 2014 at 12:54 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote:

> As mentioned in previous emails, we are also working on a re-implementation
> of the consumer. I would like to use this email thread to discuss the
> details of the public API. I would also like us to be picky about this
> public api now so it is as good as possible and we don't need to break it
> in the future.
>
> The best way to get a feel for the API is actually to take a look at the
> javadoc<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >,
> the hope is to get the api docs good enough so that it is self-explanatory.
> You can also take a look at the configs
> here<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/ConsumerConfig.html
> >
>
> Some background info on implementation:
>
> At a high level the primary difference in this consumer is that it removes
> the distinction between the "high-level" and "low-level" consumer. The new
> consumer API is non blocking and instead of returning a blocking iterator,
> the consumer provides a poll() API that returns a list of records. We think
> this is better compared to the blocking iterators since it effectively
> decouples the threading strategy used for processing messages from the
> consumer. It is worth noting that the consumer is entirely single threaded
> and runs in the user thread. The advantage is that it can be easily
> rewritten in less multi-threading-friendly languages. The consumer batches
> data and multiplexes I/O over TCP connections to each of the brokers it
> communicates with, for high throughput. The consumer also allows long poll
> to reduce the end-to-end message latency for low throughput data.
>
> The consumer provides a group management facility that supports the concept
> of a group with multiple consumer instances (just like the current
> consumer). This is done through a custom heartbeat and group management
> protocol transparent to the user. At the same time, it allows users the
> option to subscribe to a fixed set of partitions and not use group
> management at all. The offset management strategy defaults to Kafka based
> offset management and the API provides a way for the user to use a
> customized offset store to manage the consumer's offsets.
>
> A key difference in this consumer also is the fact that it does not depend
> on zookeeper at all.
>
> More details about the new consumer design are
> here<
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design
> >
>
> Please take a look at the new
> API<
> http://people.apache.org/~nehanarkhede/kafka-0.9-consumer-javadoc/doc/kafka/clients/consumer/KafkaConsumer.html
> >and
> give us any thoughts you may have.
>
> Thanks,
> Neha
>

Re: New Consumer API discussion

Reply via email to