Re: Low Level Kafka Consumer for Spark

Chris Fregly Tue, 26 Aug 2014 20:41:22 -0700

great work, Dibyendu.  looks like this would be a popular contribution.

expanding on bharat's question a bit:

what happens if you submit multiple receivers to the cluster by creating
and unioning multiple DStreams as in the kinesis example here:

https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123

for more context, the kinesis implementation above uses the Kinesis Client
Library (KCL) to automatically assign - and load balance - stream shards
among all KCL threads from all receivers (potentially coming and going as
nodes die) on all executors/nodes using DynamoDB as the association data
store.

ZooKeeper would be used for your Kafka consumers, of course.  and ZooKeeper
watches to handle the ephemeral nodes.  and I see you're using Curator,
which makes things easier.

as bharat suggested, running multiple receivers/dstreams may be desirable
from a scalability and fault tolerance standpoint.  is this type of load
balancing possible among your different Kafka consumers running in
different ephemeral JVMs?

and isn't it fun proposing a popular piece of code?  the question
floodgates have opened!  haha. :)

-chris

On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Hi Bharat,
>
> Thanks for your email. If the "Kafka Reader" worker process dies, it will
> be replaced by different machine, and it will start consuming from the
> offset where it left over ( for each partition). Same case can happen even
> if I tried to have individual Receiver for every partition.
>
> Regards,
> Dibyendu
>
>
> On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat <bvenkat.sp...@gmail.com>
> wrote:
>
>> I like this consumer for what it promises - better control over offset and
>> recovery from failures.  If I understand this right, it still uses single
>> worker process to read from Kafka (one thread per partition) - is there a
>> way to specify multiple worker processes (on different machines) to read
>> from Kafka?  Maybe one worker process for each partition?
>>
>> If there is no such option, what happens when the single machine hosting
>> the
>> "Kafka Reader" worker process dies and is replaced by a different machine
>> (like in cloud)?
>>
>> Thanks,
>> Bharat
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Low Level Kafka Consumer for Spark

Reply via email to