Re: Low Level Kafka Consumer for Spark

Jonathan Hodges Mon, 04 Aug 2014 17:04:23 -0700

Hi Yan,

That is a good suggestion.  I believe non-Zookeeper offset management will
be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for
September.


https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management

That should make this fairly easy to implement, but it will still require
explicit offset commits to avoid data loss which is different than the
current KafkaUtils implementation.

Jonathan





On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang...@gmail.com> wrote:

> Another suggestion that may help is that, you can consider use Kafka to
> store the latest offset instead of Zookeeper. There are at least two
> benefits: 1) lower the workload of ZK 2) support replay from certain
> offset. This is how Samza <http://samza.incubator.apache.org/> deals with
> the Kafka offset, the doc is here
> <http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html>
>  .
> Thank you.
>
> Cheers,
>
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
>
>
> On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com>
> wrote:
>
>> I'll let TD chime on on this one, but I'm guessing this would be a
>> welcome addition. It's great to see community effort on adding new
>> streams/receivers, adding a Java API for receivers was something we did
>> specifically to allow this :)
>>
>> - Patrick
>>
>>
>> On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
>> dibyendu.bhattach...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have implemented a Low Level Kafka Consumer for Spark Streaming using
>>> Kafka Simple Consumer API. This API will give better control over the Kafka
>>> offset management and recovery from failures. As the present Spark
>>> KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
>>> control over the offset management which is not possible in Kafka HighLevel
>>> consumer.
>>>
>>> This Project is available in below Repo :
>>>
>>> https://github.com/dibbhatt/kafka-spark-consumer
>>>
>>>
>>> I have implemented a Custom Receiver
>>> consumer.kafka.client.KafkaReceiver. The KafkaReceiver uses low level Kafka
>>> Consumer API (implemented in consumer.kafka packages) to fetch messages
>>> from Kafka and 'store' it in Spark.
>>>
>>> The logic will detect number of partitions for a topic and spawn that
>>> many threads (Individual instances of Consumers). Kafka Consumer uses
>>> Zookeeper for storing the latest offset for individual partitions, which
>>> will help to recover in case of failure. The Kafka Consumer logic is
>>> tolerant to ZK Failures, Kafka Leader of Partition changes, Kafka broker
>>> failures,  recovery from offset errors and other fail-over aspects.
>>>
>>> The consumer.kafka.client.Consumer is the sample Consumer which uses
>>> this Kafka Receivers to generate DStreams from Kafka and apply a Output
>>> operation for every messages of the RDD.
>>>
>>> We are planning to use this Kafka Spark Consumer to perform Near Real
>>> Time Indexing of Kafka Messages to target Search Cluster and also Near Real
>>> Time Aggregation using target NoSQL storage.
>>>
>>> Kindly let me know your view. Also if this looks good, can I contribute
>>> to Spark Streaming project.
>>>
>>> Regards,
>>> Dibyendu
>>>
>>
>>
>

Re: Low Level Kafka Consumer for Spark

Reply via email to