Re: [Spark Streaming] kafka consumer announce

Evgeniy Shishkin Thu, 21 Aug 2014 11:15:37 -0700

>> On 21 Aug 2014, at 20:25, Tim Smith wrote:
>> 
>> Thanks. Discovering kafka metadata from zookeeper instead of brokers
>> is nicer. Saving metadata and offsets to HBase, is that optional or
>> mandatory?
>> Can it be made optional (default to zookeeper)?
>>

For now we implemented and somewhat hardcoded only hbase support as offset 
storage.
We are aware that this is inconvenient for a lot of users and use cases.

We have plans to abstract this and make an interface for offset management. 
Patches are really welcome.

The main point we implemented it like this at first is: zookeeper is not 
suitable for this kind of task.
As you may know, default HighLevelConsumer API stores in zk, with default 30 
seconds interval.
Doing it more often results in zk problems.
This opens fairly large window of data loss or double processing.

Our solution stores in hbase because we was already using it to store our 
streaming computation results,
and hbase handles oltp-kind workload very well.

For kafka's 0.9 release, there will be major rewrite of consumer api. 
One of the main topics of discussion is offset storage and management.
They say there will be some support from kafka's brokers. I think they want to 
store processed offsets as special kind of topic.
And they will have an interface to implement custom offset storage.

To summarize: we do have plans to rewrite offset management and patches are 
welcome.

>> 
>> 
>> On Thu, Aug 21, 2014 at 6:17 AM, Evgeniy Shishkin <[email protected]> 
>> wrote:
>>> Hello,
>>> 
>>> we are glad to announce yet another kafka input stream.
>>> 
>>> Available at https://github.com/wgnet/spark-kafka-streaming
>>> 
>>> It is used in production for about 3 months.
>>> We will be happy to hear your feedback.
>>> 
>>> Custom Spark Kafka consumer based on Kafka SimpleConsumer API.
>>> 
>>> Features
>>> 
>>>       • discover kafka metadata from zookeeper (more reliable than from 
>>> brokers, does not depend on broker list changes)
>>>       • reding from multiple topics
>>>       • reliably handles leader election and topic reassignment
>>>       • saves offsets and stream metadata in hbase (more robust than 
>>> zookeeper)
>>>       • supports metrics via spark metrics mechanism (jmx, graphite, etc.)
>>> Todo
>>> 
>>>       • abstract offset storage
>>>       • time controlled offsets commit
>>>       • refactor kafka message to rdd elements transformation (flatmapper 
>>> method)
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Spark Streaming] kafka consumer announce

Reply via email to