Hi This fault tolerant aspect already taken care in the Kafka-Spark Consumer code , like if Leader of a partition changes etc.. in ZkCoordinator.java. Basically it does a refresh of PartitionManagers every X seconds to make sure Partition details is correct and consumer don't fail.
Dib On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai <saisai.s...@intel.com> wrote: > Hi, > > I think this is an awesome feature for Spark Streaming Kafka interface to > offer user the controllability of partition offset, so user can have more > applications based on this. > > What I concern is that if we want to do offset management, fault tolerant > related control and others, we have to take the role as current > ZookeeperConsumerConnect did, that would be a big field we should take care > of, for example when node is failed, how to pass current partition to > another consumer and some others. I’m not sure what is your thought? > > Thanks > Jerry > > From: Dibyendu Bhattacharya [mailto:dibyendu.bhattach...@gmail.com] > Sent: Tuesday, August 05, 2014 5:15 PM > To: Jonathan Hodges; dev@spark.apache.org > Cc: user > Subject: Re: Low Level Kafka Consumer for Spark > > Thanks Jonathan, > > Yes, till non-ZK based offset management is available in Kafka, I need to > maintain the offset in ZK. And yes, both cases explicit commit is > necessary. I modified the Low Level Kafka Spark Consumer little bit to have > Receiver spawns threads for every partition of the topic and perform the > 'store' operation in multiple threads. It would be good if the > receiver.store methods are made thread safe..which is not now presently . > > Waiting for TD's comment on this Kafka Spark Low Level consumer. > > > Regards, > Dibyendu > > > On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges <hodg...@gmail.com<mailto: > hodg...@gmail.com>> wrote: > Hi Yan, > > That is a good suggestion. I believe non-Zookeeper offset management will > be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled for > September. > > > https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management > > That should make this fairly easy to implement, but it will still require > explicit offset commits to avoid data loss which is different than the > current KafkaUtils implementation. > > Jonathan > > > > > On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang <yanfang...@gmail.com<mailto: > yanfang...@gmail.com>> wrote: > Another suggestion that may help is that, you can consider use Kafka to > store the latest offset instead of Zookeeper. There are at least two > benefits: 1) lower the workload of ZK 2) support replay from certain > offset. This is how Samza<http://samza.incubator.apache.org/> deals with > the Kafka offset, the doc is here< > http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html> > . Thank you. > > Cheers, > > Fang, Yan > yanfang...@gmail.com<mailto:yanfang...@gmail.com> > +1 (206) 849-4108<tel:%2B1%20%28206%29%20849-4108> > > On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell <pwend...@gmail.com > <mailto:pwend...@gmail.com>> wrote: > I'll let TD chime on on this one, but I'm guessing this would be a welcome > addition. It's great to see community effort on adding new > streams/receivers, adding a Java API for receivers was something we did > specifically to allow this :) > > - Patrick > > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com<mailto:dibyendu.bhattach...@gmail.com>> > wrote: > Hi, > > I have implemented a Low Level Kafka Consumer for Spark Streaming using > Kafka Simple Consumer API. This API will give better control over the Kafka > offset management and recovery from failures. As the present Spark > KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better > control over the offset management which is not possible in Kafka HighLevel > consumer. > > This Project is available in below Repo : > > https://github.com/dibbhatt/kafka-spark-consumer > > > I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver. > The KafkaReceiver uses low level Kafka Consumer API (implemented in > consumer.kafka packages) to fetch messages from Kafka and 'store' it in > Spark. > > The logic will detect number of partitions for a topic and spawn that many > threads (Individual instances of Consumers). Kafka Consumer uses Zookeeper > for storing the latest offset for individual partitions, which will help to > recover in case of failure. The Kafka Consumer logic is tolerant to ZK > Failures, Kafka Leader of Partition changes, Kafka broker failures, > recovery from offset errors and other fail-over aspects. > > The consumer.kafka.client.Consumer is the sample Consumer which uses this > Kafka Receivers to generate DStreams from Kafka and apply a Output > operation for every messages of the RDD. > > We are planning to use this Kafka Spark Consumer to perform Near Real Time > Indexing of Kafka Messages to target Search Cluster and also Near Real Time > Aggregation using target NoSQL storage. > > Kindly let me know your view. Also if this looks good, can I contribute to > Spark Streaming project. > > Regards, > Dibyendu > > > > >