Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. As I mentionned before, I'm planning to use one kafka cluster and 2 or more spark cluster distinct.
Let's say we have the following DCs configuration in a nominal case. Kafka partitions are consumed uniformly by the 2 datacenters. DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4) DC 1 Master 1.1 Worker 1.1 my_group P1 Worker 1.2 my_group P2 DC 2 Master 2.1 Worker 2.1 my_group P3 Worker 2.2 my_group P4 I would like, in case of DC crash, a rebalancing of partition on the healthy DC, something as follow DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4) DC 1 Master 1.1 Worker 1.1 my_group P1*, P3* Worker 1.2 my_group P2*, P4* DC 2 Master 2.1 Worker 2.1 my_group P3 Worker 2.2 my_group P4 I would like to know if it's possible: - using consumer group ? - using direct approach ? I prefer this one as I don't want to activate WAL. Hope the explanation is better ! On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org> wrote: > The current direct stream only handles exactly the partitions > specified at startup. You'd have to restart the job if you changed > partitions. > > https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work > towards using the kafka 0.10 consumer, which would allow for dynamic > topicparittions > > Regarding your multi-DC questions, I'm not really clear on what you're > saying. > > On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com> > wrote: > > Hello, > > > > I'm currently designing a solution where 2 distinct clusters Spark (2 > > datacenters) share the same Kafka (Kafka rack aware or manual broker > > repartition). > > The aims are > > - preventing DC crash: using kafka resiliency and consumer group > mechanism > > (or else ?) > > - keeping consistent offset among replica (vs mirror maker,which does not > > keep offset) > > > > I have several questions > > > > 1) Dynamic repartition (one or 2 DC) > > > > I'm using KafkaDirectStream which map one partition kafka with one > spark. Is > > it possible to handle new or removed partition ? > > In the compute method, it looks like we are always using the > currentOffset > > map to query the next batch and therefore it's always the same number of > > partition ? Can we request metadata at each batch ? > > > > 2) Multi DC Spark > > > > Using Direct approach, a way to achieve this would be > > - to "assign" (kafka 0.9 term) all topics to the 2 sparks > > - only one is reading the partition (Check every x interval, "lock" > stored > > in cassandra for instance) > > > > => not sure if it works just an idea > > > > Using Consumer Group > > - CommitOffset manually at the end of the batch > > > > => Does spark handle partition rebalancing ? > > > > I'd appreciate any ideas ! Let me know if it's not clear. > > > > Erwan > > > > >