I think the bigger question is what happens to Kafka and your downstream data store when DC2 crashes.
>From a Spark point of view, starting up a post-crash job in a new data center isn't really different from starting up a post-crash job in the original data center. On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com> wrote: > Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. > > As I mentionned before, I'm planning to use one kafka cluster and 2 or > more spark cluster distinct. > > Let's say we have the following DCs configuration in a nominal case. > Kafka partitions are consumed uniformly by the 2 datacenters. > > DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4) > DC 1 Master 1.1 > > Worker 1.1 my_group P1 > Worker 1.2 my_group P2 > DC 2 Master 2.1 > > Worker 2.1 my_group P3 > Worker 2.2 my_group P4 > I would like, in case of DC crash, a rebalancing of partition on the > healthy DC, something as follow > > DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4) > DC 1 Master 1.1 > > Worker 1.1 my_group P1*, P3* > Worker 1.2 my_group P2*, P4* > DC 2 Master 2.1 > > Worker 2.1 my_group P3 > Worker 2.2 my_group P4 > > I would like to know if it's possible: > - using consumer group ? > - using direct approach ? I prefer this one as I don't want to activate > WAL. > > Hope the explanation is better ! > > > On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> The current direct stream only handles exactly the partitions >> specified at startup. You'd have to restart the job if you changed >> partitions. >> >> https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work >> towards using the kafka 0.10 consumer, which would allow for dynamic >> topicparittions >> >> Regarding your multi-DC questions, I'm not really clear on what you're >> saying. >> >> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com> >> wrote: >> > Hello, >> > >> > I'm currently designing a solution where 2 distinct clusters Spark (2 >> > datacenters) share the same Kafka (Kafka rack aware or manual broker >> > repartition). >> > The aims are >> > - preventing DC crash: using kafka resiliency and consumer group >> mechanism >> > (or else ?) >> > - keeping consistent offset among replica (vs mirror maker,which does >> not >> > keep offset) >> > >> > I have several questions >> > >> > 1) Dynamic repartition (one or 2 DC) >> > >> > I'm using KafkaDirectStream which map one partition kafka with one >> spark. Is >> > it possible to handle new or removed partition ? >> > In the compute method, it looks like we are always using the >> currentOffset >> > map to query the next batch and therefore it's always the same number of >> > partition ? Can we request metadata at each batch ? >> > >> > 2) Multi DC Spark >> > >> > Using Direct approach, a way to achieve this would be >> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks >> > - only one is reading the partition (Check every x interval, "lock" >> stored >> > in cassandra for instance) >> > >> > => not sure if it works just an idea >> > >> > Using Consumer Group >> > - CommitOffset manually at the end of the batch >> > >> > => Does spark handle partition rebalancing ? >> > >> > I'd appreciate any ideas ! Let me know if it's not clear. >> > >> > Erwan >> > >> > >> > >