The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions.
https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions Regarding your multi-DC questions, I'm not really clear on what you're saying. On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com> wrote: > Hello, > > I'm currently designing a solution where 2 distinct clusters Spark (2 > datacenters) share the same Kafka (Kafka rack aware or manual broker > repartition). > The aims are > - preventing DC crash: using kafka resiliency and consumer group mechanism > (or else ?) > - keeping consistent offset among replica (vs mirror maker,which does not > keep offset) > > I have several questions > > 1) Dynamic repartition (one or 2 DC) > > I'm using KafkaDirectStream which map one partition kafka with one spark. Is > it possible to handle new or removed partition ? > In the compute method, it looks like we are always using the currentOffset > map to query the next batch and therefore it's always the same number of > partition ? Can we request metadata at each batch ? > > 2) Multi DC Spark > > Using Direct approach, a way to achieve this would be > - to "assign" (kafka 0.9 term) all topics to the 2 sparks > - only one is reading the partition (Check every x interval, "lock" stored > in cassandra for instance) > > => not sure if it works just an idea > > Using Consumer Group > - CommitOffset manually at the end of the batch > > => Does spark handle partition rebalancing ? > > I'd appreciate any ideas ! Let me know if it's not clear. > > Erwan > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org