Hello, I'm currently designing a solution where 2 distinct clusters Spark (2 datacenters) share the same Kafka (Kafka rack aware or manual broker repartition). The aims are - preventing DC crash: using kafka resiliency and consumer group mechanism (or else ?) - keeping consistent offset among replica (vs mirror maker,which does not keep offset)
I have several questions 1) Dynamic repartition (one or 2 DC) I'm using KafkaDirectStream which map one partition kafka with one spark. Is it possible to handle new or removed partition ? In the compute method, it looks like we are always using the currentOffset map to query the next batch and therefore it's always the same number of partition ? Can we request metadata at each batch ? 2) Multi DC Spark *Using Direct approach,* a way to achieve this would be - to "assign" (kafka 0.9 term) all topics to the 2 sparks - only one is reading the partition (Check every x interval, "lock" stored in cassandra for instance) => not sure if it works just an idea *Using Consumer Group* - CommitOffset manually at the end of the batch => Does spark handle partition rebalancing ? I'd appreciate any ideas ! Let me know if it's not clear. Erwan