Hello,

I'm currently designing a solution where 2 distinct clusters Spark (2
datacenters) share the same Kafka (Kafka rack aware or manual broker
repartition).
The aims are
- preventing DC crash: using kafka resiliency and consumer group mechanism
(or else ?)
- keeping consistent offset among replica (vs mirror maker,which does not
keep offset)

I have several questions

1) Dynamic repartition (one or 2 DC)

I'm using KafkaDirectStream which map one partition kafka with one spark.
Is it possible to handle new or removed partition ?
In the compute method, it looks like we are always using the currentOffset
map to query the next batch and therefore it's always the same number of
partition ? Can we request metadata at each batch ?

2) Multi DC Spark

*Using Direct approach,* a way to achieve this would be
- to "assign" (kafka 0.9 term) all topics to the 2 sparks
- only one is reading the partition (Check every x interval, "lock" stored
in cassandra for instance)

=> not sure if it works just an idea

*Using Consumer Group*
- CommitOffset manually at the end of the batch

=> Does spark handle partition rebalancing ?

I'd appreciate any ideas ! Let me know if it's not clear.

Erwan

Reply via email to