Hi Erwan,

You might consider InsightEdge: http://insightedge.io <http://insightedge.io/> 
. It has the capability of doing WAN between data grids and would save you the 
work of having to re-invent the wheel. Additionally, RDDs can be shared between 
developers in the same DC.

Thanks,
Jason

> On Apr 18, 2016, at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com> wrote:
> 
> Hello,
> 
> I'm currently designing a solution where 2 distinct clusters Spark (2 
> datacenters) share the same Kafka (Kafka rack aware or manual broker 
> repartition). 
> The aims are
> - preventing DC crash: using kafka resiliency and consumer group mechanism 
> (or else ?)
> - keeping consistent offset among replica (vs mirror maker,which does not 
> keep offset)
> 
> I have several questions 
> 
> 1) Dynamic repartition (one or 2 DC)
> 
> I'm using KafkaDirectStream which map one partition kafka with one spark. Is 
> it possible to handle new or removed partition ? 
> In the compute method, it looks like we are always using the currentOffset 
> map to query the next batch and therefore it's always the same number of 
> partition ? Can we request metadata at each batch ?
> 
> 2) Multi DC Spark
> 
> Using Direct approach, a way to achieve this would be 
> - to "assign" (kafka 0.9 term) all topics to the 2 sparks
> - only one is reading the partition (Check every x interval, "lock" stored in 
> cassandra for instance)
> 
> => not sure if it works just an idea
> 
> Using Consumer Group
> - CommitOffset manually at the end of the batch
> 
> => Does spark handle partition rebalancing ?
> 
> I'd appreciate any ideas ! Let me know if it's not clear.
> 
> Erwan
> 
> 

Reply via email to