Multi DC Spark

Jason Nerothin Tue, 19 Apr 2016 10:00:46 -0700

At that scale, it’s best not to do coordination at the application layer.


How much of your data is transactional in nature {all, some, none}? By which I 
mean ACID-compliant.

> On Apr 19, 2016, at 10:53 AM, Erwan ALLAIN <eallain.po...@gmail.com> wrote:
> 
> Cody, you're right that was an example. Target architecture would be 3 DCs :) 
> Good point on ZK, I'll have to check that.
> 
> About Spark, both instances will run at the same time but on different 
> topics. That would be quite useless to have to 2DCs working on the same set 
> of data.
> I just want, in case of crash, that the healthy spark works on all topics 
> (retrieve dead spark load). 
> 
> Does it seem an awkward design ?
> 
> On Tue, Apr 19, 2016 at 5:38 PM, Cody Koeninger <c...@koeninger.org 
> <mailto:c...@koeninger.org>> wrote:
> Maybe I'm missing something, but I don't see how you get a quorum in only 2 
> datacenters (without splitbrain problem, etc).  I also don't know how well ZK 
> will work cross-datacenter.
> 
> As far as the spark side of things goes, if it's idempotent, why not just run 
> both instances all the time.
> 
> 
> 
> On Tue, Apr 19, 2016 at 10:21 AM, Erwan ALLAIN <eallain.po...@gmail.com 
> <mailto:eallain.po...@gmail.com>> wrote:
> I'm describing a disaster recovery but it can be used to make one datacenter 
> offline for upgrade for instance.
> 
> From my point of view when DC2 crashes:
> 
> On Kafka side:
> - kafka cluster will lose one or more broker (partition leader and replica)
> - partition leader lost will be reelected in the remaining healthy DC
> 
> => if the number of in-sync replicas are above the minimum threshold, kafka 
> should be operational
> 
> On downstream datastore side (say Cassandra for instance):
> - deploy accross the 2 DCs in (QUORUM / QUORUM)
> - idempotent write
> 
> => it should be ok (depends on replication factor)
> 
> On Spark:
> - treatment should be idempotent, it will allow us to restart from the last 
> commited offset
> 
> I understand that starting up a post crash job would work.
> 
> Question is: how can we detect when DC2 crashes to start a new job ?
> 
> dynamic topic partition (at each kafkaRDD creation for instance) + topic  
> subscription may be the answer ?
> 
> I appreciate your effort.
> 
> On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin <jasonnerot...@gmail.com 
> <mailto:jasonnerot...@gmail.com>> wrote:
> It the main concern uptime or disaster recovery?
> 
>> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <c...@koeninger.org 
>> <mailto:c...@koeninger.org>> wrote:
>> 
>> I think the bigger question is what happens to Kafka and your downstream 
>> data store when DC2 crashes.
>> 
>> From a Spark point of view, starting up a post-crash job in a new data 
>> center isn't really different from starting up a post-crash job in the 
>> original data center.
>> 
>> On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com 
>> <mailto:eallain.po...@gmail.com>> wrote:
>> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case.
>> 
>> As I mentionned before, I'm planning to use one kafka cluster and 2 or more 
>> spark cluster distinct. 
>> 
>> Let's say we have the following DCs configuration in a nominal case. 
>> Kafka partitions are consumed uniformly by the 2 datacenters.
>> 
>> DataCenter   Spark   Kafka Consumer Group    Kafka partition (P1 to P4)
>> DC 1 Master 1.1      
>> 
>> Worker 1.1   my_group        P1
>> Worker 1.2   my_group        P2
>> DC 2 Master 2.1      
>> 
>> Worker 2.1   my_group        P3
>> Worker 2.2   my_group        P4
>> 
>> I would like, in case of DC crash, a rebalancing of partition on the healthy 
>> DC, something as follow
>> 
>> DataCenter   Spark   Kafka Consumer Group    Kafka partition (P1 to P4)
>> DC 1 Master 1.1      
>> 
>> Worker 1.1   my_group        P1, P3
>> Worker 1.2   my_group        P2, P4
>> DC 2 Master 2.1      
>> 
>> Worker 2.1   my_group        P3
>> Worker 2.2   my_group        P4
>> 
>> I would like to know if it's possible:
>> - using consumer group ?
>> - using direct approach ? I prefer this one as I don't want to activate WAL.
>> 
>> Hope the explanation is better !
>> 
>> 
>> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org 
>> <mailto:c...@koeninger.org>> wrote:
>> The current direct stream only handles exactly the partitions
>> specified at startup.  You'd have to restart the job if you changed
>> partitions.
>> 
>> https://issues.apache.org/jira/browse/SPARK-12177 
>> <https://issues.apache.org/jira/browse/SPARK-12177> has the ongoing work
>> towards using the kafka 0.10 consumer, which would allow for dynamic
>> topicparittions
>> 
>> Regarding your multi-DC questions, I'm not really clear on what you're 
>> saying.
>> 
>> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com 
>> <mailto:eallain.po...@gmail.com>> wrote:
>> > Hello,
>> >
>> > I'm currently designing a solution where 2 distinct clusters Spark (2
>> > datacenters) share the same Kafka (Kafka rack aware or manual broker
>> > repartition).
>> > The aims are
>> > - preventing DC crash: using kafka resiliency and consumer group mechanism
>> > (or else ?)
>> > - keeping consistent offset among replica (vs mirror maker,which does not
>> > keep offset)
>> >
>> > I have several questions
>> >
>> > 1) Dynamic repartition (one or 2 DC)
>> >
>> > I'm using KafkaDirectStream which map one partition kafka with one spark. 
>> > Is
>> > it possible to handle new or removed partition ?
>> > In the compute method, it looks like we are always using the currentOffset
>> > map to query the next batch and therefore it's always the same number of
>> > partition ? Can we request metadata at each batch ?
>> >
>> > 2) Multi DC Spark
>> >
>> > Using Direct approach, a way to achieve this would be
>> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
>> > - only one is reading the partition (Check every x interval, "lock" stored
>> > in cassandra for instance)
>> >
>> > => not sure if it works just an idea
>> >
>> > Using Consumer Group
>> > - CommitOffset manually at the end of the batch
>> >
>> > => Does spark handle partition rebalancing ?
>> >
>> > I'd appreciate any ideas ! Let me know if it's not clear.
>> >
>> > Erwan
>> >
>> >
>> 
>> 
> 
> 
> 
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

Reply via email to