Multi DC Spark

Cody Koeninger Tue, 19 Apr 2016 08:39:06 -0700

Maybe I'm missing something, but I don't see how you get a quorum in only 2
datacenters (without splitbrain problem, etc).  I also don't know how well
ZK will work cross-datacenter.


As far as the spark side of things goes, if it's idempotent, why not just
run both instances all the time.



On Tue, Apr 19, 2016 at 10:21 AM, Erwan ALLAIN <eallain.po...@gmail.com>
wrote:

> I'm describing a disaster recovery but it can be used to make one
> datacenter offline for upgrade for instance.
>
> From my point of view when DC2 crashes:
>
> *On Kafka side:*
> - kafka cluster will lose one or more broker (partition leader and replica)
> - partition leader lost will be reelected in the remaining healthy DC
>
> => if the number of in-sync replicas are above the minimum threshold,
> kafka should be operational
>
> *On downstream datastore side (say Cassandra for instance):*
> - deploy accross the 2 DCs in (QUORUM / QUORUM)
> - idempotent write
>
> => it should be ok (depends on replication factor)
>
> *On Spark*:
> - treatment should be idempotent, it will allow us to restart from the
> last commited offset
>
> I understand that starting up a post crash job would work.
>
> Question is: how can we detect when DC2 crashes to start a new job ?
>
> dynamic topic partition (at each kafkaRDD creation for instance) + topic
> subscription may be the answer ?
>
> I appreciate your effort.
>
> On Tue, Apr 19, 2016 at 4:25 PM, Jason Nerothin <jasonnerot...@gmail.com>
> wrote:
>
>> It the main concern uptime or disaster recovery?
>>
>> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> I think the bigger question is what happens to Kafka and your downstream
>> data store when DC2 crashes.
>>
>> From a Spark point of view, starting up a post-crash job in a new data
>> center isn't really different from starting up a post-crash job in the
>> original data center.
>>
>> On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com>
>> wrote:
>>
>>> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC
>>> case.
>>>
>>> As I mentionned before, I'm planning to use one kafka cluster and 2 or
>>> more spark cluster distinct.
>>>
>>> Let's say we have the following DCs configuration in a nominal case.
>>> Kafka partitions are consumed uniformly by the 2 datacenters.
>>>
>>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>>> DC 1 Master 1.1
>>>
>>> Worker 1.1 my_group P1
>>> Worker 1.2 my_group P2
>>> DC 2 Master 2.1
>>>
>>> Worker 2.1 my_group P3
>>> Worker 2.2 my_group P4
>>> I would like, in case of DC crash, a rebalancing of partition on the
>>> healthy DC, something as follow
>>>
>>> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
>>> DC 1 Master 1.1
>>>
>>> Worker 1.1 my_group P1*, P3*
>>> Worker 1.2 my_group P2*, P4*
>>> DC 2 Master 2.1
>>>
>>> Worker 2.1 my_group P3
>>> Worker 2.2 my_group P4
>>>
>>> I would like to know if it's possible:
>>> - using consumer group ?
>>> - using direct approach ? I prefer this one as I don't want to activate
>>> WAL.
>>>
>>> Hope the explanation is better !
>>>
>>>
>>> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> The current direct stream only handles exactly the partitions
>>>> specified at startup.  You'd have to restart the job if you changed
>>>> partitions.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
>>>> towards using the kafka 0.10 consumer, which would allow for dynamic
>>>> topicparittions
>>>>
>>>> Regarding your multi-DC questions, I'm not really clear on what you're
>>>> saying.
>>>>
>>>> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com>
>>>> wrote:
>>>> > Hello,
>>>> >
>>>> > I'm currently designing a solution where 2 distinct clusters Spark (2
>>>> > datacenters) share the same Kafka (Kafka rack aware or manual broker
>>>> > repartition).
>>>> > The aims are
>>>> > - preventing DC crash: using kafka resiliency and consumer group
>>>> mechanism
>>>> > (or else ?)
>>>> > - keeping consistent offset among replica (vs mirror maker,which does
>>>> not
>>>> > keep offset)
>>>> >
>>>> > I have several questions
>>>> >
>>>> > 1) Dynamic repartition (one or 2 DC)
>>>> >
>>>> > I'm using KafkaDirectStream which map one partition kafka with one
>>>> spark. Is
>>>> > it possible to handle new or removed partition ?
>>>> > In the compute method, it looks like we are always using the
>>>> currentOffset
>>>> > map to query the next batch and therefore it's always the same number
>>>> of
>>>> > partition ? Can we request metadata at each batch ?
>>>> >
>>>> > 2) Multi DC Spark
>>>> >
>>>> > Using Direct approach, a way to achieve this would be
>>>> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
>>>> > - only one is reading the partition (Check every x interval, "lock"
>>>> stored
>>>> > in cassandra for instance)
>>>> >
>>>> > => not sure if it works just an idea
>>>> >
>>>> > Using Consumer Group
>>>> > - CommitOffset manually at the end of the batch
>>>> >
>>>> > => Does spark handle partition rebalancing ?
>>>> >
>>>> > I'd appreciate any ideas ! Let me know if it's not clear.
>>>> >
>>>> > Erwan
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>>
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

Reply via email to