Multi DC Spark

Cody Koeninger Tue, 19 Apr 2016 07:13:22 -0700

I think the bigger question is what happens to Kafka and your downstream
data store when DC2 crashes.


>From a Spark point of view, starting up a post-crash job in a new data
center isn't really different from starting up a post-crash job in the
original data center.

On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com>
wrote:

> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case.
>
> As I mentionned before, I'm planning to use one kafka cluster and 2 or
> more spark cluster distinct.
>
> Let's say we have the following DCs configuration in a nominal case.
> Kafka partitions are consumed uniformly by the 2 datacenters.
>
> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
> DC 1 Master 1.1
>
> Worker 1.1 my_group P1
> Worker 1.2 my_group P2
> DC 2 Master 2.1
>
> Worker 2.1 my_group P3
> Worker 2.2 my_group P4
> I would like, in case of DC crash, a rebalancing of partition on the
> healthy DC, something as follow
>
> DataCenter Spark Kafka Consumer Group Kafka partition (P1 to P4)
> DC 1 Master 1.1
>
> Worker 1.1 my_group P1*, P3*
> Worker 1.2 my_group P2*, P4*
> DC 2 Master 2.1
>
> Worker 2.1 my_group P3
> Worker 2.2 my_group P4
>
> I would like to know if it's possible:
> - using consumer group ?
> - using direct approach ? I prefer this one as I don't want to activate
> WAL.
>
> Hope the explanation is better !
>
>
> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> The current direct stream only handles exactly the partitions
>> specified at startup.  You'd have to restart the job if you changed
>> partitions.
>>
>> https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work
>> towards using the kafka 0.10 consumer, which would allow for dynamic
>> topicparittions
>>
>> Regarding your multi-DC questions, I'm not really clear on what you're
>> saying.
>>
>> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > I'm currently designing a solution where 2 distinct clusters Spark (2
>> > datacenters) share the same Kafka (Kafka rack aware or manual broker
>> > repartition).
>> > The aims are
>> > - preventing DC crash: using kafka resiliency and consumer group
>> mechanism
>> > (or else ?)
>> > - keeping consistent offset among replica (vs mirror maker,which does
>> not
>> > keep offset)
>> >
>> > I have several questions
>> >
>> > 1) Dynamic repartition (one or 2 DC)
>> >
>> > I'm using KafkaDirectStream which map one partition kafka with one
>> spark. Is
>> > it possible to handle new or removed partition ?
>> > In the compute method, it looks like we are always using the
>> currentOffset
>> > map to query the next batch and therefore it's always the same number of
>> > partition ? Can we request metadata at each batch ?
>> >
>> > 2) Multi DC Spark
>> >
>> > Using Direct approach, a way to achieve this would be
>> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
>> > - only one is reading the partition (Check every x interval, "lock"
>> stored
>> > in cassandra for instance)
>> >
>> > => not sure if it works just an idea
>> >
>> > Using Consumer Group
>> > - CommitOffset manually at the end of the batch
>> >
>> > => Does spark handle partition rebalancing ?
>> >
>> > I'd appreciate any ideas ! Let me know if it's not clear.
>> >
>> > Erwan
>> >
>> >
>>
>
>

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

Reply via email to