Re: KafkaSource, consumer assignment, and Kafka ‘stretch’ clusters for multi-DC streaming apps

Martijn Visser Wed, 05 Oct 2022 08:42:46 -0700

Hi Andrew,

While definitely no expert on this topic, my first thought was if this idea
could be solved with the idea that was proposed in FLIP-246
https://cwiki.apache.org/confluence/display/FLINK/FLIP-246%3A+Multi+Cluster+Kafka+Source


I'm also looping in Mason Chen who was the initiator of that FLIP :)

Best regards,

Martijn

On Wed, Oct 5, 2022 at 10:00 AM Andrew Otto <o...@wikimedia.org> wrote:

> (Ah, note that I am considering very simple streaming apps here, e.g.
> event enrichment apps.  No windowing or complex state.  The only state is
> the Kafka offsets, which I suppose would also have to be managed from
> Kafka, not from Flink state.)
>
>
>
> On Wed, Oct 5, 2022 at 9:54 AM Andrew Otto <o...@wikimedia.org> wrote:
>
>> Hi all,
>>
>> *tl;dr: Would it be possible to make a KafkaSource that uses Kafka's
>> built in consumer assignment for Flink tasks?*
>>
>> At the Wikimedia Foundation we are evaluating
>> <https://phabricator.wikimedia.org/T307944> whether we can use a Kafka
>> 'stretch' cluster to simplify the multi-datacenter deployment architecture
>> of streaming applications.
>>
>> A Kafka stretch cluster is one in which the brokers span multiple
>> datacenters, relying on the usual Kafka broker replication for multi DC
>> replication (rather than something like Kafka MirrorMaker).  This is
>> feasible with Kafka today mostly because of follower fetching
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica>
>> support, allowing consumers to be assigned to consume from partitions that
>> are 'closest' to them, e.g. in the same 'rack' (or DC :) ).
>>
>> Having a single Kafka cluster makes datacenter failover for streaming
>> applications a little bit simpler, as there is only one set of offsets to
>> use when saving state.  We can run a streaming app in active/passive mode.
>> This would allow us to stop the app in one datacenter, and then start it up
>> in another, using the same state snapshot and same Kafka cluster.
>>
>> But, this got me wondering...would it be possible to run a streaming app
>> in an active/active mode, where in normal operation, half of the work was
>> being done in each DC, and in failover, all of the work would automatically
>> failover to the online DC.
>>
>> I don't think running a single Flink application cross DC would work
>> well; there's too much inter-node traffic happening, and the Flink tasks
>> don't have any DC awareness.
>>
>> But would it be possible to run two separate streaming applications in
>> each DC, but in the *same Kafka consumer group*? I believe that, if the
>> streaming app was using Kafka's usual consumer assignment and rebalancing
>> protocol, it would.  Kafka would just see clients connecting from either DC
>> in the same consumer group, and assign each consumer an equal number of
>> partitions to consume, resulting in equal partition balancing in DCs.  If
>> we shut down one of the streaming apps, Kafka would automatically rebalance
>> the consumers in the consumer group, assigning all of the work to the
>> remaining streaming app in the other DC.
>>
>> I got excited about this possibility, only to learn that Flink's
>> KafkaSource does not use Kafka for consumer assignment.  I think I
>> understand why it does this: the Source API can do a lot more than Kafka,
>> so having some kind of state management (offsets) and task assignment
>> (Kafka consumer balance protocol) outside of the usual Flink Source would
>> be pretty weird.  Implementing offset and task assignment inside of the
>> KafkaSource allows it to work like any other Source implementation.
>>
>> However, this active/active multi DC streaming app idea seems pretty
>> compelling, as it would greatly reduce operator/SRE overhead.  It seems to
>> me that any Kafka streaming app that did use Kafka's built in consumer
>> assignment protocol (like Kafka Streams) would be deployable in this way.
>> But in Flink this is not possible because of the way it assigns tasks.
>>
>> I'm writing this email to see what others think about this, and wonder if
>> it might be possible to implement a KafkaSource that assigned tasks using
>> Kafka's usual consumer assignment protocol.  Hopefully someone more
>> knowledgeable about Sources and TaskSplits, etc. could advise here.
>>
>> Thank you!
>>
>> - Andrew Otto
>>   Wikimedia Foundation
>>
>>
>>

Re: KafkaSource, consumer assignment, and Kafka ‘stretch’ clusters for multi-DC streaming apps

Reply via email to