[jira] [Comment Edited] (FLINK-15670) Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's KeyGroups

Yuan Mei (Jira) Fri, 06 Mar 2020 02:20:25 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053232#comment-17053232
 ]


Yuan Mei edited comment on FLINK-15670 at 3/6/20, 10:19 AM:
------------------------------------------------------------

*Meeting Outline*

Sketch the meeting agenda here to make the meeting more efficient:

 

*Scope/Purpose of the problem*
 # +Data reuse+: It's not necessary to store intermediate results if not 
targeting for reuse
 ** Internal Reuse, like failure recovery
 ** External Reuse, providing Read API
 ** Important to differentiate these two since we may end up different design 
choices
 # +Avoid shuffle+: use partitioner in Kafka producer to avoid an extra shuffle 
step
 # +Failure Recovery+: Use the stored intermediate result for recovery
 ** Intermediate results need to maintain the watermark information
 ** The restored state needs to maintain watermark information (Strictly 
speaking, a watermark should be stored together with a state even in global 
snapshotting and recovery)

*Watermark*

+Problem+: If multiple sink subtasks write to the same partition, how to decide 
watermark. 

+Why it is a problem+: subtasks are running independently and are progressing 
at different speeds. 

+How the current model works+: watermarks are kept in each individual channel, 
aligned and decided at downstream nodes.

+Possible Solutions+:
 # Push the watermark alignment logic up and uses a coordinator to coordinate 
the watermark amongst subtasks. The coordinator is responsible to maintain the 
minimal watermark amongst all subtasks.
 ** Pros: can support both internal and external reuse
 ** Cons:
 ### Subtle delayed watermark (this may slightly affect how much data an op 
holds, but a big deal)
 ### Extra cost to send watermarks to the coordinator and broadcasting 
watermarks back. Depending on the parallelism and how watermarks are generated, 
the coordinator may significantly affect performance. (But this may be too 
extreme cases)
 ### Complicate runtime logic, as you can imagine
 # Simulate How the current data partitions are handled. In other words, we 
will store each channel partition in a different partition
 ** Pros:
 ### Make the watermark logic strictly consistent with the current model
 ### Runtime logic untouched
 ** Cons:
 ### External reuse is difficult to reason about (not saying completely not 
possible)
 ### Kafka's partition number is different for a user the reason about (We can 
make this simple though)
 ### Assignment and Kafka consumer logic is complicated (But we can wrap for a 
user)

 

 


was (Author: ym):
*Meeting Outline*

Sketch the meeting agenda here to make the meeting more efficient:

 

*Scope of the problem*
 # +Data reuse+: It's not necessary to store intermediate results if not 
targeting for reuse
 ** Internal Reuse, like failure recovery
 ** External Reuse, providing Read API
 ** Important to differentiate these two since we may end up different design 
choices
 # +Avoid shuffle+: use partitioner in Kafka producer to avoid an extra shuffle 
step
 # +Failure Recovery+: Use the stored intermediate result for recovery
 ** Intermediate results need to maintain the watermark information
 ** The restored state needs to maintain watermark information (Strictly 
speaking, a watermark should be stored together with a state even in global 
snapshotting and recovery)

*Watermark*

+Problem+: If multiple sink subtasks write to the same partition, how to decide 
watermark. 

+Why it is a problem+: subtasks are running independently and are progressing 
at different speeds. 

+How the current model works+: watermarks are kept in each individual channel, 
aligned and decided at downstream nodes.

+Possible Solutions+:
 # Push the watermark alignment logic up and uses a coordinator to coordinate 
the watermark amongst subtasks. The coordinator is responsible to maintain the 
minimal watermark amongst all subtasks.
 ** Pros: can support both internal and external reuse
 ** Cons:
 ### Subtle delayed watermark (this may slightly affect how much data an op 
holds, but a big deal)
 ### Extra cost to send watermarks to the coordinator and broadcasting 
watermarks back. Depending on the parallelism and how watermarks are generated, 
the coordinator may significantly affect performance. (But this may be too 
extreme cases)
 ### Complicate runtime logic, as you can imagine
 # Simulate How the current data partitions are handled. In other words, we 
will store each channel partition in a different partition
 ** Pros:
 ### Make the watermark logic strictly consistent with the current model
 ### Runtime logic untouched
 ** Cons:
 ### External reuse is difficult to reason about (not saying completely not 
possible)
 ### Kafka's partition number is different for a user the reason about (We can 
make this simple though)
 ### Assignment and Kafka consumer logic is complicated (But we can wrap for a 
user)

 

 

> Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's 
> KeyGroups
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-15670
>                 URL: https://issues.apache.org/jira/browse/FLINK-15670
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / DataStream, Connectors / Kafka
>            Reporter: Stephan Ewen
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> This Source/Sink pair would serve two purposes:
> 1. You can read topics that are already partitioned by key and process them 
> without partitioning them again (avoid shuffles)
> 2. You can use this to shuffle through Kafka, thereby decomposing the job 
> into smaller jobs and independent pipelined regions that fail over 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15670) Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's KeyGroups

Reply via email to