[ https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053232#comment-17053232 ]
Yuan Mei edited comment on FLINK-15670 at 3/6/20, 10:19 AM: ------------------------------------------------------------ *Meeting Outline* Sketch the meeting agenda here to make the meeting more efficient: *Scope/Purpose of the problem* # +Data reuse+: It's not necessary to store intermediate results if not targeting for reuse ** Internal Reuse, like failure recovery ** External Reuse, providing Read API ** Important to differentiate these two since we may end up different design choices # +Avoid shuffle+: use partitioner in Kafka producer to avoid an extra shuffle step # +Failure Recovery+: Use the stored intermediate result for recovery ** Intermediate results need to maintain the watermark information ** The restored state needs to maintain watermark information (Strictly speaking, a watermark should be stored together with a state even in global snapshotting and recovery) *Watermark* +Problem+: If multiple sink subtasks write to the same partition, how to decide watermark. +Why it is a problem+: subtasks are running independently and are progressing at different speeds. +How the current model works+: watermarks are kept in each individual channel, aligned and decided at downstream nodes. +Possible Solutions+: # Push the watermark alignment logic up and uses a coordinator to coordinate the watermark amongst subtasks. The coordinator is responsible to maintain the minimal watermark amongst all subtasks. ** Pros: can support both internal and external reuse ** Cons: ### Subtle delayed watermark (this may slightly affect how much data an op holds, but a big deal) ### Extra cost to send watermarks to the coordinator and broadcasting watermarks back. Depending on the parallelism and how watermarks are generated, the coordinator may significantly affect performance. (But this may be too extreme cases) ### Complicate runtime logic, as you can imagine # Simulate How the current data partitions are handled. In other words, we will store each channel partition in a different partition ** Pros: ### Make the watermark logic strictly consistent with the current model ### Runtime logic untouched ** Cons: ### External reuse is difficult to reason about (not saying completely not possible) ### Kafka's partition number is different for a user the reason about (We can make this simple though) ### Assignment and Kafka consumer logic is complicated (But we can wrap for a user) was (Author: ym): *Meeting Outline* Sketch the meeting agenda here to make the meeting more efficient: *Scope of the problem* # +Data reuse+: It's not necessary to store intermediate results if not targeting for reuse ** Internal Reuse, like failure recovery ** External Reuse, providing Read API ** Important to differentiate these two since we may end up different design choices # +Avoid shuffle+: use partitioner in Kafka producer to avoid an extra shuffle step # +Failure Recovery+: Use the stored intermediate result for recovery ** Intermediate results need to maintain the watermark information ** The restored state needs to maintain watermark information (Strictly speaking, a watermark should be stored together with a state even in global snapshotting and recovery) *Watermark* +Problem+: If multiple sink subtasks write to the same partition, how to decide watermark. +Why it is a problem+: subtasks are running independently and are progressing at different speeds. +How the current model works+: watermarks are kept in each individual channel, aligned and decided at downstream nodes. +Possible Solutions+: # Push the watermark alignment logic up and uses a coordinator to coordinate the watermark amongst subtasks. The coordinator is responsible to maintain the minimal watermark amongst all subtasks. ** Pros: can support both internal and external reuse ** Cons: ### Subtle delayed watermark (this may slightly affect how much data an op holds, but a big deal) ### Extra cost to send watermarks to the coordinator and broadcasting watermarks back. Depending on the parallelism and how watermarks are generated, the coordinator may significantly affect performance. (But this may be too extreme cases) ### Complicate runtime logic, as you can imagine # Simulate How the current data partitions are handled. In other words, we will store each channel partition in a different partition ** Pros: ### Make the watermark logic strictly consistent with the current model ### Runtime logic untouched ** Cons: ### External reuse is difficult to reason about (not saying completely not possible) ### Kafka's partition number is different for a user the reason about (We can make this simple though) ### Assignment and Kafka consumer logic is complicated (But we can wrap for a user) > Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's > KeyGroups > ------------------------------------------------------------------------------------- > > Key: FLINK-15670 > URL: https://issues.apache.org/jira/browse/FLINK-15670 > Project: Flink > Issue Type: New Feature > Components: API / DataStream, Connectors / Kafka > Reporter: Stephan Ewen > Priority: Major > Labels: usability > Fix For: 1.11.0 > > > This Source/Sink pair would serve two purposes: > 1. You can read topics that are already partitioned by key and process them > without partitioning them again (avoid shuffles) > 2. You can use this to shuffle through Kafka, thereby decomposing the job > into smaller jobs and independent pipelined regions that fail over > independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)