Hi Xu,Thanks for the response.I am currently using Spark Streaming to process 
data from Kafka in microbatches, writing each microbatch's data to a dedicated 
prefix in S3. Since Spark Streaming is lazy, it processes data only when a 
microbatch is created or triggered, leaving resources idle until then. This 
approach is not true streaming. To improve resource utilization and process 
data as it arrives, I am considering switching to Flink.
Both Spark's Kafka source and Flink's Kafka source (with parallelism > 1) use 
multithreaded processing. However, Spark's Kafka source readers share a global 
microbatch epoch, ensuring a consistent view across readers. In contrast, 
Flink's Kafka source split readers do not share a global epoch or identifiers 
to divide the stream into chunks.
It requires all split readers of a source should emit a special event which 
same epoch at the same time. 
Thanks

    On Friday, January 3, 2025 at 06:21:34 AM PST, Xu Huang 
<huangxu.wal...@gmail.com> wrote:  
 
 Hi, Anil

I don't understand what you mean by Global Watermark, are you trying to
have all Sources emit a special event with the same epoch at the same time?
Is there a specific user case for this question?

Happy new year!

Best,
Xu Huang

Anil Dasari <dasaria...@myyahoo.com.invalid> 于2025年1月3日周五 14:02写道:

>  Hello XU,Happy new year. Thank you for FLIP-499 and FLIP-467.
> I tried to split/chunk streams based by fixed timestamp intervals and
> route them to the appropriate destination. A few months ago, I evaluated
> the following options and found that Flink currently lacks direct support
> for a global watermark or timer that can share consistent information (such
> as an epoch or identifier) across task nodes.
> 1. Windowing: While promising, this approach requires record-level checks
> for flushing, as window data isn't accessible throughout the pipeline.
> 2. Window + Trigger: This method buffers events until the trigger interval
> is reached, impacting real-time processing since events are processed only
> when the trigger fires.
> 3. Processing Time: Processing time is localized to each file writer,
> causing inconsistencies across task managers.
> 4. Watermark: Watermarks are specific to each source task. Additionally,
> the initial watermark (before the first event) is not epoch-based, leading
> to further challenges.
> Would global watermarks address this use case? If not, could this use case
> align with any of the proposed FLIPs
> Thanks in advance.
>
>    On Thursday, January 2, 2025 at 09:06:31 PM PST, Xu Huang <
> huangxu.wal...@gmail.com> wrote:
>
>  Hi Devs,
>
> Weijie Guo and I would like to initiate a discussion about FLIP-499:
> Support Event Time by Generalized Watermark in DataStream V2
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-499%3A+Support+Event+Time+by+Generalized+Watermark+in+DataStream+V2
> >
> [1].
>
> Event time is a fundamental feature of Flink that has been widely adopted.
> For instance, the Window operator can determine whether to trigger a window
> based on event time, and users can register timer using the event time.
> FLIP-467[2] introduces the Generalized Watermark in DataStream V2, enabling
> users to define specific events that can be emitted from a source or other
> operators, propagate along the streams, received by downstream operators,
> and aligned during propagation. Within this framework, the traditional
> (event-time) Watermark can be viewed as a special instance of the
> Generalized Watermark already provided by Flink.
>
> To make it easy for users to use event time in DataStream V2, this FLIP
> will implement event time extension in DataStream V2 based on Generalized
> Watermark.
>
> For more details, please refer to FLIP-499 [1]. We look forward to your
> feedback.
>
> Best,
>
> Xu Huang
>
> [1] https://cwiki.apache.org/confluence/x/pQz0Ew
>
> [2] https://cwiki.apache.org/confluence/x/oA6TEg
>
  

Reply via email to