Re: Micro batching with flink

Gabor Somogyi Mon, 04 Nov 2024 09:41:55 -0800

Presume you're coming from Spark and looking for something like RDD.foreach.
In Flink there is no such feature. I think you can use a batch job for
processing and storing the data.
All the rest can be done in a custom code outside of Flink.


The hard way is to implement a custom connector which is possible but more
complex.

G


On Wed, Oct 30, 2024 at 6:37 AM Anil Dasari <dasaria...@myyahoo.com.invalid>
wrote:

>  Hi Venkat,Thanks for the reply.
> Microbatching is a data processing technique where small batches of data
> are collected and processed together at regular intervals.However, I'm
> aiming to avoid traditional micro-batch processing by tagging records
> within a time window as a batch, allowing for near-real-time data
> processing. I’m currently exploring Flink for the following use case:
> 1. Group data by a time window and write it to S3 under the appropriate
> prefix.2. Generate metrics for the microbatch and, if needed, store them in
> S3.3. Send metrics to an external system to notify that Step 1 has been
> completed.If any part of the process fails, the entire microbatch step
> should be rolled back. Planning to implement two phase commit sink for Step
> 2 and 3. The primary challenge is tagging the record set with epoch time
> across all tasks within a window to utilize it in the sink process for
> creating committable splits, such as the processing time in the flink file
> sink.
> Thanks    On Tuesday, October 29, 2024 at 09:40:40 PM PDT, Venkatakrishnan
> Sowrirajan <vsowr...@asu.edu> wrote:
>
>  Can you share more details on what do you mean by micro-batching? Can you
> explain with an example to understand it better?
>
> Thanks
> Venkat
>
> On Tue, Oct 29, 2024, 1:22 PM Anil Dasari <dasaria...@myyahoo.com.invalid>
> wrote:
>
> > Hello team,
> > I apologize for reaching out on the dev mailing list. I'm working on
> > implementing micro-batching with near real-time processing.
> > I've seen similar questions in the Flink Slack channel and user mailing
> > list, but there hasn't been much discussion or feedback. Here are the
> > options I've explored:
> > 1. Windowing: This approach looked promising, but the flushing mechanism
> > requires record-level information checks, as window data isn't accessible
> > throughout the pipeline.
> > 2. Window + Trigger: This method buffers events until the trigger
> interval
> > is reached, which affects real-time processing; events are only processed
> > when the trigger occurs.
> > 3. Processing Time: The processing time is specific to each file writer,
> > resulting in inconsistencies across different task managers.
> > 4. Watermark: There’s no global watermark; it's specific to each source
> > task, and the initial watermark information (before the first watermark
> > event) isn't epoch-based.
> > I'm looking to write data grouped by time (micro-batch time). What’s the
> > best approach to achieve micro-batching in Flink?
> > Let me know if you have any questions. thanks.
> > Thanks.
> >
> >
>

Re: Micro batching with flink

Reply via email to