Presume you're coming from Spark and looking for something like RDD.foreach. In Flink there is no such feature. I think you can use a batch job for processing and storing the data. All the rest can be done in a custom code outside of Flink.
The hard way is to implement a custom connector which is possible but more complex. G On Wed, Oct 30, 2024 at 6:37 AM Anil Dasari <dasaria...@myyahoo.com.invalid> wrote: > Hi Venkat,Thanks for the reply. > Microbatching is a data processing technique where small batches of data > are collected and processed together at regular intervals.However, I'm > aiming to avoid traditional micro-batch processing by tagging records > within a time window as a batch, allowing for near-real-time data > processing. I’m currently exploring Flink for the following use case: > 1. Group data by a time window and write it to S3 under the appropriate > prefix.2. Generate metrics for the microbatch and, if needed, store them in > S3.3. Send metrics to an external system to notify that Step 1 has been > completed.If any part of the process fails, the entire microbatch step > should be rolled back. Planning to implement two phase commit sink for Step > 2 and 3. The primary challenge is tagging the record set with epoch time > across all tasks within a window to utilize it in the sink process for > creating committable splits, such as the processing time in the flink file > sink. > Thanks On Tuesday, October 29, 2024 at 09:40:40 PM PDT, Venkatakrishnan > Sowrirajan <vsowr...@asu.edu> wrote: > > Can you share more details on what do you mean by micro-batching? Can you > explain with an example to understand it better? > > Thanks > Venkat > > On Tue, Oct 29, 2024, 1:22 PM Anil Dasari <dasaria...@myyahoo.com.invalid> > wrote: > > > Hello team, > > I apologize for reaching out on the dev mailing list. I'm working on > > implementing micro-batching with near real-time processing. > > I've seen similar questions in the Flink Slack channel and user mailing > > list, but there hasn't been much discussion or feedback. Here are the > > options I've explored: > > 1. Windowing: This approach looked promising, but the flushing mechanism > > requires record-level information checks, as window data isn't accessible > > throughout the pipeline. > > 2. Window + Trigger: This method buffers events until the trigger > interval > > is reached, which affects real-time processing; events are only processed > > when the trigger occurs. > > 3. Processing Time: The processing time is specific to each file writer, > > resulting in inconsistencies across different task managers. > > 4. Watermark: There’s no global watermark; it's specific to each source > > task, and the initial watermark information (before the first watermark > > event) isn't epoch-based. > > I'm looking to write data grouped by time (micro-batch time). What’s the > > best approach to achieve micro-batching in Flink? > > Let me know if you have any questions. thanks. > > Thanks. > > > > >