Hey, could you please check if a bucket assigner is already enough? If not, what's missing?
FileSink<RowData> orcSink = FileSink .forBulkFormat(new Path("s3a://mybucket/flink_file_sink_orc_test"), factory) .withBucketAssigner(new DateTimeBucketAssigner<>("'dt='yyyyMMdd/'hour='HH", ZoneId.of("Asia/Shanghai"))) .build(); On Mon, Nov 4, 2024 at 6:36 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Presume you're coming from Spark and looking for something like > RDD.foreach. > In Flink there is no such feature. I think you can use a batch job for > processing and storing the data. > All the rest can be done in a custom code outside of Flink. > > The hard way is to implement a custom connector which is possible but more > complex. > > G > > > On Wed, Oct 30, 2024 at 6:37 AM Anil Dasari <dasaria...@myyahoo.com > .invalid> > wrote: > > > Hi Venkat,Thanks for the reply. > > Microbatching is a data processing technique where small batches of data > > are collected and processed together at regular intervals.However, I'm > > aiming to avoid traditional micro-batch processing by tagging records > > within a time window as a batch, allowing for near-real-time data > > processing. I’m currently exploring Flink for the following use case: > > 1. Group data by a time window and write it to S3 under the appropriate > > prefix.2. Generate metrics for the microbatch and, if needed, store them > in > > S3.3. Send metrics to an external system to notify that Step 1 has been > > completed.If any part of the process fails, the entire microbatch step > > should be rolled back. Planning to implement two phase commit sink for > Step > > 2 and 3. The primary challenge is tagging the record set with epoch time > > across all tasks within a window to utilize it in the sink process for > > creating committable splits, such as the processing time in the flink > file > > sink. > > Thanks On Tuesday, October 29, 2024 at 09:40:40 PM PDT, > Venkatakrishnan > > Sowrirajan <vsowr...@asu.edu> wrote: > > > > Can you share more details on what do you mean by micro-batching? Can > you > > explain with an example to understand it better? > > > > Thanks > > Venkat > > > > On Tue, Oct 29, 2024, 1:22 PM Anil Dasari <dasaria...@myyahoo.com > .invalid> > > wrote: > > > > > Hello team, > > > I apologize for reaching out on the dev mailing list. I'm working on > > > implementing micro-batching with near real-time processing. > > > I've seen similar questions in the Flink Slack channel and user mailing > > > list, but there hasn't been much discussion or feedback. Here are the > > > options I've explored: > > > 1. Windowing: This approach looked promising, but the flushing > mechanism > > > requires record-level information checks, as window data isn't > accessible > > > throughout the pipeline. > > > 2. Window + Trigger: This method buffers events until the trigger > > interval > > > is reached, which affects real-time processing; events are only > processed > > > when the trigger occurs. > > > 3. Processing Time: The processing time is specific to each file > writer, > > > resulting in inconsistencies across different task managers. > > > 4. Watermark: There’s no global watermark; it's specific to each source > > > task, and the initial watermark information (before the first watermark > > > event) isn't epoch-based. > > > I'm looking to write data grouped by time (micro-batch time). What’s > the > > > best approach to achieve micro-batching in Flink? > > > Let me know if you have any questions. thanks. > > > Thanks. > > > > > > > > >