Hey,

could you please check if a bucket assigner is already enough? If not,
what's missing?

FileSink<RowData> orcSink = FileSink
            .forBulkFormat(new
Path("s3a://mybucket/flink_file_sink_orc_test"), factory)
            .withBucketAssigner(new
DateTimeBucketAssigner<>("'dt='yyyyMMdd/'hour='HH",
ZoneId.of("Asia/Shanghai")))
            .build();


On Mon, Nov 4, 2024 at 6:36 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

> Presume you're coming from Spark and looking for something like
> RDD.foreach.
> In Flink there is no such feature. I think you can use a batch job for
> processing and storing the data.
> All the rest can be done in a custom code outside of Flink.
>
> The hard way is to implement a custom connector which is possible but more
> complex.
>
> G
>
>
> On Wed, Oct 30, 2024 at 6:37 AM Anil Dasari <dasaria...@myyahoo.com
> .invalid>
> wrote:
>
> >  Hi Venkat,Thanks for the reply.
> > Microbatching is a data processing technique where small batches of data
> > are collected and processed together at regular intervals.However, I'm
> > aiming to avoid traditional micro-batch processing by tagging records
> > within a time window as a batch, allowing for near-real-time data
> > processing. I’m currently exploring Flink for the following use case:
> > 1. Group data by a time window and write it to S3 under the appropriate
> > prefix.2. Generate metrics for the microbatch and, if needed, store them
> in
> > S3.3. Send metrics to an external system to notify that Step 1 has been
> > completed.If any part of the process fails, the entire microbatch step
> > should be rolled back. Planning to implement two phase commit sink for
> Step
> > 2 and 3. The primary challenge is tagging the record set with epoch time
> > across all tasks within a window to utilize it in the sink process for
> > creating committable splits, such as the processing time in the flink
> file
> > sink.
> > Thanks    On Tuesday, October 29, 2024 at 09:40:40 PM PDT,
> Venkatakrishnan
> > Sowrirajan <vsowr...@asu.edu> wrote:
> >
> >  Can you share more details on what do you mean by micro-batching? Can
> you
> > explain with an example to understand it better?
> >
> > Thanks
> > Venkat
> >
> > On Tue, Oct 29, 2024, 1:22 PM Anil Dasari <dasaria...@myyahoo.com
> .invalid>
> > wrote:
> >
> > > Hello team,
> > > I apologize for reaching out on the dev mailing list. I'm working on
> > > implementing micro-batching with near real-time processing.
> > > I've seen similar questions in the Flink Slack channel and user mailing
> > > list, but there hasn't been much discussion or feedback. Here are the
> > > options I've explored:
> > > 1. Windowing: This approach looked promising, but the flushing
> mechanism
> > > requires record-level information checks, as window data isn't
> accessible
> > > throughout the pipeline.
> > > 2. Window + Trigger: This method buffers events until the trigger
> > interval
> > > is reached, which affects real-time processing; events are only
> processed
> > > when the trigger occurs.
> > > 3. Processing Time: The processing time is specific to each file
> writer,
> > > resulting in inconsistencies across different task managers.
> > > 4. Watermark: There’s no global watermark; it's specific to each source
> > > task, and the initial watermark information (before the first watermark
> > > event) isn't epoch-based.
> > > I'm looking to write data grouped by time (micro-batch time). What’s
> the
> > > best approach to achieve micro-batching in Flink?
> > > Let me know if you have any questions. thanks.
> > > Thanks.
> > >
> > >
> >
>

Reply via email to