Re: Question: Need Guidance on Timers and Shuffle in Beam-on-Flink

Jan Lukavský Wed, 20 May 2026 23:27:01 -0700

Replies inline.

On 5/20/26 18:05, vamsikrishna korada wrote:

Thanks for the reply, Jan.
The job is similar to an ETL job. We continuously read events from aKafka topic and write them to files every 5 minutes. We areconsidering using a |Timer| to close the current file and open a newwriter every 5 minutes.

If you don't use state (which implies shuffle under current Flink Runnerimplementation) this approach will lose data on failures (and restarts)of the Pipeline.

You might try using bundles & checkpointing interval to drive flushingof the files (you would have to flush on @BundleFinish, setcheckpointing interval to about 5 minutes and bundle size large enough).This _could_ work I suppose, but it will not produce exactly 5minute-long "windows", will likely produce duplicate records on outputand might have consequences for other parts of your Pipeline (if you dosome more complex processing).

You can also have a look at @RequiresStableInput DoFn annotation, whichbuffers input data until checkpoint, which will work somewhat similar asthe case above, but you would have to make sure that all elements youprocess are persisted (if written to an external system).

In general, trying to avoid the shuffle in this case seems like a harshrequirement, I would probably try to relax it, as it enables more "Beamnative" processing (e.g. you can then use GroupIntoBatches, FileIO, etc.).

> If you don't worry about consistency, but only need "rough inexactestimates", then using a pure stateless DoFn without @Timer can do thetrick.
Could you please elaborate on how this approach would work?
My understanding is that using a timer would require a keyed state,which would introduce a shuffle of the data. Is that correct? We aretrying to avoid the shuffle if possible.
Also, could you please help me understand why the output would beconsidered rough/inexact when using a stateless |DoFn|? Is there apossibility of records being dropped, or is the concern mainly aroundconsistency guarantees?

Using Beam timers require shuffle (for grouping keys), but nothing stopsyou from using processing-time (system clock). As mentioned above,without additional care, stateless processing is not supposed to bufferdata anywhere (because that would be stateful processing) and doing sowill lose data on failures.


On Wed, 20 May 2026 at 21:15, Jan Lukavský <[email protected]> wrote:

    Hi Vamsi,

    short answer is - it depends. :)

    There are many unknowns in your question. First of all - what kind
    of logic do you refer to? Does it need to modify (i.e. join with)
    the incoming data? Or is it just some (volatile) monitoring?

    If you need timers for output data consistency - then yes, under
    current Flink Runner implementation there will necessarily be a
    shuffle.
    If you don't worry about consistency, but only need a "rough
    inexact estimates" then using pure stateless DoFn without @Timer
    can do the trick.

    Can you provide more details on your use case?

     Jan

    On 5/20/26 14:29, vamsikrishna korada wrote:


    Hi Beam Community,

    I’m reaching out for some guidance on a Beam Flink streaming job
    I’m working on.

    We are reading from a Kafka topic, where the traffic can be
    either sparse or high-volume, and we need to run a piece of logic
    periodically, roughly every 5 minutes.

    We considered using |@Timer|, but based on the Beam docs, timers
    require keyed state, which introduces a shuffle. We would like to
    avoid this shuffle if possible.

    Is there a way to trigger periodic logic in a Beam pipeline
    without causing a data shuffle?



    Thanks,

    Vamsi

Re: Question: Need Guidance on Timers and Shuffle in Beam-on-Flink

Reply via email to