On Mon, Oct 12, 2020 at 9:23 PM KV 59 <kvajjal...@gmail.com> wrote: > Thanks for your responses. > > I have a follow-up question, when you say > >> elementCountAtLeast means that the runner can buffer as many as it wants >> and can decide to offer a low latency pipeline by triggering often or >> better throughput through the use of buffering. > > > Does it mean, I as a pipeline developer cannot control how often the > runner triggers? >
(Please correct if I am wrong): Yes, as I recall when trigger conditions meet, it allows the runner to fire the trigger, but the runner can choose when to fire. > > Kishore > > On Mon, Oct 12, 2020 at 2:15 PM Kenneth Knowles <k...@apache.org> wrote: > >> Another thing to keep in mind - apologies if it was already clear: >> triggering governs aggregation (GBK / Combine). It does not have any effect >> on stateful DoFn. >> >> On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik <lc...@google.com> wrote: >> >>> The default trigger will only fire when the global window closes which >>> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP >>> or during pipeline drain with partial results in streaming. Bounded sources >>> commonly have their watermark advance to the end of time when they complete >>> and some unbounded sources can stop producing output if they detect the end. >>> >>> Parallelization for stateful DoFns are per key and window. >>> Parallelization for GBK is per key and window pane. Note that >>> elementCountAtLeast means that the runner can buffer as many as it wants >>> and can decide to offer a low latency pipeline by triggering often or >>> better throughput through the use of buffering. >>> >>> >>> >>> On Mon, Oct 12, 2020 at 8:22 AM KV 59 <kvajjal...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I'm building a pipeline to process events as they come and do not >>>> really care about the event time and watermark. I'm more interested in not >>>> discarding the events and reducing the latency. The downstream pipeline has >>>> a stateful DoFn. I understand that the default window strategy is Global >>>> Windows,. I did not completely understand the default trigger as per >>>> >>>> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html >>>> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case >>>> of global window how does this work (there is no end of window)?. >>>> >>>> My source is Google PubSub and pipeline is running on Dataflow runner I >>>> have defined my window transform as below >>>> >>>> input.apply(TRANSFORM_NAME, Window.<T>into(new GlobalWindows()) >>>>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) >>>>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) >>>> >>>> >>>> A couple of questions >>>> >>>> 1. Is triggering after each element inefficient in terms of >>>> persistence(serialization) after each element and also parallelism >>>> triggering after each looks like a serial execution? >>>> 2. How does Dataflow parallelize in such cases of triggers? >>>> >>>> >>>> Thanks and appreciate the responses. >>>> >>>> Kishore >>>> >>>