Another thing to keep in mind - apologies if it was already clear: triggering governs aggregation (GBK / Combine). It does not have any effect on stateful DoFn.
On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik <lc...@google.com> wrote: > The default trigger will only fire when the global window closes which > does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP > or during pipeline drain with partial results in streaming. Bounded sources > commonly have their watermark advance to the end of time when they complete > and some unbounded sources can stop producing output if they detect the end. > > Parallelization for stateful DoFns are per key and window. Parallelization > for GBK is per key and window pane. Note that elementCountAtLeast means > that the runner can buffer as many as it wants and can decide to offer a > low latency pipeline by triggering often or better throughput through the > use of buffering. > > > > On Mon, Oct 12, 2020 at 8:22 AM KV 59 <kvajjal...@gmail.com> wrote: > >> Hi All, >> >> I'm building a pipeline to process events as they come and do not really >> care about the event time and watermark. I'm more interested in not >> discarding the events and reducing the latency. The downstream pipeline has >> a stateful DoFn. I understand that the default window strategy is Global >> Windows,. I did not completely understand the default trigger as per >> >> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html >> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of >> global window how does this work (there is no end of window)?. >> >> My source is Google PubSub and pipeline is running on Dataflow runner I >> have defined my window transform as below >> >> input.apply(TRANSFORM_NAME, Window.<T>into(new GlobalWindows()) >>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1))) >>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes()) >> >> >> A couple of questions >> >> 1. Is triggering after each element inefficient in terms of >> persistence(serialization) after each element and also parallelism >> triggering after each looks like a serial execution? >> 2. How does Dataflow parallelize in such cases of triggers? >> >> >> Thanks and appreciate the responses. >> >> Kishore >> >