Another thing to keep in mind - apologies if it was already clear:
triggering governs aggregation (GBK / Combine). It does not have any effect
on stateful DoFn.

On Mon, Oct 12, 2020 at 9:24 AM Luke Cwik <lc...@google.com> wrote:

> The default trigger will only fire when the global window closes which
> does happen with sources whose watermark goes > GlobalWindow.MAX_TIMESTAMP
> or during pipeline drain with partial results in streaming. Bounded sources
> commonly have their watermark advance to the end of time when they complete
> and some unbounded sources can stop producing output if they detect the end.
>
> Parallelization for stateful DoFns are per key and window. Parallelization
> for GBK is per key and window pane. Note that  elementCountAtLeast means
> that the runner can buffer as many as it wants and can decide to offer a
> low latency pipeline by triggering often or better throughput through the
> use of buffering.
>
>
>
> On Mon, Oct 12, 2020 at 8:22 AM KV 59 <kvajjal...@gmail.com> wrote:
>
>> Hi All,
>>
>> I'm building a pipeline to process events as they come and do not really
>> care about the event time and watermark. I'm more interested in not
>> discarding the events and reducing the latency. The downstream pipeline has
>> a stateful DoFn. I understand that the default window strategy is Global
>> Windows,. I did not completely understand the default trigger as per
>>
>> https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/windowing/DefaultTrigger.html
>> it says Repeatedly.forever(AfterWatermark.pastEndOfWindow()), In case of
>> global window how does this work (there is no end of window)?.
>>
>> My source is Google PubSub and pipeline is running on Dataflow runner I
>> have defined my window transform as below
>>
>> input.apply(TRANSFORM_NAME, Window.<T>into(new GlobalWindows())
>>> .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
>>> .withAllowedLateness(Duration.standardDays(1)).discardingFiredPanes())
>>
>>
>> A couple of questions
>>
>>    1. Is triggering after each element inefficient in terms of
>>    persistence(serialization) after each element and also parallelism
>>    triggering after each looks like a serial execution?
>>    2. How does Dataflow parallelize in such cases of triggers?
>>
>>
>> Thanks and appreciate the responses.
>>
>> Kishore
>>
>

Reply via email to