+dev <d...@beam.apache.org> +Reuven Lax <re...@google.com>

Advancing the watermark to infinity does have an effect on the
GlobalWindow. The GlobalWindow ends a little bit before infinity :-). That
is why this works to cause the output even for unbounded aggregations.

Kenn

On Fri, May 21, 2021 at 5:10 AM Jeff Klukas <jklu...@mozilla.com> wrote:

> Beam users,
>
> We're attempting to write a Java pipeline that uses Count.perKey() to
> collect event counts, and then flush those to an HTTP API every ten minutes
> based on processing time.
>
> We've tried expressing this using GlobalWindows with an
> AfterProcessingTime trigger, but we find that when we drain the pipeline
> we end up with entries in the droppedDueToLateness metric. This was
> initially surprising, but may be line line with documented behavior [0]:
>
> > When you issue the Drain command, Dataflow immediately closes any
> in-process windows and fires all triggers. The system does not wait for any
> outstanding time-based windows to finish. Dataflow causes open windows to
> close by advancing the system watermark to infinity
>
> Perhaps advancing watermark to infinity has no effect on GlobalWindows, so
> we attempted to get around this by using a fixed but arbitrarily-long
> window:
>
>     FixedWindows.of(Duration.standardDays(36500))
>
> The first few tests with this configuration came back clean, but the third
> test again showed droppedDueToLateness after calling Drain. You can see
> this current configuration in [1].
>
> Is there a pattern for reliably flushing on Drain when doing processing
> time-based aggregates like this?
>
> [0]
> https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects_of_draining_a_job
> [1]
> https://github.com/mozilla/gcp-ingestion/pull/1689/files#diff-1d75ce2cbda625465d5971a83d842dd35e2eaded2c2dd2b6c7d0d7cdfd459115R58-R71
>
>

Reply via email to