+dev <d...@beam.apache.org> +Reuven Lax <re...@google.com> Advancing the watermark to infinity does have an effect on the GlobalWindow. The GlobalWindow ends a little bit before infinity :-). That is why this works to cause the output even for unbounded aggregations.
Kenn On Fri, May 21, 2021 at 5:10 AM Jeff Klukas <jklu...@mozilla.com> wrote: > Beam users, > > We're attempting to write a Java pipeline that uses Count.perKey() to > collect event counts, and then flush those to an HTTP API every ten minutes > based on processing time. > > We've tried expressing this using GlobalWindows with an > AfterProcessingTime trigger, but we find that when we drain the pipeline > we end up with entries in the droppedDueToLateness metric. This was > initially surprising, but may be line line with documented behavior [0]: > > > When you issue the Drain command, Dataflow immediately closes any > in-process windows and fires all triggers. The system does not wait for any > outstanding time-based windows to finish. Dataflow causes open windows to > close by advancing the system watermark to infinity > > Perhaps advancing watermark to infinity has no effect on GlobalWindows, so > we attempted to get around this by using a fixed but arbitrarily-long > window: > > FixedWindows.of(Duration.standardDays(36500)) > > The first few tests with this configuration came back clean, but the third > test again showed droppedDueToLateness after calling Drain. You can see > this current configuration in [1]. > > Is there a pattern for reliably flushing on Drain when doing processing > time-based aggregates like this? > > [0] > https://cloud.google.com/dataflow/docs/guides/stopping-a-pipeline#effects_of_draining_a_job > [1] > https://github.com/mozilla/gcp-ingestion/pull/1689/files#diff-1d75ce2cbda625465d5971a83d842dd35e2eaded2c2dd2b6c7d0d7cdfd459115R58-R71 > >