On 2/27/24 19:49, Robert Bradshaw via dev wrote:
On Tue, Feb 27, 2024 at 10:39 AM Jan Lukavský <je...@seznam.cz> wrote:
On 2/27/24 19:22, Robert Bradshaw via dev wrote:
On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles <k...@apache.org> wrote:
Pulling out focus points:
On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <dev@beam.apache.org>
wrote:
I can't act on something yet [...] but I expect to be able to [...] at some
time in the processing-time future.
I like this as a clear and internally-consistent feature description. It
describes ProcessContinuation and those timers which serve the same purpose as
ProcessContinuation.
On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <dev@beam.apache.org>
wrote:
I can't think of a batch or streaming scenario where it would be correct to not
wait at least that long
The main reason we created timers: to take action in the absence of data. The archetypal
use case for processing time timers was/is "flush data from state if it has been
sitting there too long". For this use case, the right behavior for batch is to skip
the timer. It is actually basically incorrect to wait.
Good point calling out the distinction between "I need to wait in case
there's more data." and "I need to wait for something external." We
can't currently distinguish between the two, but a batch runner can
say something definitive about the first. Feels like we need a new
primitive (or at least new signaling information on our existing
primitive).
Runners signal end of data to a DoFn via (input) watermark. Is there a
need for additional information?
Yes, and I agree that watermarks/event timestamps are a much better
way to track data completeness (if possible).
Unfortunately processing timers don't specify if they're waiting for
additional data or external/environmental change, meaning we can't use
the (event time) watermark to determine whether they're safe to
trigger.
+1