I like some way to expose watermarks to the user. It does affect
the processing of the records, so it is relevant for the users.
`current_watermark()` is a good option.
The implementation of this might be engine specific. But it is a very
relevant concept for authors of streaming pipelines.
Ideally
slight correction/clarification: We now take the "previous" watermark to
determine the late record, because they are valid inputs for non-first
stateful operators dropping records based on the same criteria would drop
valid records from previous (upstream) stateful operators. Please look back
which
We wouldn't like to expose the internal mechanism to the public.
As you are a very detail oriented engineer tracking major changes, you
might notice that we "changed" the definition of late record while fixing
late records. Previously the late record is defined as a record having
event time timest
Thank you for the clarification, Jungtaek 🙏 Indeed, it doesn't sound like
a highly demanded feature from the end users, haven't seen that a lot on
StackOverflow or mailing lists. I was just curious about the reasons.
Using the arbitrary stateful processing could be indeed a workaround! But
IMHO it
Technically speaking, "late data" represents the data which cannot be
processed due to the fact the engine threw out the state associated with
the data already.
That said, the only reason watermark does exist for streaming is to handle
stateful operators. From the engine's point of view, there is