Re: Watermark on late data only

2023-10-10 Thread Raghu Angadi
I like some way to expose watermarks to the user. It does affect the processing of the records, so it is relevant for the users. `current_watermark()` is a good option. The implementation of this might be engine specific. But it is a very relevant concept for authors of streaming pipelines. Ideally

Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
slight correction/clarification: We now take the "previous" watermark to determine the late record, because they are valid inputs for non-first stateful operators dropping records based on the same criteria would drop valid records from previous (upstream) stateful operators. Please look back which

Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
We wouldn't like to expose the internal mechanism to the public. As you are a very detail oriented engineer tracking major changes, you might notice that we "changed" the definition of late record while fixing late records. Previously the late record is defined as a record having event time timest

Re: Watermark on late data only

2023-10-10 Thread Bartosz Konieczny
Thank you for the clarification, Jungtaek 🙏 Indeed, it doesn't sound like a highly demanded feature from the end users, haven't seen that a lot on StackOverflow or mailing lists. I was just curious about the reasons. Using the arbitrary stateful processing could be indeed a workaround! But IMHO it

Re: Watermark on late data only

2023-10-09 Thread Jungtaek Lim
Technically speaking, "late data" represents the data which cannot be processed due to the fact the engine threw out the state associated with the data already. That said, the only reason watermark does exist for streaming is to handle stateful operators. From the engine's point of view, there is