Hi,

This is an interesting topic. I suppose the watermark is defined based on
the event time since it's mainly used, or designed, for the event time
processing. Flink provides the event time processing mechanism because it's
widely needed. Every event has its event time and we usually need to group
or order by the event time. On the other hand, this also means that we can
process events from different sources as the event time is naturally of the
same scale.

However, just as you say, technically speaking the event timestamp can be
replaced with any other meaningful number (or event a comparable), and the
(event time) watermark should change accordingly. If we promise this field
and its watermark of all sources are of the same scale, we can process the
data/event from the sources together with it just like the event time. As
the event time processing and event time timer service doesn't rely on the
actual time point or duration, I suppose this can be implemented by
defining it as the event time, if it contains only positive numbers.


On Thu, Feb 2, 2023 at 5:18 PM Jan Lukavský <je...@seznam.cz> wrote:

> Hi,
>
> I will not speak about details related to Flink specifically, the
> concept of watermarks is more abstract, so I'll leave implementation
> details aside.
>
> Speaking generally, yes, there is a set of requirements that must be met
> in order to be able to generate a system that uses watermarks.
>
> The primary question is what are watermarks used for? The answer is - we
> need watermarks to be able to define a partially stable order of
> _events_. Event is an immutable piece of data that can be _observed_
> (i.e. processed) with various consumer-dependent delays (two consumers
> of the event can see the event at different processing times), or a
> specific (local) timestamp. Generally an event tells us that something,
> somewhere happened at given local timestamp.
>
> Watermarks create markers in processing time of each observer, so that
> the observer is able to tell if two events (e.g. event "close
> time-window T1" and "new data with timestamp T2 arrived") can be ordered
> (that is being able to tell which one is - globally! - preceding the
> other).
>
> Having said that - there is a general algebra for "timestamps" - and
> therefore watermarks. A timestamp can be any object that defines the
> following operations:
>
>   - a less-than relation <, i.e. t1 < t2: bool, this relation needs to
> be a antisymmetric, so t1 < t2 implies not t2 < t1
>
>   - a function min_timestamp_following(timestamp t1, duration):
> timestamp t2, that returns the minimal timestamp, for which t1 +
> duration < t2 (this function is actually a definition of duration)
>
> These two conditions allows to construct a working streaming processing
> system, which means there should be no problem using different
> "timestamps", provided we know how to construct the above.
>
> Using "a different number" for timestamps and watermarks seems valid in
> this sense, provided you are fine with the implicit definition of
> duration, that is currently defined as simple t2 - t1.
>
> I tried to explain why it is not good to expect that two events can be
> globally ordered and what is the actual role of watermarks in this in a
> twitter thread [1], if anyone interested.
>
> Best,
>
>   Jan
>
> [1]
>
> https://twitter.com/janl_apache/status/1478757956263071745?s=20&t=cMXfPHS8EjPrbF8jys43BQ
>
> On 2/2/23 00:18, Yaroslav Tkachenko wrote:
> > Hey everyone,
> >
> > I'm wondering if anyone has done any experiments trying to use
> > non-temporal watermarks? For example, a dataset may contain some kind
> > of virtual timestamp / version field that behaves just like a regular
> > timestamp (monotonically increasing, etc.), but has a different scale
> > / range.
> >
> > As far as I can see Flink assumes that the values used for event times
> > and watermark generation are actually timestamps and the Table API
> > requires you to define watermarks on TIMESTAMP columns.
> >
> > Practically speaking timestamp is just a number, so if I have a
> > "timeline" that consists of 1000 monotonically increasing integers,
> > for example, the concepts like late-arriving
> > data, bounded-out-of-orderness, etc. still work.
> >
> > Thanks for sharing any thoughts you might have on this topic!
>

Reply via email to