Interesting KIP. It's a known problem, and the proposed solution make
sense to me.
Nit: it seems you are mixing the terms "out-of-order" and "late" and
using them as synonymous, what we usually not do.
"Out-of-order" is the more generic term, while "late" means after the
grace period (hence, late arriving data is still out-of-order, but it's
a subset of out-of-order records... all late records are out-of-order,
but not the other way around). I think the KIP could gain clarity and
avoid potential confusion to use both terms not a synonymous.
In the end, "late" data cannot be handled at all right now, based on the
definition of "late". "late" data is always dropped.
Today the only way to implement this semantic is using a punctuation and
manually storing events in an aggregation (which is inefficient for many
reasons and can cause severe bottlenecks)
Not sure if I follow this argument? I don't see any relationship to
punctuations. Can you elaborate?
It's also not clear to me, why "storing events in an aggregation" is
problematic, and/or how this new window type would avoid it?
extends Windows<TimeWindow>
Not sure if this is the right thing to do it? I understand your intent,
but to me, `TimeWindow` has certain semantics, implying we only put data
with record-ts between the window bounds into the window. Not sure if my
interpretation is off, and re-using could be ok? It's just something we
should figure out I guess.
To be fair: it's not 100% clearly specific if these semantics apply or
not. If we believe they don't apply, happy to pivot on this, but we
might want to update the JavaDocs, if we only want to use it as
"container class" (for the lack of better word).
What actually raises a different issue: I just realized that `Window` is
in an `internal` package, but `TimeWindows#windowsFor` does return
`Map<Long, TimeWindow>`, thus making it effectively kinda public... It
seems we should we do something about this? End users do not really need
to use `TimeWindow` -- it's really internal, and it seems `windowsFor`
is effective a leaky abstraction...
If we keep `TimeWindow` internal, I think we can easily be more relaxed
with applied semantics. But it's kinda leaking right now, what does not
make me happy... (Of course, I also don't want to derail and hijack this
KIP for some related, but orthogonal issue in the API...)
On other nit I am wondering about is, if it might be good to specify the
abstract methods inherited from `Windows` on the KIP for clarity? Even
they are implicitly well specified it might be beneficial to spell it
out. After all, `StreamTimeWindow` needs to implement there methods.
And of course
Note: I really don't like the name I cam up with StreamTimeWindows... if
anyone has a better name in mind please suggest one!
I am not a fan of the name either. But not really sure what a better
name could be. In the end, it's more of an "offset ordered window"
rather than a "time window", because it does not really take the record
timestamps into account to figure out into what window a record goes
into, but it really uses the offset order.
I believe that the size of window itself does not matter too much for
naming, so even if the size is defined in time, I don't think that the
term "time" must be part of the window name. (That's also why I am
wondering if we should go with `extends<TimeWindow>` or add something
new, which we hopefully can keep internal...)
Not sure if we ever intent to make `StreamTimeWindows` overlapping?
Given the described use-cases it seems very unlikely? So effectively
it's a "tumbling window" . Of course, we should not just overload this
term, but there is no `TumblingWindows` in the API -- only the term
Tumbling Window in the docs to describe the special case of
non-overlapping `TimeWindows`.
So maybe `OffsetTumblingWindows` could work? Or `OffsetOrderedWindows`?
Just spitballing here, to get a discussion going...
-Matthias
On 1/22/25 8:26 PM, Almog Gavra wrote:
Hello!
I'd like to initiate a discussion thread on KIP-1127:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1127+Flexible+Windows+for+Late+Arriving+Data
This KIP aims to make it easier to specify windowing semantics that are
more tolerable to late arriving data, particularly with suppression.
Would appreciate any and all feedback.
Cheers,
Almog