Thanks for the KIP Ivan. Having a built-in deduplication operator is for
sure a good addition.

Couple of questions:

(1) Using the `idExtractor` has the issue that data might not be
co-partitioned as you mentioned in the KIP. Thus, I am wondering if it
might be better to do deduplication only on the key? If one sets a new
key upstream (ie, extracts the deduplication id into the key), the
`distinct` operator could automatically repartition the data and thus we
would avoid user errors.

(2) What is the motivation for allowing the `idExtractor` to return
`null`? Might be good to have some use-case examples for this feature.

(2) Is using a `TimeWindow` really what we want? I was wondering if a
`SlidingWindow` might be better? Or maybe we need a new type of window?

It would be helpful if you could describe potential use cases in more
detail. -- I am mainly wondering about hopping window? Each record would
always falls into multiple window and thus would be emitted multiple
times, ie, each time the window closes. Is this really a valid use case?

It seems that for de-duplication, one wants to have some "expiration
time", ie, for each ID, deduplicate all consecutive records with the
same ID and emit the first record after the "expiration time" passed. In
terms of a window, this would mean that the window starts at `r.ts` and
ends at `r.ts + windowSize`, ie, the window is aligned to the data.
TimeWindows are aligned to the epoch though. While `SlidingWindows` also
align to the data, for the aggregation use-case they go backward in
time, while we need a window that goes forward in time. It's an open
question if we can re-purpose `SlidingWindows` -- it might be ok the
make the alignment (into the past vs into the future) an operator
dependent behavior?

(3) `isPersistent` -- instead of using this flag, it seems better to
allow users to pass in a `Materialized` parameter next to
`DistinctParameters` to configure the state store?

(4) I am wondering if we should really have 4 overloads for
`DistinctParameters.with()`? It might be better to have one overload
with all require parameters, and add optional parameters using the
builder pattern? This seems to follow the DSL Grammer proposal.

(5) Even if it might be an implementation detail (and maybe the KIP
itself does not need to mention it), can you give a high level overview
how you intent to implement it (that would be easier to grog, compared
to reading the PR).



-Matthias

On 8/23/20 4:29 PM, Ivan Ponomarev wrote:
> Sorry, I forgot to add [DISCUSS] tag to the topic
> 
> 24.08.2020 2:27, Ivan Ponomarev пишет:
>> Hello,
>>
>> I'd like to start a discussion for KIP-655.
>>
>> KIP-655:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API
>>
>>
>> I also opened a proof-of-concept PR for you to experiment with the API:
>>
>> PR#9210: https://github.com/apache/kafka/pull/9210
>>
>> Regards,
>>
>> Ivan Ponomarev
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to