Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-16 Thread Matthias J. Sax
Thanks for the use case details. I think it actually aligns to my proposal, especially, as you say you want to react to changes asap: if you apply an "aggregation" the result would be emitted at the end (or close) of the window, and you cannot react to changes right away. But your use case also po

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-09 Thread Ivan Ponomarev
> To recap: Could it be that the idea to apply a DISTINCT-aggregation is > for a different use-case than to remove duplicate messages from a KStream? OK, imagine the following: We have 10 thermometers. Each second they transmit the current measured temperature. The id of the thermometer i

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-08 Thread Matthias J. Sax
Thanks for sharing your thoughts. I guess my first question about why using the key boils down to the use case, and maybe you have something else in mind than I do. >> I see it this way: we define 'distinct' operation as returning a single >> record per time window per selected key, I believe thi

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-06 Thread Ivan Ponomarev
> - Why restrict de-duplication for the key only? Should we not also > consider the value (or make it somehow flexible and let the user choose?) Wasn't it the first idea that we abandoned (I mean, to provide 'KeyExtractor' and so on)? In order to keep things simple we decided to make .selec

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-03 Thread Matthias J. Sax
Couple of questions: - Why restrict de-duplication for the key only? Should we not also consider the value (or make it somehow flexible and let the user choose?) - I am wondering if the return type should be `KStream` instead of a `KTable`? If I deduplicate a stream, I might expect a stream bac

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-08-01 Thread Ivan Ponomarev
Hi Bruno, I'm sorry for the delay with the answer. Unfortunately your messages were put to spam folder, that's why I didn't answer them right away. Concerning your question about comparing serialized values vs. using equals: I think it must be clear now due to John's explanations. Distinct i

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-28 Thread John Roesler
Hi Bruno, I had previously been thinking to use equals(), since I thought that this might be a stateless operation. Comparing the serialized form requires a serde and a fairly expensive serialization operation, so while byte equality is superior to equals(), we shouldn't use it in operations unles

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-26 Thread Ivan Ponomarev
Hi! I have updated the KIP with the definition of what is considered to be a duplicated record ("The records are considered to be duplicates iff serialized forms of their keys are equal.") I have also added all the standard overloads for the distinct() method. These overloads are the same as

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-23 Thread Bruno Cadonna
Hi Ivan and John, 1. John, could you clarify why comparing serialized values seems the way to go, now? 2. Ivan, Could you please answer my questions that I posted earlier? I will repost it here: Ivan, could you please make this matter a bit clearer in the KIP? Actually, thinking about it aga

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-22 Thread John Roesler
Hi Ivan, Sorry for the late amendment, but I was just reviewing the KIP again for your vote thread, and it struck me that, if this is a stateful operation, we need a few more overloads. Following the example from the other stateful windowed operators, we should have: distinct() distinct(Named) d

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-22 Thread John Roesler
Hi Ivan, Thanks for the reply. 1. I think I might have gotten myself confused. I was thinking of this operation as stateless, but now I'm not sure what I was thinking... This operator has to be stateful, right? In that case, I agree that comparing serialized values seems to be way to do it. 2. T

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-20 Thread Ivan Ponomarev
Hi all, 1. Actually I always thought about the serialized byte array only -- at least this is what local stores depend upon, and what Kafka itself depends upon when doing log compaction. I can imagine a case where two different byte arrays deserialize to objects which are `equals` to each ot

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-13 Thread Bruno Cadonna
Hi, Thanks for your comments, John! 1. equals() seems to be quite flexible in this case. I guess binary comparison would not work well if the timestamp is part of the value. 2. Then this seems to be a misunderstanding on my side. Ivan, could you please make this matter a bit clearer in the K

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-12 Thread John Roesler
Hi all, Bruno raised some very good points. I’d like to chime in with additional context. 1. Great point. We faced a similar problem defining KIP-557. For 557, we chose to use the serialized byte array instead of the equals() method, but I think the situation in KIP-655 is a bit different. I

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-12 Thread Bruno Cadonna
Hi Ivan, Thank you for the KIP! Some aspects are not clear to me from the KIP and I have a proposal. 1. The KIP does not describe the criteria that define a duplicate. Could you add a definition of duplicate to the KIP? 2. The KIP does not describe what happens if distinct() is applied on a

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-11 Thread Ivan Ponomarev
Hi, John! > To summarize, you are now only proposing the zero-arg distict() method to be added to TimeWindowedKStream and SessionWindowedKStream, right? Right, yes, it's that simple! Okay, since you and Israel Ekpo both agreed, I'll start voting thread and will start the implementation. Bu

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-11 Thread Israel Ekpo
Hello Ivan, These are the two proposal you have shared with us KIP-655: Windowed Distinct Operation for Kafka Streams API https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API KIP-759: Unneeded repartition canceling https://cwiki.apache.or

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-10 Thread John Roesler
Hi Ivan, Sorry for the silence! I have just re-read the proposal. To summarize, you are now only proposing the zero-arg distict() method to be added to TimeWindowedKStream and SessionWindowedKStream, right? I’m in favor of this proposal. Thanks, John On Sat, Jul 10, 2021, at 10:18, Ivan Pon

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-07-10 Thread Ivan Ponomarev
Hello everyone, I would like to remind you about KIP-655 and KIP-759 just in case they got lost in your inbox. Now the initial proposal is split into two independent and smaller ones, so it must be easier to review them. Of course, if you have time. Regards, Ivan 24.06.2021 18:11, Ivan P

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-06-24 Thread Mohan Parthasarathy
Ivan, I read through the discussion and your new proposal. I have a couple of questions. 1. As we have cancelRepartition, wouldn't selectKey be sufficient. You still have idExtractor. Maybe I misunderstood the discussion. 2. isPersistent should be replaced by Materialized. It looked like you agre

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-06-24 Thread Ivan Ponomarev
Hello all, I have rewritten the KIP-655 summarizing what was agreed upon during this discussion (now the proposal is much simpler and less invasive). I have also created KIP-759 (cancelRepartition operation) and started a discussion for it. Regards, Ivan. 04.06.2021 8:15, Matthias J. Sa

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-06-03 Thread Matthias J. Sax
Just skimmed over the thread -- first of all, I am glad that we could merge KIP-418 and ship it :) About the re-partitioning concerns, there are already two tickets for it: - https://issues.apache.org/jira/browse/KAFKA-4835 - https://issues.apache.org/jira/browse/KAFKA-10844 Thus, it seems bes

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-06-02 Thread John Roesler
Thanks, Ivan! That sounds like a great plan to me. Two smaller KIPs are easier to agree on than one big one. I agree hopping and sliding windows will actually have a duplicating effect. We can avoid adding distinct() to the sliding window interface, but hopping windows are just a different pa

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-05-26 Thread Ivan Ponomarev
Hi John! I think that your proposal is just fantastic, it simplifies things a lot! I also felt uncomfortable due to the fact that the proposed `distinct()` is not somewhere near `count()` and `reduce(..)`. But `selectKey(..).groupByKey().windowedBy(..).distinct()` didn't look like a correct o

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-05-24 Thread John Roesler
Hey there, Ivan! In typical fashion, I'm going to make a somewhat outlandish proposal. I'm hoping that we can side-step some of the complications that have arisen. Please bear with me. It seems like `distinct()` is not fundamentally unlike other windowed "aggregation" operations. Your concern abo

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2021-05-23 Thread Ivan Ponomarev
Hello everyone, let me revive the discussion for KIP-655. Now I have some time again and I'm eager to finalize this. Based on what was already discussed, I think that we can split the discussion into three topics for our convenience. The three topics are: - idExtractor (how should we extr

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2020-09-14 Thread Sophie Blee-Goldman
Hey all, I'm not convinced either epoch-aligned or data-aligned will fit all possible use cases. Both seem totally reasonable to me: data-aligned is useful for example when you know that a large number of updates to a single key will occur in short bursts, and epoch- aligned when you specifically

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2020-09-14 Thread Ivan Ponomarev
Hi Matthias, Thanks for your review! It made me think deeper, and indeed I understood that I was missing some important details. To simplify, let me explain my particular use case first so I can refer to it later. We have a system that collects information about ongoing live sporting event

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2020-09-02 Thread Matthias J. Sax
Thanks for the KIP Ivan. Having a built-in deduplication operator is for sure a good addition. Couple of questions: (1) Using the `idExtractor` has the issue that data might not be co-partitioned as you mentioned in the KIP. Thus, I am wondering if it might be better to do deduplication only on t

Re: [DISCUSS] KIP-655: Windowed "Distinct" Operation for KStream

2020-08-23 Thread Ivan Ponomarev
Sorry, I forgot to add [DISCUSS] tag to the topic 24.08.2020 2:27, Ivan Ponomarev пишет: Hello, I'd like to start a discussion for KIP-655. KIP-655: https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API I also opened a proof-of-conc