Thanks for the use case details. I think it actually aligns to my
proposal, especially, as you say you want to react to changes asap: if
you apply an "aggregation" the result would be emitted at the end (or
close) of the window, and you cannot react to changes right away.
But your use case also po
> To recap: Could it be that the idea to apply a DISTINCT-aggregation is
> for a different use-case than to remove duplicate messages from a
KStream?
OK, imagine the following:
We have 10 thermometers. Each second they transmit the current
measured temperature. The id of the thermometer i
Thanks for sharing your thoughts. I guess my first question about why
using the key boils down to the use case, and maybe you have something
else in mind than I do.
>> I see it this way: we define 'distinct' operation as returning a single
>> record per time window per selected key,
I believe thi
> - Why restrict de-duplication for the key only? Should we not also
> consider the value (or make it somehow flexible and let the user choose?)
Wasn't it the first idea that we abandoned (I mean, to provide
'KeyExtractor' and so on)?
In order to keep things simple we decided to make
.selec
Couple of questions:
- Why restrict de-duplication for the key only? Should we not also
consider the value (or make it somehow flexible and let the user choose?)
- I am wondering if the return type should be `KStream` instead of a
`KTable`? If I deduplicate a stream, I might expect a stream bac
Hi Bruno,
I'm sorry for the delay with the answer. Unfortunately your messages
were put to spam folder, that's why I didn't answer them right away.
Concerning your question about comparing serialized values vs. using
equals: I think it must be clear now due to John's explanations.
Distinct i
Hi Bruno,
I had previously been thinking to use equals(), since I
thought that this might be a stateless operation. Comparing
the serialized form requires a serde and a fairly expensive
serialization operation, so while byte equality is superior
to equals(), we shouldn't use it in operations unles
Hi!
I have updated the KIP with the definition of what is considered to be a
duplicated record ("The records are considered to be duplicates iff
serialized forms of their keys are equal.")
I have also added all the standard overloads for the distinct() method.
These overloads are the same as
Hi Ivan and John,
1. John, could you clarify why comparing serialized values seems the way
to go, now?
2. Ivan, Could you please answer my questions that I posted earlier? I
will repost it here:
Ivan, could you please make this matter a bit clearer in the KIP?
Actually, thinking about it aga
Hi Ivan,
Sorry for the late amendment, but I was just reviewing the
KIP again for your vote thread, and it struck me that, if
this is a stateful operation, we need a few more overloads.
Following the example from the other stateful windowed
operators, we should have:
distinct()
distinct(Named)
d
Hi Ivan,
Thanks for the reply.
1. I think I might have gotten myself confused. I was
thinking of this operation as stateless, but now I'm not
sure what I was thinking... This operator has to be
stateful, right? In that case, I agree that comparing
serialized values seems to be way to do it.
2. T
Hi all,
1. Actually I always thought about the serialized byte array only -- at
least this is what local stores depend upon, and what Kafka itself
depends upon when doing log compaction.
I can imagine a case where two different byte arrays deserialize to
objects which are `equals` to each ot
Hi,
Thanks for your comments, John!
1. equals() seems to be quite flexible in this case. I guess binary
comparison would not work well if the timestamp is part of the value.
2. Then this seems to be a misunderstanding on my side. Ivan, could you
please make this matter a bit clearer in the K
Hi all,
Bruno raised some very good points. I’d like to chime in with additional
context.
1. Great point. We faced a similar problem defining KIP-557. For 557, we chose
to use the serialized byte array instead of the equals() method, but I think
the situation in KIP-655 is a bit different. I
Hi Ivan,
Thank you for the KIP!
Some aspects are not clear to me from the KIP and I have a proposal.
1. The KIP does not describe the criteria that define a duplicate. Could
you add a definition of duplicate to the KIP?
2. The KIP does not describe what happens if distinct() is applied on a
Hi, John!
> To summarize, you are now only proposing the zero-arg distict()
method to be added to TimeWindowedKStream and SessionWindowedKStream, right?
Right, yes, it's that simple!
Okay, since you and Israel Ekpo both agreed, I'll start voting thread
and will start the implementation.
Bu
Hello Ivan,
These are the two proposal you have shared with us
KIP-655: Windowed Distinct Operation for Kafka Streams API
https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API
KIP-759: Unneeded repartition canceling
https://cwiki.apache.or
Hi Ivan,
Sorry for the silence!
I have just re-read the proposal.
To summarize, you are now only proposing the zero-arg distict() method to be
added to TimeWindowedKStream and SessionWindowedKStream, right?
I’m in favor of this proposal.
Thanks,
John
On Sat, Jul 10, 2021, at 10:18, Ivan Pon
Hello everyone,
I would like to remind you about KIP-655 and KIP-759 just in case they
got lost in your inbox.
Now the initial proposal is split into two independent and smaller ones,
so it must be easier to review them. Of course, if you have time.
Regards,
Ivan
24.06.2021 18:11, Ivan P
Ivan,
I read through the discussion and your new proposal. I have a couple of
questions.
1. As we have cancelRepartition, wouldn't selectKey be sufficient. You
still have idExtractor. Maybe I misunderstood the discussion.
2. isPersistent should be replaced by Materialized. It looked like you
agre
Hello all,
I have rewritten the KIP-655 summarizing what was agreed upon during
this discussion (now the proposal is much simpler and less invasive).
I have also created KIP-759 (cancelRepartition operation) and started a
discussion for it.
Regards,
Ivan.
04.06.2021 8:15, Matthias J. Sa
Just skimmed over the thread -- first of all, I am glad that we could
merge KIP-418 and ship it :)
About the re-partitioning concerns, there are already two tickets for it:
- https://issues.apache.org/jira/browse/KAFKA-4835
- https://issues.apache.org/jira/browse/KAFKA-10844
Thus, it seems bes
Thanks, Ivan!
That sounds like a great plan to me. Two smaller KIPs are easier to agree on
than one big one.
I agree hopping and sliding windows will actually have a duplicating effect. We
can avoid adding distinct() to the sliding window interface, but hopping
windows are just a different pa
Hi John!
I think that your proposal is just fantastic, it simplifies things a lot!
I also felt uncomfortable due to the fact that the proposed `distinct()`
is not somewhere near `count()` and `reduce(..)`. But
`selectKey(..).groupByKey().windowedBy(..).distinct()` didn't look like
a correct o
Hey there, Ivan!
In typical fashion, I'm going to make a somewhat outlandish
proposal. I'm hoping that we can side-step some of the
complications that have arisen. Please bear with me.
It seems like `distinct()` is not fundamentally unlike other windowed
"aggregation" operations. Your concern abo
Hello everyone,
let me revive the discussion for KIP-655. Now I have some time again and
I'm eager to finalize this.
Based on what was already discussed, I think that we can split the
discussion into three topics for our convenience.
The three topics are:
- idExtractor (how should we extr
Hey all,
I'm not convinced either epoch-aligned or data-aligned will fit all
possible use cases.
Both seem totally reasonable to me: data-aligned is useful for example when
you know
that a large number of updates to a single key will occur in short bursts,
and epoch-
aligned when you specifically
Hi Matthias,
Thanks for your review! It made me think deeper, and indeed I understood
that I was missing some important details.
To simplify, let me explain my particular use case first so I can refer
to it later.
We have a system that collects information about ongoing live sporting
event
Thanks for the KIP Ivan. Having a built-in deduplication operator is for
sure a good addition.
Couple of questions:
(1) Using the `idExtractor` has the issue that data might not be
co-partitioned as you mentioned in the KIP. Thus, I am wondering if it
might be better to do deduplication only on t
Sorry, I forgot to add [DISCUSS] tag to the topic
24.08.2020 2:27, Ivan Ponomarev пишет:
Hello,
I'd like to start a discussion for KIP-655.
KIP-655:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-655%3A+Windowed+Distinct+Operation+for+Kafka+Streams+API
I also opened a proof-of-conc
30 matches
Mail list logo