[jira] [Comment Edited] (KAFKA-8770) Either switch to or add an option for emit-on-change

Richard Yu (Jira) Fri, 10 Jan 2020 14:10:46 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013230#comment-17013230
 ]


Richard Yu edited comment on KAFKA-8770 at 1/10/20 10:09 PM:
-------------------------------------------------------------

[~vvcephei] Actually, I did think of something which might be very useful as a 
performance enhancement. As mentioned in the JIRA description, Kafka Streams 
would load prior results and compare them to the original. However, that 
nonetheless has potential to be a severe hit to processing speed. I propose 
that instead of loading the prior results, we just get the hash code for that 
prior result instead.

If there is a no op, the hash code of the prior result would be the same as the 
one that we have currently. However, if the result has _changed,_ then if the 
hash code function have been implemented correctly, the hash code would have 
changed correspondingly as well. Therefore, what should be done is the 
following:
 # We keep the hash codes of prior results in some store / whatever other 
device we might be able to use for storage. 
 # Whenever we obtain a new processed result,  retrieve corresponding prior 
hashcode to see if it had changed. 
 # Update store / table as necessary if the hash code has changed. 


was (Author: yohan123):
[~vvcephei] Actually, I did think of something which might be very useful as a 
performance enhancement. As mentioned in the JIRA description, Kafka Streams 
would load prior results and compare them to the original. However, that 
nonetheless has potential to be a severe hit to processing speed. I propose 
that instead of loading the prior results, we just get the hash code for that 
prior result instead.

If there is a no op, the hash code of the prior result would be the same as the 
one that we have currently. However, if the result has _changed,_ then if the 
hash code function have been implemented correctly, the hash code would have 
changed correspondingly as well. Therefore, what should be done is the 
following:
 # We keep the hash codes of prior results in some store / whatever other 
device we might be able to use for storage. 
 # Whenever we obtain a new processed result,  retrieve corresponding prior 
hashcode to see if it had changed. 
 # Update store / table as necessary if the hash code has changed.

 

 

> Either switch to or add an option for emit-on-change
> ----------------------------------------------------
>
>                 Key: KAFKA-8770
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8770
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: John Roesler
>            Priority: Major
>              Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-8770) Either switch to or add an option for emit-on-change

Reply via email to