[ https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940802#comment-16940802 ]
Di Campo commented on KAFKA-8770: --------------------------------- This comes from the following use case: Relation between pageview and session are sent down to a separate topic (i.e. to which session this pageview corresponds to, as a {{(pageviewId, sessionId)}} ). As more pageviews are added to the session, they are sent to (in a {{sessionStore.toStream.flatMap}} fashion, where pageview ids are kept in the aggregator). The relation of a pageview with its session may change, whether because of a session merge, a session cut by custom logic, etc. But most times it is the same sessionId value, so I want to have a first value rightaway to have near-realtime associations, but lower the traffic that is sent by unnecessary updates. > Either switch to or add an option for emit-on-change > ---------------------------------------------------- > > Key: KAFKA-8770 > URL: https://issues.apache.org/jira/browse/KAFKA-8770 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: John Roesler > Priority: Major > Labels: needs-kip > > Currently, Streams offers two emission models: > * emit-on-window-close: (using Suppression) > * emit-on-update: (i.e., emit a new result whenever a new record is > processed, regardless of whether the result has changed) > There is also an option to drop some intermediate results, either using > caching or suppression. > However, there is no support for emit-on-change, in which results would be > forwarded only if the result has changed. This has been reported to be > extremely valuable as a performance optimizations for some high-traffic > applications, and it reduces the computational burden both internally for > downstream Streams operations, as well as for external systems that consume > the results, and currently have to deal with a lot of "no-op" changes. > It would be pretty straightforward to implement this, by loading the prior > results before a stateful operation and comparing with the new result before > persisting or forwarding. In many cases, we load the prior result anyway, so > it may not be a significant performance impact either. > One design challenge is what to do with timestamps. If we get one record at > time 1 that produces a result, and then another at time 2 that produces a > no-op, what should be the timestamp of the result, 1 or 2? emit-on-change > would require us to say 1. > Clearly, we'd need to do some serious benchmarks to evaluate any potential > implementation of emit-on-change. > Another design challenge is to decide if we should just automatically provide > emit-on-change for stateful operators, or if it should be configurable. > Configuration increases complexity, so unless the performance impact is high, > we may just want to change the emission model without a configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005)