[jira] [Commented] (KAFKA-3542) Add "repartition (+ join)" operations to streams

Guozhang Wang (JIRA) Mon, 11 Apr 2016 18:29:53 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236401#comment-15236401
 ]


Guozhang Wang commented on KAFKA-3542:
--------------------------------------

That is a fair point.

Currently we let users to specify a selector for KTable streams to do 
aggregations, which we are working on extract out of the "aggregate" operator 
to form a separate "groupBy" operator (KAFKA-3337). As for KStreams, right now 
we only allow users to join / aggregate on the record keys, if people wants to 
use different keys for joining then they need to "map" to the new key. 

We are also adding another API for more intuitive "setting re-partition key" as 
described in KAFKA-3430, which I think would make it cleaner.

Note that the aggregate / join semantics for KStream and KTable streams are 
different, more details are here:

http://docs.confluent.io/2.1.0-alpha1/streams/developer-guide.html#kafka-streams-dsl

> Add "repartition (+ join)" operations to streams
> ------------------------------------------------
>
>                 Key: KAFKA-3542
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3542
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 0.10.0.0
>            Reporter: Greg Fodor
>            Assignee: Guozhang Wang
>            Priority: Minor
>
> A common operation in Kafka Streams seems to be to repartition the stream 
> onto a different column, usually for joining. The current way I've been doing 
> this:
> - Perform a map on the stream to the same value with a new key (the key we're 
> going to join on, usually a foreign key)
> - Sink the stream into a new topic
> - Create a new stream sourcing that topic
> - Perform the join
> Note that without explicitly sinking the intermediate topic, the topology 
> will fail to build because of the assertion that both sides of a join are 
> connected to source nodes. When you perform a map, the link between the 
> source nodes and the tail node of the topology is broken (by setting the 
> source nodes to null) so you are forced to sink to use that output in a join.
> It seems that this pattern could possibly be rolled into much simpler 
> operation(s). For example, the map could be changed into a "repartition" 
> method where you just return the new key. And the join itself could be 
> simplified by letting you specify a re-partition function on either side of 
> the join and create the intermediate topic implicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3542) Add "repartition (+ join)" operations to streams

Reply via email to