[jira] [Commented] (KAFKA-3542) Add "repartition (+ join)" operations to streams

Greg Fodor (JIRA) Mon, 11 Apr 2016 18:01:22 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236374#comment-15236374
 ]


Greg Fodor commented on KAFKA-3542:
-----------------------------------

Right, this map approach is what I am doing right now before all of my joins, 
though I didn't realize I could use through() to generate a joinable stream 
without sourcing it explicity from the new topic. I will see if some of my 
joins can be satisfied with the aggregator first approach. The thing that 
bothers me about the current map -> sink approach is that the map is not really 
DRY (I should just need to specify the selector to re-partition on) and the 
intermediate topic name should just be generated. I agree an implicit through() 
call could be useful in place of the assertion currently being made to 
determine if two streams are joinable.

> Add "repartition (+ join)" operations to streams
> ------------------------------------------------
>
>                 Key: KAFKA-3542
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3542
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 0.10.0.0
>            Reporter: Greg Fodor
>            Assignee: Guozhang Wang
>            Priority: Minor
>
> A common operation in Kafka Streams seems to be to repartition the stream 
> onto a different column, usually for joining. The current way I've been doing 
> this:
> - Perform a map on the stream to the same value with a new key (the key we're 
> going to join on, usually a foreign key)
> - Sink the stream into a new topic
> - Create a new stream sourcing that topic
> - Perform the join
> Note that without explicitly sinking the intermediate topic, the topology 
> will fail to build because of the assertion that both sides of a join are 
> connected to source nodes. When you perform a map, the link between the 
> source nodes and the tail node of the topology is broken (by setting the 
> source nodes to null) so you are forced to sink to use that output in a join.
> It seems that this pattern could possibly be rolled into much simpler 
> operation(s). For example, the map could be changed into a "repartition" 
> method where you just return the new key. And the join itself could be 
> simplified by letting you specify a re-partition function on either side of 
> the join and create the intermediate topic implicitly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3542) Add "repartition (+ join)" operations to streams

Reply via email to