[GitHub] [kafka] vpapavas commented on a diff in pull request #12555: Optimize self-join

GitBox Thu, 01 Sep 2022 03:10:26 -0700


vpapavas commented on code in PR #12555:
URL: https://github.com/apache/kafka/pull/12555#discussion_r960457146



##########
streams/src/main/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilder.java:
##########
@@ -270,16 +272,20 @@ private void maybeAddNodeForOptimizationMetadata(final 
GraphNode node) {
 
     // use this method for testing only
     public void buildAndOptimizeTopology() {
-        buildAndOptimizeTopology(false);
+        buildAndOptimizeTopology(false, false);
     }
 
-    public void buildAndOptimizeTopology(final boolean optimizeTopology) {
+    public void buildAndOptimizeTopology(
+        final boolean optimizeTopology, final boolean optimizeSelfJoin) {
 
         mergeDuplicateSourceNodes();
         if (optimizeTopology) {
             LOG.debug("Optimizing the Kafka Streams graph for repartition 
nodes");
             optimizeKTableSourceTopics();
             maybeOptimizeRepartitionOperations();
+            if (optimizeSelfJoin) {

Review Comment:
   Hey @guozhangwang ! 
   Cases 1 and 2 are optimizable and will be optimized by the algorithm. I have 
tests for these cases in `InternalStreamsBuilderTest`.  
   Case 3 is not optimizable and won't be recognized. The reason is that 
processors like `mapValues` or `filter` or `transform` are black-boxes. There 
is no way to know how they change the contents of a stream hence there is no 
way to figure out if the two sides of the join are still the same. 
   Case 4 could be optimizable but I did not consider it. I initially only had 
in-scope cases like 1 and 2. I can add it by adding a special rule to the 
rewriter that would check if the parent of the join is a merge, then it's ok to 
have multiple source nodes as long as they are ancestors of the merge node. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [kafka] vpapavas commented on a diff in pull request #12555: Optimize self-join

Reply via email to