Re: [PR] [WIP][SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

via GitHub Wed, 05 Mar 2025 05:52:07 -0800


Pajaraja commented on code in PR #49955:
URL: https://github.com/apache/spark/pull/49955#discussion_r1981202703



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1032,6 +1044,42 @@ object ColumnPruning extends Rule[LogicalPlan] {
         p
       }
 
+    case p @ Project(_, ul: UnionLoop) =>
+      if (!ul.outputSet.subsetOf(p.references)) {
+        val newAnchorChildProj = prunedChild(ul.anchor, p.references)
+        val neededIndicesListRef = {

Review Comment:
   This doesn't work because at every step we need to produce something 
readable for the next iteration of UnionLoopRef (except if we truly do not need 
it). An example is the Fibonacci generation: 
   
   ```
   WITH RECURSIVE fibonacci AS (
     VALUES (0, 1) AS t(a, b)
     UNION ALL
     SELECT b, a + b FROM fibonacci WHERE a < 10
   )
   SELECT a FROM fibonacci ORDER BY a;
   ```
   
   This will fail with the above approach, as it will prune the recursive child 
to only return the next fibonacci number in the first iteration of the 
recursion. However, when trying to calculate the next fibonacci number, since 
we only memorized the last one, we won't be able to preform the calculation.
   
   Now as for my approach, its pretty similar, just that we merge two sets of 
projection references to obtain everything we need to keep: the project right 
above UnionLoop (which we also consider in your approach), OR a project right 
above UnionLoopRef (which may not exist in which case we don't prune at all). 
   
   For this I create two sets of indices (one for each project) and merge them. 
Then I take make the pruned children in a way similar to what you suggested (I 
also started from modifying how pruning Union is handled). The reason I opted 
for sets of indices over sets of objects is because the columns down from 
UnionLoopRef might get renamed or given a different id in the process, but 
since the output of UnionLoopRef and UnionLoop should be of the same arity (and 
have the same types - they should logically correspond to each other!), the 
indices should correspond to the things we need accordingly.
   
   I do notice that I made a small mistake in creating indicesForRef which 
should be fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-50892][SQL]Add UnionLoopExec, physical operator for recursion, to perform execution of recursive queries [spark]

Reply via email to