cloud-fan commented on PR #49955:
URL: https://github.com/apache/spark/pull/49955#issuecomment-2703286093

   Regarding limit pushdown, it's a complicated problem and let me put a 
summary here:
   1. limit pushdown is important, as a recursive CTE may generate an infinite 
sequence and we need a limit to stop the recursion.
   2. global limit pushdown is quite straightforward: we get the row count for 
each iteration and we stop recursion if the total row count exceeds the limit.
   3. local limit pushdown is tricky: by definition, local limit means that 
each RDD partition produces at most N records. However, recursive CTE unions 
all DataFrames produced by each iteration, so we can't stop the loop as it 
generates a new RDD in each iteration, which becomes new partitions in the 
final unioned RDD. One idea is we add `.coelesce(1)` at the end so that we can 
treat local limit the same as global limit and use it to stop the loop earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to