cloud-fan commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2703286093
Regarding limit pushdown, it's a complicated problem and let me put a summary here: 1. limit pushdown is important, as a recursive CTE may generate an infinite sequence and we need a limit to stop the recursion. 2. global limit pushdown is quite straightforward: we get the row count for each iteration and we stop recursion if the total row count exceeds the limit. 3. local limit pushdown is tricky: by definition, local limit means that each RDD partition produces at most N records. However, recursive CTE unions all DataFrames produced by each iteration, so we can't stop the loop as it generates a new RDD in each iteration, which becomes new partitions in the final unioned RDD. One idea is we add `.coelesce(1)` at the end so that we can treat local limit the same as global limit and use it to stop the loop earlier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org