peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2703425498
> Regarding limit pushdown, it's a complicated problem and let me put a summary here: > 1. limit pushdown is important, as a recursive CTE may generate an infinite sequence and we need a limit to stop the recursion. > 2. global limit pushdown is quite straightforward: we get the row count for each iteration and we stop recursion if the total row count exceeds the limit. > 3. local limit pushdown is tricky: by definition, local limit means that each RDD partition produces at most N records. However, recursive CTE unions all DataFrames produced by each iteration, so we can't stop the loop as it generates a new RDD in each iteration, which becomes new partitions in the final unioned RDD. One idea is we add `.coelesce(1)` at the end so that we can treat local limit the same as global limit and use it to stop the loop earlier. Hmm, but we know the exact number of rows returned in each iteration. Why do we start a new iteration if the so far returned sum number of rows exceeds the limit? If we apply `LocalLimit(n)` (where n is the remaining number rows we need) in each iteration then we can get more rows than n (max n * number of partitions) but that's still ok as the outer globallimit should handle it. UnionLoop shouldn't be different to Union, just the locallimit should decrease dynamically in its legs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org