peter-toth commented on PR #49955:
URL: https://github.com/apache/spark/pull/49955#issuecomment-2703425498

   > Regarding limit pushdown, it's a complicated problem and let me put a 
summary here:
   > 1. limit pushdown is important, as a recursive CTE may generate an 
infinite sequence and we need a limit to stop the recursion.
   > 2. global limit pushdown is quite straightforward: we get the row count 
for each iteration and we stop recursion if the total row count exceeds the 
limit.
   > 3. local limit pushdown is tricky: by definition, local limit means that 
each RDD partition produces at most N records. However, recursive CTE unions 
all DataFrames produced by each iteration, so we can't stop the loop as it 
generates a new RDD in each iteration, which becomes new partitions in the 
final unioned RDD. One idea is we add `.coelesce(1)` at the end so that we can 
treat local limit the same as global limit and use it to stop the loop earlier.
   
   Hmm, but we know the exact number of rows returned in each iteration. Why do 
we start a new iteration if the so far returned sum number of rows exceeds the 
limit? If we apply `LocalLimit(n)` (where n is the remaining number rows we 
need) in each iteration then we can get more rows than n (max n * number of 
partitions) but that's still ok as the outer globallimit should handle it.
   UnionLoop shouldn't be different to Union, just the locallimit should 
decrease dynamically in its legs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to