peter-toth commented on PR #49955:
URL: https://github.com/apache/spark/pull/49955#issuecomment-2705956689

   > Ideally recursive CTE should stop if the last iteration generates no data. 
Pushing down the LIMIT and applying an early stop is an optimization and should 
not change the query result.
   
   UnionLoop is special in terms that it dynamically generates partitions and 
possibly never ends generating them. So pushing down a limit shouldn't be just 
an optimization but it is a way of stopping the loop.
   
   > - With local limit 10, we can't early stop and still need to wait for the 
100th iteration. At the end, we return a union RDD with 100 partition and each 
partition has one row. The query result is the same as not pushdown LIMIT and 
perform it after recursive CTE.
   
   I think you are saying that it shouldn't matter what the LocalLimit is 
around a UnionLoop as regards to stopping the loop.(But each returned partition 
can be LocalLimit-ed of course.)
   I would argue with this. I think `LocalLimit(n)`'s purpose is to provide a 
cheap max `n` row limiter. And indeed when in comes to regular nodes, like 
union, with fixed number of but unknown sized partitions this limit should be 
separately applied on partitions during planning (`LocalLimit(n)` can be pushed 
down to the legs). But the purpose of the operator is to return max `n` rows, 
so in the special case of `UnionLoop`, when we exactly know the rows in each 
iteration runtime, we can actually use this limit to stop generating partitions 
too.
   
   > So one idea is to add a .coalesce(1) for the recursive CTE result so that 
local limit is the same as global limit.
   
   If you coalesce the partitons of a _unioned RDD_ to 1 partition so as to 
locallimit the coalesced partition, then you could use that locallimit as 
globallimit to stop generating partitions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to