peter-toth commented on PR #49955: URL: https://github.com/apache/spark/pull/49955#issuecomment-2705956689
> Ideally recursive CTE should stop if the last iteration generates no data. Pushing down the LIMIT and applying an early stop is an optimization and should not change the query result. UnionLoop is special in terms that it dynamically generates partitions and possibly never ends generating them. So pushing down a limit shouldn't be just an optimization but it is a way of stopping the loop. > - With local limit 10, we can't early stop and still need to wait for the 100th iteration. At the end, we return a union RDD with 100 partition and each partition has one row. The query result is the same as not pushdown LIMIT and perform it after recursive CTE. I think you are saying that it shouldn't matter what the LocalLimit is around a UnionLoop as regards to stopping the loop.(But each returned partition can be LocalLimit-ed of course.) I would argue with this. I think `LocalLimit(n)`'s purpose is to provide a cheap max `n` row limiter. And indeed when in comes to regular nodes, like union, with fixed number of but unknown sized partitions this limit should be separately applied on partitions during planning (`LocalLimit(n)` can be pushed down to the legs). But the purpose of the operator is to return max `n` rows, so in the special case of `UnionLoop`, when we exactly know the rows in each iteration runtime, we can actually use this limit to stop generating partitions too. > So one idea is to add a .coalesce(1) for the recursive CTE result so that local limit is the same as global limit. If you coalesce the partitons of a _unioned RDD_ to 1 partition so as to locallimit the coalesced partition, then you could use that locallimit as globallimit to stop generating partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org