[PR] Add LimitPushdown optimization rule and CoalesceBatchesExec fetch [datafusion]

via GitHub Thu, 25 Jul 2024 08:26:53 -0700


alihandroid opened a new pull request, #11652:
URL: https://github.com/apache/datafusion/pull/11652


   ## Which issue does this PR close?
   
   Closes #9792.
   
   ## Rationale for this change
   
   Physical plans can be optimized further by pushing `GlobalLimitExec` and 
`LocalLimitExec` down through certain nodes, or using versions of their 
children nodes with fetch limits, without changing the result. This reduces 
unnecessary data transfer and processing for a more efficient plan execution.
   
   `CoalesceBatchesExec` can also benefit from this improvement, and as such, a 
fetch limit functionality is implemented for it.
   
   For example,
   ```
   GlobalLimitExec: skip=0, fetch=5
     StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], 
infinite_source=true
   ```
   can be turned into
   ```
   StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], 
infinite_source=true, fetch=5
   ```
   and
   ```
   GlobalLimitExec: skip=0, fetch=5
     CoalescePartitionsExec
       FilterExec: c3@2 > 0
         RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
           StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], 
infinite_source=true
   ```
   can be turned into
   ```
   GlobalLimitExec: skip=0, fetch=5
     CoalescePartitionsExec
       LocalLimitExec: fetch=5
         FilterExec: c3@2 > 0
           RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
             StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], 
infinite_source=true
   ```
   without changing the result, but using fewer resources and finishing faster
   
   The physical plan in the following excerpt
   
https://github.com/apache/datafusion/blob/ecf5323eaa38869ed2f911b02f98e17aa6db639a/datafusion/sqllogictest/test_files/repartition.slt#L116-L129
   
   will turn into 
   ```
   01)GlobalLimitExec: skip=0, fetch=5
   02)--CoalescePartitionsExec
   03)----CoalesceBatchesExec: target_batch_size=8192, fetch=5
   04)------FilterExec: c3@2 > 0
   05)--------RepartitionExec: partitioning=RoundRobinBatch(3), 
input_partitions=1
   06)----------StreamingTableExec: partition_sizes=1, projection=[c1, c2, c3], 
infinite_source=true
   ```
   
   Other examples can be found in the tests provided in `limit_pushdown.rs` and 
other .slt tests
   
   ## What changes are included in this PR?
   
   Implement `LimitPushdown` Rule:
     - Introduced new APIs in the `ExecutionPlan` trait:
       - `with_fetch(&self, fetch: Option<usize>) -> Option<Arc<dyn 
ExecutionPlan>>`: Returns fetching version if supported, None otherwise. The 
default implementation returns None
       - `supports_limit_pushdown(&self) -> bool`: Returns true if a node 
supports limit pushdown. The default implemenation returns false
   
   Add fetch support to `CoalesceBatchesExec`:
     - Add `fetch` field and `with_fetch` implementation
     - Implement fetch limit functionality
   
   ## Are these changes tested?
   
   Unit tests are provided for `LimitPushdown` and the new fetching support for 
`CoalesceBatchesExec`
   
   ## Are there any user-facing changes?
   
   No. The changes only affect performance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add LimitPushdown optimization rule and CoalesceBatchesExec fetch [datafusion]

Reply via email to