Re: [I] Optimized version of `SortPreservingMerge` that doesn't actually compare sort keys of the key ranges are ordered [datafusion]

via GitHub Thu, 07 Nov 2024 10:28:44 -0800


suremarc commented on issue #10316:
URL: https://github.com/apache/datafusion/issues/10316#issuecomment-2462945990


   I went ahead and made an attempt at implementing this over in this draft PR: 
#13296
   
   I was able to reuse the `MinMaxStatistics` code with some small changes 
which was pretty cool. 
   
   I have no nontrivial way to test this code until we get statistics per 
partition. In the PR I proposed the following API:
   
   ```rust
   pub trait ExecutionPlan: [...] {
       // [...]
       
       fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
           // Return global statistics by default
           Ok(vec![
               self.statistics()?;
               self.properties().partitioning.partition_count()
           ])
       }
   }
   ```
   
   In order for the statistics to be useful we'll actually need non-default 
implementations of course. So I'm wondering if I should just implement this 
method just for `ParquetExec` in my PR, so I can get some tests in. Of course, 
nothing will be finalized yet. 
   
   As discussed in [Epic: Statistics 
Improvements](https://github.com/apache/datafusion/issues/8227#issuecomment-2457565135)
 we will need #8078 in order for this code to actually work properly in all 
situations, but I believe @alamb is working on that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Optimized version of `SortPreservingMerge` that doesn't actually compare sort keys of the key ranges are ordered [datafusion]

Reply via email to