berkaysynnada commented on PR #15503:
URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2785682160

   > What do you think?
   
   I'm still thinking we should unify the API's. We've discussed the issue with 
our team, and got a common opinion: 
   
   If I give an example from my concerns, there will be many duplications like 
this:
   
   ```rust
       fn statistics(&self) -> Result<Statistics> {
           let stats = Self::statistics_helper(
               self.schema(),
               self.input().statistics()?,
               self.predicate(),
               self.default_selectivity,
           )?;
           Ok(stats.project(self.projection.as_ref()))
       }
   
       fn statistics_by_partition(&self) -> Result<PartitionedStatistics> {
           let input_stats = self.input.statistics_by_partition()?;
   
           let stats: Result<Vec<Arc<Statistics>>> = input_stats
               .iter()
               .map(|stat| {
                   let stat = Self::statistics_helper(
                       self.schema(),
                       stat.clone(),
                       self.predicate(),
                       self.default_selectivity,
                   )
                   .map(|stat| stat.project(self.projection.as_ref()))?;
                   Ok(Arc::new(stat))
               })
               .collect();
   
           Ok(PartitionedStatistics::new(stats?))
       }
   ```
   
   There are/will be duplications of statistics() logics in each operator like 
this, because the calculations are the same, whether the stats are coming from 
the whole table or just for one partition. We can avoid the duplications and 
write efficient functional statistics() implementations if we adopt 
   ```rust
   fn statistics(&self, partition: Option<usize>) -> Result<Statistics>
   ```
   style. So, that alternative wins clearly for me against other alternatives. 
It also does not modify the other existing structs/API's, and propose an 
extensible way while enabling the statistics access over any partition. 
   
   TLDR, updating the API as `fn statistics(&self, partition: Option<usize>) -> 
Result<Statistics>` has a minimal change, doesn't force us to follow an 
immature design path, reduce duplications, and enables partition-based stats 
access, that's the main goal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to