berkaysynnada commented on PR #13933:
URL: https://github.com/apache/datafusion/pull/13933#issuecomment-2565152257

   I believe extending the Statistics with sort information is dangerous, as it 
deviates from the single-responsibility principle and creates the burden of 
maintaining order information in two places (Statistics and equivalences).
   
   I wonder if we can utilize the `sorting_columns()` API of `WriterProperties` 
and `WriterPropertiesBuilder` instead. Also I guess giving this info should be 
on `TableParquetOptions` rather than `DataFrameWriteOptions`. So `sort_by` in 
`DataFrameWriteOptions` would be redundant.
   
   We are currently working on an extensive refactor of the Statistics 
framework, so both in its current state and the new version, storing order 
information in it does not seem the right way. We need to address this issue in 
a more seamless way. I also don’t think this should be a fork-specific 
implementation, as it’s a common need, but we need to find a smoother approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to