berkaysynnada commented on PR #13933: URL: https://github.com/apache/datafusion/pull/13933#issuecomment-2565152257
I believe extending the Statistics with sort information is dangerous, as it deviates from the single-responsibility principle and creates the burden of maintaining order information in two places (Statistics and equivalences). I wonder if we can utilize the `sorting_columns()` API of `WriterProperties` and `WriterPropertiesBuilder` instead. Also I guess giving this info should be on `TableParquetOptions` rather than `DataFrameWriteOptions`. So `sort_by` in `DataFrameWriteOptions` would be redundant. We are currently working on an extensive refactor of the Statistics framework, so both in its current state and the new version, storing order information in it does not seem the right way. We need to address this issue in a more seamless way. I also don’t think this should be a fork-specific implementation, as it’s a common need, but we need to find a smoother approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org