AdamGS commented on code in PR #14754: URL: https://github.com/apache/datafusion/pull/14754#discussion_r1963752446
########## datafusion/core/src/datasource/data_source.rs: ########## @@ -62,9 +64,33 @@ pub trait FileSource: Send + Sync { fn fmt_extra(&self, _t: DisplayFormatType, _f: &mut Formatter) -> fmt::Result { Ok(()) } - /// Return true if the file format supports repartition + + /// If supported by the [`FileSource`], redistribute files across partitions according to their size. + /// Allows custom file formats to implement their own repartitioning logic. /// - /// If this returns true, the DataSourceExec may repartition the data - /// by breaking up the input files into multiple smaller groups. - fn supports_repartition(&self, config: &FileScanConfig) -> bool; + /// Provides a default repartitioning behavior, see comments on [`FileGroupPartitioner`] for more detail. + fn repartitioned( + &self, + target_partitions: usize, + repartition_file_min_size: usize, + output_ordering: Option<LexOrdering>, + config: &FileScanConfig, + ) -> datafusion_common::Result<Option<FileScanConfig>> { + if config.file_compression_type.is_compressed() || config.new_lines_in_values { Review Comment: the first one just makes sense to me as it makes it impossible (or at least very hard) to map byte ranges to actual offset in the uncompressed file. The second one is CSV only but as long as its there I think its fine to respect it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org