Re: [PR] Allow `FileSource`-specific repartitioning [datafusion]

via GitHub Thu, 20 Feb 2025 07:10:17 -0800


AdamGS commented on code in PR #14754:
URL: https://github.com/apache/datafusion/pull/14754#discussion_r1963752446



##########
datafusion/core/src/datasource/data_source.rs:
##########
@@ -62,9 +64,33 @@ pub trait FileSource: Send + Sync {
     fn fmt_extra(&self, _t: DisplayFormatType, _f: &mut Formatter) -> 
fmt::Result {
         Ok(())
     }
-    /// Return true if the file format supports repartition
+
+    /// If supported by the [`FileSource`], redistribute files across 
partitions according to their size.
+    /// Allows custom file formats to implement their own repartitioning logic.
     ///
-    /// If this returns true, the DataSourceExec may repartition the data
-    /// by breaking up the input files into multiple smaller groups.
-    fn supports_repartition(&self, config: &FileScanConfig) -> bool;
+    /// Provides a default repartitioning behavior, see comments on 
[`FileGroupPartitioner`] for more detail.
+    fn repartitioned(
+        &self,
+        target_partitions: usize,
+        repartition_file_min_size: usize,
+        output_ordering: Option<LexOrdering>,
+        config: &FileScanConfig,
+    ) -> datafusion_common::Result<Option<FileScanConfig>> {
+        if config.file_compression_type.is_compressed() || 
config.new_lines_in_values {

Review Comment:
   the first one just makes sense to me as it makes it impossible (or at least 
very hard) to map byte ranges to actual offset in the uncompressed file. The 
second one is CSV only but as long as its there I think its fine to respect it 
here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Allow `FileSource`-specific repartitioning [datafusion]

Reply via email to