Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

via GitHub Sun, 06 Apr 2025 01:19:59 -0700


berkaysynnada commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2030077255



##########
datafusion/datasource/src/file_scan_config.rs:
##########
@@ -858,6 +858,96 @@ impl FileScanConfig {
         })
     }
 
+    /// Splits file groups into new groups based on statistics to enable 
efficient parallel processing.
+    ///
+    /// The method distributes files across a target number of partitions 
while ensuring
+    /// files within each partition maintain sort order based on their min/max 
statistics.
+    ///
+    /// The algorithm works by:
+    /// 1. Takes files sorted by minimum values
+    /// 2. For each file:
+    ///   - Finds eligible groups (empty or where file's min > group's last 
max)
+    ///   - Selects the smallest eligible group
+    ///   - Creates a new group if needed
+    ///
+    /// # Parameters
+    /// * `table_schema`: Schema containing information about the columns
+    /// * `file_groups`: The original file groups to split
+    /// * `sort_order`: The lexicographical ordering to maintain within each 
group
+    /// * `target_partitions`: The desired number of output partitions
+    ///
+    /// # Returns
+    /// A new set of file groups, where files within each group are 
non-overlapping with respect to
+    /// their min/max statistics and maintain the specified sort order.
+    pub fn split_groups_by_statistics_with_target_partitions(
+        table_schema: &SchemaRef,
+        file_groups: &[FileGroup],
+        sort_order: &LexOrdering,
+        target_partitions: usize,

Review Comment:
   Do we really need `target_partitions` as a parameter here? I think if this 
function takes that, it should return `Result<Option<...>>`, in case of the 
groups cannot split into the given partition count, being ordered (so we 
shouldn't check that at the caller side).
   
   The other option is removing that parameter, since the same logic can be 
written without that (no need to initiate the `file_groups_indices` based on 
this count)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

Reply via email to