Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

via GitHub Thu, 10 Apr 2025 11:22:56 -0700


alamb commented on code in PR #15503:
URL: https://github.com/apache/datafusion/pull/15503#discussion_r2038006221



##########
datafusion/physical-plan/src/statistics.rs:
##########
@@ -0,0 +1,196 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines the cross join plan for loading the left side of the cross join
+//! and producing batches in parallel for the right partitions
+
+use datafusion_common::stats::Precision;
+use datafusion_common::{ColumnStatistics, ScalarValue, Statistics};
+use std::mem;
+use std::ops::Index;
+use std::sync::Arc;
+
+/// Represents statistics data grouped by partition.
+///
+/// This structure maintains a collection of statistics, one for each partition
+/// of a distributed dataset, allowing access to statistics by partition index.
+#[derive(Debug, Clone)]
+pub struct PartitionedStatistics {
+    inner: Vec<Arc<Statistics>>,
+}
+
+impl PartitionedStatistics {
+    pub fn new(statistics: Vec<Arc<Statistics>>) -> Self {
+        Self { inner: statistics }
+    }
+
+    pub fn statistics(&self, partition_idx: usize) -> &Statistics {
+        &self.inner[partition_idx]
+    }
+
+    pub fn get_statistics(&self, partition_idx: usize) -> Option<&Statistics> {
+        self.inner.get(partition_idx).map(|arc| arc.as_ref())
+    }
+
+    pub fn iter(&self) -> impl Iterator<Item = &Statistics> {
+        self.inner.iter().map(|arc| arc.as_ref())
+    }
+
+    pub fn is_empty(&self) -> bool {
+        self.inner.is_empty()
+    }
+
+    pub fn len(&self) -> usize {
+        self.inner.len()
+    }
+}
+
+impl Index<usize> for PartitionedStatistics {
+    type Output = Statistics;
+
+    fn index(&self, partition_idx: usize) -> &Self::Output {
+        self.statistics(partition_idx)
+    }
+}
+
+/// Generic function to compute statistics across multiple items that have 
statistics
+pub fn compute_summary_statistics<T, I>(
+    items: I,
+    column_count: usize,

Review Comment:
   I don't think `column_count` is enough here -- specifically if we are 
merging statistics from multiple-files together then their columns need to be 
aligned -- for example if we have two files
   * File 1: `(a int32, b int32)`
   * File 2: `(b int32, a int32)`
   
   I think the code in this PR will combine statistics for columns a and b 
together 
   
   Instead, I recommend adding some function that knows how to map columns from 
a file schema --> table schema (filling in any missing columns with 
`ColumnStatistics::new_unnown`) before combining them
   
   I think this would be easier to implement after the APIs in 
https://github.com/apache/datafusion/pull/15661



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

Reply via email to