date:20250322

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub



xudong963 commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743593315

   > Do you know any use cases where this method would be especially useful? If 
so, maybe we can study one of those cases in more detail. That could help us 
understand the real need and guide us toward a more solid algorithm.
   
   Yes, we're considering restarting the 
[work](https://github.com/apache/datafusion/pull/13296/files#diff-8d786f45bc2d5bf629754a119ed6fa7998dcff7faacd954c45945b7047b87fa1R498),
 and given that `Precision` will be replaced with `Distribution`, so I opened 
the proposal to discuss how to do **merge** for `Distribution`.
   
   > a) Mixture Model (Weighted Average of PDFs)
   > This is a method for combining different probability distributions.
   > p1(x) and p2(x) is some PDF's, and we give them equal weight (0.5). The 
combined PDF would be:
   > _pmix(x) = 0.5 * p1(x) + 0.5 * p2(x)_
   > This creates a probabilistic blend of the two. The result is still a valid 
PDF (non-negative and integrates to 1).
   
   It seems that my current way is close to the mixture model.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] multiply overflow in stats.rs [datafusion]

2025-03-22 Thread via GitHub



LindaSummer commented on issue #13775:
URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2746049186

   Hi @Speculative ,
   
   Thanks very much for your investigation! ❤
   
   I'm sorry that in last month due to my personal circumstances, I didn't have 
enough time to follow up on this issue.
   
   I will try to follow your reproducing steps for this problem.
   
   I will try to update in this thread or reach out to you when I have any new 
progress or problem on reproducing if you don't mind.
   
   Best Regards,
   Edward
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-03-22 Thread via GitHub



alamb commented on issue #15323:
URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2744217807

   Makes sense -- with 183 spill files, we probably would need to merge in 
stages
   
   For example starting with 183 spill files
   1. run 10 jobs, each merging about 10 files into one (results in 10 files)
   2. run the final merge of 10 files
   
   This results in 2x the IO (have to read / write each row twice) but it would 
be possible at least to parallelize the merges of the earlier step
   
   I think @2010YOUY01  was starting to look into a SpillFileManager -- this is 
the kind of behavior I would imagine being part of such a thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]

2025-03-22 Thread via GitHub



Dandandan commented on code in PR #15266:
URL: https://github.com/apache/datafusion/pull/15266#discussion_r2008942439


##
datafusion/functions-aggregate/src/first_last.rs:
##
@@ -179,6 +292,420 @@ impl AggregateUDFImpl for FirstValue {
 }
 }
 
+struct FirstPrimitiveGroupsAccumulator
+where
+T: ArrowPrimitiveType + Send,
+{
+//  state ===
+vals: Vec,
+// Stores ordering values, of the aggregator requirement corresponding to 
first value
+// of the aggregator.
+// The `orderings` are stored row-wise, meaning that `orderings[group_idx]`
+// represents the ordering values corresponding to the `group_idx`-th 
group.
+orderings: Vec>,
+// At the beginning, `is_sets[group_idx]` is false, which means `first` is 
not seen yet.
+// Once we see the first value, we set the `is_sets[group_idx]` flag
+is_sets: BooleanBufferBuilder,
+// null_builder[group_idx] == false => vals[group_idx] is null
+null_builder: BooleanBufferBuilder,
+// size of `self.orderings`
+// Calculating the memory usage of `self.orderings` using 
`ScalarValue::size_of_vec` is quite costly.
+// Therefore, we cache it and compute `size_of` only after each update
+// to avoid calling `ScalarValue::size_of_vec` by Self.size.
+size_of_orderings: usize,
+
+// === option 
+
+// Stores the applicable ordering requirement.
+ordering_req: LexOrdering,
+// derived from `ordering_req`.
+sort_options: Vec,
+// Stores whether incoming data already satisfies the ordering requirement.
+input_requirement_satisfied: bool,
+// Ignore null values.
+ignore_nulls: bool,
+/// The output type
+data_type: DataType,
+default_orderings: Vec,
+}
+
+impl FirstPrimitiveGroupsAccumulator
+where
+T: ArrowPrimitiveType + Send,
+{
+fn try_new(
+ordering_req: LexOrdering,
+ignore_nulls: bool,
+data_type: &DataType,
+ordering_dtypes: &[DataType],
+) -> Result {
+let requirement_satisfied = ordering_req.is_empty();
+
+let default_orderings = ordering_dtypes
+.iter()
+.map(ScalarValue::try_from)
+.collect::>>()?;
+
+let sort_options = get_sort_options(ordering_req.as_ref());
+
+Ok(Self {
+null_builder: BooleanBufferBuilder::new(0),
+ordering_req,
+sort_options,
+input_requirement_satisfied: requirement_satisfied,
+ignore_nulls,
+default_orderings,
+data_type: data_type.clone(),
+vals: Vec::new(),
+orderings: Vec::new(),
+is_sets: BooleanBufferBuilder::new(0),
+size_of_orderings: 0,
+})
+}
+
+fn need_update(&self, group_idx: usize) -> bool {
+if !self.is_sets.get_bit(group_idx) {
+return true;
+}
+
+if self.ignore_nulls && !self.null_builder.get_bit(group_idx) {
+return true;
+}
+
+!self.input_requirement_satisfied
+}
+
+fn should_update_state(
+&self,
+group_idx: usize,
+new_ordering_values: &[ScalarValue],
+) -> Result {
+if !self.is_sets.get_bit(group_idx) {
+return Ok(true);
+}
+
+assert!(new_ordering_values.len() == self.ordering_req.len());
+let current_ordering = &self.orderings[group_idx];
+compare_rows(current_ordering, new_ordering_values, &self.sort_options)
+.map(|x| x.is_gt())
+}
+
+fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> {
+let result = emit_to.take_needed(&mut self.orderings);
+
+match emit_to {
+EmitTo::All => self.size_of_orderings = 0,
+EmitTo::First(_) => {
+self.size_of_orderings -=
+result.iter().map(ScalarValue::size_of_vec).sum::()
+}
+}
+
+result
+}
+
+fn take_need(
+bool_buf_builder: &mut BooleanBufferBuilder,
+emit_to: EmitTo,
+) -> BooleanBuffer {
+let bool_buf = bool_buf_builder.finish();
+match emit_to {
+EmitTo::All => bool_buf,
+EmitTo::First(n) => {
+// split off the first N values in seen_values
+//
+// TODO make this more efficient rather than two
+// copies and bitwise manipulation
+let first_n: BooleanBuffer = bool_buf.iter().take(n).collect();
+// reset the existing buffer
+for b in bool_buf.iter().skip(n) {
+bool_buf_builder.append(b);
+}
+first_n
+}
+}
+}
+
+fn resize_states(&mut self, new_size: usize) {
+self.vals.resize(new_size, T::default_value());
+
+self.null_builder.resize(new_size);
+
+if self.orderings.len() < new_size {
+let

Re: [PR] Always use `PartitionMode::Auto` in planner [datafusion]

2025-03-22 Thread via GitHub



ozankabak commented on code in PR #15339:
URL: https://github.com/apache/datafusion/pull/15339#discussion_r2008738870


##
datafusion/sqllogictest/test_files/explain_tree.slt:
##
@@ -345,63 +345,68 @@ FROM
 
 physical_plan
 01)┌───┐
-02)│CoalesceBatchesExec│
+02)│   ProjectionExec  │

Review Comment:
   Why do we have this projection here? Shouldn't the built-in projection of 
hash join handle this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Only unnest source for `EmptyRelation` [datafusion]

2025-03-22 Thread via GitHub



goldmedal commented on code in PR #15159:
URL: https://github.com/apache/datafusion/pull/15159#discussion_r2008701304


##
datafusion/sql/tests/cases/plan_to_sql.rs:
##
@@ -633,6 +633,18 @@ fn roundtrip_statement_with_dialect() -> Result<()> {
 parser_dialect: Box::new(GenericDialect {}),
 unparser_dialect: 
Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()),
 },
+TestStatementWithDialect {
+sql: "SELECT unnest([1, 2, 3, 4]) from unnest([1, 2, 3]);",
+expected: r#"SELECT UNNEST([1, 2, 3, 4]) AS 
UNNEST(make_array(Int64(1),Int64(2),Int64(3),Int64(4))) FROM UNNEST([1, 2, 
3])"#,
+parser_dialect: Box::new(GenericDialect {}),
+unparser_dialect: 
Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()),
+},
+TestStatementWithDialect {
+sql: "SELECT unnest([1, 2, 3, 4]) from unnest([1, 2, 3]);",
+expected: r#"SELECT UNNEST([1, 2, 3, 4]) AS 
UNNEST(make_array(Int64(1),Int64(2),Int64(3),Int64(4))) FROM UNNEST([1, 2, 
3])"#,
+parser_dialect: Box::new(GenericDialect {}),
+unparser_dialect: 
Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()),
+},

Review Comment:
   They look like the same. 🤔 I think we can remove one of them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Triggering extended tests through PR comment [datafusion]

2025-03-22 Thread via GitHub



Omega359 commented on PR #15101:
URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2743066275

   @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764438


##
src/dataframe.rs:
##
@@ -771,3 +871,82 @@ fn record_batch_into_schema(
 
 RecordBatch::try_new(schema, data_arrays)
 }
+
+/// This is a helper function to return the first non-empty record batch from 
executing a DataFrame.
+/// It additionally returns a bool, which indicates if there are more record 
batches available.
+/// We do this so we can determine if we should indicate to the user that the 
data has been
+/// truncated. This collects until we have achived both of these two conditions
+///
+/// - We have collected our minimum number of rows
+/// - We have reached our limit, either data size or maximum number of rows
+///
+/// Otherwise it will return when the stream has exhausted. If you want a 
specific number of
+/// rows, set min_rows == max_rows.
+async fn collect_record_batches_to_display(
+df: DataFrame,
+min_rows: usize,
+max_rows: usize,
+) -> Result<(Vec, bool), DataFusionError> {
+let mut stream = df.execute_stream().await?;
+let mut size_estimate_so_far = 0;
+let mut rows_so_far = 0;
+let mut record_batches = Vec::default();
+let mut has_more = false;
+
+while (size_estimate_so_far < MAX_TABLE_BYTES_TO_DISPLAY && rows_so_far < 
max_rows)
+|| rows_so_far < min_rows
+{
+let mut rb = match stream.next().await {
+None => {
+break;
+}
+Some(Ok(r)) => r,
+Some(Err(e)) => return Err(e),
+};
+
+let mut rows_in_rb = rb.num_rows();
+if rows_in_rb > 0 {
+size_estimate_so_far += rb.get_array_memory_size();
+
+if size_estimate_so_far > MAX_TABLE_BYTES_TO_DISPLAY {
+let ratio = MAX_TABLE_BYTES_TO_DISPLAY as f32 / 
size_estimate_so_far as f32;
+let total_rows = rows_in_rb + rows_so_far;
+
+let mut reduced_row_num = (total_rows as f32 * ratio).round() 
as usize;

Review Comment:
   Yes. And the data size is an estimate as well. The point is to get a general 
ball park and not necessarily an exact measure. It should still indicate that 
that data have been truncated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Avg distinct [datafusion]

2025-03-22 Thread via GitHub



qazxcdswe123 commented on code in PR #15356:
URL: https://github.com/apache/datafusion/pull/15356#discussion_r2008792246


##
datafusion/functions-aggregate-common/src/aggregate/avg_distinct/decimal.rs:
##
@@ -0,0 +1,133 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use ahash::RandomState;
+use arrow::{
+array::{ArrayRef, ArrowNativeTypeOp, PrimitiveArray},
+datatypes::{ArrowNativeType, DecimalType},
+};
+use datafusion_common::{
+cast::{as_list_array, as_primitive_array},
+utils::{memory::estimate_memory_size, SingleRowListArrayBuilder},
+HashSet, ScalarValue,
+};
+use datafusion_expr_common::accumulator::Accumulator;
+use std::{fmt::Debug, sync::Arc};
+
+use crate::utils::{DecimalAverager, Hashable};
+
+/// Generic implementation of `AVG DISTINCT` for Decimal types.
+/// Handles both Decimal128Type and Decimal256Type.
+#[derive(Debug)]
+pub struct DecimalDistinctAvgAccumulator {
+values: HashSet, RandomState>,
+sum_scale: i8,
+target_precision: u8,
+target_scale: i8,
+}
+
+impl DecimalDistinctAvgAccumulator {
+pub fn with_decimal_params(
+sum_scale: i8,
+target_precision: u8,
+target_scale: i8,
+) -> Self {
+Self {
+values: HashSet::default(),
+sum_scale,
+target_precision,
+target_scale,
+}
+}
+}
+
+impl Accumulator for DecimalDistinctAvgAccumulator {
+fn state(&mut self) -> datafusion_common::Result> {
+let arr = Arc::new(
+PrimitiveArrayfrom_iter_values(self.values.iter().map(|v| 
v.0))
+.with_data_type(T::TYPE_CONSTRUCTOR(T::MAX_PRECISION, 
self.sum_scale)),

Review Comment:
   I just use MAX_PERCISION and I think it should be fine?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Add "end to end parquet reading test" for WASM [datafusion]

2025-03-22 Thread via GitHub



alamb opened a new issue, #15357:
URL: https://github.com/apache/datafusion/issues/15357

   I think the current code tests the re-exported Parquet functionalities, not 
touching the DataFusion-related code. Ideally, we should test the end-to-end 
Parquet reading process.
   
   The process roughly looks like this:
   1. Create a [in-memory 
object_store](https://docs.rs/object_store/latest/object_store/memory/struct.InMemory.html),
 and put the Parquet data you generated into the object_store.
   2. [Register the 
object_store](https://github.com/apache/datafusion/blob/3269f01b42021cfab181577d579b0544808b4fca/datafusion/core/src/execution/context/mod.rs#L494)
 along with the path to the DataFusion.
   3. Run a SQL query from the DataFusion side to see if the results can be 
read back.
   
   A loosely related test can be found here: 
https://github.com/XiangpengHao/parquet-viewer/blob/main/src/tests.rs#L9
   
   _Originally posted by @XiangpengHao in 
https://github.com/apache/datafusion/pull/15325#discussion_r2008190901_
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add "end to end parquet reading test" for WASM [datafusion]

2025-03-22 Thread via GitHub



jsai28 commented on issue #15357:
URL: https://github.com/apache/datafusion/issues/15357#issuecomment-2745315413

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve Spill Performance: `mmap` the spill files [datafusion]

2025-03-22 Thread via GitHub



alamb commented on issue #15321:
URL: https://github.com/apache/datafusion/issues/15321#issuecomment-2745343112

   BTW I am not sure the code is really in a great position to do this one yet 
-- it might help to wait for @2010YOUY01   (or help him) to pull some of the 
spilling code into its own structure -- see 
https://github.com/apache/datafusion/pull/15355


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix type coercion for uint/int's [datafusion]

2025-03-22 Thread via GitHub



Omega359 commented on PR #15341:
URL: https://github.com/apache/datafusion/pull/15341#issuecomment-2745343678

   > However, I am struggling to understand the implications of this change to 
a user. Like for example, if we were going to add a note about this in the 
upgrade / release notes, what would it say?
   
   An issue was fixed where type coercion between expressions using certain 
mathematical operations having unsigned / signed types could produce values 
with an incorrect type that is not large enough to encompass all the possible 
values for both types. For example, comparing an unsigned int32 with a signed 
int32 could result in values having int32 type (where it should be int64) and 
could result in "Can't cast .." error for any unsigned values larger than the 
maximum int32 value.
   
   This change may result in expressions unexpectedly having a 'larger' output 
type than they would have had in previous releases.
   
   > Or put another way, what problem is this PR solving (the ticket just 
describes what the code does as wrong but it doesn't say why) 🤔
   
   ```Rust
   let df = df
   .select_columns(&[P_ID, IDENTITY_KEY_VALUE])?
   .with_column(
   IDENTITY_KEY_VALUE,
   cast(hex_to_u64.call(vec![col(IDENTITY_KEY_VALUE)]), 
DataType::UInt64),
   )?
   .with_column(PARTITION_COLUMN, 
col(IDENTITY_KEY_VALUE).rem(lit(64)))?;
   ```
   
   That currently throws `
   Cast error: Can't cast value 16858640775341098663 to type Int32`
   
   Is it easy to work around? Yes. Should it happen? No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support `merge` for `Distribution` [datafusion]

2025-03-22 Thread via GitHub



xudong963 closed issue #15290: Support `merge` for `Distribution`
URL: https://github.com/apache/datafusion/issues/15290


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub



xudong963 closed pull request #15296: feat: support merge for `Distribution`
URL: https://github.com/apache/datafusion/pull/15296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745345231

   Let me know when it's ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub



xudong963 commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2745345742

   Thanks for your suggestions!! @alamb @ozankabak @berkaysynnada and @kosiew 
   
   I'll continue to do such work after the `Migrate to Distribution from 
Precision` work is done. I think after it's done, the new statistics framework 
will be relatively stable and we can see how the `Distribution` is integrated 
into datafusion core.  It'll definitely be helpful for me to do some work 
around the new statistics framework. 
   
   Again, I sincerely appreciate that you take time to review and discuss ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745345097

   I do love this PR. I hadn't looked at it since it's in draft, but I fully 
endorse. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]

2025-03-22 Thread via GitHub



riziles closed issue #1077: DeltaLake integration not working (Python) (FFI 
Table providers not working)
URL: https://github.com/apache/datafusion-python/issues/1077


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]

2025-03-22 Thread via GitHub



westhide commented on PR #1212:
URL: 
https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2743805128

   > I believe whole job should be cancelled
   
   Yes, working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-22 Thread via GitHub



XiangpengHao commented on PR #61:
URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2745401638

   Thank you @kevinjqliu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-substrait from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] closed pull request #1048: build(deps): bump 
datafusion-substrait from 45.0.0 to 46.0.0
URL: https://github.com/apache/datafusion-python/pull/1048


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1051:
URL: 
https://github.com/apache/datafusion-python/pull/1051#issuecomment-2745403955

   Looks like datafusion is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-proto from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1049:
URL: 
https://github.com/apache/datafusion-python/pull/1049#issuecomment-2745403935

   Looks like datafusion-proto is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-ffi from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] closed pull request #1050: build(deps): bump datafusion-ffi 
from 45.0.0 to 46.0.0
URL: https://github.com/apache/datafusion-python/pull/1050


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-proto from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] closed pull request #1049: build(deps): bump datafusion-proto 
from 45.0.0 to 46.0.0
URL: https://github.com/apache/datafusion-python/pull/1049


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump uuid from 1.13.1 to 1.16.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1068:
URL: 
https://github.com/apache/datafusion-python/pull/1068#issuecomment-2745403994

   Looks like uuid is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-substrait from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1048:
URL: 
https://github.com/apache/datafusion-python/pull/1048#issuecomment-2745403927

   Looks like datafusion-substrait is up-to-date now, so this is no longer 
needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump arrow from 54.2.0 to 54.2.1 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] closed pull request #1038: build(deps): bump arrow from 54.2.0 
to 54.2.1
URL: https://github.com/apache/datafusion-python/pull/1038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Update DataFusion dependency to 46 [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer merged PR #1079:
URL: https://github.com/apache/datafusion-python/pull/1079


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Parse `SUBSTR` as alias for `SUBSTRING` [datafusion-sqlparser-rs]

2025-03-22 Thread via GitHub



iffyio merged PR #1769:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1769


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: Update DataFusion dependency to 46 [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer opened a new pull request, #1079:
URL: https://github.com/apache/datafusion-python/pull/1079

   # Which issue does this PR close?
   
   None
   
# Rationale for this change
   
   In preparing for the next release, update the datafusion dependency.
   
   # What changes are included in this PR?
   
   Corresponding changes in the Expr representation due to upstream changes. 
Window and Aggregate functions have a slight change to their structs.
   
   # Are there any user-facing changes?
   
   None. These are all internal changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Documentation: Plan custom expressions [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15353:
URL: https://github.com/apache/datafusion/pull/15353#discussion_r2008792240


##
docs/source/library-user-guide/adding-udfs.md:
##
@@ -1160,6 +1160,89 @@ async fn main() -> Result<()> {
 // +---+
 ```
 
+## Custom Expression Planning
+
+DataFusion provides native support for a limited set of SQL operators by 
default. For operators not natively defined, developers can extend DataFusion's 
functionality by implementing custom expression planning. This extensibility is 
a core feature of DataFusion, allowing it to be customized for particular 
workloads and requirements.
+
+### Implementing Custom Expression Planning
+
+To extend DataFusion with support for custom operators not natively available, 
you need to:
+
+1. Implement the `ExprPlanner` trait: This allows you to define custom logic 
for planning expressions that DataFusion doesn't natively recognize. The trait 
provides the necessary interface to translate logical expressions into physical 
execution plans.
+
+   For a detailed documentation please see: [Trait 
ExprPlanner](https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html)
+
+2. Register your custom planner: Integrate your implementation with 
DataFusion's `SessionContext` to ensure your custom planning logic is invoked 
during the query optimization and execution planning phase.
+
+   For a detailed documentation please see: [fn 
register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner)

Review Comment:
   ```suggestion
  For a detailed documentation see: [fn 
register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner)
   ```



##
docs/source/library-user-guide/adding-udfs.md:
##
@@ -1160,6 +1160,89 @@ async fn main() -> Result<()> {
 // +---+
 ```
 
+## Custom Expression Planning
+
+DataFusion provides native support for a limited set of SQL operators by 
default. For operators not natively defined, developers can extend DataFusion's 
functionality by implementing custom expression planning. This extensibility is 
a core feature of DataFusion, allowing it to be customized for particular 
workloads and requirements.
+
+### Implementing Custom Expression Planning
+
+To extend DataFusion with support for custom operators not natively available, 
you need to:
+
+1. Implement the `ExprPlanner` trait: This allows you to define custom logic 
for planning expressions that DataFusion doesn't natively recognize. The trait 
provides the necessary interface to translate logical expressions into physical 
execution plans.
+
+   For a detailed documentation please see: [Trait 
ExprPlanner](https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html)
+
+2. Register your custom planner: Integrate your implementation with 
DataFusion's `SessionContext` to ensure your custom planning logic is invoked 
during the query optimization and execution planning phase.
+
+   For a detailed documentation please see: [fn 
register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner)
+
+See example below:
+
+```rust
+# use arrow::array::RecordBatch;
+# use std::sync::Arc;
+
+# use datafusion::common::{assert_batches_eq, DFSchema};
+# use datafusion::error::Result;
+# use datafusion::execution::FunctionRegistry;
+# use datafusion::logical_expr::Operator;
+# use datafusion::prelude::*;
+# use datafusion::sql::sqlparser::ast::BinaryOperator;
+# use datafusion_common::ScalarValue;
+# use datafusion_expr::expr::Alias;
+# use datafusion_expr::planner::{ExprPlanner, PlannerResult, RawBinaryExpr};
+# use datafusion_expr::BinaryExpr;
+
+# #[derive(Debug)]
+# // Define the custom planner
+# struct MyCustomPlanner;
+
+// Implement ExprPlanner for cutom operator logic
+impl ExprPlanner for MyCustomPlanner {
+fn plan_binary_op(
+&self,
+expr: RawBinaryExpr,
+_schema: &DFSchema,
+) -> Result> {
+match &expr.op {
+// Map `->` to string concatenation
+BinaryOperator::Arrow => {
+// Rewrite `->` as a string concatenation operation
+// - `left` and `right` are the operands (e.g., 'hello' and 
'world')
+// - `Operator::StringConcat` tells DataFusion to concatenate 
them
+Ok(PlannerResult::Planned(Expr::BinaryExpr(BinaryExpr {
+left: Box::new(expr.left.clone()),
+right: Box::new(expr.right.clone()),
+op: Operator::StringConcat,
+})))
+}
+_ => Ok(PlannerResult::Original(expr)),
+}
+}
+}
+
+use datafusion::execution::context::SessionContext;
+use datafusion::arrow::util::pretty;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+let config =

Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]

2025-03-22 Thread via GitHub



riziles commented on issue #1077:
URL: 
https://github.com/apache/datafusion-python/issues/1077#issuecomment-2745324305

   @timsaucer , I upgraded deltalake to 0.25.4 and datafusion to 45.2.0, and 
the example above now works fine. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15168:
URL: https://github.com/apache/datafusion/pull/15168#discussion_r2008804385


##
datafusion/spark/src/function/math/expm1.rs:
##
@@ -0,0 +1,169 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::function::error_utils::{
+invalid_arg_count_exec_err, unsupported_data_type_exec_err,
+};
+use arrow::array::{ArrayRef, AsArray};
+use arrow::datatypes::{DataType, Float64Type};
+use datafusion_common::{Result, ScalarValue};
+use datafusion_expr::{
+ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility,
+};
+use datafusion_macros::user_doc;
+use std::any::Any;
+use std::sync::Arc;
+
+#[user_doc(
+doc_section(label = "Spark Math Functions"),
+description = "Returns exp(expr) - 1 as a Float64.",
+syntax_example = "expm1(expr)",
+sql_example = r#"```sql
+> select expm1(0);
+++
+| expm1(0)   |
+++
+| 0.0|
+++
+> select expm1(1);
+++
+| expm1(1)   |
+++
+| 50 |
+++
+```"#,
+argument(
+name = "expr",
+description = "An expression that evaluates to a numeric."
+)
+)]
+#[derive(Debug)]
+pub struct SparkExpm1 {
+signature: Signature,
+aliases: Vec,
+}
+
+impl Default for SparkExpm1 {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl SparkExpm1 {
+pub fn new() -> Self {
+Self {
+signature: Signature::user_defined(Volatility::Immutable),
+aliases: vec!["spark_expm1".to_string()],
+}
+}
+}
+
+impl ScalarUDFImpl for SparkExpm1 {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn name(&self) -> &str {
+"expm1"
+}
+
+fn signature(&self) -> &Signature {
+&self.signature
+}
+
+fn return_type(&self, _arg_types: &[DataType]) -> Result {
+Ok(DataType::Float64)
+}
+
+fn invoke_with_args(&self, args: ScalarFunctionArgs) -> 
Result {
+if args.args.len() != 1 {
+return Err(invalid_arg_count_exec_err("expm1", (1, 1), 
args.args.len()));
+}
+match &args.args[0] {
+ColumnarValue::Scalar(ScalarValue::Float64(value)) => Ok(
+ColumnarValue::Scalar(ScalarValue::Float64(value.map(|x| 
x.exp_m1(,
+),
+ColumnarValue::Array(array) => match array.data_type() {
+DataType::Float64 => Ok(ColumnarValue::Array(Arc::new(
+array
+.as_primitive::()
+.unary::<_, Float64Type>(|x| x.exp_m1()),
+)
+as ArrayRef)),
+other => Err(unsupported_data_type_exec_err(
+"expm1",
+format!("{}", DataType::Float64).as_str(),
+other,
+)),
+},
+other => Err(unsupported_data_type_exec_err(
+"expm1",
+format!("{}", DataType::Float64).as_str(),
+&other.data_type(),
+)),
+}
+}
+
+fn aliases(&self) -> &[String] {
+&self.aliases
+}
+
+fn coerce_types(&self, arg_types: &[DataType]) -> Result> {
+if arg_types.len() != 1 {
+return Err(invalid_arg_count_exec_err("expm1", (1, 1), 
arg_types.len()));
+}
+if arg_types[0].is_numeric() {
+Ok(vec![DataType::Float64])
+} else {
+Err(unsupported_data_type_exec_err(
+"expm1",
+"Numeric Type",
+&arg_types[0],
+))
+}
+}
+}
+
+#[cfg(test)]
+mod tests {
+use crate::function::math::expm1::SparkExpm1;
+use crate::function::utils::test::test_scalar_function;
+use arrow::array::{Array, Float64Array};
+use arrow::datatypes::DataType::Float64;
+use datafusion_common::{Result, ScalarValue};
+use datafusion_expr::{ColumnarValue, ScalarUDFImpl};
+
+macro_rules! test_expm1_float64_invoke {
+($INPUT:expr, $EXPECTED:expr) => {
+test_scalar_function!(
+SparkExpm1::new(),
+

Re: [PR] fix type coercion for uint/int's [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15341:
URL: https://github.com/apache/datafusion/pull/15341#discussion_r2008804758


##
datafusion/optimizer/tests/optimizer_integration.rs:
##
@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() {
 let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) 
FROM test GROUP BY 1) where c > 3";
 let plan = test_sql(sql).unwrap();
 let expected = "Projection: test.col_int32 + test.col_uint32 AS c, 
count(Int64(1)) AS count(*)\
-\n  Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS 
Int32)]], aggr=[[count(Int64(1))]]\
-\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\
+\n  Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + 
CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\

Review Comment:
   this might be slower-- as now the larger column type is used (so it needs to 
do a 64 bit comparison rather than 32 bit) 🤔 



##
datafusion/optimizer/tests/optimizer_integration.rs:
##
@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() {
 let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) 
FROM test GROUP BY 1) where c > 3";
 let plan = test_sql(sql).unwrap();
 let expected = "Projection: test.col_int32 + test.col_uint32 AS c, 
count(Int64(1)) AS count(*)\
-\n  Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS 
Int32)]], aggr=[[count(Int64(1))]]\
-\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\
+\n  Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + 
CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\

Review Comment:
   but it probably also doesn't lose precision 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix type coercion for uint/int's [datafusion]

2025-03-22 Thread via GitHub



Omega359 commented on code in PR #15341:
URL: https://github.com/apache/datafusion/pull/15341#discussion_r2008808511


##
datafusion/optimizer/tests/optimizer_integration.rs:
##
@@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() {
 let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) 
FROM test GROUP BY 1) where c > 3";
 let plan = test_sql(sql).unwrap();
 let expected = "Projection: test.col_int32 + test.col_uint32 AS c, 
count(Int64(1)) AS count(*)\
-\n  Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS 
Int32)]], aggr=[[count(Int64(1))]]\
-\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\
+\n  Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + 
CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\

Review Comment:
   Correctness > performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Extended tests failing on main [datafusion]

2025-03-22 Thread via GitHub



alamb commented on issue #15359:
URL: https://github.com/apache/datafusion/issues/15359#issuecomment-2745343956

   Instructions to update the expected results are in 
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md#running-tests-sqlite


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] [DISCUSS] In-house implementation of certain dependency package. [datafusion]

2025-03-22 Thread via GitHub



logan-keede opened a new issue, #15360:
URL: https://github.com/apache/datafusion/issues/15360

   ### Is your feature request related to a problem or challenge?
   
   DataFusion has many dependencies that leads to increased binary size and 
compilation while saving us the trouble of maintaining those implementations in 
our own codebase. 
   
   
   ### Describe the solution you'd like
   
   This issue is to Identify any crate/dependency `shallow` and `stable` enough 
to justify having an In-house implementation. 
   'shallow': We do not use much of the original crate. i.e we just use a small 
part that can be implemented in house without having to implement/port a 
massive dependency tree that crate has. The shallower the dependency, the more 
beneficial an in-house implementation would be.
   
   'stable':  The code is mostly stable and will not require much active 
changes after having an In-house implementation. (While all code eventually 
requires some changes, it is about how frequently and disruptively it will give 
us a headache.) 
   
   ### Describe alternatives you've considered
   
   Do not have any in-house implementation. Pay the price of increased binary 
size and compile time.
   
   ### Additional context
   
   _Originally suggested by @ozankabak while discussing my GSoC 2025 Proposal 
for "Optimizing compile time and binary size" over discord and before that in 
#14478._ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-22 Thread via GitHub



deanm commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745349544

   I'll mark it ready and just do a second PR for more methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Avg distinct [datafusion]

2025-03-22 Thread via GitHub



qazxcdswe123 commented on code in PR #15356:
URL: https://github.com/apache/datafusion/pull/15356#discussion_r2008777574


##
datafusion/sqllogictest/test_files/aggregate.slt:
##
@@ -6686,3 +6688,35 @@ SELECT a, median(b), arrow_typeof(median(b)) FROM 
group_median_all_nulls GROUP B
 
 group0 NULL Int32
 group1 NULL Int32
+
+statement ok
+create table t_decimal (c decimal(10, 4)) as values (100.00), (125.00), 
(175.00), (200.00), (200.00), (300.00), (null), (null);
+
+# Test avg_distinct for Decimal128
+query RT
+select avg(distinct c), arrow_typeof(avg(distinct c)) from t_decimal;
+
+180 Decimal128(14, 8)

Review Comment:
   I'm not sure if the type `Decimal128(14, 8)` and `Decimal256(54, 6)` below 
is correct 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Extended tests failing on main [datafusion]

2025-03-22 Thread via GitHub



alamb opened a new issue, #15359:
URL: https://github.com/apache/datafusion/issues/15359

   ### Describe the bug
   
   example
   
   https://github.com/apache/datafusion/actions/runs/14005503215/job/39218865563
   
   ```
   External error: query is expected to fail with error:
(regex) DataFusion error: Error during planning: Projection references 
non\-aggregate values: Expression cor0\.col1 could not be resolved from 
available columns: cor0\.col2
   but got error:
DataFusion error: Error during planning: Column in SELECT must be in 
GROUP BY or an aggregate function: While expanding wildcard, column "cor0.col1" 
must appear in the GROUP BY clause or must be part of an aggregate function, 
currently only "cor0.col2" appears in the SELECT clause satisfies this 
requirement
   [SQL] SELECT - COALESCE ( - 54, + cor0.col1 * + NULLIF ( + cor0.col1, 
cor0.col0 ), - cor0.col1 ) AS col0 FROM tab1 cor0 GROUP BY cor0.col2
   at ../../datafusion-testing/data/sqlite/random/groupby/slt_good_7.slt:14159
   
   ```
   
   Happened first with this PR:
   - https://github.com/apache/datafusion/pull/15287
   
   I think the expected messages just need updating
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] SET statements: scope modifier for multiple assignments [datafusion-sqlparser-rs]

2025-03-22 Thread via GitHub



alamb commented on PR #1772:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1772#issuecomment-2745343279

   Thank you all for keeping this train (of PRs) rolling. Very impressive I 
must say


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745373049

   At a high level I like some of the aspects of a namespace. It would 
especially be nice to clean up our documentation using them in `functions`. I 
haven't thought through what this would mean for organizing both `functions` 
and in `Expr`. We would at least need to have some helper functions marked as 
deprecated for a few releases if we did this. It's a really good suggestion, 
but I just want to make sure we have a consistent story across the repo. I 
haven't looked into how we could do this in PyO3 or if it would just be in the 
wrappers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [DISCUSS] In-house implementation of certain dependencies. [datafusion]

2025-03-22 Thread via GitHub



ozankabak commented on issue #15360:
URL: https://github.com/apache/datafusion/issues/15360#issuecomment-2745373075

   If we can write a script to approximately estimate "how much" of an external 
crate we are using, we can create a starter list of candidates for this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion-ffi from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1050:
URL: 
https://github.com/apache/datafusion-python/pull/1050#issuecomment-2745404001

   Looks like datafusion-ffi is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump datafusion from 45.0.0 to 46.0.0 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] closed pull request #1051: build(deps): bump datafusion from 
45.0.0 to 46.0.0
URL: https://github.com/apache/datafusion-python/pull/1051


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump arrow from 54.2.0 to 54.2.1 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] commented on PR #1038:
URL: 
https://github.com/apache/datafusion-python/pull/1038#issuecomment-2745403943

   Looks like arrow is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] simplify `array_has` UDF to `InList` expr when haystack is constant [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15354:
URL: https://github.com/apache/datafusion/pull/15354#discussion_r2008786953


##
datafusion/functions-nested/src/array_has.rs:
##
@@ -121,6 +123,43 @@ impl ScalarUDFImpl for ArrayHas {
 Ok(DataType::Boolean)
 }
 
+fn simplify(
+&self,
+mut args: Vec,
+_info: &dyn datafusion_expr::simplify::SimplifyInfo,
+) -> Result {
+let [haystack, _needle] = take_function_args(self.name(), &args)?;
+
+// if the haystack is a constant list, we can use an inlist expression 
which is more
+// efficient because the haystack is not varying per-row
+if let Expr::Literal(ScalarValue::List(array)) = haystack {
+assert_eq!(array.len(), 1); // guarantee of ScalarValue
+if let Ok(scalar_values) =
+ScalarValue::convert_array_to_scalar_vec(array.as_ref())
+{
+assert_eq!(scalar_values.len(), 1);
+let list = scalar_values
+.into_iter()
+.flatten()
+.map(Expr::Literal)
+.collect();
+
+// ok to pop here, we will not use args again
+let needle = args.pop().unwrap();
+return Ok(ExprSimplifyResult::Simplified(Expr::InList(InList {
+expr: Box::new(needle),
+list,
+negated: false,
+})));
+}
+}
+
+// TODO: support LargeList / FixedSizeList?
+// (not supported by `convert_array_to_scalar_vec`)

Review Comment:
   I suggest doing the work in a follow on ticket / PR. So that would mean:
   1. Filing a ticket with a reproducer (perhaps modified from 
https://github.com/apache/datafusion/issues/14533)
   2. leaving a TODO with a link to that ticket here
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Support Avg distinct [datafusion]

2025-03-22 Thread via GitHub



qazxcdswe123 opened a new pull request, #15356:
URL: https://github.com/apache/datafusion/pull/15356

   ## Which issue does this PR close?
   
   - Closes #2408
   
   ## Rationale for this change
   
   Related: https://github.com/apache/datafusion/pull/15099
   
   ## What changes are included in this PR?
   
   1. implemented a separate accumulator type for `Float64` and `Decimal128` 
`Decimal256`
   2. tests
   
   ## Are these changes tested?
   
   Current test passed and new ones are added
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]

2025-03-22 Thread via GitHub



deanm commented on PR #1074:
URL: 
https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745350211

   @timsaucer did you have any thoughts on the namespaces because going from 
this to that would wind up being a breaking change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-22 Thread via GitHub



XiangpengHao commented on PR #61:
URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2745351963

   Thank you @alamb @comphead @Omega359 @kevinjqliu I believe all comments have 
been addressed!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-22 Thread via GitHub



kevinjqliu commented on code in PR #61:
URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2008823867


##
content/blog/2025-03-21-parquet-pushdown.md:
##
@@ -0,0 +1,312 @@
+---
+layout: post
+title: Efficient Filter Pushdown in Parquet
+date: 2025-03-21
+author: Xiangpeng Hao
+categories: [performance]
+---
+
+
+figure {
+  margin: 20px 0;
+}
+
+figure img {
+  display: block;
+  max-width: 80%;
+}
+
+figcaption {
+  font-style: italic;
+  margin-top: 10px;
+  color: #555;
+  font-size: 0.9em;
+  max-width: 80%;
+}
+
+
+
+
+_Editor's Note: This blog was first published on [Xiangpeng Hao's blog]. 
Thanks to [InfluxData] for sponsoring this work as part of his PhD funding._
+
+[Xiangpeng Hao's blog]: https://blog.xiangpeng.systems/posts/parquet-pushdown/
+[InfluxData]: https://www.influxdata.com/
+
+
+
+In the [previous post], we discussed how [Apache DataFusion] prunes [Apache 
Parquet] files to skip irrelevant **files/row_groups** (sometimes also 
[pages](https://parquet.apache.org/docs/file-format/pageindex/)).
+
+This post discusses how Parquet readers skip irrelevant **rows** while 
scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.
+
+
+[previous post]: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Why filter pushdown in Parquet? 
+
+Below is an example query that reads sensor data with filters on `date_time` 
and `location`.
+Without filter pushdown, all rows from location, val, and date_time columns 
are decoded before `location='office'` is evaluated. Filter pushdown is 
especially useful when the filter is selective, i.e., removes many rows.
+
+
+```sql
+SELECT val, location 
+FROM sensor_data 
+WHERE date_time > '2025-03-11' AND location = 'office';
+```
+
+
+  
+  
+Parquet pruning skips irrelevant files/row_groups, while filter pushdown 
skips irrelevant rows. Without filter pushdown, all rows from location, val, 
and date_time columns are decoded before `location='office'` is evaluated. 
Filter pushdown is especially useful when the filter is selective, i.e., 
removes many rows.
+  
+
+
+
+In our setup, sensor data is aggregated by date — each day has its own Parquet 
file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e., 
`2025-03-10.parquet` and `2025-03-11.parquet`.
+
+Once the files to read are located, the [*DataFusion's current default 
implementation*](https://github.com/apache/datafusion/issues/3463) reads all 
the projected columns (`sensor_id`, `val`, and `location`) into Arrow 
RecordBatches, then applies the filters over `location` to get the final set of 
rows.
+
+A better approach is called **filter pushdown**, which evaluates filter 
conditions first and only decodes data that passes these conditions.
+In practice, this works by first processing only the filter columns 
(`date_time` and `location`), building a boolean mask of rows that satisfy our 
conditions, then using this mask to selectively decode only the relevant rows 
from other columns (`sensor_id`, `val`). 
+This eliminates the waste of decoding rows that will be immediately filtered 
out.
+
+While simple in theory, practical implementations often make performance worse.
+
+## How can filter pushdown be slower?
+
+At a high level, the Parquet reader first builds a filter mask -- essentially 
a boolean array indicating which rows meet the filter criteria -- and then uses 
this mask to selectively decode only the needed rows from the remaining columns 
in the projection.
+
+Let's dig into details of [how filter pushdown is 
implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712)
 in the current Rust Parquet reader implementation, illustrated in the 
following figure.
+
+
+  
+  
+Implementation of filter pushdown in Rust Parquet readers -- the first 
phase builds the filter mask, the second phase applies the filter mask to the 
other columns
+  
+
+
+The filter pushdown has two phases:
+
+1. Build the filter mask (steps 1-3)
+
+2. Use the filter mask to selectively decode other columns (steps 4-7), e.g., 
output step 3 is used as input for step 5 and 7.
+
+Within each phase, it takes three steps from Parquet to Arrow:
+
+1. Decompress the Parquet pages using generic decompression algorithms like 
LZ4, Zstd, etc. (steps 1, 4, 6)
+
+2. Decode the page content into Arrow format (steps 2, 5, 7)
+
+3. Evaluate the filter over Arrow data (step 3)
+
+In the figure above, we can see that `location` is **decompressed and decoded 
twice**, first when building the filter mask (steps 1, 2), and second when 
building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and output.
+
+The table below shows the corresponding CPU

[PR] build(deps): bump mimalloc from 0.1.43 to 0.1.44 [datafusion-python]

2025-03-22 Thread via GitHub



dependabot[bot] opened a new pull request, #1080:
URL: https://github.com/apache/datafusion-python/pull/1080

   Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 
0.1.43 to 0.1.44.
   
   Release notes
   Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's 
releases.
   
   Version 0.1.44
   Changes
   
   Mimalloc v2.2.2
   
   
   
   
   Commits
   
   https://github.com/purpleprotocol/mimalloc_rust/commit/bbf61305ad2d9d1c4b417bd277e36caec31d21b7";>bbf6130
 v0.1.44
   https://github.com/purpleprotocol/mimalloc_rust/commit/3e7e3214ea3397ffc922baaa42933e5091482073";>3e7e321
 Patch windows
   https://github.com/purpleprotocol/mimalloc_rust/commit/f37c56ad7c3ed13904afd272e59f3ca80f7b02b8";>f37c56a
 Clippy
   https://github.com/purpleprotocol/mimalloc_rust/commit/c00538aa0b553e7eecab40b114fedb56f414ddc7";>c00538a
 Update to mimalloc v2.2.2
   https://github.com/purpleprotocol/mimalloc_rust/commit/cc1c72a62bf3edefe5abf842b60092f088740860";>cc1c72a
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/126";>#126
 from richerfu/master
   https://github.com/purpleprotocol/mimalloc_rust/commit/e638bd6055ad4df3aa4edbfddfc1d0a09d073692";>e638bd6
 feat: use env to adapt to more general scenarios for atomic
   See full diff in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.43...v0.1.44";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mimalloc&package-manager=cargo&previous-version=0.1.43&new-version=0.1.44)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]

2025-03-22 Thread via GitHub



westhide commented on PR #1212:
URL: 
https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2745126578

   > ah no, sorry for misunderstanding please do not cancel this PR. what I 
meant, in case of this type of error, ballista job should be cancelled.
   
   Hello @milenkovicm, Cancel Job after prepare task definition failed ready 
for review, Thx~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-22 Thread via GitHub



Shreyaskr1409 commented on PR #15313:
URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745028266

   @alamb @blaginin I have resolved all the conversations. Please look into it.
   Also my bad for few unnecessary commits here and there, I am still getting 
acquainted with the workflow.
   Thank you :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Ballista client keep blocking when prepare_task_definition or prepare_multi_task_definition fail [datafusion-ballista]

2025-03-22 Thread via GitHub



westhide opened a new issue, #1214:
URL: https://github.com/apache/datafusion-ballista/issues/1214

   **Describe the bug**
   A clear and concise description of what the bug is.
   Ballista client keep blocking when prepare_task_definition or 
prepare_multi_task_definition fail
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-03-22 Thread via GitHub



alamb commented on issue #15343:
URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2745139051

   I wonder if we could just change the default format for `datafusion-cli` (it 
is a config setting) 🤔  Downstream projects could also then "opt-in" if they 
wanted nicer default explain plans


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPre… [datafusion]

2025-03-22 Thread via GitHub



zhuqi-lucas commented on PR #15348:
URL: https://github.com/apache/datafusion/pull/15348#issuecomment-2745186535

   Updated the result for short string sort which will benefit a lot from 
StringView type, add Q 11 for sort:
   
   ```rust
   -const SORT_QUERIES: [&'static str; 10] = [
   +const SORT_QUERIES: [&'static str; 11] = [
// Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column
r#"
SELECT l_linenumber, l_partkey
   @@ -159,6 +159,12 @@ impl RunOpt {
FROM lineitem
ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment
"#,
   +// Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload 
column
   +r#"
   +SELECT l_shipmode, l_comment, l_partkey
   +FROM lineitem
   +ORDER BY l_shipmode;
   +"#,
];
   ```
   
   This PR:
   
   ```rust
   Q11 iteration 0 took 5645.3 ms and returned 59986052 rows
   Q11 iteration 1 took 5641.1 ms and returned 59986052 rows
   Q11 iteration 2 took 5520.6 ms and returned 59986052 rows
   Q11 avg time: 5602.33 ms
   ```
   
   The main:
   ```rust
   Q11 iteration 0 took 6687.5 ms and returned 59986052 rows
   Q11 iteration 1 took 6504.5 ms and returned 59986052 rows
   Q11 iteration 2 took 6544.6 ms and returned 59986052 rows
   Q11 avg time: 6578.87 ms
   ```
   
   About 20% performance improvement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improved error for expand wildcard rule [datafusion]

2025-03-22 Thread via GitHub



ozankabak commented on PR #15287:
URL: https://github.com/apache/datafusion/pull/15287#issuecomment-2745187508

   This PR seems to have broken `main`.
   
   @alamb -- Extended tests break very frequently these days. We should 
prioritize completing the work on running them before merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-22 Thread via GitHub



alamb commented on code in PR #63:
URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2008719590


##
content/blog/2025-03-20-datafusion-comet-0.7.0.md:
##
@@ -0,0 +1,134 @@
+---
+layout: post
+title: Apache DataFusion Comet 0.7.0 Release
+date: 2025-03-20
+author: pmc
+categories: [subprojects]
+---
+
+
+
+The Apache DataFusion PMC is pleased to announce version 0.7.0 of the 
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans 
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+Comet runs on commodity hardware and aims to provide 100% compatibility with 
Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless 
explicitly enabled by the user. Refer
+to the [compatibility guide] for more information.
+
+[compatibility guide]: 
https://datafusion.apache.org/comet/user-guide/compatibility.html
+
+This release covers approximately four weeks of development work and is the 
result of merging 46 PRs from 11
+contributors. See the [change log] for more information.
+
+[change log]: 
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.7.0.md
+
+## Release Highlights
+
+### Performance
+
+Comet 0.7.0 has improved performance compared to the previous release due to 
improvements in the native shuffle 
+implementation and performance improvements in DataFusion 46.
+
+For single-node TPC-H at 100 GB, Comet now delivers a **greater than 2x 
speedup** compared to Spark using the same 
+CPU and RAM. Even with **half the resources**, Comet still provides a 
measurable performance improvement.
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache DataFusion PMC is pleased to announce version 0.7.0 of the 
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans 
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+Comet runs on commodity hardware and aims to provide 100% compatibility with 
Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless 
explicitly enabled by the user. Refer
+to the [compatibility guide] for more information.
+
+[compatibility guide]: 
https://datafusion.apache.org/comet/user-guide/compatibility.html
+
+This release covers approximately four weeks of development work and is the 
result of merging 46 PRs from 11
+contributors. See the [change log] for more information.
+
+[change log]: 
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.7.0.md
+
+## Release Highlights
+
+### Performance
+
+Comet 0.7.0 has improved performance compared to the previous release due to 
improvements in the native shuffle 
+implementation and performance improvements in DataFusion 46.
+
+For single-node TPC-H at 100 GB, Comet now delivers a **greater than 2x 
speedup** compared to Spark using the same 
+CPU and RAM. Even with **half the resources**, Comet still provides a 
measurable performance improvement.
+
+
+
+*These benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X 
CPU (16 cores), 128 GB RAM, and data 
+stored locally in Parquet format on NVMe storage. Spark was running in 
Kubernetes with hard memory limits.*
+
+## Shuffle Improvements
+
+There are several improvements to shuffle in this release:
+
+- When running in off-heap mode (which is the recommended approach), Comet was 
using the wrong memory allocator 
+  implementation for some types of shuffle operation, which could result in 
OOM rather than spilling to disk.
+- The number of spill files is drastically reduced. In previous releases, each 
instance of ShuffleMapTask could 
+  potentially create a new spill file for each output partition each time that 
spill was invoked. Comet now creates 
+  a maximum of one spill file per output partition per instance of 
ShuffleMapTask, which is appended to in subsequent 
+  spills.
+- There was a flaw in the memory accounting which resulted in Comet requesting 
approximately twice the amount of 
+  memory that was needed, resulting in premature spilling. This is now 
resolved.
+- The metric for number of spilled bytes is now accurate. It was previously 
reporting invalid information.
+
+## Improved Hash Join Performance
+
+When using the `spark.comet.exec.replaceSortMergeJoin` setting to replace 
sort-merge joins with hash joins, Comet 
+will now do a better job of picking the optimal build side. Thanks to 
[@hayman

Re: [PR] Comet 0.7.0 [datafusion-site]

2025-03-22 Thread via GitHub



alamb commented on PR #63:
URL: https://github.com/apache/datafusion-site/pull/63#issuecomment-2745149502

   Released site: 
https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0.7.0/ 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on issue #1077:
URL: 
https://github.com/apache/datafusion-python/issues/1077#issuecomment-2745248393

   Which version of datafusion-python are you using? If you have 45.2.0 can you 
please try downgrading to 44.0.0? If that solves it, I know what the problem is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2008766096


##
src/dataframe.rs:
##
@@ -111,56 +116,151 @@ impl PyDataFrame {
 }
 
 fn __repr__(&self, py: Python) -> PyDataFusionResult {
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
-let batches_as_string = pretty::pretty_format_batches(&batches);
-match batches_as_string {
-Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
-Err(err) => Ok(format!("Error: {:?}", err.to_string())),
+let (batches, has_more) = wait_for_future(
+py,
+collect_record_batches_to_display(self.df.as_ref().clone(), 10, 
10),
+)?;
+if batches.is_empty() {
+// This should not be reached, but do it for safety since we index 
into the vector below
+return Ok("No data to display".to_string());
 }
-}
 
-fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
-let mut html_str = "\n".to_string();
+let batches_as_displ =
+
pretty::pretty_format_batches(&batches).map_err(py_datafusion_err)?;
+
+let additional_str = match has_more {
+true => "\nData truncated.",
+false => "",
+};
 
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
+Ok(format!("DataFrame()\n{batches_as_displ}{additional_str}"))
+}
 
+fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
+let (batches, has_more) = wait_for_future(
+py,
+collect_record_batches_to_display(
+self.df.as_ref().clone(),
+MIN_TABLE_ROWS_TO_DISPLAY,
+usize::MAX,
+),
+)?;

Review Comment:
   Added to follow on issue 
https://github.com/apache/datafusion-python/issues/1078



##
src/dataframe.rs:
##
@@ -70,6 +72,9 @@ impl PyTableProvider {
 PyTable::new(table_provider)
 }
 }
+const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB

Review Comment:
   Added to issue https://github.com/apache/datafusion-python/issues/1078



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764218


##
src/dataframe.rs:
##
@@ -771,3 +871,82 @@ fn record_batch_into_schema(
 
 RecordBatch::try_new(schema, data_arrays)
 }
+
+/// This is a helper function to return the first non-empty record batch from 
executing a DataFrame.
+/// It additionally returns a bool, which indicates if there are more record 
batches available.
+/// We do this so we can determine if we should indicate to the user that the 
data has been
+/// truncated. This collects until we have achived both of these two conditions
+///
+/// - We have collected our minimum number of rows
+/// - We have reached our limit, either data size or maximum number of rows
+///
+/// Otherwise it will return when the stream has exhausted. If you want a 
specific number of
+/// rows, set min_rows == max_rows.
+async fn collect_record_batches_to_display(
+df: DataFrame,
+min_rows: usize,
+max_rows: usize,
+) -> Result<(Vec, bool), DataFusionError> {
+let mut stream = df.execute_stream().await?;

Review Comment:
   I'll switch to your approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Improve html table rendering formatting [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer opened a new issue, #1078:
URL: https://github.com/apache/datafusion-python/issues/1078

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   This is a follow on to https://github.com/apache/datafusion-python/pull/1036
   
   There is a suggestion that we should enable a feature to customize the html 
table rendering.
   
   **Describe the solution you'd like**
   
   Allow the user to set the amount of data returned. Right now it's hard coded 
to 2 MB. The actual data size is approximated. We would like this to be 
customizable.
   
   Allow the user to disable the styling that is applied to the html output.
   
   Optional: Add a way for a user to specify a formatting function to replace 
the default `ArrayFormatter`. This would enable users to change the data view 
when they have custom metadata
   
   Nice to have: Split up some of the html generation into a set of helper 
functions.
   
   **Describe alternatives you've considered**
   
   None
   
   **Additional context**
   
   If this code is getting too large, it may make sense to split it off into a 
separate file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-22 Thread via GitHub



2010YOUY01 opened a new pull request, #15355:
URL: https://github.com/apache/datafusion/pull/15355

   ## Which issue does this PR close?
   
   
   
   Related to https://github.com/apache/datafusion/pull/14975
   
   ## Rationale for this change
   
   
   ### What's the inefficiency
   Let's walkthrough an example, there is one external sort query with 1 
partition, and sort exec has:
   - 10MB memory limit
   - 1MB `sort_spill_reservation_bytes` (see configuration explanation in 
https://datafusion.apache.org/user-guide/configs.html)
   
   During the execution:
   1. `SortExec` will read in batches, and 10MB memory limit is reached
   2. It will sorted all buffered batches in-place, and merge them at once. 
Note only the 1MB buffer from 10MB total limit is pre-reserved for merging, so 
there is only 1MB available to store the merged output.
   3. After we have collected 1MB of merged batch, one spill will be triggered. 
And this 1MB space will be cleared, the merging can continue.
   **Inefficency:** Now `ExternalSorter` will create a new spill file for those 
1MB merged batches, after spilling all intermediates, all spilled files will be 
merged at once, then there are too many files to merge.
   **Ideal case:** All batches in a single sorted run can be incrementally 
appended to a single file.
   
    Reproducer
   Execute datafusion-cli with `cargo run --profile release-nonlto -- 
--mem-pool-type fair -m 10M`
   ```sql
   set datafusion.execution.sort_spill_reservation_bytes = 100;
   set datafusion.execution.target_partitions = 4;
   
   explain analyze select * from generate_series(1, 100) as t1(v1) order by 
v1;
   ```
   Main: 10 spills
   PR: 2 spills
   
   ### Rationale for the fix
   
   Introduced a new spill interface `SpillManager` with the ability of 
incrementally appending batches to a written file.
   1. `SpillManager` is designed to do `RecordBatch <---> raw file`, and 
configurations can be put inside `SpillManager` to control how do we do the 
serialization physically for future optimizations.
   Example configurations:
   - General purpose compression like `lz4`
   - Specialized encoding other than the current `Arrow IPC` , or 
configurations to change the `IPC Writer` behavior
   - See `datafusion-comet`'s proprietary serde implementation in 
https://github.com/apache/datafusion-comet/pull/1190
   - One example of extra configuration can be 
https://github.com/apache/datafusion/issues/15320
   2. `SpillManager` is not responsible for holding spilled files inside, 
because the logical representation of those files can vary, I think it's 
clearer to place those raw files inside spilling operators.
   For example, `vec` is managed inside `SortExec`, the 
implicit rule is within each file all entries are sorted by the sort keys, also 
in `Comet`'s `ShuffleWriterExec`, each partition should maintain one 
in-progress file. If we keep those tempfiles inside `SpillManager`, it's hard 
to clearly define those implicit requirements.
   3. Additionally, `SpillManager` is responsible for updating related 
statistics, the spill-related metrics should be the same across operators, so 
this part of of code can also be reused. Also, total disk usage limit for 
spilled files can be easily implemented upon it.
   
   ### Why refactor and introduce `SpillManager`
   This fix can be implemented without a major refactor. However, this change 
is included to prepare for supporting disk limits for spilling queries, as 
described in https://github.com/apache/datafusion/pull/14975
   
   ## What changes are included in this PR?
   
   
   1. Group spilling related metrics into one struct
   2. Introduce `SpillManager`
   3. Update `SortExec` to use the new `SpillManager` interface
   
   ### TODO:
   - [ ] There are two extra operators that can be changed to this new 
interface (`Aggregate` and `SortMergeJoin`), they're planned to be included in 
this PR. I plan to do it after getting some review feedback.
   
   ## Are these changes tested?
   
   
   For the too-many-spills issue: one test case is updated, and more comment is 
added above the assertion to prevent regression.
   For `SpillManager`: unit tests are included.
   
   ## Are there any user-facing changes?
   
   No.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764621


##
src/dataframe.rs:
##
@@ -70,6 +72,9 @@ impl PyTableProvider {
 PyTable::new(table_provider)
 }
 }
+const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB

Review Comment:
   How about I open an issue to enhance this to be configurable as well as the 
follow on part about disabling the styling? I'd like to get this in so we fix 
explain and add some useful functionality now and then we can get these things 
tightened up in the next iteration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2008766157


##
src/dataframe.rs:
##
@@ -111,56 +116,151 @@ impl PyDataFrame {
 }
 
 fn __repr__(&self, py: Python) -> PyDataFusionResult {
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
-let batches_as_string = pretty::pretty_format_batches(&batches);
-match batches_as_string {
-Ok(batch) => Ok(format!("DataFrame()\n{batch}")),
-Err(err) => Ok(format!("Error: {:?}", err.to_string())),
+let (batches, has_more) = wait_for_future(
+py,
+collect_record_batches_to_display(self.df.as_ref().clone(), 10, 
10),
+)?;
+if batches.is_empty() {
+// This should not be reached, but do it for safety since we index 
into the vector below
+return Ok("No data to display".to_string());
 }
-}
 
-fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
-let mut html_str = "\n".to_string();
+let batches_as_displ =
+
pretty::pretty_format_batches(&batches).map_err(py_datafusion_err)?;
+
+let additional_str = match has_more {
+true => "\nData truncated.",
+false => "",
+};
 
-let df = self.df.as_ref().clone().limit(0, Some(10))?;
-let batches = wait_for_future(py, df.collect())?;
+Ok(format!("DataFrame()\n{batches_as_displ}{additional_str}"))
+}
 
+fn _repr_html_(&self, py: Python) -> PyDataFusionResult {
+let (batches, has_more) = wait_for_future(
+py,
+collect_record_batches_to_display(
+self.df.as_ref().clone(),
+MIN_TABLE_ROWS_TO_DISPLAY,
+usize::MAX,
+),
+)?;
 if batches.is_empty() {
-html_str.push_str("\n");
-return Ok(html_str);
+// This should not be reached, but do it for safety since we index 
into the vector below
+return Ok("No data to display".to_string());
 }
 
+let table_uuid = uuid::Uuid::new_v4().to_string();
+
+let mut html_str = "
+
+.expandable-container {
+display: inline-block;
+max-width: 200px;
+}
+.expandable {
+white-space: nowrap;
+overflow: hidden;
+text-overflow: ellipsis;
+display: block;
+}
+.full-text {
+display: none;
+white-space: normal;
+}
+.expand-btn {
+cursor: pointer;
+color: blue;
+text-decoration: underline;
+border: none;
+background: none;
+font-size: inherit;
+display: block;
+margin-top: 5px;
+}
+
+
+

Review Comment:
   Added to issue https://github.com/apache/datafusion-python/issues/1078



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Enforce JOIN plan to require condition [datafusion]

2025-03-22 Thread via GitHub



goldmedal commented on code in PR #15334:
URL: https://github.com/apache/datafusion/pull/15334#discussion_r2008714660


##
datafusion/sqllogictest/test_files/tpch/plans/q22.slt.part:
##
@@ -65,7 +65,7 @@ logical_plan
 07)Projection: customer.c_phone, customer.c_acctbal
 08)--LeftAnti Join: customer.c_custkey = 
__correlated_sq_1.o_custkey
 09)Filter: substr(customer.c_phone, Int64(1), Int64(2)) IN 
([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), 
Utf8View("30"), Utf8View("18"), Utf8View("17")])
-10)--TableScan: customer projection=[c_custkey, c_phone, 
c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN 
([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), 
Utf8View("30"), Utf8View("18"), Utf8View("17")])]
+10)--TableScan: customer projection=[c_custkey, c_phone, 
c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN 
([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), 
Utf8View("30"), Utf8View("18"), Utf8View("17")]), Boolean(true)]

Review Comment:
   I benchmarked to ensure the plan change won't impact the performance.
   ```
   
   Benchmark tpch_sf1.json
   
   ┏━━┳━┳┳━━┓
   ┃ Query┃main ┃ fix_join-check ┃   Change ┃
   ┡━━╇━╇╇━━┩
   │ QQuery 1 │ 68.97ms │70.14ms │no change │
   │ QQuery 2 │ 18.72ms │18.55ms │no change │
   │ QQuery 3 │ 30.47ms │29.46ms │no change │
   │ QQuery 4 │ 21.45ms │21.77ms │no change │
   │ QQuery 5 │ 43.53ms │41.56ms │no change │
   │ QQuery 6 │ 14.39ms │14.88ms │no change │
   │ QQuery 7 │ 55.03ms │55.30ms │no change │
   │ QQuery 8 │ 41.24ms │39.45ms │no change │
   │ QQuery 9 │ 50.84ms │53.19ms │no change │
   │ QQuery 10│ 44.85ms │44.83ms │no change │
   │ QQuery 11│ 13.42ms │13.22ms │no change │
   │ QQuery 12│ 28.49ms │28.20ms │no change │
   │ QQuery 13│ 29.56ms │29.38ms │no change │
   │ QQuery 14│ 23.03ms │24.40ms │ 1.06x slower │
   │ QQuery 15│ 33.59ms │34.98ms │no change │
   │ QQuery 16│ 13.07ms │13.08ms │no change │
   │ QQuery 17│ 58.00ms │58.67ms │no change │
   │ QQuery 18│ 76.00ms │75.06ms │no change │
   │ QQuery 19│ 39.99ms │38.40ms │no change │
   │ QQuery 20│ 29.04ms │29.79ms │no change │
   │ QQuery 21│ 62.14ms │64.14ms │no change │
   │ QQuery 22│ 13.82ms │14.01ms │no change │
   └──┴─┴┴──┘
   ┏━━━┳━━┓
   ┃ Benchmark Summary ┃  ┃
   ┡━━━╇━━┩
   │ Total Time (main) │ 809.65ms │
   │ Total Time (fix_join-check)   │ 812.45ms │
   │ Average Time (main)   │  36.80ms │
   │ Average Time (fix_join-check) │  36.93ms │
   │ Queries Faster│0 │
   │ Queries Slower│1 │
   │ Queries with No Change│   21 │
   └───┴──┘
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]

2025-03-22 Thread via GitHub



ozankabak commented on issue #15343:
URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2745139974

   > I wonder if we could just change the default format for `datafusion-cli` 
(it is a config setting) 🤔 Downstream projects could also then "opt-in" if they 
wanted nicer default explain plans
   
   Makes sense to change the default format for `datafusion-cli` 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] `avg(distinct)` support [datafusion]

2025-03-22 Thread via GitHub



qazxcdswe123 commented on issue #2408:
URL: https://github.com/apache/datafusion/issues/2408#issuecomment-2745155964

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15325:
URL: https://github.com/apache/datafusion/pull/15325#discussion_r2008783615


##
datafusion/wasmtest/src/lib.rs:
##
@@ -182,4 +182,29 @@ mod test {
 let task_ctx = ctx.task_ctx();
 let _ = collect(physical_plan, task_ctx).await.unwrap();
 }
+
+#[wasm_bindgen_test(unsupported = tokio::test)]
+async fn test_parquet_write() {
+let schema = Arc::new(Schema::new(vec![
+Field::new("id", DataType::Int32, false),
+Field::new("value", DataType::Utf8, false),
+]));
+
+let data: Vec = vec![
+Arc::new(Int32Array::from(vec![1])),
+Arc::new(StringArray::from(vec!["a"])),
+];
+
+let batch = RecordBatch::try_new(schema.clone(), data).unwrap();
+let mut buffer = Vec::new();
+let mut writer = datafusion::parquet::arrow::ArrowWriter::try_new(
+&mut buffer,
+schema.clone(),
+None,
+)
+.unwrap();
+
+writer.write(&batch).unwrap();

Review Comment:
   I filed  a ticket to track this work:
   - https://github.com/apache/datafusion/issues/15357



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-22 Thread via GitHub



alamb commented on PR #15313:
URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745294370

   > @alamb @blaginin I have resolved all the conversations. Please look into 
it. Also my bad for few unnecessary commits here and there, I am still getting 
acquainted with the workflow. Thank you :)
   
   No worries -- thanks @Shreyaskr1409  
   
   PLease don't worry about extra commits -- they are all squashed when merged 
to main. Multiple commits makes the history of a PR much easier to see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]

2025-03-22 Thread via GitHub



alamb commented on PR #15325:
URL: https://github.com/apache/datafusion/pull/15325#issuecomment-2745294019

   Thanks again @pranavJibhakate 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

2025-03-22 Thread via GitHub



timsaucer merged PR #1036:
URL: https://github.com/apache/datafusion-python/pull/1036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]

2025-03-22 Thread via GitHub



alamb merged PR #15325:
URL: https://github.com/apache/datafusion/pull/15325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-22 Thread via GitHub



alamb commented on PR #15313:
URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745294680

   i took a look at the most recent commits and it looks good to me. Thanks 
again @Shreyaskr1409 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-22 Thread via GitHub



alamb merged PR #15313:
URL: https://github.com/apache/datafusion/pull/15313


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPre… [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15348:
URL: https://github.com/apache/datafusion/pull/15348#discussion_r2008785345


##
datafusion/physical-plan/src/sorts/cursor.rs:
##
@@ -281,6 +281,33 @@ impl CursorArray for GenericByteArray 
{
 }
 }
 
+impl CursorArray for StringViewArray {
+type Values = StringViewArray;
+fn values(&self) -> Self {
+self.clone()
+}
+}
+
+impl CursorValues for StringViewArray {
+fn len(&self) -> usize {
+self.views().len()
+}
+
+fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, 
r_idx).is_eq() }

Review Comment:
   I agree it would be good to justify the use of unchecked (which I think is 
ok here)
   
   The docs say 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked
   
   SO maybe the safety argument is mostly "The left/right_idx must within range 
of each array"
   
   
   It also seems like we need to be comparing the Null masks too 🤔  like 
checking if the values are null before comparing
   
   Given that this comparison is typically *the* hottest part of a merge 
operation maybe we should try using unchecked comparisions elswhere



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] limit max disk usage for spilling queries [datafusion]

2025-03-22 Thread via GitHub



alamb opened a new issue, #15358:
URL: https://github.com/apache/datafusion/issues/15358

   ### Is your feature request related to a problem or challenge?
   
   Breaking rationale from 
https://github.com/apache/datafusion/pull/14975#issue-2890626662 into its own 
ticket:
   
   
   For memory-limit queries, executors might write temporary results into the 
disk to reduce memory load. It's important to have a configuration option to 
limit the max disk usage, in case some query would bloat the disk and cause 
other issues.
   
   DuckDB provides a similar configuration:
   ```
   max_temp_directory_size | The maximum amount of data stored inside the 
'temp_directory' (when set) (e.g., 1GB)
   ```
   
   ### Describe the solution you'd like
   
   Provide a way to limit the total disk usage of queries that spill to disk
   
   ### Describe alternatives you've considered
   
   Add a configuration option max_temp_directory_size to disk manager (default 
to 100GB), if the limit is reached for all spilled files, an error will be 
returned. 
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add test coverage for wasm32 + parquet build [datafusion]

2025-03-22 Thread via GitHub



alamb closed issue #15158: Add test coverage for wasm32 + parquet build
URL: https://github.com/apache/datafusion/issues/15158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-22 Thread via GitHub



alamb commented on PR #15355:
URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2745303386

   - I filed https://github.com/apache/datafusion/issues/15358 to track the 
feature request, and linked this ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-22 Thread via GitHub



alamb commented on issue #15177:
URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740983293

   BTW combined with @adriangb's PR here
   - https://github.com/apache/datafusion/pull/15301
   
   It will likely go crazy fast 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-22 Thread via GitHub



alamb commented on code in PR #15355:
URL: https://github.com/apache/datafusion/pull/15355#discussion_r2008788363


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -230,9 +219,14 @@ struct ExternalSorter {
 /// if `Self::in_mem_batches` are sorted
 in_mem_batches_sorted: bool,
 
-/// If data has previously been spilled, the locations of the
-/// spill files (in Arrow IPC format)
-spills: Vec,
+/// During external sorting, in-memory intermediate data will be appended 
to
+/// this file incrementally. Once finished, this file will be moved to 
[`Self::finished_spill_files`].
+in_progress_spill_file: Option,
+/// If data has previously been spilled, the locations of the spill files 
(in
+/// Arrow IPC format)
+/// Within the same spill file, the data might be chunked into multiple 
batches,
+/// and ordered by sort keys.
+finished_spill_files: Vec,

Review Comment:
   It might make more sense to have the `SpillManager` own these files so there 
can't be different sets of references



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -65,23 +63,14 @@ struct ExternalSorterMetrics {
 /// metrics
 baseline: BaselineMetrics,
 
-/// count of spills during the execution of the operator

Review Comment:
   Nice



##
datafusion/physical-plan/src/spill.rs:
##
@@ -223,25 +229,182 @@ impl IPCStreamWriter {
 }
 }
 
+/// The `SpillManager` is responsible for the following tasks:

Review Comment:
   Love the spill manager 👍 



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -379,46 +382,64 @@ impl ExternalSorter {
 
 /// How many bytes have been spilled to disk?
 fn spilled_bytes(&self) -> usize {
-self.metrics.spilled_bytes.value()
+self.metrics.spill_metrics.spilled_bytes.value()
 }
 
 /// How many rows have been spilled to disk?
 fn spilled_rows(&self) -> usize {
-self.metrics.spilled_rows.value()
+self.metrics.spill_metrics.spilled_rows.value()
 }
 
 /// How many spill files have been created?
 fn spill_count(&self) -> usize {
-self.metrics.spill_count.value()
+self.metrics.spill_metrics.spill_file_count.value()
 }
 
-/// Writes any `in_memory_batches` to a spill file and clears
-/// the batches. The contents of the spill file are sorted.
-///
-/// Returns the amount of memory freed.
-async fn spill(&mut self) -> Result {
+/// When calling, all `in_mem_batches` must be sorted, and then all of 
them will
+/// be appended to the in-progress spill file.
+async fn spill_append(&mut self) -> Result<()> {
 // we could always get a chance to free some memory as long as we are 
holding some
 if self.in_mem_batches.is_empty() {
-return Ok(0);
+return Ok(());
+}
+
+// Lazily initialize the in-progress spill file
+if self.in_progress_spill_file.is_none() {
+self.in_progress_spill_file =
+Some(self.spill_manager.create_in_progress_file("Sorting")?);
 }
 
 self.organize_stringview_arrays()?;
 
 debug!("Spilling sort data of ExternalSorter to disk whilst 
inserting");
 
-let spill_file = self.runtime.disk_manager.create_tmp_file("Sorting")?;
 let batches = std::mem::take(&mut self.in_mem_batches);
-let (spilled_rows, spilled_bytes) = spill_record_batches(
-&batches,
-spill_file.path().into(),
-Arc::clone(&self.schema),
-)?;
-let used = self.reservation.free();
-self.metrics.spill_count.add(1);
-self.metrics.spilled_bytes.add(spilled_bytes);
-self.metrics.spilled_rows.add(spilled_rows);
-self.spills.push(spill_file);
-Ok(used)
+self.reservation.free();
+
+let in_progress_file = 
self.in_progress_spill_file.as_mut().ok_or_else(|| {
+internal_datafusion_err!("In-progress spill file should be 
initialized")
+})?;
+
+for batch in batches {

Review Comment:
   I don't understand this logic -- i thought that each individual 
`self.in_mem_batches`  was sorted but they aren't sorted overall
   
   Thus if we write write them back to back to the same spill file, the spill 
file itself won't be sorted
   
   Like if the two in memory batches are
   
   | A | B |
   |||
   | 1 | 10|
   | 2 | 10 |
   | 2 | 10 | 
   
   | A | B |
   |||
   | 1 | 10|
   | 2 | 10 |
   | 2 | 10 | 
   
   I think this code would produce a single spill file like
   
   | A | B |
   |||
   | 1 | 10|
   | 2 | 10 |
   | 2 | 10 | 
   | 1 | 10|
   | 2 | 10 |
   | 2 | 10 | 
   
   
   Which is not sorted 🤔 
   
   On the other hand all the tests are passing so maybe I misunderstand what 
this is doing (or we have a testing gap)



##
datafusion/physical-plan/src/sorts/sort.rs:
##

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-22 Thread via GitHub



alamb commented on PR #15355:
URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2745309065

   >  There are two extra operators that can be changed to this new interface 
(Aggregate and SortMergeJoin), they're planned to be included in this PR. I 
plan to do it after getting some review feedback.
   
   I request that we do this feature in multiple smaller PRs which will be 
easier to review / understand
   
   BTW I think this PR may address some of this issue to0:
   - https://github.com/apache/datafusion/issues/15323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-22 Thread via GitHub



omar commented on code in PR #61:
URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2008990238


##
content/blog/2025-03-21-parquet-pushdown.md:
##
@@ -0,0 +1,312 @@
+---
+layout: post
+title: Efficient Filter Pushdown in Parquet
+date: 2025-03-21
+author: Xiangpeng Hao
+categories: [performance]
+---
+
+
+figure {
+  margin: 20px 0;
+}
+
+figure img {
+  display: block;
+  max-width: 80%;
+}
+
+figcaption {
+  font-style: italic;
+  margin-top: 10px;
+  color: #555;
+  font-size: 0.9em;
+  max-width: 80%;
+}
+
+
+
+
+_Editor's Note: This blog was first published on [Xiangpeng Hao's blog]. 
Thanks to [InfluxData] for sponsoring this work as part of his PhD funding._
+
+[Xiangpeng Hao's blog]: https://blog.xiangpeng.systems/posts/parquet-pushdown/
+[InfluxData]: https://www.influxdata.com/
+
+
+
+In the [previous post], we discussed how [Apache DataFusion] prunes [Apache 
Parquet] files to skip irrelevant **files/row_groups** (sometimes also 
[pages](https://parquet.apache.org/docs/file-format/pageindex/)).
+
+This post discusses how Parquet readers skip irrelevant **rows** while 
scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.
+
+
+[previous post]: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Why filter pushdown in Parquet? 
+
+Below is an example query that reads sensor data with filters on `date_time` 
and `location`.
+Without filter pushdown, all rows from location, val, and date_time columns 
are decoded before `location='office'` is evaluated. Filter pushdown is 
especially useful when the filter is selective, i.e., removes many rows.
+
+
+```sql
+SELECT val, location 
+FROM sensor_data 
+WHERE date_time > '2025-03-11' AND location = 'office';
+```
+
+
+  
+  
+Parquet pruning skips irrelevant files/row_groups, while filter pushdown 
skips irrelevant rows. Without filter pushdown, all rows from location, val, 
and date_time columns are decoded before `location='office'` is evaluated. 
Filter pushdown is especially useful when the filter is selective, i.e., 
removes many rows.
+  
+
+
+
+In our setup, sensor data is aggregated by date — each day has its own Parquet 
file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e., 
`2025-03-10.parquet` and `2025-03-11.parquet`.
+
+Once the files to read are located, the [*DataFusion's current default 
implementation*](https://github.com/apache/datafusion/issues/3463) reads all 
the projected columns (`sensor_id`, `val`, and `location`) into Arrow 
RecordBatches, then applies the filters over `location` to get the final set of 
rows.
+
+A better approach is called **filter pushdown**, which evaluates filter 
conditions first and only decodes data that passes these conditions.
+In practice, this works by first processing only the filter columns 
(`date_time` and `location`), building a boolean mask of rows that satisfy our 
conditions, then using this mask to selectively decode only the relevant rows 
from other columns (`sensor_id`, `val`). 
+This eliminates the waste of decoding rows that will be immediately filtered 
out.
+
+While simple in theory, practical implementations often make performance worse.
+
+## How can filter pushdown be slower?
+
+At a high level, the Parquet reader first builds a filter mask -- essentially 
a boolean array indicating which rows meet the filter criteria -- and then uses 
this mask to selectively decode only the needed rows from the remaining columns 
in the projection.
+
+Let's dig into details of [how filter pushdown is 
implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712)
 in the current Rust Parquet reader implementation, illustrated in the 
following figure.
+
+
+  
+  
+Implementation of filter pushdown in Rust Parquet readers -- the first 
phase builds the filter mask, the second phase applies the filter mask to the 
other columns
+  
+
+
+The filter pushdown has two phases:
+
+1. Build the filter mask (steps 1-3)
+
+2. Use the filter mask to selectively decode other columns (steps 4-7), e.g., 
output step 3 is used as input for step 5 and 7.
+
+Within each phase, it takes three steps from Parquet to Arrow:
+
+1. Decompress the Parquet pages using generic decompression algorithms like 
LZ4, Zstd, etc. (steps 1, 4, 6)
+
+2. Decode the page content into Arrow format (steps 2, 5, 7)
+
+3. Evaluate the filter over Arrow data (step 3)
+
+In the figure above, we can see that `location` is **decompressed and decoded 
twice**, first when building the filter mask (steps 1, 2), and second when 
building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and output.
+
+The table below shows the corresponding CPU time o

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-22 Thread via GitHub



2010YOUY01 commented on code in PR #15355:
URL: https://github.com/apache/datafusion/pull/15355#discussion_r2009004733


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -230,9 +219,14 @@ struct ExternalSorter {
 /// if `Self::in_mem_batches` are sorted
 in_mem_batches_sorted: bool,
 
-/// If data has previously been spilled, the locations of the
-/// spill files (in Arrow IPC format)
-spills: Vec,
+/// During external sorting, in-memory intermediate data will be appended 
to
+/// this file incrementally. Once finished, this file will be moved to 
[`Self::finished_spill_files`].
+in_progress_spill_file: Option,
+/// If data has previously been spilled, the locations of the spill files 
(in
+/// Arrow IPC format)
+/// Within the same spill file, the data might be chunked into multiple 
batches,
+/// and ordered by sort keys.
+finished_spill_files: Vec,

Review Comment:
   I think it will be hard to define the semantics of those temp files if we 
put them inside `SpillManager`, because different operators will interpret 
those files differently:
   - For `SortExec`, `vec` is representing multiple sorted 
runs on sort keys.
   - For `ShuffleWriterExec` in `datafusion-comet`, since `Spark`'s shuffle 
operator is blocking (due to spark's staged execution design), it might want to 
keep `vec` instead.
   - Similarly, if we want to spill `Row`s to accelerate `SortExec`, or we want 
to implement spilling hash join, the temp files will have very different 
logical meanings.
   
   Overall, the `SpillManager` is designed only to do `RecordBatch <-> raw 
file` with different configurations and stat accounting. Operators have more 
flexibility to implement specific utilities for managing raw files, which have 
diverse semantics.
   
   Do you see any potential issues or improvements?



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -379,46 +382,64 @@ impl ExternalSorter {
 
 /// How many bytes have been spilled to disk?
 fn spilled_bytes(&self) -> usize {
-self.metrics.spilled_bytes.value()
+self.metrics.spill_metrics.spilled_bytes.value()
 }
 
 /// How many rows have been spilled to disk?
 fn spilled_rows(&self) -> usize {
-self.metrics.spilled_rows.value()
+self.metrics.spill_metrics.spilled_rows.value()
 }
 
 /// How many spill files have been created?
 fn spill_count(&self) -> usize {
-self.metrics.spill_count.value()
+self.metrics.spill_metrics.spill_file_count.value()
 }
 
-/// Writes any `in_memory_batches` to a spill file and clears
-/// the batches. The contents of the spill file are sorted.
-///
-/// Returns the amount of memory freed.
-async fn spill(&mut self) -> Result {
+/// When calling, all `in_mem_batches` must be sorted, and then all of 
them will
+/// be appended to the in-progress spill file.
+async fn spill_append(&mut self) -> Result<()> {
 // we could always get a chance to free some memory as long as we are 
holding some
 if self.in_mem_batches.is_empty() {
-return Ok(0);
+return Ok(());
+}
+
+// Lazily initialize the in-progress spill file
+if self.in_progress_spill_file.is_none() {
+self.in_progress_spill_file =
+Some(self.spill_manager.create_in_progress_file("Sorting")?);
 }
 
 self.organize_stringview_arrays()?;
 
 debug!("Spilling sort data of ExternalSorter to disk whilst 
inserting");
 
-let spill_file = self.runtime.disk_manager.create_tmp_file("Sorting")?;
 let batches = std::mem::take(&mut self.in_mem_batches);
-let (spilled_rows, spilled_bytes) = spill_record_batches(
-&batches,
-spill_file.path().into(),
-Arc::clone(&self.schema),
-)?;
-let used = self.reservation.free();
-self.metrics.spill_count.add(1);
-self.metrics.spilled_bytes.add(spilled_bytes);
-self.metrics.spilled_rows.add(spilled_rows);
-self.spills.push(spill_file);
-Ok(used)
+self.reservation.free();
+
+let in_progress_file = 
self.in_progress_spill_file.as_mut().ok_or_else(|| {
+internal_datafusion_err!("In-progress spill file should be 
initialized")
+})?;
+
+for batch in batches {

Review Comment:
   No, they are globally sorted. In different stages, `in_mem_batches` can 
either represent unordered input, or globally sorted run (but chunked into 
smaller batches)
   I agree this approach has poor understandability and is error-prone, I'll 
try to improve it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact

Re: [PR] Fix array_has_all and array_has_any with empty array [datafusion]

2025-03-22 Thread via GitHub



LuQQiu commented on PR #15039:
URL: https://github.com/apache/datafusion/pull/15039#issuecomment-2741622696

   @alamb @Weijun-H Thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix parquet pruning blog post hyperlink [datafusion-site]

2025-03-22 Thread via GitHub



alamb merged PR #62:
URL: https://github.com/apache/datafusion-site/pull/62


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

2025-03-22 Thread via GitHub



TheBuilderJR commented on PR #15295:
URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2745948813

   Nice! Fwiw another edge case I found recently that's probably worth testing 
is a List where the Struct evolves. I ended up solving it by updating 
list_coersion but curious if you have a better way: 
https://github.com/apache/datafusion/pull/15259/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-22 Thread via GitHub



friendlymatthew opened a new pull request, #15361:
URL: https://github.com/apache/datafusion/pull/15361

   Closes https://github.com/apache/datafusion/issues/14536
   
   ## Rationale for this change
   
   Datafusion currently errs when attempting to format a date using 
time-related specifiers. 
   
   ```sql
   > select to_char('2023-09-04'::date, '%Y-%m-%dT%H:%M:%S%.3f');
   Execution error: Cast error: Format error
   ```
   
   However, Postgres supports this feature as it implicitly treats the date as 
a timestamp. 
   
   Rather than eagerly casting every `Date32` to a `Date64` when calling 
`to_char`, this commit attempts to first format a `Date32` with the supplied 
format string. If the formatting fails, we try to reformat as a `Date64`. 
   
   This way, only format strings with time-related specifiers endure the 
intermediary cast. 
   
   ## What changes are included in this PR?
   
   All changes are tested, specifically for two different call sites: 
`_to_char_scalar` and `_to_char_array`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Parse Postgres's LOCK TABLE statement [datafusion-sqlparser-rs]

2025-03-22 Thread via GitHub



github-actions[bot] commented on PR #1614:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1614#issuecomment-2745969727

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

93 matches

Mail list logo