Re: [PR] feat: support merge for `Distribution` [datafusion]
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743593315 > Do you know any use cases where this method would be especially useful? If so, maybe we can study one of those cases in more detail. That could help us understand the real need and guide us toward a more solid algorithm. Yes, we're considering restarting the [work](https://github.com/apache/datafusion/pull/13296/files#diff-8d786f45bc2d5bf629754a119ed6fa7998dcff7faacd954c45945b7047b87fa1R498), and given that `Precision` will be replaced with `Distribution`, so I opened the proposal to discuss how to do **merge** for `Distribution`. > a) Mixture Model (Weighted Average of PDFs) > This is a method for combining different probability distributions. > p1(x) and p2(x) is some PDF's, and we give them equal weight (0.5). The combined PDF would be: > _pmix(x) = 0.5 * p1(x) + 0.5 * p2(x)_ > This creates a probabilistic blend of the two. The result is still a valid PDF (non-negative and integrates to 1). It seems that my current way is close to the mixture model. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] multiply overflow in stats.rs [datafusion]
LindaSummer commented on issue #13775: URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2746049186 Hi @Speculative , Thanks very much for your investigation! ❤ I'm sorry that in last month due to my personal circumstances, I didn't have enough time to follow up on this issue. I will try to follow your reproducing steps for this problem. I will try to update in this thread or reach out to you when I have any new progress or problem on reproducing if you don't mind. Best Regards, Edward -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]
alamb commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2744217807 Makes sense -- with 183 spill files, we probably would need to merge in stages For example starting with 183 spill files 1. run 10 jobs, each merging about 10 files into one (results in 10 files) 2. run the final merge of 10 files This results in 2x the IO (have to read / write each row twice) but it would be possible at least to parallelize the merges of the earlier step I think @2010YOUY01 was starting to look into a SpillFileManager -- this is the kind of behavior I would imagine being part of such a thing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
Dandandan commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2008942439 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,420 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstPrimitiveGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, +// size of `self.orderings` +// Calculating the memory usage of `self.orderings` using `ScalarValue::size_of_vec` is quite costly. +// Therefore, we cache it and compute `size_of` only after each update +// to avoid calling `ScalarValue::size_of_vec` by Self.size. +size_of_orderings: usize, + +// === option + +// Stores the applicable ordering requirement. +ordering_req: LexOrdering, +// derived from `ordering_req`. +sort_options: Vec, +// Stores whether incoming data already satisfies the ordering requirement. +input_requirement_satisfied: bool, +// Ignore null values. +ignore_nulls: bool, +/// The output type +data_type: DataType, +default_orderings: Vec, +} + +impl FirstPrimitiveGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +fn try_new( +ordering_req: LexOrdering, +ignore_nulls: bool, +data_type: &DataType, +ordering_dtypes: &[DataType], +) -> Result { +let requirement_satisfied = ordering_req.is_empty(); + +let default_orderings = ordering_dtypes +.iter() +.map(ScalarValue::try_from) +.collect::>>()?; + +let sort_options = get_sort_options(ordering_req.as_ref()); + +Ok(Self { +null_builder: BooleanBufferBuilder::new(0), +ordering_req, +sort_options, +input_requirement_satisfied: requirement_satisfied, +ignore_nulls, +default_orderings, +data_type: data_type.clone(), +vals: Vec::new(), +orderings: Vec::new(), +is_sets: BooleanBufferBuilder::new(0), +size_of_orderings: 0, +}) +} + +fn need_update(&self, group_idx: usize) -> bool { +if !self.is_sets.get_bit(group_idx) { +return true; +} + +if self.ignore_nulls && !self.null_builder.get_bit(group_idx) { +return true; +} + +!self.input_requirement_satisfied +} + +fn should_update_state( +&self, +group_idx: usize, +new_ordering_values: &[ScalarValue], +) -> Result { +if !self.is_sets.get_bit(group_idx) { +return Ok(true); +} + +assert!(new_ordering_values.len() == self.ordering_req.len()); +let current_ordering = &self.orderings[group_idx]; +compare_rows(current_ordering, new_ordering_values, &self.sort_options) +.map(|x| x.is_gt()) +} + +fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> { +let result = emit_to.take_needed(&mut self.orderings); + +match emit_to { +EmitTo::All => self.size_of_orderings = 0, +EmitTo::First(_) => { +self.size_of_orderings -= +result.iter().map(ScalarValue::size_of_vec).sum::() +} +} + +result +} + +fn take_need( +bool_buf_builder: &mut BooleanBufferBuilder, +emit_to: EmitTo, +) -> BooleanBuffer { +let bool_buf = bool_buf_builder.finish(); +match emit_to { +EmitTo::All => bool_buf, +EmitTo::First(n) => { +// split off the first N values in seen_values +// +// TODO make this more efficient rather than two +// copies and bitwise manipulation +let first_n: BooleanBuffer = bool_buf.iter().take(n).collect(); +// reset the existing buffer +for b in bool_buf.iter().skip(n) { +bool_buf_builder.append(b); +} +first_n +} +} +} + +fn resize_states(&mut self, new_size: usize) { +self.vals.resize(new_size, T::default_value()); + +self.null_builder.resize(new_size); + +if self.orderings.len() < new_size { +let
Re: [PR] Always use `PartitionMode::Auto` in planner [datafusion]
ozankabak commented on code in PR #15339: URL: https://github.com/apache/datafusion/pull/15339#discussion_r2008738870 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -345,63 +345,68 @@ FROM physical_plan 01)┌───┐ -02)│CoalesceBatchesExec│ +02)│ ProjectionExec │ Review Comment: Why do we have this projection here? Shouldn't the built-in projection of hash join handle this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Only unnest source for `EmptyRelation` [datafusion]
goldmedal commented on code in PR #15159: URL: https://github.com/apache/datafusion/pull/15159#discussion_r2008701304 ## datafusion/sql/tests/cases/plan_to_sql.rs: ## @@ -633,6 +633,18 @@ fn roundtrip_statement_with_dialect() -> Result<()> { parser_dialect: Box::new(GenericDialect {}), unparser_dialect: Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()), }, +TestStatementWithDialect { +sql: "SELECT unnest([1, 2, 3, 4]) from unnest([1, 2, 3]);", +expected: r#"SELECT UNNEST([1, 2, 3, 4]) AS UNNEST(make_array(Int64(1),Int64(2),Int64(3),Int64(4))) FROM UNNEST([1, 2, 3])"#, +parser_dialect: Box::new(GenericDialect {}), +unparser_dialect: Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()), +}, +TestStatementWithDialect { +sql: "SELECT unnest([1, 2, 3, 4]) from unnest([1, 2, 3]);", +expected: r#"SELECT UNNEST([1, 2, 3, 4]) AS UNNEST(make_array(Int64(1),Int64(2),Int64(3),Int64(4))) FROM UNNEST([1, 2, 3])"#, +parser_dialect: Box::new(GenericDialect {}), +unparser_dialect: Box::new(CustomDialectBuilder::default().with_unnest_as_table_factor(true).build()), +}, Review Comment: They look like the same. 🤔 I think we can remove one of them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Triggering extended tests through PR comment [datafusion]
Omega359 commented on PR #15101: URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2743066275 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764438 ## src/dataframe.rs: ## @@ -771,3 +871,82 @@ fn record_batch_into_schema( RecordBatch::try_new(schema, data_arrays) } + +/// This is a helper function to return the first non-empty record batch from executing a DataFrame. +/// It additionally returns a bool, which indicates if there are more record batches available. +/// We do this so we can determine if we should indicate to the user that the data has been +/// truncated. This collects until we have achived both of these two conditions +/// +/// - We have collected our minimum number of rows +/// - We have reached our limit, either data size or maximum number of rows +/// +/// Otherwise it will return when the stream has exhausted. If you want a specific number of +/// rows, set min_rows == max_rows. +async fn collect_record_batches_to_display( +df: DataFrame, +min_rows: usize, +max_rows: usize, +) -> Result<(Vec, bool), DataFusionError> { +let mut stream = df.execute_stream().await?; +let mut size_estimate_so_far = 0; +let mut rows_so_far = 0; +let mut record_batches = Vec::default(); +let mut has_more = false; + +while (size_estimate_so_far < MAX_TABLE_BYTES_TO_DISPLAY && rows_so_far < max_rows) +|| rows_so_far < min_rows +{ +let mut rb = match stream.next().await { +None => { +break; +} +Some(Ok(r)) => r, +Some(Err(e)) => return Err(e), +}; + +let mut rows_in_rb = rb.num_rows(); +if rows_in_rb > 0 { +size_estimate_so_far += rb.get_array_memory_size(); + +if size_estimate_so_far > MAX_TABLE_BYTES_TO_DISPLAY { +let ratio = MAX_TABLE_BYTES_TO_DISPLAY as f32 / size_estimate_so_far as f32; +let total_rows = rows_in_rb + rows_so_far; + +let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize; Review Comment: Yes. And the data size is an estimate as well. The point is to get a general ball park and not necessarily an exact measure. It should still indicate that that data have been truncated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Avg distinct [datafusion]
qazxcdswe123 commented on code in PR #15356: URL: https://github.com/apache/datafusion/pull/15356#discussion_r2008792246 ## datafusion/functions-aggregate-common/src/aggregate/avg_distinct/decimal.rs: ## @@ -0,0 +1,133 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use ahash::RandomState; +use arrow::{ +array::{ArrayRef, ArrowNativeTypeOp, PrimitiveArray}, +datatypes::{ArrowNativeType, DecimalType}, +}; +use datafusion_common::{ +cast::{as_list_array, as_primitive_array}, +utils::{memory::estimate_memory_size, SingleRowListArrayBuilder}, +HashSet, ScalarValue, +}; +use datafusion_expr_common::accumulator::Accumulator; +use std::{fmt::Debug, sync::Arc}; + +use crate::utils::{DecimalAverager, Hashable}; + +/// Generic implementation of `AVG DISTINCT` for Decimal types. +/// Handles both Decimal128Type and Decimal256Type. +#[derive(Debug)] +pub struct DecimalDistinctAvgAccumulator { +values: HashSet, RandomState>, +sum_scale: i8, +target_precision: u8, +target_scale: i8, +} + +impl DecimalDistinctAvgAccumulator { +pub fn with_decimal_params( +sum_scale: i8, +target_precision: u8, +target_scale: i8, +) -> Self { +Self { +values: HashSet::default(), +sum_scale, +target_precision, +target_scale, +} +} +} + +impl Accumulator for DecimalDistinctAvgAccumulator { +fn state(&mut self) -> datafusion_common::Result> { +let arr = Arc::new( +PrimitiveArrayfrom_iter_values(self.values.iter().map(|v| v.0)) +.with_data_type(T::TYPE_CONSTRUCTOR(T::MAX_PRECISION, self.sum_scale)), Review Comment: I just use MAX_PERCISION and I think it should be fine? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add "end to end parquet reading test" for WASM [datafusion]
alamb opened a new issue, #15357: URL: https://github.com/apache/datafusion/issues/15357 I think the current code tests the re-exported Parquet functionalities, not touching the DataFusion-related code. Ideally, we should test the end-to-end Parquet reading process. The process roughly looks like this: 1. Create a [in-memory object_store](https://docs.rs/object_store/latest/object_store/memory/struct.InMemory.html), and put the Parquet data you generated into the object_store. 2. [Register the object_store](https://github.com/apache/datafusion/blob/3269f01b42021cfab181577d579b0544808b4fca/datafusion/core/src/execution/context/mod.rs#L494) along with the path to the DataFusion. 3. Run a SQL query from the DataFusion side to see if the results can be read back. A loosely related test can be found here: https://github.com/XiangpengHao/parquet-viewer/blob/main/src/tests.rs#L9 _Originally posted by @XiangpengHao in https://github.com/apache/datafusion/pull/15325#discussion_r2008190901_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add "end to end parquet reading test" for WASM [datafusion]
jsai28 commented on issue #15357: URL: https://github.com/apache/datafusion/issues/15357#issuecomment-2745315413 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improve Spill Performance: `mmap` the spill files [datafusion]
alamb commented on issue #15321: URL: https://github.com/apache/datafusion/issues/15321#issuecomment-2745343112 BTW I am not sure the code is really in a great position to do this one yet -- it might help to wait for @2010YOUY01 (or help him) to pull some of the spilling code into its own structure -- see https://github.com/apache/datafusion/pull/15355 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix type coercion for uint/int's [datafusion]
Omega359 commented on PR #15341: URL: https://github.com/apache/datafusion/pull/15341#issuecomment-2745343678 > However, I am struggling to understand the implications of this change to a user. Like for example, if we were going to add a note about this in the upgrade / release notes, what would it say? An issue was fixed where type coercion between expressions using certain mathematical operations having unsigned / signed types could produce values with an incorrect type that is not large enough to encompass all the possible values for both types. For example, comparing an unsigned int32 with a signed int32 could result in values having int32 type (where it should be int64) and could result in "Can't cast .." error for any unsigned values larger than the maximum int32 value. This change may result in expressions unexpectedly having a 'larger' output type than they would have had in previous releases. > Or put another way, what problem is this PR solving (the ticket just describes what the code does as wrong but it doesn't say why) 🤔 ```Rust let df = df .select_columns(&[P_ID, IDENTITY_KEY_VALUE])? .with_column( IDENTITY_KEY_VALUE, cast(hex_to_u64.call(vec![col(IDENTITY_KEY_VALUE)]), DataType::UInt64), )? .with_column(PARTITION_COLUMN, col(IDENTITY_KEY_VALUE).rem(lit(64)))?; ``` That currently throws ` Cast error: Can't cast value 16858640775341098663 to type Int32` Is it easy to work around? Yes. Should it happen? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support `merge` for `Distribution` [datafusion]
xudong963 closed issue #15290: Support `merge` for `Distribution` URL: https://github.com/apache/datafusion/issues/15290 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support merge for `Distribution` [datafusion]
xudong963 closed pull request #15296: feat: support merge for `Distribution` URL: https://github.com/apache/datafusion/pull/15296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]
timsaucer commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745345231 Let me know when it's ready for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support merge for `Distribution` [datafusion]
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2745345742 Thanks for your suggestions!! @alamb @ozankabak @berkaysynnada and @kosiew I'll continue to do such work after the `Migrate to Distribution from Precision` work is done. I think after it's done, the new statistics framework will be relatively stable and we can see how the `Distribution` is integrated into datafusion core. It'll definitely be helpful for me to do some work around the new statistics framework. Again, I sincerely appreciate that you take time to review and discuss ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]
timsaucer commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745345097 I do love this PR. I hadn't looked at it since it's in draft, but I fully endorse. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]
riziles closed issue #1077: DeltaLake integration not working (Python) (FFI Table providers not working) URL: https://github.com/apache/datafusion-python/issues/1077 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]
westhide commented on PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2743805128 > I believe whole job should be cancelled Yes, working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]
XiangpengHao commented on PR #61: URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2745401638 Thank you @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-substrait from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] closed pull request #1048: build(deps): bump datafusion-substrait from 45.0.0 to 46.0.0 URL: https://github.com/apache/datafusion-python/pull/1048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] commented on PR #1051: URL: https://github.com/apache/datafusion-python/pull/1051#issuecomment-2745403955 Looks like datafusion is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-proto from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] commented on PR #1049: URL: https://github.com/apache/datafusion-python/pull/1049#issuecomment-2745403935 Looks like datafusion-proto is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-ffi from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] closed pull request #1050: build(deps): bump datafusion-ffi from 45.0.0 to 46.0.0 URL: https://github.com/apache/datafusion-python/pull/1050 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-proto from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] closed pull request #1049: build(deps): bump datafusion-proto from 45.0.0 to 46.0.0 URL: https://github.com/apache/datafusion-python/pull/1049 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump uuid from 1.13.1 to 1.16.0 [datafusion-python]
dependabot[bot] commented on PR #1068: URL: https://github.com/apache/datafusion-python/pull/1068#issuecomment-2745403994 Looks like uuid is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-substrait from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] commented on PR #1048: URL: https://github.com/apache/datafusion-python/pull/1048#issuecomment-2745403927 Looks like datafusion-substrait is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump arrow from 54.2.0 to 54.2.1 [datafusion-python]
dependabot[bot] closed pull request #1038: build(deps): bump arrow from 54.2.0 to 54.2.1 URL: https://github.com/apache/datafusion-python/pull/1038 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Update DataFusion dependency to 46 [datafusion-python]
timsaucer merged PR #1079: URL: https://github.com/apache/datafusion-python/pull/1079 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Parse `SUBSTR` as alias for `SUBSTRING` [datafusion-sqlparser-rs]
iffyio merged PR #1769: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Update DataFusion dependency to 46 [datafusion-python]
timsaucer opened a new pull request, #1079: URL: https://github.com/apache/datafusion-python/pull/1079 # Which issue does this PR close? None # Rationale for this change In preparing for the next release, update the datafusion dependency. # What changes are included in this PR? Corresponding changes in the Expr representation due to upstream changes. Window and Aggregate functions have a slight change to their structs. # Are there any user-facing changes? None. These are all internal changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Documentation: Plan custom expressions [datafusion]
alamb commented on code in PR #15353: URL: https://github.com/apache/datafusion/pull/15353#discussion_r2008792240 ## docs/source/library-user-guide/adding-udfs.md: ## @@ -1160,6 +1160,89 @@ async fn main() -> Result<()> { // +---+ ``` +## Custom Expression Planning + +DataFusion provides native support for a limited set of SQL operators by default. For operators not natively defined, developers can extend DataFusion's functionality by implementing custom expression planning. This extensibility is a core feature of DataFusion, allowing it to be customized for particular workloads and requirements. + +### Implementing Custom Expression Planning + +To extend DataFusion with support for custom operators not natively available, you need to: + +1. Implement the `ExprPlanner` trait: This allows you to define custom logic for planning expressions that DataFusion doesn't natively recognize. The trait provides the necessary interface to translate logical expressions into physical execution plans. + + For a detailed documentation please see: [Trait ExprPlanner](https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html) + +2. Register your custom planner: Integrate your implementation with DataFusion's `SessionContext` to ensure your custom planning logic is invoked during the query optimization and execution planning phase. + + For a detailed documentation please see: [fn register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner) Review Comment: ```suggestion For a detailed documentation see: [fn register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner) ``` ## docs/source/library-user-guide/adding-udfs.md: ## @@ -1160,6 +1160,89 @@ async fn main() -> Result<()> { // +---+ ``` +## Custom Expression Planning + +DataFusion provides native support for a limited set of SQL operators by default. For operators not natively defined, developers can extend DataFusion's functionality by implementing custom expression planning. This extensibility is a core feature of DataFusion, allowing it to be customized for particular workloads and requirements. + +### Implementing Custom Expression Planning + +To extend DataFusion with support for custom operators not natively available, you need to: + +1. Implement the `ExprPlanner` trait: This allows you to define custom logic for planning expressions that DataFusion doesn't natively recognize. The trait provides the necessary interface to translate logical expressions into physical execution plans. + + For a detailed documentation please see: [Trait ExprPlanner](https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.ExprPlanner.html) + +2. Register your custom planner: Integrate your implementation with DataFusion's `SessionContext` to ensure your custom planning logic is invoked during the query optimization and execution planning phase. + + For a detailed documentation please see: [fn register_expr_planner](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html#method.register_expr_planner) + +See example below: + +```rust +# use arrow::array::RecordBatch; +# use std::sync::Arc; + +# use datafusion::common::{assert_batches_eq, DFSchema}; +# use datafusion::error::Result; +# use datafusion::execution::FunctionRegistry; +# use datafusion::logical_expr::Operator; +# use datafusion::prelude::*; +# use datafusion::sql::sqlparser::ast::BinaryOperator; +# use datafusion_common::ScalarValue; +# use datafusion_expr::expr::Alias; +# use datafusion_expr::planner::{ExprPlanner, PlannerResult, RawBinaryExpr}; +# use datafusion_expr::BinaryExpr; + +# #[derive(Debug)] +# // Define the custom planner +# struct MyCustomPlanner; + +// Implement ExprPlanner for cutom operator logic +impl ExprPlanner for MyCustomPlanner { +fn plan_binary_op( +&self, +expr: RawBinaryExpr, +_schema: &DFSchema, +) -> Result> { +match &expr.op { +// Map `->` to string concatenation +BinaryOperator::Arrow => { +// Rewrite `->` as a string concatenation operation +// - `left` and `right` are the operands (e.g., 'hello' and 'world') +// - `Operator::StringConcat` tells DataFusion to concatenate them +Ok(PlannerResult::Planned(Expr::BinaryExpr(BinaryExpr { +left: Box::new(expr.left.clone()), +right: Box::new(expr.right.clone()), +op: Operator::StringConcat, +}))) +} +_ => Ok(PlannerResult::Original(expr)), +} +} +} + +use datafusion::execution::context::SessionContext; +use datafusion::arrow::util::pretty; + +#[tokio::main] +async fn main() -> Result<()> { +let config =
Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]
riziles commented on issue #1077: URL: https://github.com/apache/datafusion-python/issues/1077#issuecomment-2745324305 @timsaucer , I upgraded deltalake to 0.25.4 and datafusion to 45.2.0, and the example above now works fine. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
alamb commented on code in PR #15168: URL: https://github.com/apache/datafusion/pull/15168#discussion_r2008804385 ## datafusion/spark/src/function/math/expm1.rs: ## @@ -0,0 +1,169 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::function::error_utils::{ +invalid_arg_count_exec_err, unsupported_data_type_exec_err, +}; +use arrow::array::{ArrayRef, AsArray}; +use arrow::datatypes::{DataType, Float64Type}; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::{ +ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility, +}; +use datafusion_macros::user_doc; +use std::any::Any; +use std::sync::Arc; + +#[user_doc( +doc_section(label = "Spark Math Functions"), +description = "Returns exp(expr) - 1 as a Float64.", +syntax_example = "expm1(expr)", +sql_example = r#"```sql +> select expm1(0); +++ +| expm1(0) | +++ +| 0.0| +++ +> select expm1(1); +++ +| expm1(1) | +++ +| 50 | +++ +```"#, +argument( +name = "expr", +description = "An expression that evaluates to a numeric." +) +)] +#[derive(Debug)] +pub struct SparkExpm1 { +signature: Signature, +aliases: Vec, +} + +impl Default for SparkExpm1 { +fn default() -> Self { +Self::new() +} +} + +impl SparkExpm1 { +pub fn new() -> Self { +Self { +signature: Signature::user_defined(Volatility::Immutable), +aliases: vec!["spark_expm1".to_string()], +} +} +} + +impl ScalarUDFImpl for SparkExpm1 { +fn as_any(&self) -> &dyn Any { +self +} + +fn name(&self) -> &str { +"expm1" +} + +fn signature(&self) -> &Signature { +&self.signature +} + +fn return_type(&self, _arg_types: &[DataType]) -> Result { +Ok(DataType::Float64) +} + +fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { +if args.args.len() != 1 { +return Err(invalid_arg_count_exec_err("expm1", (1, 1), args.args.len())); +} +match &args.args[0] { +ColumnarValue::Scalar(ScalarValue::Float64(value)) => Ok( +ColumnarValue::Scalar(ScalarValue::Float64(value.map(|x| x.exp_m1(, +), +ColumnarValue::Array(array) => match array.data_type() { +DataType::Float64 => Ok(ColumnarValue::Array(Arc::new( +array +.as_primitive::() +.unary::<_, Float64Type>(|x| x.exp_m1()), +) +as ArrayRef)), +other => Err(unsupported_data_type_exec_err( +"expm1", +format!("{}", DataType::Float64).as_str(), +other, +)), +}, +other => Err(unsupported_data_type_exec_err( +"expm1", +format!("{}", DataType::Float64).as_str(), +&other.data_type(), +)), +} +} + +fn aliases(&self) -> &[String] { +&self.aliases +} + +fn coerce_types(&self, arg_types: &[DataType]) -> Result> { +if arg_types.len() != 1 { +return Err(invalid_arg_count_exec_err("expm1", (1, 1), arg_types.len())); +} +if arg_types[0].is_numeric() { +Ok(vec![DataType::Float64]) +} else { +Err(unsupported_data_type_exec_err( +"expm1", +"Numeric Type", +&arg_types[0], +)) +} +} +} + +#[cfg(test)] +mod tests { +use crate::function::math::expm1::SparkExpm1; +use crate::function::utils::test::test_scalar_function; +use arrow::array::{Array, Float64Array}; +use arrow::datatypes::DataType::Float64; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::{ColumnarValue, ScalarUDFImpl}; + +macro_rules! test_expm1_float64_invoke { +($INPUT:expr, $EXPECTED:expr) => { +test_scalar_function!( +SparkExpm1::new(), +
Re: [PR] fix type coercion for uint/int's [datafusion]
alamb commented on code in PR #15341: URL: https://github.com/apache/datafusion/pull/15341#discussion_r2008804758 ## datafusion/optimizer/tests/optimizer_integration.rs: ## @@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() { let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3"; let plan = test_sql(sql).unwrap(); let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\ -\n Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\ -\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\ +\n Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\ Review Comment: this might be slower-- as now the larger column type is used (so it needs to do a 64 bit comparison rather than 32 bit) 🤔 ## datafusion/optimizer/tests/optimizer_integration.rs: ## @@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() { let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3"; let plan = test_sql(sql).unwrap(); let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\ -\n Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\ -\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\ +\n Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\ Review Comment: but it probably also doesn't lose precision -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix type coercion for uint/int's [datafusion]
Omega359 commented on code in PR #15341: URL: https://github.com/apache/datafusion/pull/15341#discussion_r2008808511 ## datafusion/optimizer/tests/optimizer_integration.rs: ## @@ -267,8 +267,8 @@ fn push_down_filter_groupby_expr_contains_alias() { let sql = "SELECT * FROM (SELECT (col_int32 + col_uint32) AS c, count(*) FROM test GROUP BY 1) where c > 3"; let plan = test_sql(sql).unwrap(); let expected = "Projection: test.col_int32 + test.col_uint32 AS c, count(Int64(1)) AS count(*)\ -\n Aggregate: groupBy=[[test.col_int32 + CAST(test.col_uint32 AS Int32)]], aggr=[[count(Int64(1))]]\ -\nFilter: test.col_int32 + CAST(test.col_uint32 AS Int32) > Int32(3)\ +\n Aggregate: groupBy=[[CAST(test.col_int32 AS Int64) + CAST(test.col_uint32 AS Int64)]], aggr=[[count(Int64(1))]]\ Review Comment: Correctness > performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Extended tests failing on main [datafusion]
alamb commented on issue #15359: URL: https://github.com/apache/datafusion/issues/15359#issuecomment-2745343956 Instructions to update the expected results are in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md#running-tests-sqlite -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [DISCUSS] In-house implementation of certain dependency package. [datafusion]
logan-keede opened a new issue, #15360: URL: https://github.com/apache/datafusion/issues/15360 ### Is your feature request related to a problem or challenge? DataFusion has many dependencies that leads to increased binary size and compilation while saving us the trouble of maintaining those implementations in our own codebase. ### Describe the solution you'd like This issue is to Identify any crate/dependency `shallow` and `stable` enough to justify having an In-house implementation. 'shallow': We do not use much of the original crate. i.e we just use a small part that can be implemented in house without having to implement/port a massive dependency tree that crate has. The shallower the dependency, the more beneficial an in-house implementation would be. 'stable': The code is mostly stable and will not require much active changes after having an In-house implementation. (While all code eventually requires some changes, it is about how frequently and disruptively it will give us a headache.) ### Describe alternatives you've considered Do not have any in-house implementation. Pay the price of increased binary size and compile time. ### Additional context _Originally suggested by @ozankabak while discussing my GSoC 2025 Proposal for "Optimizing compile time and binary size" over discord and before that in #14478._ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]
deanm commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745349544 I'll mark it ready and just do a second PR for more methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Avg distinct [datafusion]
qazxcdswe123 commented on code in PR #15356: URL: https://github.com/apache/datafusion/pull/15356#discussion_r2008777574 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -6686,3 +6688,35 @@ SELECT a, median(b), arrow_typeof(median(b)) FROM group_median_all_nulls GROUP B group0 NULL Int32 group1 NULL Int32 + +statement ok +create table t_decimal (c decimal(10, 4)) as values (100.00), (125.00), (175.00), (200.00), (200.00), (300.00), (null), (null); + +# Test avg_distinct for Decimal128 +query RT +select avg(distinct c), arrow_typeof(avg(distinct c)) from t_decimal; + +180 Decimal128(14, 8) Review Comment: I'm not sure if the type `Decimal128(14, 8)` and `Decimal256(54, 6)` below is correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Extended tests failing on main [datafusion]
alamb opened a new issue, #15359: URL: https://github.com/apache/datafusion/issues/15359 ### Describe the bug example https://github.com/apache/datafusion/actions/runs/14005503215/job/39218865563 ``` External error: query is expected to fail with error: (regex) DataFusion error: Error during planning: Projection references non\-aggregate values: Expression cor0\.col1 could not be resolved from available columns: cor0\.col2 but got error: DataFusion error: Error during planning: Column in SELECT must be in GROUP BY or an aggregate function: While expanding wildcard, column "cor0.col1" must appear in the GROUP BY clause or must be part of an aggregate function, currently only "cor0.col2" appears in the SELECT clause satisfies this requirement [SQL] SELECT - COALESCE ( - 54, + cor0.col1 * + NULLIF ( + cor0.col1, cor0.col0 ), - cor0.col1 ) AS col0 FROM tab1 cor0 GROUP BY cor0.col2 at ../../datafusion-testing/data/sqlite/random/groupby/slt_good_7.slt:14159 ``` Happened first with this PR: - https://github.com/apache/datafusion/pull/15287 I think the expected messages just need updating ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] SET statements: scope modifier for multiple assignments [datafusion-sqlparser-rs]
alamb commented on PR #1772: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1772#issuecomment-2745343279 Thank you all for keeping this train (of PRs) rolling. Very impressive I must say -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]
timsaucer commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745373049 At a high level I like some of the aspects of a namespace. It would especially be nice to clean up our documentation using them in `functions`. I haven't thought through what this would mean for organizing both `functions` and in `Expr`. We would at least need to have some helper functions marked as deprecated for a few releases if we did this. It's a really good suggestion, but I just want to make sure we have a consistent story across the repo. I haven't looked into how we could do this in PyO3 or if it would just be in the wrappers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSS] In-house implementation of certain dependencies. [datafusion]
ozankabak commented on issue #15360: URL: https://github.com/apache/datafusion/issues/15360#issuecomment-2745373075 If we can write a script to approximately estimate "how much" of an external crate we are using, we can create a starter list of candidates for this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion-ffi from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] commented on PR #1050: URL: https://github.com/apache/datafusion-python/pull/1050#issuecomment-2745404001 Looks like datafusion-ffi is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump datafusion from 45.0.0 to 46.0.0 [datafusion-python]
dependabot[bot] closed pull request #1051: build(deps): bump datafusion from 45.0.0 to 46.0.0 URL: https://github.com/apache/datafusion-python/pull/1051 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump arrow from 54.2.0 to 54.2.1 [datafusion-python]
dependabot[bot] commented on PR #1038: URL: https://github.com/apache/datafusion-python/pull/1038#issuecomment-2745403943 Looks like arrow is up-to-date now, so this is no longer needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] simplify `array_has` UDF to `InList` expr when haystack is constant [datafusion]
alamb commented on code in PR #15354: URL: https://github.com/apache/datafusion/pull/15354#discussion_r2008786953 ## datafusion/functions-nested/src/array_has.rs: ## @@ -121,6 +123,43 @@ impl ScalarUDFImpl for ArrayHas { Ok(DataType::Boolean) } +fn simplify( +&self, +mut args: Vec, +_info: &dyn datafusion_expr::simplify::SimplifyInfo, +) -> Result { +let [haystack, _needle] = take_function_args(self.name(), &args)?; + +// if the haystack is a constant list, we can use an inlist expression which is more +// efficient because the haystack is not varying per-row +if let Expr::Literal(ScalarValue::List(array)) = haystack { +assert_eq!(array.len(), 1); // guarantee of ScalarValue +if let Ok(scalar_values) = +ScalarValue::convert_array_to_scalar_vec(array.as_ref()) +{ +assert_eq!(scalar_values.len(), 1); +let list = scalar_values +.into_iter() +.flatten() +.map(Expr::Literal) +.collect(); + +// ok to pop here, we will not use args again +let needle = args.pop().unwrap(); +return Ok(ExprSimplifyResult::Simplified(Expr::InList(InList { +expr: Box::new(needle), +list, +negated: false, +}))); +} +} + +// TODO: support LargeList / FixedSizeList? +// (not supported by `convert_array_to_scalar_vec`) Review Comment: I suggest doing the work in a follow on ticket / PR. So that would mean: 1. Filing a ticket with a reproducer (perhaps modified from https://github.com/apache/datafusion/issues/14533) 2. leaving a TODO with a link to that ticket here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Support Avg distinct [datafusion]
qazxcdswe123 opened a new pull request, #15356: URL: https://github.com/apache/datafusion/pull/15356 ## Which issue does this PR close? - Closes #2408 ## Rationale for this change Related: https://github.com/apache/datafusion/pull/15099 ## What changes are included in this PR? 1. implemented a separate accumulator type for `Float64` and `Decimal128` `Decimal256` 2. tests ## Are these changes tested? Current test passed and new ones are added ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] 1064/enhancement/add functions to Expr class [datafusion-python]
deanm commented on PR #1074: URL: https://github.com/apache/datafusion-python/pull/1074#issuecomment-2745350211 @timsaucer did you have any thoughts on the namespaces because going from this to that would wind up being a breaking change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]
XiangpengHao commented on PR #61: URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2745351963 Thank you @alamb @comphead @Omega359 @kevinjqliu I believe all comments have been addressed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]
kevinjqliu commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2008823867 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,312 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xiangpeng Hao +categories: [performance] +--- + + +figure { + margin: 20px 0; +} + +figure img { + display: block; + max-width: 80%; +} + +figcaption { + font-style: italic; + margin-top: 10px; + color: #555; + font-size: 0.9em; + max-width: 80%; +} + + + + +_Editor's Note: This blog was first published on [Xiangpeng Hao's blog]. Thanks to [InfluxData] for sponsoring this work as part of his PhD funding._ + +[Xiangpeng Hao's blog]: https://blog.xiangpeng.systems/posts/parquet-pushdown/ +[InfluxData]: https://www.influxdata.com/ + + + +In the [previous post], we discussed how [Apache DataFusion] prunes [Apache Parquet] files to skip irrelevant **files/row_groups** (sometimes also [pages](https://parquet.apache.org/docs/file-format/pageindex/)). + +This post discusses how Parquet readers skip irrelevant **rows** while scanning data, +leveraging Parquet's columnar layout by first reading only filter columns, +and then selectively reading other columns only for matching rows. + + +[previous post]: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning +[Apache DataFusion]: https://datafusion.apache.org/ +[Apache Parquet]: https://parquet.apache.org/ + +## Why filter pushdown in Parquet? + +Below is an example query that reads sensor data with filters on `date_time` and `location`. +Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows. + + +```sql +SELECT val, location +FROM sensor_data +WHERE date_time > '2025-03-11' AND location = 'office'; +``` + + + + +Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows. + + + + +In our setup, sensor data is aggregated by date — each day has its own Parquet file. +At planning time, DataFusion prunes the unneeded Parquet files, i.e., `2025-03-10.parquet` and `2025-03-11.parquet`. + +Once the files to read are located, the [*DataFusion's current default implementation*](https://github.com/apache/datafusion/issues/3463) reads all the projected columns (`sensor_id`, `val`, and `location`) into Arrow RecordBatches, then applies the filters over `location` to get the final set of rows. + +A better approach is called **filter pushdown**, which evaluates filter conditions first and only decodes data that passes these conditions. +In practice, this works by first processing only the filter columns (`date_time` and `location`), building a boolean mask of rows that satisfy our conditions, then using this mask to selectively decode only the relevant rows from other columns (`sensor_id`, `val`). +This eliminates the waste of decoding rows that will be immediately filtered out. + +While simple in theory, practical implementations often make performance worse. + +## How can filter pushdown be slower? + +At a high level, the Parquet reader first builds a filter mask -- essentially a boolean array indicating which rows meet the filter criteria -- and then uses this mask to selectively decode only the needed rows from the remaining columns in the projection. + +Let's dig into details of [how filter pushdown is implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712) in the current Rust Parquet reader implementation, illustrated in the following figure. + + + + +Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns + + + +The filter pushdown has two phases: + +1. Build the filter mask (steps 1-3) + +2. Use the filter mask to selectively decode other columns (steps 4-7), e.g., output step 3 is used as input for step 5 and 7. + +Within each phase, it takes three steps from Parquet to Arrow: + +1. Decompress the Parquet pages using generic decompression algorithms like LZ4, Zstd, etc. (steps 1, 4, 6) + +2. Decode the page content into Arrow format (steps 2, 5, 7) + +3. Evaluate the filter over Arrow data (step 3) + +In the figure above, we can see that `location` is **decompressed and decoded twice**, first when building the filter mask (steps 1, 2), and second when building the output (steps 4, 5). +This happens for all columns that appear both in the filter and output. + +The table below shows the corresponding CPU
[PR] build(deps): bump mimalloc from 0.1.43 to 0.1.44 [datafusion-python]
dependabot[bot] opened a new pull request, #1080: URL: https://github.com/apache/datafusion-python/pull/1080 Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 0.1.43 to 0.1.44. Release notes Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's releases. Version 0.1.44 Changes Mimalloc v2.2.2 Commits https://github.com/purpleprotocol/mimalloc_rust/commit/bbf61305ad2d9d1c4b417bd277e36caec31d21b7";>bbf6130 v0.1.44 https://github.com/purpleprotocol/mimalloc_rust/commit/3e7e3214ea3397ffc922baaa42933e5091482073";>3e7e321 Patch windows https://github.com/purpleprotocol/mimalloc_rust/commit/f37c56ad7c3ed13904afd272e59f3ca80f7b02b8";>f37c56a Clippy https://github.com/purpleprotocol/mimalloc_rust/commit/c00538aa0b553e7eecab40b114fedb56f414ddc7";>c00538a Update to mimalloc v2.2.2 https://github.com/purpleprotocol/mimalloc_rust/commit/cc1c72a62bf3edefe5abf842b60092f088740860";>cc1c72a Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/126";>#126 from richerfu/master https://github.com/purpleprotocol/mimalloc_rust/commit/e638bd6055ad4df3aa4edbfddfc1d0a09d073692";>e638bd6 feat: use env to adapt to more general scenarios for atomic See full diff in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.43...v0.1.44";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]
westhide commented on PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#issuecomment-2745126578 > ah no, sorry for misunderstanding please do not cancel this PR. what I meant, in case of this type of error, ballista job should be cancelled. Hello @milenkovicm, Cancel Job after prepare task definition failed ready for review, Thx~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]
Shreyaskr1409 commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745028266 @alamb @blaginin I have resolved all the conversations. Please look into it. Also my bad for few unnecessary commits here and there, I am still getting acquainted with the workflow. Thank you :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Ballista client keep blocking when prepare_task_definition or prepare_multi_task_definition fail [datafusion-ballista]
westhide opened a new issue, #1214: URL: https://github.com/apache/datafusion-ballista/issues/1214 **Describe the bug** A clear and concise description of what the bug is. Ballista client keep blocking when prepare_task_definition or prepare_multi_task_definition fail **To Reproduce** Steps to reproduce the behavior: **Expected behavior** A clear and concise description of what you expected to happen. **Additional context** Add any other context about the problem here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]
alamb commented on issue #15343: URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2745139051 I wonder if we could just change the default format for `datafusion-cli` (it is a config setting) 🤔 Downstream projects could also then "opt-in" if they wanted nicer default explain plans -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPre… [datafusion]
zhuqi-lucas commented on PR #15348: URL: https://github.com/apache/datafusion/pull/15348#issuecomment-2745186535 Updated the result for short string sort which will benefit a lot from StringView type, add Q 11 for sort: ```rust -const SORT_QUERIES: [&'static str; 10] = [ +const SORT_QUERIES: [&'static str; 11] = [ // Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column r#" SELECT l_linenumber, l_partkey @@ -159,6 +159,12 @@ impl RunOpt { FROM lineitem ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment "#, +// Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload column +r#" +SELECT l_shipmode, l_comment, l_partkey +FROM lineitem +ORDER BY l_shipmode; +"#, ]; ``` This PR: ```rust Q11 iteration 0 took 5645.3 ms and returned 59986052 rows Q11 iteration 1 took 5641.1 ms and returned 59986052 rows Q11 iteration 2 took 5520.6 ms and returned 59986052 rows Q11 avg time: 5602.33 ms ``` The main: ```rust Q11 iteration 0 took 6687.5 ms and returned 59986052 rows Q11 iteration 1 took 6504.5 ms and returned 59986052 rows Q11 iteration 2 took 6544.6 ms and returned 59986052 rows Q11 avg time: 6578.87 ms ``` About 20% performance improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improved error for expand wildcard rule [datafusion]
ozankabak commented on PR #15287: URL: https://github.com/apache/datafusion/pull/15287#issuecomment-2745187508 This PR seems to have broken `main`. @alamb -- Extended tests break very frequently these days. We should prioritize completing the work on running them before merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Comet 0.7.0 [datafusion-site]
alamb commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2008719590 ## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ## @@ -0,0 +1,134 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +author: pmc +categories: [subprojects] +--- + + + +The Apache DataFusion PMC is pleased to announce version 0.7.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for +improved performance and efficiency without requiring any code changes. + +Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or +expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer +to the [compatibility guide] for more information. + +[compatibility guide]: https://datafusion.apache.org/comet/user-guide/compatibility.html + +This release covers approximately four weeks of development work and is the result of merging 46 PRs from 11 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.7.0.md + +## Release Highlights + +### Performance + +Comet 0.7.0 has improved performance compared to the previous release due to improvements in the native shuffle +implementation and performance improvements in DataFusion 46. + +For single-node TPC-H at 100 GB, Comet now delivers a **greater than 2x speedup** compared to Spark using the same +CPU and RAM. Even with **half the resources**, Comet still provides a measurable performance improvement. + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +The Apache DataFusion PMC is pleased to announce version 0.7.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for +improved performance and efficiency without requiring any code changes. + +Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or +expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer +to the [compatibility guide] for more information. + +[compatibility guide]: https://datafusion.apache.org/comet/user-guide/compatibility.html + +This release covers approximately four weeks of development work and is the result of merging 46 PRs from 11 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.7.0.md + +## Release Highlights + +### Performance + +Comet 0.7.0 has improved performance compared to the previous release due to improvements in the native shuffle +implementation and performance improvements in DataFusion 46. + +For single-node TPC-H at 100 GB, Comet now delivers a **greater than 2x speedup** compared to Spark using the same +CPU and RAM. Even with **half the resources**, Comet still provides a measurable performance improvement. + + + +*These benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and data +stored locally in Parquet format on NVMe storage. Spark was running in Kubernetes with hard memory limits.* + +## Shuffle Improvements + +There are several improvements to shuffle in this release: + +- When running in off-heap mode (which is the recommended approach), Comet was using the wrong memory allocator + implementation for some types of shuffle operation, which could result in OOM rather than spilling to disk. +- The number of spill files is drastically reduced. In previous releases, each instance of ShuffleMapTask could + potentially create a new spill file for each output partition each time that spill was invoked. Comet now creates + a maximum of one spill file per output partition per instance of ShuffleMapTask, which is appended to in subsequent + spills. +- There was a flaw in the memory accounting which resulted in Comet requesting approximately twice the amount of + memory that was needed, resulting in premature spilling. This is now resolved. +- The metric for number of spilled bytes is now accurate. It was previously reporting invalid information. + +## Improved Hash Join Performance + +When using the `spark.comet.exec.replaceSortMergeJoin` setting to replace sort-merge joins with hash joins, Comet +will now do a better job of picking the optimal build side. Thanks to [@hayman
Re: [PR] Comet 0.7.0 [datafusion-site]
alamb commented on PR #63: URL: https://github.com/apache/datafusion-site/pull/63#issuecomment-2745149502 Released site: https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0.7.0/ 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]
timsaucer commented on issue #1077: URL: https://github.com/apache/datafusion-python/issues/1077#issuecomment-2745248393 Which version of datafusion-python are you using? If you have 45.2.0 can you please try downgrading to 44.0.0? If that solves it, I know what the problem is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2008766096 ## src/dataframe.rs: ## @@ -111,56 +116,151 @@ impl PyDataFrame { } fn __repr__(&self, py: Python) -> PyDataFusionResult { -let df = self.df.as_ref().clone().limit(0, Some(10))?; -let batches = wait_for_future(py, df.collect())?; -let batches_as_string = pretty::pretty_format_batches(&batches); -match batches_as_string { -Ok(batch) => Ok(format!("DataFrame()\n{batch}")), -Err(err) => Ok(format!("Error: {:?}", err.to_string())), +let (batches, has_more) = wait_for_future( +py, +collect_record_batches_to_display(self.df.as_ref().clone(), 10, 10), +)?; +if batches.is_empty() { +// This should not be reached, but do it for safety since we index into the vector below +return Ok("No data to display".to_string()); } -} -fn _repr_html_(&self, py: Python) -> PyDataFusionResult { -let mut html_str = "\n".to_string(); +let batches_as_displ = + pretty::pretty_format_batches(&batches).map_err(py_datafusion_err)?; + +let additional_str = match has_more { +true => "\nData truncated.", +false => "", +}; -let df = self.df.as_ref().clone().limit(0, Some(10))?; -let batches = wait_for_future(py, df.collect())?; +Ok(format!("DataFrame()\n{batches_as_displ}{additional_str}")) +} +fn _repr_html_(&self, py: Python) -> PyDataFusionResult { +let (batches, has_more) = wait_for_future( +py, +collect_record_batches_to_display( +self.df.as_ref().clone(), +MIN_TABLE_ROWS_TO_DISPLAY, +usize::MAX, +), +)?; Review Comment: Added to follow on issue https://github.com/apache/datafusion-python/issues/1078 ## src/dataframe.rs: ## @@ -70,6 +72,9 @@ impl PyTableProvider { PyTable::new(table_provider) } } +const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB Review Comment: Added to issue https://github.com/apache/datafusion-python/issues/1078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764218 ## src/dataframe.rs: ## @@ -771,3 +871,82 @@ fn record_batch_into_schema( RecordBatch::try_new(schema, data_arrays) } + +/// This is a helper function to return the first non-empty record batch from executing a DataFrame. +/// It additionally returns a bool, which indicates if there are more record batches available. +/// We do this so we can determine if we should indicate to the user that the data has been +/// truncated. This collects until we have achived both of these two conditions +/// +/// - We have collected our minimum number of rows +/// - We have reached our limit, either data size or maximum number of rows +/// +/// Otherwise it will return when the stream has exhausted. If you want a specific number of +/// rows, set min_rows == max_rows. +async fn collect_record_batches_to_display( +df: DataFrame, +min_rows: usize, +max_rows: usize, +) -> Result<(Vec, bool), DataFusionError> { +let mut stream = df.execute_stream().await?; Review Comment: I'll switch to your approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Improve html table rendering formatting [datafusion-python]
timsaucer opened a new issue, #1078: URL: https://github.com/apache/datafusion-python/issues/1078 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** This is a follow on to https://github.com/apache/datafusion-python/pull/1036 There is a suggestion that we should enable a feature to customize the html table rendering. **Describe the solution you'd like** Allow the user to set the amount of data returned. Right now it's hard coded to 2 MB. The actual data size is approximated. We would like this to be customizable. Allow the user to disable the styling that is applied to the html output. Optional: Add a way for a user to specify a formatting function to replace the default `ArrayFormatter`. This would enable users to change the data view when they have custom metadata Nice to have: Split up some of the html generation into a set of helper functions. **Describe alternatives you've considered** None **Additional context** If this code is getting too large, it may make sense to split it off into a separate file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]
2010YOUY01 opened a new pull request, #15355: URL: https://github.com/apache/datafusion/pull/15355 ## Which issue does this PR close? Related to https://github.com/apache/datafusion/pull/14975 ## Rationale for this change ### What's the inefficiency Let's walkthrough an example, there is one external sort query with 1 partition, and sort exec has: - 10MB memory limit - 1MB `sort_spill_reservation_bytes` (see configuration explanation in https://datafusion.apache.org/user-guide/configs.html) During the execution: 1. `SortExec` will read in batches, and 10MB memory limit is reached 2. It will sorted all buffered batches in-place, and merge them at once. Note only the 1MB buffer from 10MB total limit is pre-reserved for merging, so there is only 1MB available to store the merged output. 3. After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue. **Inefficency:** Now `ExternalSorter` will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge. **Ideal case:** All batches in a single sorted run can be incrementally appended to a single file. Reproducer Execute datafusion-cli with `cargo run --profile release-nonlto -- --mem-pool-type fair -m 10M` ```sql set datafusion.execution.sort_spill_reservation_bytes = 100; set datafusion.execution.target_partitions = 4; explain analyze select * from generate_series(1, 100) as t1(v1) order by v1; ``` Main: 10 spills PR: 2 spills ### Rationale for the fix Introduced a new spill interface `SpillManager` with the ability of incrementally appending batches to a written file. 1. `SpillManager` is designed to do `RecordBatch <---> raw file`, and configurations can be put inside `SpillManager` to control how do we do the serialization physically for future optimizations. Example configurations: - General purpose compression like `lz4` - Specialized encoding other than the current `Arrow IPC` , or configurations to change the `IPC Writer` behavior - See `datafusion-comet`'s proprietary serde implementation in https://github.com/apache/datafusion-comet/pull/1190 - One example of extra configuration can be https://github.com/apache/datafusion/issues/15320 2. `SpillManager` is not responsible for holding spilled files inside, because the logical representation of those files can vary, I think it's clearer to place those raw files inside spilling operators. For example, `vec` is managed inside `SortExec`, the implicit rule is within each file all entries are sorted by the sort keys, also in `Comet`'s `ShuffleWriterExec`, each partition should maintain one in-progress file. If we keep those tempfiles inside `SpillManager`, it's hard to clearly define those implicit requirements. 3. Additionally, `SpillManager` is responsible for updating related statistics, the spill-related metrics should be the same across operators, so this part of of code can also be reused. Also, total disk usage limit for spilled files can be easily implemented upon it. ### Why refactor and introduce `SpillManager` This fix can be implemented without a major refactor. However, this change is included to prepare for supporting disk limits for spilling queries, as described in https://github.com/apache/datafusion/pull/14975 ## What changes are included in this PR? 1. Group spilling related metrics into one struct 2. Introduce `SpillManager` 3. Update `SortExec` to use the new `SpillManager` interface ### TODO: - [ ] There are two extra operators that can be changed to this new interface (`Aggregate` and `SortMergeJoin`), they're planned to be included in this PR. I plan to do it after getting some review feedback. ## Are these changes tested? For the too-many-spills issue: one test case is updated, and more comment is added above the assertion to prevent regression. For `SpillManager`: unit tests are included. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2008764621 ## src/dataframe.rs: ## @@ -70,6 +72,9 @@ impl PyTableProvider { PyTable::new(table_provider) } } +const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB Review Comment: How about I open an issue to enhance this to be configurable as well as the follow on part about disabling the styling? I'd like to get this in so we fix explain and add some useful functionality now and then we can get these things tightened up in the next iteration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2008766157 ## src/dataframe.rs: ## @@ -111,56 +116,151 @@ impl PyDataFrame { } fn __repr__(&self, py: Python) -> PyDataFusionResult { -let df = self.df.as_ref().clone().limit(0, Some(10))?; -let batches = wait_for_future(py, df.collect())?; -let batches_as_string = pretty::pretty_format_batches(&batches); -match batches_as_string { -Ok(batch) => Ok(format!("DataFrame()\n{batch}")), -Err(err) => Ok(format!("Error: {:?}", err.to_string())), +let (batches, has_more) = wait_for_future( +py, +collect_record_batches_to_display(self.df.as_ref().clone(), 10, 10), +)?; +if batches.is_empty() { +// This should not be reached, but do it for safety since we index into the vector below +return Ok("No data to display".to_string()); } -} -fn _repr_html_(&self, py: Python) -> PyDataFusionResult { -let mut html_str = "\n".to_string(); +let batches_as_displ = + pretty::pretty_format_batches(&batches).map_err(py_datafusion_err)?; + +let additional_str = match has_more { +true => "\nData truncated.", +false => "", +}; -let df = self.df.as_ref().clone().limit(0, Some(10))?; -let batches = wait_for_future(py, df.collect())?; +Ok(format!("DataFrame()\n{batches_as_displ}{additional_str}")) +} +fn _repr_html_(&self, py: Python) -> PyDataFusionResult { +let (batches, has_more) = wait_for_future( +py, +collect_record_batches_to_display( +self.df.as_ref().clone(), +MIN_TABLE_ROWS_TO_DISPLAY, +usize::MAX, +), +)?; if batches.is_empty() { -html_str.push_str("\n"); -return Ok(html_str); +// This should not be reached, but do it for safety since we index into the vector below +return Ok("No data to display".to_string()); } +let table_uuid = uuid::Uuid::new_v4().to_string(); + +let mut html_str = " + +.expandable-container { +display: inline-block; +max-width: 200px; +} +.expandable { +white-space: nowrap; +overflow: hidden; +text-overflow: ellipsis; +display: block; +} +.full-text { +display: none; +white-space: normal; +} +.expand-btn { +cursor: pointer; +color: blue; +text-decoration: underline; +border: none; +background: none; +font-size: inherit; +display: block; +margin-top: 5px; +} + + + Review Comment: Added to issue https://github.com/apache/datafusion-python/issues/1078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Enforce JOIN plan to require condition [datafusion]
goldmedal commented on code in PR #15334: URL: https://github.com/apache/datafusion/pull/15334#discussion_r2008714660 ## datafusion/sqllogictest/test_files/tpch/plans/q22.slt.part: ## @@ -65,7 +65,7 @@ logical_plan 07)Projection: customer.c_phone, customer.c_acctbal 08)--LeftAnti Join: customer.c_custkey = __correlated_sq_1.o_custkey 09)Filter: substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")]) -10)--TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")])] +10)--TableScan: customer projection=[c_custkey, c_phone, c_acctbal], partial_filters=[substr(customer.c_phone, Int64(1), Int64(2)) IN ([Utf8View("13"), Utf8View("31"), Utf8View("23"), Utf8View("29"), Utf8View("30"), Utf8View("18"), Utf8View("17")]), Boolean(true)] Review Comment: I benchmarked to ensure the plan change won't impact the performance. ``` Benchmark tpch_sf1.json ┏━━┳━┳┳━━┓ ┃ Query┃main ┃ fix_join-check ┃ Change ┃ ┡━━╇━╇╇━━┩ │ QQuery 1 │ 68.97ms │70.14ms │no change │ │ QQuery 2 │ 18.72ms │18.55ms │no change │ │ QQuery 3 │ 30.47ms │29.46ms │no change │ │ QQuery 4 │ 21.45ms │21.77ms │no change │ │ QQuery 5 │ 43.53ms │41.56ms │no change │ │ QQuery 6 │ 14.39ms │14.88ms │no change │ │ QQuery 7 │ 55.03ms │55.30ms │no change │ │ QQuery 8 │ 41.24ms │39.45ms │no change │ │ QQuery 9 │ 50.84ms │53.19ms │no change │ │ QQuery 10│ 44.85ms │44.83ms │no change │ │ QQuery 11│ 13.42ms │13.22ms │no change │ │ QQuery 12│ 28.49ms │28.20ms │no change │ │ QQuery 13│ 29.56ms │29.38ms │no change │ │ QQuery 14│ 23.03ms │24.40ms │ 1.06x slower │ │ QQuery 15│ 33.59ms │34.98ms │no change │ │ QQuery 16│ 13.07ms │13.08ms │no change │ │ QQuery 17│ 58.00ms │58.67ms │no change │ │ QQuery 18│ 76.00ms │75.06ms │no change │ │ QQuery 19│ 39.99ms │38.40ms │no change │ │ QQuery 20│ 29.04ms │29.79ms │no change │ │ QQuery 21│ 62.14ms │64.14ms │no change │ │ QQuery 22│ 13.82ms │14.01ms │no change │ └──┴─┴┴──┘ ┏━━━┳━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━╇━━┩ │ Total Time (main) │ 809.65ms │ │ Total Time (fix_join-check) │ 812.45ms │ │ Average Time (main) │ 36.80ms │ │ Average Time (fix_join-check) │ 36.93ms │ │ Queries Faster│0 │ │ Queries Slower│1 │ │ Queries with No Change│ 21 │ └───┴──┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSS] Switch to `tree` explain by default [datafusion]
ozankabak commented on issue #15343: URL: https://github.com/apache/datafusion/issues/15343#issuecomment-2745139974 > I wonder if we could just change the default format for `datafusion-cli` (it is a config setting) 🤔 Downstream projects could also then "opt-in" if they wanted nicer default explain plans Makes sense to change the default format for `datafusion-cli` 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `avg(distinct)` support [datafusion]
qazxcdswe123 commented on issue #2408: URL: https://github.com/apache/datafusion/issues/2408#issuecomment-2745155964 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]
alamb commented on code in PR #15325: URL: https://github.com/apache/datafusion/pull/15325#discussion_r2008783615 ## datafusion/wasmtest/src/lib.rs: ## @@ -182,4 +182,29 @@ mod test { let task_ctx = ctx.task_ctx(); let _ = collect(physical_plan, task_ctx).await.unwrap(); } + +#[wasm_bindgen_test(unsupported = tokio::test)] +async fn test_parquet_write() { +let schema = Arc::new(Schema::new(vec![ +Field::new("id", DataType::Int32, false), +Field::new("value", DataType::Utf8, false), +])); + +let data: Vec = vec![ +Arc::new(Int32Array::from(vec![1])), +Arc::new(StringArray::from(vec!["a"])), +]; + +let batch = RecordBatch::try_new(schema.clone(), data).unwrap(); +let mut buffer = Vec::new(); +let mut writer = datafusion::parquet::arrow::ArrowWriter::try_new( +&mut buffer, +schema.clone(), +None, +) +.unwrap(); + +writer.write(&batch).unwrap(); Review Comment: I filed a ticket to track this work: - https://github.com/apache/datafusion/issues/15357 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]
alamb commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745294370 > @alamb @blaginin I have resolved all the conversations. Please look into it. Also my bad for few unnecessary commits here and there, I am still getting acquainted with the workflow. Thank you :) No worries -- thanks @Shreyaskr1409 PLease don't worry about extra commits -- they are all squashed when merged to main. Multiple commits makes the history of a PR much easier to see -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]
alamb commented on PR #15325: URL: https://github.com/apache/datafusion/pull/15325#issuecomment-2745294019 Thanks again @pranavJibhakate -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve collection during repr and repr_html [datafusion-python]
timsaucer merged PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added tests with are writing into parquet files in memory for issue #… [datafusion]
alamb merged PR #15325: URL: https://github.com/apache/datafusion/pull/15325 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]
alamb commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2745294680 i took a look at the most recent commits and it looks good to me. Thanks again @Shreyaskr1409 ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]
alamb merged PR #15313: URL: https://github.com/apache/datafusion/pull/15313 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPre… [datafusion]
alamb commented on code in PR #15348: URL: https://github.com/apache/datafusion/pull/15348#discussion_r2008785345 ## datafusion/physical-plan/src/sorts/cursor.rs: ## @@ -281,6 +281,33 @@ impl CursorArray for GenericByteArray { } } +impl CursorArray for StringViewArray { +type Values = StringViewArray; +fn values(&self) -> Self { +self.clone() +} +} + +impl CursorValues for StringViewArray { +fn len(&self) -> usize { +self.views().len() +} + +fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool { +unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() } Review Comment: I agree it would be good to justify the use of unchecked (which I think is ok here) The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked SO maybe the safety argument is mostly "The left/right_idx must within range of each array" It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing Given that this comparison is typically *the* hottest part of a merge operation maybe we should try using unchecked comparisions elswhere -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] limit max disk usage for spilling queries [datafusion]
alamb opened a new issue, #15358: URL: https://github.com/apache/datafusion/issues/15358 ### Is your feature request related to a problem or challenge? Breaking rationale from https://github.com/apache/datafusion/pull/14975#issue-2890626662 into its own ticket: For memory-limit queries, executors might write temporary results into the disk to reduce memory load. It's important to have a configuration option to limit the max disk usage, in case some query would bloat the disk and cause other issues. DuckDB provides a similar configuration: ``` max_temp_directory_size | The maximum amount of data stored inside the 'temp_directory' (when set) (e.g., 1GB) ``` ### Describe the solution you'd like Provide a way to limit the total disk usage of queries that spill to disk ### Describe alternatives you've considered Add a configuration option max_temp_directory_size to disk manager (default to 100GB), if the limit is reached for all spilled files, an error will be returned. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add test coverage for wasm32 + parquet build [datafusion]
alamb closed issue #15158: Add test coverage for wasm32 + parquet build URL: https://github.com/apache/datafusion/issues/15158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]
alamb commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2745303386 - I filed https://github.com/apache/datafusion/issues/15358 to track the feature request, and linked this ticket -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make ClickBench Q23 Go Faster [datafusion]
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740983293 BTW combined with @adriangb's PR here - https://github.com/apache/datafusion/pull/15301 It will likely go crazy fast 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]
alamb commented on code in PR #15355: URL: https://github.com/apache/datafusion/pull/15355#discussion_r2008788363 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -230,9 +219,14 @@ struct ExternalSorter { /// if `Self::in_mem_batches` are sorted in_mem_batches_sorted: bool, -/// If data has previously been spilled, the locations of the -/// spill files (in Arrow IPC format) -spills: Vec, +/// During external sorting, in-memory intermediate data will be appended to +/// this file incrementally. Once finished, this file will be moved to [`Self::finished_spill_files`]. +in_progress_spill_file: Option, +/// If data has previously been spilled, the locations of the spill files (in +/// Arrow IPC format) +/// Within the same spill file, the data might be chunked into multiple batches, +/// and ordered by sort keys. +finished_spill_files: Vec, Review Comment: It might make more sense to have the `SpillManager` own these files so there can't be different sets of references ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -65,23 +63,14 @@ struct ExternalSorterMetrics { /// metrics baseline: BaselineMetrics, -/// count of spills during the execution of the operator Review Comment: Nice ## datafusion/physical-plan/src/spill.rs: ## @@ -223,25 +229,182 @@ impl IPCStreamWriter { } } +/// The `SpillManager` is responsible for the following tasks: Review Comment: Love the spill manager 👍 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -379,46 +382,64 @@ impl ExternalSorter { /// How many bytes have been spilled to disk? fn spilled_bytes(&self) -> usize { -self.metrics.spilled_bytes.value() +self.metrics.spill_metrics.spilled_bytes.value() } /// How many rows have been spilled to disk? fn spilled_rows(&self) -> usize { -self.metrics.spilled_rows.value() +self.metrics.spill_metrics.spilled_rows.value() } /// How many spill files have been created? fn spill_count(&self) -> usize { -self.metrics.spill_count.value() +self.metrics.spill_metrics.spill_file_count.value() } -/// Writes any `in_memory_batches` to a spill file and clears -/// the batches. The contents of the spill file are sorted. -/// -/// Returns the amount of memory freed. -async fn spill(&mut self) -> Result { +/// When calling, all `in_mem_batches` must be sorted, and then all of them will +/// be appended to the in-progress spill file. +async fn spill_append(&mut self) -> Result<()> { // we could always get a chance to free some memory as long as we are holding some if self.in_mem_batches.is_empty() { -return Ok(0); +return Ok(()); +} + +// Lazily initialize the in-progress spill file +if self.in_progress_spill_file.is_none() { +self.in_progress_spill_file = +Some(self.spill_manager.create_in_progress_file("Sorting")?); } self.organize_stringview_arrays()?; debug!("Spilling sort data of ExternalSorter to disk whilst inserting"); -let spill_file = self.runtime.disk_manager.create_tmp_file("Sorting")?; let batches = std::mem::take(&mut self.in_mem_batches); -let (spilled_rows, spilled_bytes) = spill_record_batches( -&batches, -spill_file.path().into(), -Arc::clone(&self.schema), -)?; -let used = self.reservation.free(); -self.metrics.spill_count.add(1); -self.metrics.spilled_bytes.add(spilled_bytes); -self.metrics.spilled_rows.add(spilled_rows); -self.spills.push(spill_file); -Ok(used) +self.reservation.free(); + +let in_progress_file = self.in_progress_spill_file.as_mut().ok_or_else(|| { +internal_datafusion_err!("In-progress spill file should be initialized") +})?; + +for batch in batches { Review Comment: I don't understand this logic -- i thought that each individual `self.in_mem_batches` was sorted but they aren't sorted overall Thus if we write write them back to back to the same spill file, the spill file itself won't be sorted Like if the two in memory batches are | A | B | ||| | 1 | 10| | 2 | 10 | | 2 | 10 | | A | B | ||| | 1 | 10| | 2 | 10 | | 2 | 10 | I think this code would produce a single spill file like | A | B | ||| | 1 | 10| | 2 | 10 | | 2 | 10 | | 1 | 10| | 2 | 10 | | 2 | 10 | Which is not sorted 🤔 On the other hand all the tests are passing so maybe I misunderstand what this is doing (or we have a testing gap) ## datafusion/physical-plan/src/sorts/sort.rs: ##
Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]
alamb commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2745309065 > There are two extra operators that can be changed to this new interface (Aggregate and SortMergeJoin), they're planned to be included in this PR. I plan to do it after getting some review feedback. I request that we do this feature in multiple smaller PRs which will be easier to review / understand BTW I think this PR may address some of this issue to0: - https://github.com/apache/datafusion/issues/15323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]
omar commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2008990238 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,312 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xiangpeng Hao +categories: [performance] +--- + + +figure { + margin: 20px 0; +} + +figure img { + display: block; + max-width: 80%; +} + +figcaption { + font-style: italic; + margin-top: 10px; + color: #555; + font-size: 0.9em; + max-width: 80%; +} + + + + +_Editor's Note: This blog was first published on [Xiangpeng Hao's blog]. Thanks to [InfluxData] for sponsoring this work as part of his PhD funding._ + +[Xiangpeng Hao's blog]: https://blog.xiangpeng.systems/posts/parquet-pushdown/ +[InfluxData]: https://www.influxdata.com/ + + + +In the [previous post], we discussed how [Apache DataFusion] prunes [Apache Parquet] files to skip irrelevant **files/row_groups** (sometimes also [pages](https://parquet.apache.org/docs/file-format/pageindex/)). + +This post discusses how Parquet readers skip irrelevant **rows** while scanning data, +leveraging Parquet's columnar layout by first reading only filter columns, +and then selectively reading other columns only for matching rows. + + +[previous post]: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning +[Apache DataFusion]: https://datafusion.apache.org/ +[Apache Parquet]: https://parquet.apache.org/ + +## Why filter pushdown in Parquet? + +Below is an example query that reads sensor data with filters on `date_time` and `location`. +Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows. + + +```sql +SELECT val, location +FROM sensor_data +WHERE date_time > '2025-03-11' AND location = 'office'; +``` + + + + +Parquet pruning skips irrelevant files/row_groups, while filter pushdown skips irrelevant rows. Without filter pushdown, all rows from location, val, and date_time columns are decoded before `location='office'` is evaluated. Filter pushdown is especially useful when the filter is selective, i.e., removes many rows. + + + + +In our setup, sensor data is aggregated by date — each day has its own Parquet file. +At planning time, DataFusion prunes the unneeded Parquet files, i.e., `2025-03-10.parquet` and `2025-03-11.parquet`. + +Once the files to read are located, the [*DataFusion's current default implementation*](https://github.com/apache/datafusion/issues/3463) reads all the projected columns (`sensor_id`, `val`, and `location`) into Arrow RecordBatches, then applies the filters over `location` to get the final set of rows. + +A better approach is called **filter pushdown**, which evaluates filter conditions first and only decodes data that passes these conditions. +In practice, this works by first processing only the filter columns (`date_time` and `location`), building a boolean mask of rows that satisfy our conditions, then using this mask to selectively decode only the relevant rows from other columns (`sensor_id`, `val`). +This eliminates the waste of decoding rows that will be immediately filtered out. + +While simple in theory, practical implementations often make performance worse. + +## How can filter pushdown be slower? + +At a high level, the Parquet reader first builds a filter mask -- essentially a boolean array indicating which rows meet the filter criteria -- and then uses this mask to selectively decode only the needed rows from the remaining columns in the projection. + +Let's dig into details of [how filter pushdown is implemented](https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712) in the current Rust Parquet reader implementation, illustrated in the following figure. + + + + +Implementation of filter pushdown in Rust Parquet readers -- the first phase builds the filter mask, the second phase applies the filter mask to the other columns + + + +The filter pushdown has two phases: + +1. Build the filter mask (steps 1-3) + +2. Use the filter mask to selectively decode other columns (steps 4-7), e.g., output step 3 is used as input for step 5 and 7. + +Within each phase, it takes three steps from Parquet to Arrow: + +1. Decompress the Parquet pages using generic decompression algorithms like LZ4, Zstd, etc. (steps 1, 4, 6) + +2. Decode the page content into Arrow format (steps 2, 5, 7) + +3. Evaluate the filter over Arrow data (step 3) + +In the figure above, we can see that `location` is **decompressed and decoded twice**, first when building the filter mask (steps 1, 2), and second when building the output (steps 4, 5). +This happens for all columns that appear both in the filter and output. + +The table below shows the corresponding CPU time o
Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]
2010YOUY01 commented on code in PR #15355: URL: https://github.com/apache/datafusion/pull/15355#discussion_r2009004733 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -230,9 +219,14 @@ struct ExternalSorter { /// if `Self::in_mem_batches` are sorted in_mem_batches_sorted: bool, -/// If data has previously been spilled, the locations of the -/// spill files (in Arrow IPC format) -spills: Vec, +/// During external sorting, in-memory intermediate data will be appended to +/// this file incrementally. Once finished, this file will be moved to [`Self::finished_spill_files`]. +in_progress_spill_file: Option, +/// If data has previously been spilled, the locations of the spill files (in +/// Arrow IPC format) +/// Within the same spill file, the data might be chunked into multiple batches, +/// and ordered by sort keys. +finished_spill_files: Vec, Review Comment: I think it will be hard to define the semantics of those temp files if we put them inside `SpillManager`, because different operators will interpret those files differently: - For `SortExec`, `vec` is representing multiple sorted runs on sort keys. - For `ShuffleWriterExec` in `datafusion-comet`, since `Spark`'s shuffle operator is blocking (due to spark's staged execution design), it might want to keep `vec` instead. - Similarly, if we want to spill `Row`s to accelerate `SortExec`, or we want to implement spilling hash join, the temp files will have very different logical meanings. Overall, the `SpillManager` is designed only to do `RecordBatch <-> raw file` with different configurations and stat accounting. Operators have more flexibility to implement specific utilities for managing raw files, which have diverse semantics. Do you see any potential issues or improvements? ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -379,46 +382,64 @@ impl ExternalSorter { /// How many bytes have been spilled to disk? fn spilled_bytes(&self) -> usize { -self.metrics.spilled_bytes.value() +self.metrics.spill_metrics.spilled_bytes.value() } /// How many rows have been spilled to disk? fn spilled_rows(&self) -> usize { -self.metrics.spilled_rows.value() +self.metrics.spill_metrics.spilled_rows.value() } /// How many spill files have been created? fn spill_count(&self) -> usize { -self.metrics.spill_count.value() +self.metrics.spill_metrics.spill_file_count.value() } -/// Writes any `in_memory_batches` to a spill file and clears -/// the batches. The contents of the spill file are sorted. -/// -/// Returns the amount of memory freed. -async fn spill(&mut self) -> Result { +/// When calling, all `in_mem_batches` must be sorted, and then all of them will +/// be appended to the in-progress spill file. +async fn spill_append(&mut self) -> Result<()> { // we could always get a chance to free some memory as long as we are holding some if self.in_mem_batches.is_empty() { -return Ok(0); +return Ok(()); +} + +// Lazily initialize the in-progress spill file +if self.in_progress_spill_file.is_none() { +self.in_progress_spill_file = +Some(self.spill_manager.create_in_progress_file("Sorting")?); } self.organize_stringview_arrays()?; debug!("Spilling sort data of ExternalSorter to disk whilst inserting"); -let spill_file = self.runtime.disk_manager.create_tmp_file("Sorting")?; let batches = std::mem::take(&mut self.in_mem_batches); -let (spilled_rows, spilled_bytes) = spill_record_batches( -&batches, -spill_file.path().into(), -Arc::clone(&self.schema), -)?; -let used = self.reservation.free(); -self.metrics.spill_count.add(1); -self.metrics.spilled_bytes.add(spilled_bytes); -self.metrics.spilled_rows.add(spilled_rows); -self.spills.push(spill_file); -Ok(used) +self.reservation.free(); + +let in_progress_file = self.in_progress_spill_file.as_mut().ok_or_else(|| { +internal_datafusion_err!("In-progress spill file should be initialized") +})?; + +for batch in batches { Review Comment: No, they are globally sorted. In different stages, `in_mem_batches` can either represent unordered input, or globally sorted run (but chunked into smaller batches) I agree this approach has poor understandability and is error-prone, I'll try to improve it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact
Re: [PR] Fix array_has_all and array_has_any with empty array [datafusion]
LuQQiu commented on PR #15039: URL: https://github.com/apache/datafusion/pull/15039#issuecomment-2741622696 @alamb @Weijun-H Thanks for the review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix parquet pruning blog post hyperlink [datafusion-site]
alamb merged PR #62: URL: https://github.com/apache/datafusion-site/pull/62 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]
TheBuilderJR commented on PR #15295: URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2745948813 Nice! Fwiw another edge case I found recently that's probably worth testing is a List where the Struct evolves. I ended up solving it by updating list_coersion but curious if you have a better way: https://github.com/apache/datafusion/pull/15259/files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew opened a new pull request, #15361: URL: https://github.com/apache/datafusion/pull/15361 Closes https://github.com/apache/datafusion/issues/14536 ## Rationale for this change Datafusion currently errs when attempting to format a date using time-related specifiers. ```sql > select to_char('2023-09-04'::date, '%Y-%m-%dT%H:%M:%S%.3f'); Execution error: Cast error: Format error ``` However, Postgres supports this feature as it implicitly treats the date as a timestamp. Rather than eagerly casting every `Date32` to a `Date64` when calling `to_char`, this commit attempts to first format a `Date32` with the supplied format string. If the formatting fails, we try to reformat as a `Date64`. This way, only format strings with time-related specifiers endure the intermediary cast. ## What changes are included in this PR? All changes are tested, specifically for two different call sites: `_to_char_scalar` and `_to_char_array`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Parse Postgres's LOCK TABLE statement [datafusion-sqlparser-rs]
github-actions[bot] commented on PR #1614: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1614#issuecomment-2745969727 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org