Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
shehabgamin commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2735261314 I feel like this may be important enough to try to get into the release. Does anyone else have thoughts? https://github.com/apache/datafusion/issues/15174 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2735401711 Inspired by discussion in https://github.com/apache/datafusion/pull/13054 I went with adding this to `ExecutionPlan`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2735408031 Tomorrow I plan on doing some tracer bullet testing to see if this approach works at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2735413403 cc @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `core_expressions` feature flag broken, move `overlay` into `core` functions [datafusion]
alamb commented on PR #15217: URL: https://github.com/apache/datafusion/pull/15217#issuecomment-2734530255 > hey @alamb, I have already added a re-export at the end of `datafusion/functions/src/string/overlay.rs` like this Thanks @shruti2522 - that looks good to me I double checked and it is appearing as we would expect  Thank you 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
comphead commented on code in PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456#discussion_r2001739415 ## native/core/Cargo.toml: ## @@ -77,6 +77,7 @@ jni = { version = "0.21", features = ["invocation"] } lazy_static = "1.4" assertables = "7" hex = "0.4.3" +datafusion-functions-nested = "46.0.0" Review Comment: I'm not sure if it is possible to make it `workspace = true`, this package is optional for test only and optional packages are not allowed on workspace level. I can include it in workspace as non optional but it will be compiled into the target binary which is probably not expected -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add WITH ORDER example to blog post [datafusion-site]
akurmustafa commented on PR #59: URL: https://github.com/apache/datafusion-site/pull/59#issuecomment-2734756045 > > Thanks @alamb, I was working on to add the example you gave ("DataFusion can find / use orderings based on query intermediates"). Should we add this to the document what do you think? We can leave the document in current form also? > > I think this would be great! > > We could also write an entire other blog post about it too -- whatever you prefer / have time for! I think, for the completeness we can add this section to the current post. However, independent of this I think we can write a dedicated blog post for this definitely (with lots of examples, and with a focus on queries and plans ). However, before that post I want to migrate following [post](https://akurmustafa.github.io/blogs/query_optimization_filter_pushdown/query_optimization_filter_pushdown.html), to the Datafusion website. This post is an analysis of filter pushdown with a more theoretical focus: "when pushing down filter makes sense? (e.g. always:))". During migration I plan to update that post a lot, however main approach and content will be very similar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
alamb merged PR #15209: URL: https://github.com/apache/datafusion/pull/15209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): bump uuid from 1.15.1 to 1.16.0 [datafusion]
xudong963 merged PR #15292: URL: https://github.com/apache/datafusion/pull/15292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
shehabgamin commented on code in PR #15168: URL: https://github.com/apache/datafusion/pull/15168#discussion_r2002067067 ## datafusion/spark/src/function/math/expm1.rs: ## @@ -0,0 +1,169 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::function::error_utils::{ +invalid_arg_count_exec_err, unsupported_data_type_exec_err, +}; +use arrow::array::{ArrayRef, AsArray}; +use arrow::datatypes::{DataType, Float64Type}; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::{ +ColumnarValue, ScalarFunctionArgs, ScalarUDFImpl, Signature, Volatility, +}; +use datafusion_macros::user_doc; +use std::any::Any; +use std::sync::Arc; + +#[user_doc( +doc_section(label = "Spark Math Functions"), +description = "Returns exp(expr) - 1 as a Float64.", +syntax_example = "expm1(expr)", +sql_example = r#"```sql +> select expm1(0); +++ +| expm1(0) | +++ +| 0.0| +++ +> select expm1(1); +++ +| expm1(1) | +++ +| 50 | +++ +```"#, +argument( +name = "expr", +description = "An expression that evaluates to a numeric." +) +)] +#[derive(Debug)] +pub struct SparkExpm1 { +signature: Signature, +aliases: Vec, +} + +impl Default for SparkExpm1 { +fn default() -> Self { +Self::new() +} +} + +impl SparkExpm1 { +pub fn new() -> Self { +Self { +signature: Signature::user_defined(Volatility::Immutable), +aliases: vec!["spark_expm1".to_string()], +} +} +} + +impl ScalarUDFImpl for SparkExpm1 { +fn as_any(&self) -> &dyn Any { +self +} + +fn name(&self) -> &str { +"expm1" +} + +fn signature(&self) -> &Signature { +&self.signature +} + +fn return_type(&self, _arg_types: &[DataType]) -> Result { +Ok(DataType::Float64) +} + +fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result { +if args.args.len() != 1 { +return Err(invalid_arg_count_exec_err("expm1", (1, 1), args.args.len())); +} +match &args.args[0] { +ColumnarValue::Scalar(ScalarValue::Float64(value)) => Ok( +ColumnarValue::Scalar(ScalarValue::Float64(value.map(|x| x.exp_m1(, +), +ColumnarValue::Array(array) => match array.data_type() { +DataType::Float64 => Ok(ColumnarValue::Array(Arc::new( +array +.as_primitive::() +.unary::<_, Float64Type>(|x| x.exp_m1()), +) +as ArrayRef)), +other => Err(unsupported_data_type_exec_err( +"expm1", +format!("{}", DataType::Float64).as_str(), +other, +)), +}, +other => Err(unsupported_data_type_exec_err( +"expm1", +format!("{}", DataType::Float64).as_str(), +&other.data_type(), +)), +} +} + +fn aliases(&self) -> &[String] { +&self.aliases +} + +fn coerce_types(&self, arg_types: &[DataType]) -> Result> { +if arg_types.len() != 1 { +return Err(invalid_arg_count_exec_err("expm1", (1, 1), arg_types.len())); +} +if arg_types[0].is_numeric() { +Ok(vec![DataType::Float64]) +} else { +Err(unsupported_data_type_exec_err( +"expm1", +"Numeric Type", +&arg_types[0], +)) +} +} +} + +#[cfg(test)] +mod tests { +use crate::function::math::expm1::SparkExpm1; +use crate::function::utils::test::test_scalar_function; +use arrow::array::{Array, Float64Array}; +use arrow::datatypes::DataType::Float64; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::{ColumnarValue, ScalarUDFImpl}; + +macro_rules! test_expm1_float64_invoke { +($INPUT:expr, $EXPECTED:expr) => { +test_scalar_function!( +SparkExpm1::new(), +
Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]
comphead commented on code in PR #60: URL: https://github.com/apache/datafusion-site/pull/60#discussion_r2002069929 ## content/blog/2025-03-18-parquet-pruning.md: ## @@ -0,0 +1,111 @@ +--- +layout: post +title: Parquet pruning in DataFusion: Read Only What Matters +date: 2025-03-18 +author: Xiangpeng Hao +categories: [performance] +--- + + + + +_Editor's Note: This blog was first published on [Xiangpeng Hao's blog]. Thanks to [InfluxData] for sponsoring this work as part of his PhD funding._ + +[Xiangpeng Hao's blog]: https://blog.xiangpeng.systems/posts/parquet-to-arrow/ +[InfluxData]: https://www.influxdata.com/ + + +Parquet has become the industry standard for storing columnar data, and reading Parquet efficiently is crucial for query performance. + +To optimize this, DataFusion implements advanced Parquet support for effective data pruning and decoding. Review Comment: I think we need to specify what exactly optimized? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improvement/improve wildcard error 15004 [datafusion]
Jiashu-Hu commented on code in PR #15287: URL: https://github.com/apache/datafusion/pull/15287#discussion_r2001662346 ## datafusion/sql/src/select.rs: ## @@ -826,6 +827,13 @@ impl SqlToRel<'_, S> { .map(|expr| rebase_expr(expr, &aggr_projection_exprs, input)) .collect::>>()?; +// check if the columns in the SELECT list are in the GROUP BY clause +// or are part of an aggregate function, if not, throw an error. +validate_columns_in_group_by_or_aggregate( Review Comment: hey jay, thanks for your review! I've unified these 2 also updated error message in the test file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Parse `SUBSTR` as alias for `SUBSTRING` [datafusion-sqlparser-rs]
mvzink opened a new pull request, #1769: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1769 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001376856 ## datafusion/core/tests/parquet/custom_reader.rs: ## @@ -96,17 +97,15 @@ async fn route_data_access_ops_to_parquet_file_reader_factory() { let task_ctx = session_ctx.task_ctx(); let read = collect(parquet_exec, task_ctx).await.unwrap(); -let expected = [ -"+-+++", -"| c1 | c2 | c3 |", -"+-+++", -"| Foo | 1 | 10 |", -"| | 2 | 20 |", -"| bar |||", -"+-+++", -]; - -assert_batches_sorted_eq!(expected, &read); +assert_snapshot!(batches_to_sort_string(&read), @r" Review Comment: change to batch_to_string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add CatalogProvider and SchemaProvider to FFI Crate [datafusion]
timsaucer merged PR #15280: URL: https://github.com/apache/datafusion/pull/15280 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add WITH ORDER example to blog post [datafusion-site]
akurmustafa commented on PR #59: URL: https://github.com/apache/datafusion-site/pull/59#issuecomment-2734859868 With the [commit](https://github.com/apache/datafusion-site/pull/59/commits/85eea6a572f95972a155ee9926319112e7149ce8), I have added the @alamb's suggestion to the post. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
andygrove commented on code in PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456#discussion_r2002089283 ## native/core/src/execution/planner.rs: ## @@ -3004,4 +3006,130 @@ mod tests { type_info: None, } } + +#[test] +fn test_create_array() { +let session_ctx = SessionContext::new(); +session_ctx.register_udf(ScalarUDF::from( +datafusion_functions_nested::make_array::MakeArray::new(), +)); +let task_ctx = session_ctx.task_ctx(); +let planner = PhysicalPlanner::new(Arc::from(session_ctx)); + +// Create a plan for +// ProjectionExec: expr=[make_array(col_0@0) as col_0] +// ScanExec: source=[CometScan parquet (unknown)], schema=[col_0: Int32] +let op_scan = Operator { +plan_id: 0, +children: vec![], +op_struct: Some(OpStruct::Scan(spark_operator::Scan { +fields: vec![ +spark_expression::DataType { +type_id: 3, // Int32 +type_info: None, +}, +spark_expression::DataType { +type_id: 3, // Int32 +type_info: None, +}, +spark_expression::DataType { +type_id: 3, // Int32 +type_info: None, +}, +], +source: "".to_string(), +})), +}; + +let array_col = spark_expression::Expr { +expr_struct: Some(Bound(spark_expression::BoundReference { +index: 0, +datatype: Some(spark_expression::DataType { +type_id: 3, +type_info: None, +}), +})), +}; + +let array_col_1 = spark_expression::Expr { +expr_struct: Some(Bound(spark_expression::BoundReference { +index: 1, +datatype: Some(spark_expression::DataType { +type_id: 3, +type_info: None, +}), +})), +}; + +let projection = Operator { +children: vec![op_scan], +plan_id: 0, +op_struct: Some(OpStruct::Projection(spark_operator::Projection { +project_list: vec![spark_expression::Expr { +expr_struct: Some(ExprStruct::ScalarFunc(spark_expression::ScalarFunc { +func: "make_array".to_string(), +args: vec![array_col, array_col_1], +return_type: None, +})), +}], +})), +}; + +let a = Int32Array::from(vec![0, 3]); +let b = Int32Array::from(vec![1, 4]); +let c = Int32Array::from(vec![2, 5]); +let input_batch = InputBatch::Batch(vec![Arc::new(a), Arc::new(b), Arc::new(c)], 2); + +let (mut scans, datafusion_plan) = +planner.create_plan(&projection, &mut vec![], 1).unwrap(); +scans[0].set_input_batch(input_batch); + +let mut stream = datafusion_plan.native_plan.execute(0, task_ctx).unwrap(); + +let runtime = tokio::runtime::Runtime::new().unwrap(); +let (tx, mut rx) = mpsc::channel(1); + +// Separate thread to send the EOF signal once we've processed the only input batch +runtime.spawn(async move { +// Create a dictionary array with 100 values, and use it as input to the execution. +let a = Int32Array::from(vec![0, 3]); +let b = Int32Array::from(vec![1, 4]); +let c = Int32Array::from(vec![2, 5]); +let input_batch1 = InputBatch::Batch(vec![Arc::new(a), Arc::new(b), Arc::new(c)], 2); +let input_batch2 = InputBatch::EOF; + +let batches = vec![input_batch1, input_batch2]; + +for batch in batches.into_iter() { +tx.send(batch).await.unwrap(); +} +}); + +runtime.block_on(async move { Review Comment: It's nice to see an end-to-end test like this in the Rust project -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add WITH ORDER example to blog post [datafusion-site]
Omega359 commented on code in PR #59: URL: https://github.com/apache/datafusion-site/pull/59#discussion_r2002099018 ## content/images/ordering_analysis/query_window_plan.png: ## Review Comment: At the output of the window function the table has the ordering: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
comphead commented on PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456#issuecomment-2734862629 Thanks everyone -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add read array support [datafusion-comet]
comphead merged PR #1456: URL: https://github.com/apache/datafusion-comet/pull/1456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Use a shallow clone for Spark SQL test instructions [datafusion-comet]
andygrove merged PR #1547: URL: https://github.com/apache/datafusion-comet/pull/1547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Blog for DataFusion 46.0.0 [datafusion]
alamb commented on issue #15053: URL: https://github.com/apache/datafusion/issues/15053#issuecomment-2734564305 Thanks @berkaysynnada In general I suggest emphasizing things that many users of the crate will see / appreciate and mentioning, but not too deeply, things that developers of the crate care more about I also suggest that until we enable a feature by default we should not make a big deal about it. The rationale is that most users won't use it until it is on by default, and we can make a big deal about it when it is available I suggest emphasizing: - https://github.com/apache/datafusion/pull/14579 - GSOC for sure I would also mention that we are working on statistics / monotonicity, but until there is something major we can point at that is user facing (e.g. an important SQL query that now goes faster with the monotonicity changes) I probably wouldn't emphasize them I think these ones are worth mentioning but don't make a huge deal: - https://github.com/apache/datafusion/pull/14224 These ones are in progress so I would suggest we wait t: - https://github.com/apache/datafusion/pull/13664 - https://github.com/apache/datafusion/pull/14699 I would suggesting highlighting the performance improvements - perf: Improve median with no grouping by 2X https://github.com/apache/datafusion/pull/14399 (2010YOUY01) - Improve performance 10%-100% in FIRST_VALUE / LAST_VALUE by not sort rows in FirstValueAccumulator https://github.com/apache/datafusion/pull/14402 (blaginin) - Speed up uuid UDF (40x faster) https://github.com/apache/datafusion/pull/14675 (simonvandel) - perf: Drop RowConverter from GroupOrderingPartial https://github.com/apache/datafusion/pull/14566 (ctsk) Speedup to_hex (~2x faster) https://github.com/apache/datafusion/pull/14686 (simonvandel) - Implement predicate pruning for not like expressions https://github.com/apache/datafusion/pull/14567 (UBarney) - Speed up chr UDF (~4x faster) https://github.com/apache/datafusion/pull/14700 (simonvandel) Also I suggest pointing people at the new upgrade guide: https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md#datafusion-4600 And maybe we can refer people to https://github.com/apache/datafusion/blob/main/dev/changelog/46.0.0.md for more details -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support merge for `Distribution` [datafusion]
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002299377 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distributions into a single distribution that represents their combined statistics. +/// This creates a more general distribution that approximates the mixture of the input distributions. +pub fn merge_distributions(a: &Distribution, b: &Distribution) -> Result { +let range_a = a.range()?; +let range_b = b.range()?; + +// Determine data type and create combined range +let combined_range = if range_a.is_unbounded() || range_b.is_unbounded() { Review Comment: Great, I found the `Interval::union` works with intervals of the same data type. I seems that we can loose the requirement, such as, `Int64` with `Int32`, `int` with `float`, etc also can be unioned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support merge for `Distribution` [datafusion]
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002309255 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distributions into a single distribution that represents their combined statistics. +/// This creates a more general distribution that approximates the mixture of the input distributions. Review Comment: Maybe we can add some notes to `GenericDistribution` ``` /// # Range Guarantees /// The provided range is assumed to be conservative - all possible values of the /// distribution must lie within this range. However, the range itself might be /// an approximation, as the actual distribution could occupy only a subset of the range. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support `merge` for `Distribution` [datafusion]
xudong963 commented on issue #15290: URL: https://github.com/apache/datafusion/issues/15290#issuecomment-2732391652 There is a proposal: https://github.com/apache/datafusion/pull/15296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
UBarney commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2000992780 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,423 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, +// size of `self.orderings` +// Calculating the memory usage of `self.orderings` using `ScalarValue::size_of_vec` is quite costly. +// Therefore, we cache it and compute `size_of` only after each update +// to avoid calling `ScalarValue::size_of_vec` by Self.size. +size_of_orderings: usize, + +// === option + +// Stores the applicable ordering requirement. +ordering_req: LexOrdering, +// derived from `ordering_req`. +sort_options: Vec, +// Stores whether incoming data already satisfies the ordering requirement. +input_requirement_satisfied: bool, +// Ignore null values. +ignore_nulls: bool, +/// The output type +data_type: DataType, +default_orderings: Vec, +} + +impl FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +fn try_new( +ordering_req: LexOrdering, +ignore_nulls: bool, +data_type: &DataType, +ordering_dtypes: &[DataType], +) -> Result { +let requirement_satisfied = ordering_req.is_empty(); + +let default_orderings = ordering_dtypes +.iter() +.map(ScalarValue::try_from) +.collect::>>()?; + +let sort_options = get_sort_options(ordering_req.as_ref()); + +Ok(Self { +null_builder: BooleanBufferBuilder::new(0), +ordering_req, +sort_options, +input_requirement_satisfied: requirement_satisfied, +ignore_nulls, +default_orderings, +data_type: data_type.clone(), +vals: Vec::new(), +orderings: Vec::new(), +is_sets: BooleanBufferBuilder::new(0), +size_of_orderings: 0, +}) +} + +fn need_update(&self, group_idx: usize) -> bool { +if !self.is_sets.get_bit(group_idx) { +return true; +} + +if self.ignore_nulls && !self.null_builder.get_bit(group_idx) { +return true; +} + +!self.input_requirement_satisfied +} + +fn should_update_state( +&self, +group_idx: usize, +new_ordering_values: &[ScalarValue], +) -> Result { +if !self.is_sets.get_bit(group_idx) { +return Ok(true); +} + +assert!(new_ordering_values.len() == self.ordering_req.len()); +let current_ordering = &self.orderings[group_idx]; +compare_rows(current_ordering, new_ordering_values, &self.sort_options) +.map(|x| x.is_gt()) +} + +fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> { +let result = emit_to.take_needed(&mut self.orderings); + +match emit_to { +EmitTo::All => self.size_of_orderings = 0, +EmitTo::First(_) => { +self.size_of_orderings -= +result.iter().map(ScalarValue::size_of_vec).sum::() +} +} + +result +} + +fn take_need( +bool_buf_builder: &mut BooleanBufferBuilder, +emit_to: EmitTo, +) -> BooleanBuffer { +let bool_buf = bool_buf_builder.finish(); +match emit_to { +EmitTo::All => bool_buf, +EmitTo::First(n) => { +// split off the first N values in seen_values +// +// TODO make this more efficient rather than two +// copies and bitwise manipulation +let first_n: BooleanBuffer = bool_buf.iter().take(n).collect(); +// reset the existing buffer +for b in bool_buf.iter().skip(n) { +bool_buf_builder.append(b); +} +first_n +} +} +} + +fn resize_states(&mut self, new_size: usize) { +self.vals.resize(new_size, T::default_value()); + +if self.null_builder.len() < new_size { +self.null_builder +.append_n(new_size - self.null_bu
Re: [I] Failed optimizations with Int64 type [datafusion]
alamb commented on issue #15291: URL: https://github.com/apache/datafusion/issues/15291#issuecomment-2732103208 Thanks @aectaan -- what is the error message that you get? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
UBarney commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2000992780 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,423 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, +// size of `self.orderings` +// Calculating the memory usage of `self.orderings` using `ScalarValue::size_of_vec` is quite costly. +// Therefore, we cache it and compute `size_of` only after each update +// to avoid calling `ScalarValue::size_of_vec` by Self.size. +size_of_orderings: usize, + +// === option + +// Stores the applicable ordering requirement. +ordering_req: LexOrdering, +// derived from `ordering_req`. +sort_options: Vec, +// Stores whether incoming data already satisfies the ordering requirement. +input_requirement_satisfied: bool, +// Ignore null values. +ignore_nulls: bool, +/// The output type +data_type: DataType, +default_orderings: Vec, +} + +impl FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +fn try_new( +ordering_req: LexOrdering, +ignore_nulls: bool, +data_type: &DataType, +ordering_dtypes: &[DataType], +) -> Result { +let requirement_satisfied = ordering_req.is_empty(); + +let default_orderings = ordering_dtypes +.iter() +.map(ScalarValue::try_from) +.collect::>>()?; + +let sort_options = get_sort_options(ordering_req.as_ref()); + +Ok(Self { +null_builder: BooleanBufferBuilder::new(0), +ordering_req, +sort_options, +input_requirement_satisfied: requirement_satisfied, +ignore_nulls, +default_orderings, +data_type: data_type.clone(), +vals: Vec::new(), +orderings: Vec::new(), +is_sets: BooleanBufferBuilder::new(0), +size_of_orderings: 0, +}) +} + +fn need_update(&self, group_idx: usize) -> bool { +if !self.is_sets.get_bit(group_idx) { +return true; +} + +if self.ignore_nulls && !self.null_builder.get_bit(group_idx) { +return true; +} + +!self.input_requirement_satisfied +} + +fn should_update_state( +&self, +group_idx: usize, +new_ordering_values: &[ScalarValue], +) -> Result { +if !self.is_sets.get_bit(group_idx) { +return Ok(true); +} + +assert!(new_ordering_values.len() == self.ordering_req.len()); +let current_ordering = &self.orderings[group_idx]; +compare_rows(current_ordering, new_ordering_values, &self.sort_options) +.map(|x| x.is_gt()) +} + +fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> { +let result = emit_to.take_needed(&mut self.orderings); + +match emit_to { +EmitTo::All => self.size_of_orderings = 0, +EmitTo::First(_) => { +self.size_of_orderings -= +result.iter().map(ScalarValue::size_of_vec).sum::() +} +} + +result +} + +fn take_need( +bool_buf_builder: &mut BooleanBufferBuilder, +emit_to: EmitTo, +) -> BooleanBuffer { +let bool_buf = bool_buf_builder.finish(); +match emit_to { +EmitTo::All => bool_buf, +EmitTo::First(n) => { +// split off the first N values in seen_values +// +// TODO make this more efficient rather than two +// copies and bitwise manipulation +let first_n: BooleanBuffer = bool_buf.iter().take(n).collect(); +// reset the existing buffer +for b in bool_buf.iter().skip(n) { +bool_buf_builder.append(b); +} +first_n +} +} +} + +fn resize_states(&mut self, new_size: usize) { +self.vals.resize(new_size, T::default_value()); + +if self.null_builder.len() < new_size { +self.null_builder +.append_n(new_size - self.null_bu
Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]
Omega359 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2001009802 ## datafusion/sqllogictest/test_files/union_by_name.slt: ## @@ -287,3 +287,137 @@ SELECT '0' as c UNION ALL BY NAME SELECT 0 as c; 0 0 + +# Regression tests for https://github.com/apache/datafusion/issues/15236 +# Ensure that the correct output is produced even if the width of an input node's +# schema is the same as the resulting schema width after the union is applied. + +statement ok +create table t3 (x varchar(255), y varchar(255), z varchar(255)); + +statement ok +create table t4 (x varchar(255), y varchar(255), z varchar(255)); + +statement ok +insert into t3 values ('a', 'b', 'c'); + +statement ok +insert into t4 values ('a', 'b', 'c'); + +query rowsort +select t3.x, t3.y, t3.z from t3 union by name select t3.z, t3.y, t3.x, 'd' as zz from t3; + +a b c NULL +a b c d + +query rowsort +select t3.x, t3.y, t3.z from t3 union by name select t4.z, t4.y, t4.x, 'd' as zz from t4; + +a b c NULL +a b c d + +query TTT rowsort +select x, y, z from t3 union all by name select z, y, x from t3; + +a b c +a b c + +query TTT rowsort +select x, y, z from t3 union all by name select z, y, x from t4; + +a b c +a b c + +query TTT +select x, y, z from t3 union all by name select z, y, x from t4 order by x; + +a b c +a b c + + +# FIXME: The following should pass without error, but currently it is failing +# due to differing record batch schemas when the SLT runner collects results in +# normalize::convert_batches. +# +# More context can be found here: https://github.com/apache/datafusion/pull/15242#issuecomment-2730402547 +query error +select x, y, z from t3 union all by name select z, y, x, 'd' as zz from t3; + +DataFusion error: Internal error: Schema mismatch. Previously had +Schema { +fields: [ +Field { +name: "x", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "y", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "z", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "zz", +data_type: Utf8, +nullable: false, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +], +metadata: {}, +} + +Got: +Schema { +fields: [ +Field { +name: "x", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "y", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "z", +data_type: Utf8, +nullable: true, +dict_id: 0, +dict_is_ordered: false, +metadata: {}, +}, +Field { +name: "zz", +data_type: Utf8, +nullable: true, Review Comment: Interesting that the nullable changed from false to true here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate datasource tests to insta [datafusion]
blaginin commented on code in PR #15258: URL: https://github.com/apache/datafusion/pull/15258#discussion_r2001416176 ## datafusion/core/Cargo.toml: ## @@ -126,6 +126,7 @@ datafusion-physical-plan = { workspace = true } datafusion-sql = { workspace = true } flate2 = { version = "1.1.0", optional = true } futures = { workspace = true } +insta = "1.42.2" Review Comment: you should be use to use the workspace version -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001379803 ## datafusion/core/tests/parquet/schema.rs: ## @@ -153,7 +151,15 @@ async fn schema_merge_can_preserve_metadata() { let actual = df.collect().await.unwrap(); -assert_batches_sorted_eq!(expected, &actual); +assert_snapshot!(batches_to_sort_string(&actual), @r" Review Comment: change to batches_to_string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001384434 ## datafusion/core/tests/sql/path_partition.rs: ## @@ -145,16 +146,7 @@ async fn parquet_distinct_partition_col() -> Result<()> { .collect() .await?; -let expected = [ -"+--+---+-+", -"| year | month | day |", -"+--+---+-+", -"| 2021 | 09| 09 |", -"| 2021 | 10| 09 |", -"| 2021 | 10| 28 |", -"+--+---+-+", -]; -assert_batches_sorted_eq!(expected, &result); +assert_snapshot!(batches_to_sort_string(&result)); Review Comment: missing inline snapshot ## datafusion/core/tests/sql/path_partition.rs: ## @@ -275,18 +267,7 @@ async fn csv_filter_with_file_col() -> Result<()> { .collect() .await?; -let expected = [ -"+++", -"| c1 | c2 |", -"+++", -"| a | 1 |", -"| b | 1 |", -"| b | 5 |", -"| c | 2 |", -"| d | 5 |", -"+++", -]; -assert_batches_sorted_eq!(expected, &result); +assert_snapshot!(batches_to_sort_string(&result)); Review Comment: missing inline snapshot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] minor: fix `data/sqlite` link [datafusion]
sdht0 commented on PR #15286: URL: https://github.com/apache/datafusion/pull/15286#issuecomment-2733877823 Ah fixed it thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
xudong963 commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r2001108185 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -202,53 +202,48 @@ physical_plan 02)│ AggregateExec │ 03)│ │ 04)│ aggr: sum(bigint_col) │ -05)│ │ -06)│ group_by: │ -07)│ string_col@0 as string_col│ Review Comment: got it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor file schema type coercions [datafusion]
xudong963 commented on PR #15268: URL: https://github.com/apache/datafusion/pull/15268#issuecomment-2733344214 Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor file schema type coercions [datafusion]
xudong963 merged PR #15268: URL: https://github.com/apache/datafusion/pull/15268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support `merge` for `Distribution` [datafusion]
xudong963 commented on issue #15290: URL: https://github.com/apache/datafusion/issues/15290#issuecomment-2732132955 > Will require an accurate distribution (not just an approximation Yes, it depends on whether each distribution is accurate, if they're, the merged distribution should be accurate, or we should merge them conservatively -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
Dandandan commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2000574008 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,423 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstGroupsAccumulator Review Comment: ```suggestion struct FirstPrimitiveGroupsAccumulator ``` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
xudong963 commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r2000832270 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -202,53 +202,48 @@ physical_plan 02)│ AggregateExec │ 03)│ │ 04)│ aggr: sum(bigint_col) │ -05)│ │ -06)│ group_by: │ -07)│ string_col@0 as string_col│ Review Comment: I think it'll be helpful to keep the index `@0`, what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]
alamb commented on PR #14286: URL: https://github.com/apache/datafusion/pull/14286#issuecomment-2733529188 Converting to a draft until we hav spawn service - https://github.com/apache/arrow-rs/pull/7253 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove inline table scan analyzer rule [datafusion]
jayzhan211 commented on code in PR #15201: URL: https://github.com/apache/datafusion/pull/15201#discussion_r2000904176 ## datafusion/core/tests/dataframe/mod.rs: ## @@ -1571,14 +1571,18 @@ async fn with_column_join_same_columns() -> Result<()> { assert_snapshot!( df_with_column.logical_plan(), -@r###" +@r" Projection: t1.c1, t2.c1, Boolean(true) AS new_column Limit: skip=0, fetch=1 Sort: t1.c1 ASC NULLS FIRST Inner Join: t1.c1 = t2.c1 -TableScan: t1 -TableScan: t2 -"### +SubqueryAlias: t1 Review Comment: I explain it in the above https://github.com/apache/datafusion/pull/15201#discussion_r1996487546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Triggering extended tests through PR comment [datafusion]
Omega359 commented on PR #15101: URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2733251482 Is this ready for review or is there something outstanding for it to be still in draft? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2733266739 Thanks @alamb, @jayzhan211 and @xudong963 for your review, here are two points that remain unclear: 1. For GROUP BY, is it necessary to preserve the row index -- for more information, additional functionality, or just better aesthetics? 2. Should the default Display implementation of Expr be modified to replace SqlDisplay? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] minor: fix `data/sqlite` link [datafusion]
sdht0 commented on PR #15286: URL: https://github.com/apache/datafusion/pull/15286#issuecomment-2733900724 Weird that I received an email about another comment from Weijun-H but can't see it here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2732945218 > Is it possible to modify `Display` for Expr for explain statement? I haven't tried it, not sure how it will affect other logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Fix multi-lines printing issue for datafusion-cli and add the streaming printing feature back [datafusion]
blaginin commented on code in PR #14954: URL: https://github.com/apache/datafusion/pull/14954#discussion_r2001448048 ## datafusion-cli/tests/cli_integration.rs: ## @@ -51,6 +51,163 @@ fn init() { ["--command", "show datafusion.execution.batch_size", "--format", "json", "-q", "-b", "1"], "[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n" )] + Review Comment: you should be able to use snapshots now (merge latest main to see the changes in that file). lmk if you need help -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Require Comet 0.6 Docker image for Spark 3.5.5 [datafusion-comet]
RaghavendraGanesh commented on issue #1509: URL: https://github.com/apache/datafusion-comet/issues/1509#issuecomment-2733053103 Thanks @andygrove , will give it a try. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [D] enum Expr extension on logical level [datafusion]
GitHub user bertvermeiren edited a discussion: enum Expr extension on logical level Hi, In order to write some additional logical and physical plan implementations, we do have to create some kind of "composite" logical expression. Nowadays in the code base you have already existing expression implementations with such composite behaviour like the `ScalarFunction` expression containing other expressions (the arguments) ; the `AggregateFunction` containing _filtering_ and _order by_ expressions, etc. Lot's of interfaces/API/traits are working on the `enum Expr` directly : `UserLogicalNode::with_exprs_and_inputs` and related `UserLogicalNode::expressions` Applying optimisation rules on those logical plans by analysing and potentially rewriting expressions. By not having some customisation, we loose the internal semantics of “composite” logical expressions if those have to be “decomposed” again into existing expression implementations to be used throughout the codebase. Would it be opportune the have an extension point on logical expression as well, like it has been introduced on the `LogicalPlan::Extension(Extension)`? What are the opinions on this ? Are there other alternatives ? How would such an extension look like ? Regards, Bert. GitHub link: https://github.com/apache/datafusion/discussions/15297 This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001378037 ## datafusion/core/tests/parquet/schema.rs: ## @@ -82,7 +69,18 @@ async fn schema_merge_ignores_metadata_by_default() { .unwrap(); let actual = df.collect().await.unwrap(); -assert_batches_sorted_eq!(expected, &actual); +assert_snapshot!(batches_to_sort_string(&actual), @r" Review Comment: change to batches_to_string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001380462 ## datafusion/core/tests/parquet/schema.rs: ## @@ -167,7 +173,15 @@ async fn schema_merge_can_preserve_metadata() { assert_eq!(actual.clone(), expected_metadata); let actual = df.collect().await.unwrap(); -assert_batches_sorted_eq!(expected, &actual); +assert_snapshot!(batches_to_sort_string(&actual), @r" Review Comment: change to batch_to_string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001378793 ## datafusion/core/tests/parquet/schema.rs: ## @@ -97,7 +95,18 @@ async fn schema_merge_ignores_metadata_by_default() { .collect() .await .unwrap(); -assert_batches_sorted_eq!(expected, &actual); +assert_snapshot!(batches_to_sort_string(&actual), @r" Review Comment: change to batches_to_string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001386538 ## datafusion/core/tests/sql/path_partition.rs: ## @@ -430,21 +381,7 @@ async fn parquet_multiple_partitions() -> Result<()> { .collect() .await?; -let expected = [ -"++-+", -"| id | day |", -"++-+", -"| 0 | 09 |", -"| 1 | 09 |", -"| 2 | 09 |", -"| 3 | 09 |", -"| 4 | 09 |", -"| 5 | 09 |", -"| 6 | 09 |", -"| 7 | 09 |", -"++-+", -]; -assert_batches_sorted_eq!(expected, &result); +assert_snapshot!(batches_to_sort_string(&result)); Review Comment: missing inline snapshot ## datafusion/core/tests/sql/path_partition.rs: ## @@ -476,21 +413,7 @@ async fn parquet_multiple_nonstring_partitions() -> Result<()> { .collect() .await?; -let expected = [ -"++-+", -"| id | day |", -"++-+", -"| 0 | 9 |", -"| 1 | 9 |", -"| 2 | 9 |", -"| 3 | 9 |", -"| 4 | 9 |", -"| 5 | 9 |", -"| 6 | 9 |", -"| 7 | 9 |", -"++-+", -]; -assert_batches_sorted_eq!(expected, &result); +assert_snapshot!(batches_to_sort_string(&result)); Review Comment: missing inline snapshot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001387282 ## datafusion/core/tests/sql/select.rs: ## @@ -30,23 +30,7 @@ async fn test_list_query_parameters() -> Result<()> { .with_param_values(vec![ScalarValue::from(3i32)])? .collect() .await?; -let expected = vec![ -"+++---+", -"| c1 | c2 | c3|", -"+++---+", -"| 3 | 1 | false |", -"| 3 | 10 | true |", -"| 3 | 2 | true |", -"| 3 | 3 | false |", -"| 3 | 4 | true |", -"| 3 | 5 | false |", -"| 3 | 6 | true |", -"| 3 | 7 | false |", -"| 3 | 8 | true |", -"| 3 | 9 | false |", -"+++---+", -]; -assert_batches_sorted_eq!(expected, &results); +assert_snapshot!(batches_to_sort_string(&results)); Review Comment: missing inline snapshot ## datafusion/core/tests/sql/select.rs: ## @@ -66,33 +50,7 @@ async fn test_named_query_parameters() -> Result<()> { ])? .collect() .await?; -let expected = vec![ -"+++", -"| c1 | c2 |", -"+++", -"| 1 | 1 |", -"| 1 | 2 |", -"| 1 | 3 |", -"| 1 | 4 |", -"| 1 | 5 |", -"| 1 | 6 |", -"| 1 | 7 |", -"| 1 | 8 |", -"| 1 | 9 |", -"| 1 | 10 |", -"| 2 | 1 |", -"| 2 | 2 |", -"| 2 | 3 |", -"| 2 | 4 |", -"| 2 | 5 |", -"| 2 | 6 |", -"| 2 | 7 |", -"| 2 | 8 |", -"| 2 | 9 |", -"| 2 | 10 |", -"+++", -]; -assert_batches_sorted_eq!(expected, &results); +assert_snapshot!(batches_to_sort_string(&results)); Review Comment: missing inline snapshot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001385323 ## datafusion/core/tests/sql/path_partition.rs: ## @@ -313,18 +294,7 @@ async fn csv_filter_with_file_nonstring_col() -> Result<()> { .collect() .await?; -let expected = [ -"++++", -"| c1 | c2 | date |", -"++++", -"| a | 1 | 2021-10-28 |", -"| b | 1 | 2021-10-28 |", -"| b | 5 | 2021-10-28 |", -"| c | 2 | 2021-10-28 |", -"| d | 5 | 2021-10-28 |", -"++++", -]; -assert_batches_sorted_eq!(expected, &result); +assert_snapshot!(batches_to_sort_string(&result)); Review Comment: missing inline snapshot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001388172 ## datafusion/core/tests/sql/select.rs: ## @@ -114,33 +72,7 @@ async fn test_prepare_statement() -> Result<()> { let dataframe = dataframe.with_param_values(param_values)?; let results = dataframe.collect().await?; -let expected = vec![ -"+++", -"| c1 | c2 |", -"+++", -"| 1 | 1 |", -"| 1 | 10 |", -"| 1 | 2 |", -"| 1 | 3 |", -"| 1 | 4 |", -"| 1 | 5 |", -"| 1 | 6 |", -"| 1 | 7 |", -"| 1 | 8 |", -"| 1 | 9 |", -"| 2 | 1 |", -"| 2 | 10 |", -"| 2 | 2 |", -"| 2 | 3 |", -"| 2 | 4 |", -"| 2 | 5 |", -"| 2 | 6 |", -"| 2 | 7 |", -"| 2 | 8 |", -"| 2 | 9 |", -"+++", -]; -assert_batches_sorted_eq!(expected, &results); +assert_snapshot!(batches_to_sort_string(&results)); Review Comment: missing inline snapshot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate tests to insta [datafusion]
jsai28 commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2001376856 ## datafusion/core/tests/parquet/custom_reader.rs: ## @@ -96,17 +97,15 @@ async fn route_data_access_ops_to_parquet_file_reader_factory() { let task_ctx = session_ctx.task_ctx(); let read = collect(parquet_exec, task_ctx).await.unwrap(); -let expected = [ -"+-+++", -"| c1 | c2 | c3 |", -"+-+++", -"| Foo | 1 | 10 |", -"| | 2 | 20 |", -"| bar |||", -"+-+++", -]; - -assert_batches_sorted_eq!(expected, &read); +assert_snapshot!(batches_to_sort_string(&read), @r" Review Comment: double check this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate user_defined tests to insta [datafusion]
blaginin commented on code in PR #15255: URL: https://github.com/apache/datafusion/pull/15255#discussion_r2000955601 ## datafusion/core/tests/user_defined/expr_planner.rs: ## @@ -73,52 +73,62 @@ async fn plan_and_collect(sql: &str) -> Result> { ctx.sql(sql).await?.collect().await } +fn fmt_batches(batches: &[RecordBatch]) -> String { +use arrow::util::pretty::pretty_format_batches; +match pretty_format_batches(batches) { +Ok(formatted) => formatted.to_string(), +Err(e) => format!("Error formatting record batches: {}", e), +} +} Review Comment: can we use `batches_to_string`Â for that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add CatalogProvider and SchemaProvider to FFI Crate [datafusion]
alamb commented on PR #15280: URL: https://github.com/apache/datafusion/pull/15280#issuecomment-2732820890 🎉 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate user_defined tests to insta [datafusion]
blaginin commented on code in PR #15255: URL: https://github.com/apache/datafusion/pull/15255#discussion_r2000968071 ## datafusion/core/tests/user_defined/user_defined_table_functions.rs: ## @@ -34,11 +34,19 @@ use datafusion::physical_plan::{collect, ExecutionPlan}; use datafusion::prelude::SessionContext; use datafusion_catalog::Session; use datafusion_catalog::TableFunctionImpl; -use datafusion_common::{assert_batches_eq, DFSchema, ScalarValue}; +use datafusion_common::{DFSchema, ScalarValue}; use datafusion_expr::{EmptyRelation, Expr, LogicalPlan, Projection, TableType}; use async_trait::async_trait; +fn fmt_batches(batches: &[RecordBatch]) -> String { +use arrow::util::pretty::pretty_format_batches; +match pretty_format_batches(batches) { +Ok(formatted) => formatted.to_string(), +Err(e) => format!("Error formatting record batches: {}", e), +} +} + Review Comment: and this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate user_defined tests to insta [datafusion]
blaginin commented on code in PR #15255: URL: https://github.com/apache/datafusion/pull/15255#discussion_r2000967075 ## datafusion/core/tests/user_defined/user_defined_window_functions.rs: ## @@ -57,30 +57,38 @@ const BOUNDED_WINDOW_QUERY: &str = odd_counter(val) OVER (PARTITION BY x ORDER BY y ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) \ from t ORDER BY x, y"; -/// Test to show the contents of the setup +fn fmt_batches(batches: &[RecordBatch]) -> String { +use arrow::util::pretty::pretty_format_batches; +match pretty_format_batches(batches) { +Ok(formatted) => formatted.to_string(), +Err(e) => format!("Error formatting record batches: {}", e), +} +} + Review Comment: i think this can be removed as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update all github workflow to use actions tied to sha hashes [datafusion]
Omega359 opened a new issue, #15298: URL: https://github.com/apache/datafusion/issues/15298 ### Is your feature request related to a problem or challenge? A recent [supply chain attack](https://arstechnica.com/information-technology/2025/03/supply-chain-attack-exposing-credentials-affects-23k-users-of-tj-actions/) has made it extremely apparent that github workflows should only use actions that are tied to a specific hash, not a version. This applies to any non-github, non-apache action of which there seems to be a few: - [dev.yml](https://github.com/apache/datafusion/blob/main/.github/workflows/dev.yml) -> - uses: korandoru/hawkeye@v6 - [rust.yml](https://github.com/apache/datafusion/blob/main/.github/workflows/rust.yml) -> - uses: korandoru/hawkeye@v6 - [setup-macos-aarch64-builder/action.yaml](https://github.com/apache/datafusion/blob/main/.github/actions/setup-macos-aarch64-builder/action.yaml) -> uses: Swatinem/rust-cache@v2 - [setup-rust-runtime/action.yaml](https://github.com/apache/datafusion/blob/main/.github/actions/setup-rust-runtime/action.yaml) -> uses: mozilla-actions/sccache-action@v0.0.4 an example of how to use a sha hash instead of a version can be seen in the extended.yml file: `uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be` ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add GLOBAL context/modifier to SET statements [datafusion-sqlparser-rs]
MohamedAbdeen21 commented on code in PR #1767: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1767#discussion_r2000753514 ## src/ast/mod.rs: ## @@ -7919,11 +7921,28 @@ impl fmt::Display for ContextModifier { write!(f, "") } Self::Local => { -write!(f, " LOCAL") +write!(f, "LOCAL ") } Self::Session => { -write!(f, " SESSION") +write!(f, "SESSION ") } +Self::Global => { +write!(f, "GLOBAL ") +} +} +} +} + +impl From> for ContextModifier { +fn from(kw: Option) -> Self { +match kw { +Some(kw) => match kw { +Keyword::LOCAL => Self::Local, +Keyword::SESSION => Self::Session, +Keyword::GLOBAL => Self::Global, +_ => Self::None, +}, +None => Self::None, } } Review Comment: I'm not following. Instead of an `impl From> for ContextModifier` you'd prefer a `fn into_modifier(k: Option) -> ContextModifier`?e -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update all github workflow to use actions tied to sha hashes [datafusion]
alamb commented on issue #15298: URL: https://github.com/apache/datafusion/issues/15298#issuecomment-2734055350 Thank you @Omega359 -- I agree this is very important -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix predicate pushdown for custom SchemaAdapters [datafusion]
alamb commented on code in PR #15263: URL: https://github.com/apache/datafusion/pull/15263#discussion_r200126 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -224,6 +224,327 @@ mod tests { ) } +#[tokio::test] +async fn test_pushdown_with_missing_column_in_file() { +let c1: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3])); + +let file_schema = +Arc::new(Schema::new(vec![Field::new("c1", DataType::Int32, true)])); + +let table_schema = Arc::new(Schema::new(vec![ +Field::new("c1", DataType::Int32, true), +Field::new("c2", DataType::Int32, true), +])); + +let batch = RecordBatch::try_new(file_schema.clone(), vec![c1]).unwrap(); + +// Since c2 is missing from the file and we didn't supply a custom `SchemaAdapterFactory`, +// the default behavior is to fill in missing columns with nulls. +// Thus this predicate will come back as false. +let filter = col("c2").eq(lit(1_i32)); +let rt = RoundTrip::new() +.with_schema(table_schema.clone()) +.with_predicate(filter.clone()) +.with_pushdown_predicate() +.round_trip(vec![batch.clone()]) +.await; +let total_rows = rt +.batches +.unwrap() +.iter() +.map(|b| b.num_rows()) +.sum::(); +assert_eq!(total_rows, 0, "Expected no rows to match the predicate"); +let metrics = rt.parquet_exec.metrics().unwrap(); +let metric = get_value(&metrics, "pushdown_rows_pruned"); +assert_eq!(metric, 3, "Expected all rows to be pruned"); + +// If we excplicitly allow nulls the rest of the predicate should work +let filter = col("c2").is_null().and(col("c1").eq(lit(1_i32))); +let rt = RoundTrip::new() +.with_schema(table_schema.clone()) +.with_predicate(filter.clone()) +.with_pushdown_predicate() +.round_trip(vec![batch.clone()]) +.await; +let batches = rt.batches.unwrap(); +#[rustfmt::skip] +let expected = [ +"+++", +"| c1 | c2 |", +"+++", +"| 1 ||", +"+++", +]; +assert_batches_sorted_eq!(expected, &batches); +let metrics = rt.parquet_exec.metrics().unwrap(); +let metric = get_value(&metrics, "pushdown_rows_pruned"); +assert_eq!(metric, 2, "Expected all rows to be pruned"); +} + +#[tokio::test] +async fn test_pushdown_with_missing_column_in_file_multiple_types() { +let c1: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3])); + +let file_schema = +Arc::new(Schema::new(vec![Field::new("c1", DataType::Int32, true)])); + +let table_schema = Arc::new(Schema::new(vec![ +Field::new("c1", DataType::Int32, true), +Field::new("c2", DataType::Utf8, true), +])); + +let batch = RecordBatch::try_new(file_schema.clone(), vec![c1]).unwrap(); + +// Since c2 is missing from the file and we didn't supply a custom `SchemaAdapterFactory`, +// the default behavior is to fill in missing columns with nulls. +// Thus this predicate will come back as false. +let filter = col("c2").eq(lit("abc")); +let rt = RoundTrip::new() +.with_schema(table_schema.clone()) +.with_predicate(filter.clone()) +.with_pushdown_predicate() +.round_trip(vec![batch.clone()]) +.await; +let total_rows = rt +.batches +.unwrap() +.iter() +.map(|b| b.num_rows()) +.sum::(); +assert_eq!(total_rows, 0, "Expected no rows to match the predicate"); +let metrics = rt.parquet_exec.metrics().unwrap(); +let metric = get_value(&metrics, "pushdown_rows_pruned"); +assert_eq!(metric, 3, "Expected all rows to be pruned"); + +// If we excplicitly allow nulls the rest of the predicate should work +let filter = col("c2").is_null().and(col("c1").eq(lit(1_i32))); +let rt = RoundTrip::new() +.with_schema(table_schema.clone()) +.with_predicate(filter.clone()) +.with_pushdown_predicate() +.round_trip(vec![batch.clone()]) +.await; +let batches = rt.batches.unwrap(); +#[rustfmt::skip] +let expected = [ +"+++", +"| c1 | c2 |", +"+++", +"| 1 ||", +"+++", +]; +assert_batches_sorted_eq!(expected, &batches); +let metrics = rt.parquet_exec.metrics().unwrap(); +let metric = get_value(&metrics, "pushdown_rows_pruned"); +assert_eq!(metric, 2, "Expected all rows to be pruned"); +} +
Re: [I] Add support for S3 Object Store in default binaries [datafusion-ballista]
milenkovicm commented on issue #1205: URL: https://github.com/apache/datafusion-ballista/issues/1205#issuecomment-2734056479 once we have s3 support it should work with minio. would you be interested in contributing @fithisux ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update all github workflow to use actions tied to sha hashes [datafusion]
alamb commented on issue #15298: URL: https://github.com/apache/datafusion/issues/15298#issuecomment-2734058015 I think this is a good first issue as the write up is clear and there is an example to follow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Failed optimizations with Int64 type [datafusion]
alamb commented on issue #15291: URL: https://github.com/apache/datafusion/issues/15291#issuecomment-2734059396 Thanks @aectaan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Failed optimizations with Int64 type [datafusion]
alamb commented on issue #15291: URL: https://github.com/apache/datafusion/issues/15291#issuecomment-2734063672 I wonder if you can get a pure SQL (`datafusion-cli` based) reproducer? Or does it require creating and configuring a custom context / optimizer rules 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: support merge for `Distribution` [datafusion]
xudong963 opened a new pull request, #15296: URL: https://github.com/apache/datafusion/pull/15296 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15290 ## Rationale for this change See issue #15290 ## What changes are included in this PR? ## Are these changes tested? I'll do it after we are consistent. ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
2010YOUY01 commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2000593852 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,423 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, +// size of `self.orderings` +// Calculating the memory usage of `self.orderings` using `ScalarValue::size_of_vec` is quite costly. +// Therefore, we cache it and compute `size_of` only after each update +// to avoid calling `ScalarValue::size_of_vec` by Self.size. +size_of_orderings: usize, + +// === option + +// Stores the applicable ordering requirement. +ordering_req: LexOrdering, +// derived from `ordering_req`. +sort_options: Vec, +// Stores whether incoming data already satisfies the ordering requirement. +input_requirement_satisfied: bool, +// Ignore null values. +ignore_nulls: bool, +/// The output type +data_type: DataType, +default_orderings: Vec, +} + +impl FirstGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +fn try_new( +ordering_req: LexOrdering, +ignore_nulls: bool, +data_type: &DataType, +ordering_dtypes: &[DataType], +) -> Result { +let requirement_satisfied = ordering_req.is_empty(); + +let default_orderings = ordering_dtypes +.iter() +.map(ScalarValue::try_from) +.collect::>>()?; + +let sort_options = get_sort_options(ordering_req.as_ref()); + +Ok(Self { +null_builder: BooleanBufferBuilder::new(0), +ordering_req, +sort_options, +input_requirement_satisfied: requirement_satisfied, +ignore_nulls, +default_orderings, +data_type: data_type.clone(), +vals: Vec::new(), +orderings: Vec::new(), +is_sets: BooleanBufferBuilder::new(0), +size_of_orderings: 0, +}) +} + +fn need_update(&self, group_idx: usize) -> bool { +if !self.is_sets.get_bit(group_idx) { +return true; +} + +if self.ignore_nulls && !self.null_builder.get_bit(group_idx) { +return true; +} + +!self.input_requirement_satisfied +} + +fn should_update_state( +&self, +group_idx: usize, +new_ordering_values: &[ScalarValue], +) -> Result { +if !self.is_sets.get_bit(group_idx) { +return Ok(true); +} + +assert!(new_ordering_values.len() == self.ordering_req.len()); +let current_ordering = &self.orderings[group_idx]; +compare_rows(current_ordering, new_ordering_values, &self.sort_options) +.map(|x| x.is_gt()) +} + +fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> { +let result = emit_to.take_needed(&mut self.orderings); + +match emit_to { +EmitTo::All => self.size_of_orderings = 0, +EmitTo::First(_) => { +self.size_of_orderings -= +result.iter().map(ScalarValue::size_of_vec).sum::() +} +} + +result +} + +fn take_need( +bool_buf_builder: &mut BooleanBufferBuilder, +emit_to: EmitTo, +) -> BooleanBuffer { +let bool_buf = bool_buf_builder.finish(); +match emit_to { +EmitTo::All => bool_buf, +EmitTo::First(n) => { +// split off the first N values in seen_values +// +// TODO make this more efficient rather than two +// copies and bitwise manipulation +let first_n: BooleanBuffer = bool_buf.iter().take(n).collect(); +// reset the existing buffer +for b in bool_buf.iter().skip(n) { +bool_buf_builder.append(b); +} +first_n +} +} +} + +fn resize_states(&mut self, new_size: usize) { +self.vals.resize(new_size, T::default_value()); + +if self.null_builder.len() < new_size { +self.null_builder +.append_n(new_size - self.null
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r2000877388 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -202,53 +202,48 @@ physical_plan 02)│ AggregateExec │ 03)│ │ 04)│ aggr: sum(bigint_col) │ -05)│ │ -06)│ group_by: │ -07)│ string_col@0 as string_col│ Review Comment: It's true that column indices provide precise tracking and tracing capabilities for complex query analysis, but I think tree explain should be as simple as possible🤔, as doc says: ``` /// TreeRender, displayed in the `tree` explain type. /// /// This format is inspired by DuckDB's explain plans. The information /// presented should be "user friendly", and contain only the most relevant /// information for understanding a plan. It should NOT contain the same level /// of detail information as the [`Self::Default`] format. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): bump async-trait from 0.1.87 to 0.1.88 [datafusion]
xudong963 merged PR #15294: URL: https://github.com/apache/datafusion/pull/15294 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [D] Does DataFusion Support JSON Path Filtering Like `jsonb_path_exists` in PostgreSQL? [datafusion]
GitHub user dadepo added a comment to the discussion: Does DataFusion Support JSON Path Filtering Like `jsonb_path_exists` in PostgreSQL? > You could create a function called `jsonb_path_exists` that takes a binary > column and a json path string perhaps? I think what I am missing is how this can be used via the expression API. Using sql, I know I can do something like: ``` SessionContext.sql("select jsonb_path_exists(d_col, path)") ``` But via the expression API, how will that look? ie ``` df.filter(col('d_col')...) ``` How will that look? GitHub link: https://github.com/apache/datafusion/discussions/15264#discussioncomment-12534943 This is an automatically sent email for github@datafusion.apache.org. To unsubscribe, please send an email to: github-unsubscr...@datafusion.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] A collection of tickets for improved WASM support in DataFusion [datafusion]
savaliyabhargav commented on issue #13815: URL: https://github.com/apache/datafusion/issues/13815#issuecomment-2732178094 @matthewmturner yes sure i am interested can you please give me more detail about it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Support `merge` for `Distribution` [datafusion]
xudong963 opened a new issue, #15290: URL: https://github.com/apache/datafusion/issues/15290 ### Is your feature request related to a problem or challenge? I'm working on the ticket: https://github.com/apache/datafusion/issues/10316. Given that, we'll replace all `Precision` with `Distribution`: https://github.com/synnada-ai/datafusion-upstream/pull/63. So, while I make the design for #10316, I presumably use `Distribution` in statistics. There is a spot where I'll do the `merge` for statistics, and it'll be spread to the `Distribution`. The specific case is that I need to compute the partition-level statistics, aka, files will be grouped as the filegroup, each file group will be treated as a partition, and different partitions will be processed in parallel. So, the partition-level statistics will be from the merge of the files in a filegroup. ### Describe the solution you'd like Create a function that combines their statistical properties into a new distribution. The most appropriate approach is to create a GenericDistribution that approximates the mixture of the two input distributions. ```rust pub fn merge_distributions(a: &Distribution, b: &Distribution) -> Result { ... } ``` --- I'll open a PR and we can do more discussions based on the PR. ### Describe alternatives you've considered No ### Additional context No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Update links for released version [datafusion-comet]
andygrove commented on code in PR #1540: URL: https://github.com/apache/datafusion-comet/pull/1540#discussion_r2001302504 ## docs/source/user-guide/kubernetes.md: ## @@ -65,31 +65,31 @@ metadata: spec: type: Scala mode: cluster - image: ghcr.io/apache/datafusion-comet:spark-3.4-scala-2.12-0.5.0 + image: apache/datafusion-comet:0.7.0-spark3.5.4-scala2.12-java11 Review Comment: I have not actually tested the changes in this doc. @comphead does this look correct to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add support for S3 Object Store in default binaries [datafusion-ballista]
fithisux commented on issue #1205: URL: https://github.com/apache/datafusion-ballista/issues/1205#issuecomment-2733646600 I would rather have minio support because they provide full working docker images that you incorporate in a full fledge docker compose educational stack. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Failed optimizations with Int64 type [datafusion]
aectaan commented on issue #15291: URL: https://github.com/apache/datafusion/issues/15291#issuecomment-2732112476 @alamb it's related to common subexpr eliminate: ``` called `Result::unwrap()` on an `Err` value: Context("Optimizer rule 'common_sub_expression_eliminate' failed", SchemaError(FieldNotFound { field: Column { relation: None, name: "count(test.col_utf8) FILTER (WHERE $1 - Int64(1) <= test.col_int64)" }, valid_fields: [Column { relation: None, name: "count(test.col_utf8) FILTER (WHERE __common_expr_1 AS $1 - Int64(1) <= test.col_int64)" }, Column { relation: None, name: "count(test.col_utf8) FILTER (WHERE $1 - Int64(2) <= test.col_int64 AND test.col_uint32 >= Int64(0))" }, Column { relation: None, name: "count(test.col_utf8) FILTER (WHERE $1 - Int64(2) <= test.col_int64 AND test.col_uint32 >= Int64(0) AND test.col_uint32 >= Int64(0))" }] }, Some(""))) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Enhance Schema adapter to accommodate evolving struct [datafusion]
kosiew opened a new pull request, #15295: URL: https://github.com/apache/datafusion/pull/15295 ## Which issue does this PR close? - Closes #14757. ## Rationale for this change This PR introduces a `NestedStructSchemaAdapter` to improve schema evolution handling in DataFusion when dealing with nested struct types. Currently, schema evolution primarily supports flat schemas, but evolving nested structures (such as adding new fields to existing structs) requires special handling. This change ensures better compatibility and adaptability for evolving datasets. ## What changes are included in this PR? - Introduces `NestedStructSchemaAdapter` to handle schema evolution for nested struct fields. - Implements `NestedStructSchemaAdapterFactory` to determine whether the specialized adapter is needed based on schema characteristics. - Enhances `SchemaMapping` with a new constructor for improved usability. - Updates `schema_adapter.rs` and integrates the new adapter into the `datafusion_datasource` module. - Adds comprehensive unit tests to verify the correctness of schema adaptation, including nested struct evolution scenarios. ## Are these changes tested? Yes, extensive unit tests have been added to verify: - Proper mapping of fields, including added and missing nested struct fields. - Correct adaptation from flat schemas to nested schemas. - Validation of different adapter selection logic based on schema characteristics. ## Are there any user-facing changes? No breaking changes. However, users working with evolving nested struct schemas will benefit from improved support for automatic schema adaptation. This enhances compatibility with sources like Parquet, where schemas may change over time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]
kosiew commented on code in PR #15295: URL: https://github.com/apache/datafusion/pull/15295#discussion_r2000486806 ## datafusion/datasource/src/nested_schema_adapter.rs: ## @@ -0,0 +1,582 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! [`SchemaAdapter`] and [`SchemaAdapterFactory`] to adapt file-level record batches to a table schema. +//! +//! Adapter provides a method of translating the RecordBatches that come out of the +//! physical format into how they should be used by DataFusion. For instance, a schema +//! can be stored external to a parquet file that maps parquet logical types to arrow types. + +use arrow::datatypes::{DataType, Field, Fields, Schema, SchemaRef}; +use datafusion_common::Result; +use std::collections::HashMap; +use std::sync::Arc; + +use crate::schema_adapter::DefaultSchemaAdapterFactory; +use crate::schema_adapter::SchemaAdapter; +use crate::schema_adapter::SchemaAdapterFactory; +use crate::schema_adapter::SchemaMapper; +use crate::schema_adapter::SchemaMapping; + +/// Factory for creating [`NestedStructSchemaAdapter`] +/// +/// This factory creates schema adapters that properly handle schema evolution +/// for nested struct fields, allowing new fields to be added to struct columns +/// over time. +#[derive(Debug, Clone, Default)] +pub struct NestedStructSchemaAdapterFactory; + +impl SchemaAdapterFactory for NestedStructSchemaAdapterFactory { +fn create( +&self, +projected_table_schema: SchemaRef, +table_schema: SchemaRef, +) -> Box { +Box::new(NestedStructSchemaAdapter::new( +projected_table_schema, +table_schema, +)) +} +} + +impl NestedStructSchemaAdapterFactory { +/// Create a new factory for mapping batches from a file schema to a table +/// schema with support for nested struct evolution. +/// +/// This is a convenience method that handles nested struct fields properly. +pub fn from_schema(table_schema: SchemaRef) -> Box { +Self.create(Arc::clone(&table_schema), table_schema) +} + +/// Determines if a schema contains nested struct fields that would benefit +/// from special handling during schema evolution +pub fn has_nested_structs(schema: &Schema) -> bool { +schema +.fields() +.iter() +.any(|field| matches!(field.data_type(), DataType::Struct(_))) +} + +/// Create an appropriate schema adapter based on schema characteristics. +/// Returns a NestedStructSchemaAdapter if the projected schema contains nested structs, +/// otherwise returns a DefaultSchemaAdapter. +pub fn create_appropriate_adapter( +projected_table_schema: SchemaRef, +table_schema: SchemaRef, +) -> Box { +// Use nested adapter if target has nested structs +if Self::has_nested_structs(table_schema.as_ref()) { +NestedStructSchemaAdapterFactory.create(projected_table_schema, table_schema) +} else { +// Default case for simple schemas +DefaultSchemaAdapterFactory.create(projected_table_schema, table_schema) +} +} +} + +/// A SchemaAdapter that handles schema evolution for nested struct types +#[derive(Debug, Clone)] +pub struct NestedStructSchemaAdapter { +/// The schema for the table, projected to include only the fields being output (projected) by the +/// associated ParquetSource +projected_table_schema: SchemaRef, +/// The entire table schema for the table we're using this to adapt. +/// +/// This is used to evaluate any filters pushed down into the scan +/// which may refer to columns that are not referred to anywhere +/// else in the plan. +table_schema: SchemaRef, +} + +impl NestedStructSchemaAdapter { +/// Create a new NestedStructSchemaAdapter with the target schema +pub fn new(projected_table_schema: SchemaRef, table_schema: SchemaRef) -> Self { +Self { +projected_table_schema, +table_schema, +} +} + +pub fn projected_table_schema(&self) -> &Schema { +self.projected_table_schema.as_ref() +} + +pub fn table_schema(&self) -> &Schema {
Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]
kosiew commented on code in PR #15295: URL: https://github.com/apache/datafusion/pull/15295#discussion_r2000486806 ## datafusion/datasource/src/nested_schema_adapter.rs: ## @@ -0,0 +1,582 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! [`SchemaAdapter`] and [`SchemaAdapterFactory`] to adapt file-level record batches to a table schema. +//! +//! Adapter provides a method of translating the RecordBatches that come out of the +//! physical format into how they should be used by DataFusion. For instance, a schema +//! can be stored external to a parquet file that maps parquet logical types to arrow types. + +use arrow::datatypes::{DataType, Field, Fields, Schema, SchemaRef}; +use datafusion_common::Result; +use std::collections::HashMap; +use std::sync::Arc; + +use crate::schema_adapter::DefaultSchemaAdapterFactory; +use crate::schema_adapter::SchemaAdapter; +use crate::schema_adapter::SchemaAdapterFactory; +use crate::schema_adapter::SchemaMapper; +use crate::schema_adapter::SchemaMapping; + +/// Factory for creating [`NestedStructSchemaAdapter`] +/// +/// This factory creates schema adapters that properly handle schema evolution +/// for nested struct fields, allowing new fields to be added to struct columns +/// over time. +#[derive(Debug, Clone, Default)] +pub struct NestedStructSchemaAdapterFactory; + +impl SchemaAdapterFactory for NestedStructSchemaAdapterFactory { +fn create( +&self, +projected_table_schema: SchemaRef, +table_schema: SchemaRef, +) -> Box { +Box::new(NestedStructSchemaAdapter::new( +projected_table_schema, +table_schema, +)) +} +} + +impl NestedStructSchemaAdapterFactory { +/// Create a new factory for mapping batches from a file schema to a table +/// schema with support for nested struct evolution. +/// +/// This is a convenience method that handles nested struct fields properly. +pub fn from_schema(table_schema: SchemaRef) -> Box { +Self.create(Arc::clone(&table_schema), table_schema) +} + +/// Determines if a schema contains nested struct fields that would benefit +/// from special handling during schema evolution +pub fn has_nested_structs(schema: &Schema) -> bool { +schema +.fields() +.iter() +.any(|field| matches!(field.data_type(), DataType::Struct(_))) +} + +/// Create an appropriate schema adapter based on schema characteristics. +/// Returns a NestedStructSchemaAdapter if the projected schema contains nested structs, +/// otherwise returns a DefaultSchemaAdapter. +pub fn create_appropriate_adapter( +projected_table_schema: SchemaRef, +table_schema: SchemaRef, +) -> Box { +// Use nested adapter if target has nested structs +if Self::has_nested_structs(table_schema.as_ref()) { +NestedStructSchemaAdapterFactory.create(projected_table_schema, table_schema) +} else { +// Default case for simple schemas +DefaultSchemaAdapterFactory.create(projected_table_schema, table_schema) +} +} +} + +/// A SchemaAdapter that handles schema evolution for nested struct types +#[derive(Debug, Clone)] +pub struct NestedStructSchemaAdapter { +/// The schema for the table, projected to include only the fields being output (projected) by the +/// associated ParquetSource +projected_table_schema: SchemaRef, +/// The entire table schema for the table we're using this to adapt. +/// +/// This is used to evaluate any filters pushed down into the scan +/// which may refer to columns that are not referred to anywhere +/// else in the plan. +table_schema: SchemaRef, +} + +impl NestedStructSchemaAdapter { +/// Create a new NestedStructSchemaAdapter with the target schema +pub fn new(projected_table_schema: SchemaRef, table_schema: SchemaRef) -> Self { +Self { +projected_table_schema, +table_schema, +} +} + +pub fn projected_table_schema(&self) -> &Schema { +self.projected_table_schema.as_ref() +} + +pub fn table_schema(&self) -> &Schema {
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r2001153997 ## datafusion/physical-plan/src/aggregates/mod.rs: ## @@ -801,6 +803,16 @@ impl DisplayAs for AggregateExec { } } DisplayFormatType::TreeRender => { +let format_expr_with_alias = +|(e, alias): &(Arc, String)| -> String { +let expr_sql = fmt_sql(e.as_ref()).to_string(); Review Comment: We have to construct a string to compare with alias. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
onlyjackfrost commented on code in PR #15209: URL: https://github.com/apache/datafusion/pull/15209#discussion_r2001158621 ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: ``` # from the reply above note (without span) = "+ can only be used with numbers, intervals, and timestamps" help (without span) = "perhaps you need to cast {expression}" ``` oops... I might have misunderstood your reply. I'll remove the condition from here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for `RAISE` statement [datafusion-sqlparser-rs]
iffyio commented on code in PR #1766: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1766#discussion_r2001170116 ## src/ast/mod.rs: ## @@ -2256,6 +2256,57 @@ impl fmt::Display for ConditionalStatements { } } +/// A `RAISE` statement. +/// +/// Examples: +/// ```sql +/// RAISE USING MESSAGE = 'error'; +/// +/// RAISE myerror; +/// ``` +/// +/// [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#raise) +/// [Snowflake](https://docs.snowflake.com/en/sql-reference/snowflake-scripting/raise) +#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)] +#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] +#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] +pub struct RaiseStatement { +pub value: Option, +} + +impl fmt::Display for RaiseStatement { +fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { +let RaiseStatement { value } = self; + +write!(f, "RAISE")?; +if let Some(value) = value { +write!(f, " {value}")?; +} + +Ok(()) +} +} + +/// Represents the error value of a [RaiseStatement]. +#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)] +#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] +#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] +pub enum RaiseStatementValue { +/// `RAISE USING MESSAGE = 'error'` +UsingMessage(Expr), Review Comment: Yeah they can be arbitrary expressions, bigquery for example would accept this sql ```sql begin select 1; exception when error then raise using message = 2+3; end; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r2001176277 ## datafusion/expr/src/expr.rs: ## @@ -2596,6 +2612,176 @@ impl Display for SchemaDisplay<'_> { } } +struct SqlDisplay<'a>(&'a Expr); +impl Display for SqlDisplay<'_> { Review Comment: Maybe we should take nested expr into consideration, like `aggr(case aggr() when...)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
onlyjackfrost commented on code in PR #15209: URL: https://github.com/apache/datafusion/pull/15209#discussion_r2001158621 ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: # from the reply above > note (without span) = "+ can only be used with numbers, intervals, and timestamps" > help (without span) = "perhaps you need to cast {expression}" oops... I might have misunderstood your reply. I'll remove the condition from here. ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: from the reply above > note (without span) = "+ can only be used with numbers, intervals, and timestamps" > help (without span) = "perhaps you need to cast {expression}" oops... I might have misunderstood your reply. I'll remove the condition from here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for `RAISE` statement [datafusion-sqlparser-rs]
iffyio merged PR #1766: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1766 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
eliaperantoni commented on code in PR #15209: URL: https://github.com/apache/datafusion/pull/15209#discussion_r2001166961 ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: Ahh sorry! Hahah. Yeah I meant to say that the note and the help should have no `Span` attached to them, in my opinion. But they should always be there. ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: Ahh sorry! Hahah. Yeah I meant to say that the note and the help should have no `Span` attached to them, in my opinion. But they (the note and the help) should always be there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
onlyjackfrost commented on code in PR #15209: URL: https://github.com/apache/datafusion/pull/15209#discussion_r2001158621 ## datafusion/sql/src/expr/unary_op.rs: ## @@ -45,7 +45,18 @@ impl SqlToRel<'_, S> { { Ok(operand) } else { -plan_err!("Unary operator '+' only supports numeric, interval and timestamp types") +plan_err!("Unary operator '+' only supports numeric, interval and timestamp types").map_err(|e| { +let span = operand.spans().and_then(|s| s.first()); +let mut diagnostic = Diagnostic::new_error( +format!("+ cannot be used with {data_type}"), +span +); +if span.is_none() { +diagnostic.add_note("+ can only be used with numbers, intervals, and timestamps", None); +diagnostic.add_help(format!("perhaps you need to cast {operand}"), None); +} Review Comment: > note (without span) = "+ can only be used with numbers, intervals, and timestamps" > help (without span) = "perhaps you need to cast {expression}" oops... I might have misunderstood your reply. I'll remove the condition from here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
blaginin commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2001206529 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,424 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstPrimitiveGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, Review Comment: should we use `NullState` for that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]
blaginin commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2001210201 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,424 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstPrimitiveGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +// state === +vals: Vec, +// Stores ordering values, of the aggregator requirement corresponding to first value +// of the aggregator. +// The `orderings` are stored row-wise, meaning that `orderings[group_idx]` +// represents the ordering values corresponding to the `group_idx`-th group. +orderings: Vec>, +// At the beginning, `is_sets[group_idx]` is false, which means `first` is not seen yet. +// Once we see the first value, we set the `is_sets[group_idx]` flag +is_sets: BooleanBufferBuilder, +// null_builder[group_idx] == false => vals[group_idx] is null +null_builder: BooleanBufferBuilder, +// size of `self.orderings` +// Calculating the memory usage of `self.orderings` using `ScalarValue::size_of_vec` is quite costly. +// Therefore, we cache it and compute `size_of` only after each update +// to avoid calling `ScalarValue::size_of_vec` by Self.size. +size_of_orderings: usize, + +// === option + +// Stores the applicable ordering requirement. +ordering_req: LexOrdering, +// derived from `ordering_req`. +sort_options: Vec, +// Stores whether incoming data already satisfies the ordering requirement. +input_requirement_satisfied: bool, +// Ignore null values. +ignore_nulls: bool, +/// The output type +data_type: DataType, +default_orderings: Vec, +} + +impl FirstPrimitiveGroupsAccumulator +where +T: ArrowPrimitiveType + Send, +{ +fn try_new( +ordering_req: LexOrdering, +ignore_nulls: bool, +data_type: &DataType, +ordering_dtypes: &[DataType], +) -> Result { +let requirement_satisfied = ordering_req.is_empty(); + +let default_orderings = ordering_dtypes +.iter() +.map(ScalarValue::try_from) +.collect::>>()?; + +let sort_options = get_sort_options(ordering_req.as_ref()); + +Ok(Self { +null_builder: BooleanBufferBuilder::new(0), +ordering_req, +sort_options, +input_requirement_satisfied: requirement_satisfied, +ignore_nulls, +default_orderings, +data_type: data_type.clone(), +vals: Vec::new(), +orderings: Vec::new(), +is_sets: BooleanBufferBuilder::new(0), +size_of_orderings: 0, +}) +} + +fn need_update(&self, group_idx: usize) -> bool { +if !self.is_sets.get_bit(group_idx) { +return true; +} + +if self.ignore_nulls && !self.null_builder.get_bit(group_idx) { +return true; +} + +!self.input_requirement_satisfied +} + +fn should_update_state( +&self, +group_idx: usize, +new_ordering_values: &[ScalarValue], +) -> Result { +if !self.is_sets.get_bit(group_idx) { +return Ok(true); +} + +assert!(new_ordering_values.len() == self.ordering_req.len()); +let current_ordering = &self.orderings[group_idx]; +compare_rows(current_ordering, new_ordering_values, &self.sort_options) +.map(|x| x.is_gt()) +} + +fn take_orderings(&mut self, emit_to: EmitTo) -> Vec> { +let result = emit_to.take_needed(&mut self.orderings); + +match emit_to { +EmitTo::All => self.size_of_orderings = 0, +EmitTo::First(_) => { +self.size_of_orderings -= +result.iter().map(ScalarValue::size_of_vec).sum::() +} +} + +result +} + +fn take_need( +bool_buf_builder: &mut BooleanBufferBuilder, +emit_to: EmitTo, +) -> BooleanBuffer { +let bool_buf = bool_buf_builder.finish(); +match emit_to { +EmitTo::All => bool_buf, +EmitTo::First(n) => { +// split off the first N values in seen_values +// +// TODO make this more efficient rather than two +// copies and bitwise manipulation +let first_n: BooleanBuffer = bool_buf.iter().take(n).collect(); +// reset the existing buffer +for b in bool_buf.iter().skip(n) { +bool_buf_builder.append(b); +} +first_n +} +} +} + +fn resize_states(&mut self, new_size: usize) { +self.vals.resize(new_size, T::default_value()); + +if self.null_builder.len() < new_size { +self.null_builder +.append_n(new_
Re: [I] Timeouts reading "large" files from object stores over "slow" connections [datafusion]
alamb commented on issue #15067: URL: https://github.com/apache/datafusion/issues/15067#issuecomment-2733546979 I am convinced this issue would be solved with automatic retries - https://github.com/apache/arrow-rs/issues/7242 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add GLOBAL context/modifier to SET statements [datafusion-sqlparser-rs]
iffyio commented on code in PR #1767: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1767#discussion_r2000290321 ## src/ast/mod.rs: ## @@ -7919,11 +7921,28 @@ impl fmt::Display for ContextModifier { write!(f, "") } Self::Local => { -write!(f, " LOCAL") +write!(f, "LOCAL ") } Self::Session => { -write!(f, " SESSION") +write!(f, "SESSION ") } +Self::Global => { +write!(f, "GLOBAL ") +} +} +} +} + +impl From> for ContextModifier { +fn from(kw: Option) -> Self { +match kw { +Some(kw) => match kw { +Keyword::LOCAL => Self::Local, +Keyword::SESSION => Self::Session, +Keyword::GLOBAL => Self::Global, +_ => Self::None, +}, +None => Self::None, } } Review Comment: I think we can instead turn this into a regular function if the goal is to reuse it, since its not expected to be able to turn an arbitrary keyword into a ContextModifier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(deps): bump uuid from 1.15.1 to 1.16.0 [datafusion]
dependabot[bot] opened a new pull request, #15292: URL: https://github.com/apache/datafusion/pull/15292 Bumps [uuid](https://github.com/uuid-rs/uuid) from 1.15.1 to 1.16.0. Release notes Sourced from https://github.com/uuid-rs/uuid/releases";>uuid's releases. v1.16.0 What's Changed Mark Uuid::new_v8 const by https://github.com/tguichaoua";>@​tguichaoua in https://redirect.github.com/uuid-rs/uuid/pull/815";>uuid-rs/uuid#815 Prepare for 1.16.0 release by https://github.com/KodrAus";>@​KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/817";>uuid-rs/uuid#817 New Contributors https://github.com/tguichaoua";>@​tguichaoua made their first contribution in https://redirect.github.com/uuid-rs/uuid/pull/815";>uuid-rs/uuid#815 Full Changelog: https://github.com/uuid-rs/uuid/compare/v1.15.1...v1.16.0";>https://github.com/uuid-rs/uuid/compare/v1.15.1...v1.16.0 Commits https://github.com/uuid-rs/uuid/commit/c36beb14d50f835c1f1220117ca51aae64860a3e";>c36beb1 Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/817";>#817 from uuid-rs/cargo/v1.16.0 https://github.com/uuid-rs/uuid/commit/5338b246b7a8244cab3cfaa85b14fe1d1bcdcd96";>5338b24 prepare for 1.16.0 release https://github.com/uuid-rs/uuid/commit/420f6279aeff48f0e12b0b39af43a5c149963382";>420f627 Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/815";>#815 from tguichaoua/new_v8_const https://github.com/uuid-rs/uuid/commit/254258c8c7c7d6c41aaf6f573dc1731549d519b2";>254258c mark Uuid::new_v8 const See full diff in https://github.com/uuid-rs/uuid/compare/v1.15.1...v1.16.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(deps): bump rust_decimal from 1.36.0 to 1.37.0 [datafusion]
dependabot[bot] opened a new pull request, #15293: URL: https://github.com/apache/datafusion/pull/15293 Bumps [rust_decimal](https://github.com/paupino/rust-decimal) from 1.36.0 to 1.37.0. Release notes Sourced from https://github.com/paupino/rust-decimal/releases";>rust_decimal's releases. 1.37.0 What's Changed Bumps diesel version by https://github.com/Dylan-DPC";>@​Dylan-DPC in https://redirect.github.com/paupino/rust-decimal/pull/677";>paupino/rust-decimal#677 Fix f64 dec limit edge case by https://github.com/finnbear";>@​finnbear in https://redirect.github.com/paupino/rust-decimal/pull/678";>paupino/rust-decimal#678 Rkyv 0.8 example by https://github.com/Tony-Samuels";>@​Tony-Samuels in https://redirect.github.com/paupino/rust-decimal/pull/683";>paupino/rust-decimal#683 Add explanation of how to enable maths feature by https://github.com/olgabot";>@​olgabot in https://redirect.github.com/paupino/rust-decimal/pull/681";>paupino/rust-decimal#681 Create pub constant Decimal::MAX_SCALE by https://github.com/Tony-Samuels";>@​Tony-Samuels in https://redirect.github.com/paupino/rust-decimal/pull/685";>paupino/rust-decimal#685 Fix panic when parsing improper scientific notation. by https://github.com/schungx";>@​schungx in https://redirect.github.com/paupino/rust-decimal/pull/700";>paupino/rust-decimal#700 Clean up tests to use Ok/Err format by https://github.com/paupino";>@​paupino in https://redirect.github.com/paupino/rust-decimal/pull/703";>paupino/rust-decimal#703 Add dec!() macro by example by https://github.com/daniel-pfeiffer";>@​daniel-pfeiffer in https://redirect.github.com/paupino/rust-decimal/pull/692";>paupino/rust-decimal#692 feat: add rand-0_9 crate feature by https://github.com/robjtede";>@​robjtede in https://redirect.github.com/paupino/rust-decimal/pull/702";>paupino/rust-decimal#702 Cargo.toml: Only enable mysql_backend on diesel, not the whole mysql feature by https://github.com/mcronce";>@​mcronce in https://redirect.github.com/paupino/rust-decimal/pull/707";>paupino/rust-decimal#707 Backwards compatibility for dec!() macro by https://github.com/paupino";>@​paupino in https://redirect.github.com/paupino/rust-decimal/pull/711";>paupino/rust-decimal#711 Specify fixed version for macros dependency by https://github.com/paupino";>@​paupino in https://redirect.github.com/paupino/rust-decimal/pull/712";>paupino/rust-decimal#712 New Contributors https://github.com/Dylan-DPC";>@​Dylan-DPC made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/677";>paupino/rust-decimal#677 https://github.com/finnbear";>@​finnbear made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/678";>paupino/rust-decimal#678 https://github.com/Tony-Samuels";>@​Tony-Samuels made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/683";>paupino/rust-decimal#683 https://github.com/olgabot";>@​olgabot made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/681";>paupino/rust-decimal#681 https://github.com/daniel-pfeiffer";>@​daniel-pfeiffer made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/692";>paupino/rust-decimal#692 https://github.com/mcronce";>@​mcronce made their first contribution in https://redirect.github.com/paupino/rust-decimal/pull/707";>paupino/rust-decimal#707 Full Changelog: https://github.com/paupino/rust-decimal/compare/1.36.0...1.37.0";>https://github.com/paupino/rust-decimal/compare/1.36.0...1.37.0 Changelog Sourced from https://github.com/paupino/rust-decimal/blob/master/CHANGELOG.md";>rust_decimal's changelog. Version History Please see https://github.com/paupino/rust-decimal/releases";>Github Releases for version history going forward. Commits https://github.com/paupino/rust-decimal/commit/ffe9495d7734d8021718050f98e3f48f39c428b2";>ffe9495 Specify fixed version for macros dependency https://github.com/paupino/rust-decimal/commit/39fe37eca9d78aad65d9be2b1bdb74c2ad037c37";>39fe37e Specify fixed version for macros dependency (https://redirect.github.com/paupino/rust-decimal/issues/712";>#712) https://github.com/paupino/rust-decimal/commit/ad974543184f7047fa97acae8a4a15201d3bafdf";>ad97454 Refactor macro for backwards compatability (https://redirect.github.com/paupino/rust-decimal/issues/711";>#711) https://github.com/paupino/rust-decimal/commit/f34cefc1127d358acbd6eb73c3bdd1de088e552a";>f34cefc Reduce required mysql dependencies for diesel (https://redirect.github.com/paupino/rust-decimal/issues/707";>#707) https://github.com/paupino/rust-decimal/commit/41ce632f0f84a62f187359f9dacafa86c0b223bd";>41ce632 feat: add rand-0_9 crate feature (https://redirect.github.com/paupino/rust-decimal/issues/702";>#702) https://github.com/paupino/rust-decimal/commit/968997115a1d031e2
[PR] chore(deps): bump async-trait from 0.1.87 to 0.1.88 [datafusion]
dependabot[bot] opened a new pull request, #15294: URL: https://github.com/apache/datafusion/pull/15294 Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.87 to 0.1.88. Release notes Sourced from https://github.com/dtolnay/async-trait/releases";>async-trait's releases. 0.1.88 Fix lifetime bounding on generic parameters that have cfg (https://redirect.github.com/dtolnay/async-trait/issues/289";>#289) Commits https://github.com/dtolnay/async-trait/commit/b3a59195c29c5b336490cec1bac23cff8d3e4483";>b3a5919 Release 0.1.88 https://github.com/dtolnay/async-trait/commit/a306be84ec998f46acc700e8b24a3b68b77a873a";>a306be8 Merge pull request https://redirect.github.com/dtolnay/async-trait/issues/289";>#289 from dtolnay/cfg https://github.com/dtolnay/async-trait/commit/d3059849a4024425f80f0713bc802d8959290d96";>d305984 Fix lifetime bounding on generic parameters that have cfg https://github.com/dtolnay/async-trait/commit/78506f17149e08594c1a120f1df828411772a0b8";>78506f1 Add regression test for issue 288 https://github.com/dtolnay/async-trait/commit/a11384eec60634098f66a3d6ac89c23beccdbbc8";>a11384e Add issue 283 link in test See full diff in https://github.com/dtolnay/async-trait/compare/0.1.87...0.1.88";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]
suibianwanwank commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2731893665 @alamb Hi, I have updated the PR title to start with "fix," but it seems that the "bug" label has not been added. And the PR is now ready for review, Please take a look at your convenience~ Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: consistently apply `clippy::clone_on_ref_ptr` in all crates [datafusion]
berkaysynnada merged PR #15284: URL: https://github.com/apache/datafusion/pull/15284 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: consistently apply `clippy::clone_on_ref_ptr` in all crates [datafusion]
berkaysynnada commented on PR #15284: URL: https://github.com/apache/datafusion/pull/15284#issuecomment-2731952749 Thank you @alamb and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Upgrade Guide for DataFusion 46 does not include the array signatures change [datafusion]
alamb closed issue #15105: Upgrade Guide for DataFusion 46 does not include the array signatures change URL: https://github.com/apache/datafusion/issues/15105 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org