Re: [PR] Update version to 46.0.1, add CHANGELOG [datafusion]
xudong963 merged PR #15243: URL: https://github.com/apache/datafusion/pull/15243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
joroKr21 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996627243 ## datafusion/sqllogictest/test_files/array.slt: ## @@ -6232,12 +6244,12 @@ select array_intersect(arrow_cast([1, 1, 2, 2, 3, 3], 'LargeList(Int64)'), null) query ? select array_intersect(null, [1, 1, 2, 2, 3, 3]); -NULL +[] Review Comment: Above we have ``` query ? select array_intersect([1, 1, 2, 2, 3, 3], null); [] ``` So I think it's better that `array_intersect` is commutative. I can make both `null` or both empty list but I think it should be the same if we flip the arguments. Which one should it be? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
joroKr21 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996627020 ## datafusion/sqllogictest/test_files/array.slt: ## @@ -4408,12 +4422,10 @@ select array_union(arrow_cast([], 'LargeList(Int64)'), arrow_cast([], 'LargeList query ? select array_union([[null]], []); -[[NULL]] +[[]] -query ? +query error DataFusion error: Error during planning: Failed to unify argument types of array_union: Review Comment: The problem is that the argument types don't make sense. How are we supposed to union `LargeList(List(Int64))` and `LargeList(Int64)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
joroKr21 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996628218 ## datafusion/functions-nested/src/extract.rs: ## @@ -200,6 +199,7 @@ fn array_element_inner(args: &[ArrayRef]) -> Result { let [array, indexes] = take_function_args("array_element", args)?; match &array.data_type() { +Null => Ok(Arc::new(NullArray::new(array.len(, Review Comment: I will write a test that shows the issue. But basically something like `array_union(make_array(), ['x', 'y', 'z'])` will be an issue because we can't union a list of ints and a list of strings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
joroKr21 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996628327 ## datafusion/functions-nested/src/extract.rs: ## @@ -200,6 +199,7 @@ fn array_element_inner(args: &[ArrayRef]) -> Result { let [array, indexes] = take_function_args("array_element", args)?; match &array.data_type() { +Null => Ok(Arc::new(NullArray::new(array.len(, Review Comment: Or in this case similar nested expression with array element -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `FROM` first in `SELECT` statements [datafusion-sqlparser-rs]
iffyio commented on issue #1400: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1400#issuecomment-2726264373 added in https://github.com/apache/datafusion-sqlparser-rs/pull/1713 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Re-enable github discussion [datafusion]
2010YOUY01 merged PR #15241: URL: https://github.com/apache/datafusion/pull/15241 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
joroKr21 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996627420 ## datafusion/sqllogictest/test_files/array.slt: ## @@ -6232,12 +6244,12 @@ select array_intersect(arrow_cast([1, 1, 2, 2, 3, 3], 'LargeList(Int64)'), null) query ? select array_intersect(null, [1, 1, 2, 2, 3, 3]); -NULL +[] Review Comment: Also, below we have: ``` query ? select array_intersect([], null); [] ``` ## datafusion/sqllogictest/test_files/array.slt: ## @@ -6232,12 +6244,12 @@ select array_intersect(arrow_cast([1, 1, 2, 2, 3, 3], 'LargeList(Int64)'), null) query ? select array_intersect(null, [1, 1, 2, 2, 3, 3]); -NULL +[] Review Comment: Also, below we have: ``` query ? select array_intersect([], null); [] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use insta for `DataFrame` tests [datafusion]
alamb merged PR #15165: URL: https://github.com/apache/datafusion/pull/15165 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Update version to 46.0.1, add CHANGELOG (#15243) [datafusion]
xudong963 opened a new pull request, #15244: URL: https://github.com/apache/datafusion/pull/15244 ## Which issue does this PR close? - part of #https://github.com/apache/datafusion/issues/15151 ## Rationale for this change ## What changes are included in this PR? Port change from branch-46 ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]
andygrove commented on code in PR #1443: URL: https://github.com/apache/datafusion-comet/pull/1443#discussion_r1996054834 ## native/core/src/parquet/parquet_exec.rs: ## @@ -0,0 +1,141 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use crate::execution::operators::ExecutionError; +use crate::parquet::parquet_support::SparkParquetOptions; +use crate::parquet::schema_adapter::SparkSchemaAdapterFactory; +use arrow_schema::{Field, SchemaRef}; +use datafusion::datasource::listing::PartitionedFile; +use datafusion::datasource::physical_plan::{FileScanConfig, FileSource, ParquetSource}; +use datafusion::datasource::source::DataSourceExec; +use datafusion::physical_expr::PhysicalExpr; +use datafusion_comet_spark_expr::EvalMode; +use datafusion_common::config::TableParquetOptions; +use datafusion_execution::object_store::ObjectStoreUrl; +use datafusion_physical_expr::expressions::BinaryExpr; +use itertools::Itertools; +use std::sync::Arc; + +/// Initializes a DataSourceExec plan with a ParquetSource. This may be used by either the +/// `native_datafusion` scan or the `native_iceberg_compat` scan. +/// +/// `required_schema`: Schema to be projected by the scan. +/// +/// `data_schema`: Schema of the underlying data. It is optional and, if provided, is used +/// instead of `required_schema` to initialize the file scan +/// +/// `partition_schema` and `partition_fields` are optional. If `partition_schema` is specified, +/// then `partition_fields` must also be specified +/// +/// `object_store_url`: Url to read data from +/// +/// `file_groups`: A collection of groups of `PartitionedFiles` that are to be read by the scan +/// +/// `projection_vector`: A vector of the indexes in the schema of the fields to be projected +/// +/// `data_filters`: Any predicate that must be applied to the data returned by the scan. If +/// specified, then `data_schema` must also be specified. +#[allow(clippy::too_many_arguments)] +pub(crate) fn init_parquet_exec( +required_schema: SchemaRef, +data_schema: Option, +partition_schema: Option, +partition_fields: Option>, +object_store_url: ObjectStoreUrl, +file_groups: Vec>, +projection_vector: Option>, +data_filters: Option>>, +session_timezone: &str, +) -> Result, ExecutionError> { +let (table_parquet_options, spark_parquet_options) = get_options(session_timezone); +let mut parquet_source = ParquetSource::new(table_parquet_options).with_schema_adapter_factory( +Arc::new(SparkSchemaAdapterFactory::new(spark_parquet_options)), +); +// Create a conjunctive form of the vector because ParquetExecBuilder takes +// a single expression +if let Some(data_filters) = data_filters { +let cnf_data_filters = data_filters.clone().into_iter().reduce(|left, right| { +Arc::new(BinaryExpr::new( +left, +datafusion::logical_expr::Operator::And, +right, +)) +}); + +if let (Some(filter), Some(data_schema)) = (cnf_data_filters, &data_schema) { +parquet_source = parquet_source.with_predicate(Arc::clone(data_schema), filter); +} +} +let file_scan_config = match (data_schema, projection_vector, partition_fields) { +(Some(data_schema), Some(projection_vector), Some(partition_fields)) => get_file_config( +data_schema, +partition_schema, +file_groups, +object_store_url, +Arc::new(parquet_source), +) +.with_projection(Some(projection_vector)) +.with_table_partition_cols(partition_fields), +_ => get_file_config( +required_schema, +partition_schema, +file_groups, +object_store_url, +Arc::new(parquet_source), +), +}; + +let scan = Arc::new(DataSourceExec::new(Arc::new(file_scan_config))); + +Ok(scan) Review Comment: The variable `scan` isn't really required here. ```suggestion Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config ``` -- This is an automated message from the Apache Git Service. To respond to the message,
Re: [PR] Snowflake: Support dollar quoted comment when creating tables, views, and their fields [datafusion-sqlparser-rs]
iffyio commented on code in PR #1755: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1755#discussion_r1995819665 ## src/parser/mod.rs: ## @@ -6926,14 +6927,16 @@ impl<'a> Parser<'a> { let comment = if self.parse_keyword(Keyword::COMMENT) { let has_eq = self.consume_token(&Token::Eq); let next_token = self.next_token(); -match next_token.token { -Token::SingleQuotedString(str) => Some(if has_eq { -CommentDef::WithEq(str) -} else { -CommentDef::WithoutEq(str) -}), +let comment = match next_token.token { +Token::SingleQuotedString(str) => str, +Token::DollarQuotedString(str) => str.value, _ => self.expected("comment", next_token)?, -} +}; Review Comment: can we move this part to be reusable? something like ```rust fn parse_comment_value() -> Result { let value = match next_token.token { Token::SingleQuotedString(str) => str, Token::DollarQuotedString(str) => str.value, _ => self.expected("comment", next_token)?, }; Ok(value) } ``` and then to call the function from both places -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add all functions to the Expr class so that they're chainable. [datafusion-python]
deanm opened a new issue, #1064: URL: https://github.com/apache/datafusion-python/issues/1064 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Instead of doing ```python df.select( F.abs( col("id1") ) .alias("id1") ) ``` It would be nicer to do ```python df.select( col("id1") .abs() .alias("id1") ) ``` **Describe the solution you'd like** This is already partly there, for example, `alias` is already in the Expr class. It would be a bit tedious but easy to add under `class Expr`, for example: ```python def abs(self) -> datafusion.Expr: """Return the absolute value of a given number. Returns: Expr A new expression representing the absolute value of the input expression. """ return F.abs(self) ``` **Describe alternatives you've considered** if it weren't for the type hinter, monkey patching. **Additional context** There will still be functions that don't make sense to chain off of a call to `col` such as `when` since it doesn't return an Expr. But, even functions that take multiple inputs can have this for instance `col("a").atan2("b")`. Additionally, this is completely backwards compatible since I'm not proposing eliminating the functions module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: re-enable GitHub discussions [datafusion-comet]
andygrove opened a new pull request, #1532: URL: https://github.com/apache/datafusion-comet/pull/1532 ## Which issue does this PR close? N/A ## Rationale for this change Our GitHub discussions disappeared! Possibly related to an update in ASF tooling mentioned in https://infra.apache.org/blog/newsletter_02_25.html ? ## What changes are included in this PR? Enable discussions via asf.yaml ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: various improvements to tuning guide [datafusion-comet]
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r1996138171 ## docs/source/user-guide/tuning.md: ## @@ -23,12 +23,84 @@ Comet provides some tuning options to help you get the best performance from you ## Memory Tuning -### Unified Memory Management with Off-Heap Memory +It is necessary to specify how much memory Comet can use in addition to memory already allocated to Spark. In some +cases, it may be possible to reduce the amount of memory allocated to Spark so that overall memory allocation is +the same or lower than the original configuration. In other cases, enabling Comet may require allocating more memory +than before. See the `Determining How Much Memory to Allocate` section for more details. -The recommended way to share memory between Spark and Comet is to set `spark.memory.offHeap.enabled=true`. This allows -Comet to share an off-heap memory pool with Spark. The size of the pool is specified by `spark.memory.offHeap.size`. For more details about Spark off-heap memory mode, please refer to Spark documentation: https://spark.apache.org/docs/latest/configuration.html. +Comet supports Spark's on-heap (the default) and off-heap mode for allocating memory. However, we strongly recommend +using off-heap mode. Comet has some limitations when running in on-heap mode, such as requiring more memory overall, +and requiring shuffle memory to be separately configured. -The type of pool can be specified with `spark.comet.exec.memoryPool`. +### Configuring Comet Memory in Off-Heap Mode + +The recommended way to allocate memory for Comet is to set `spark.memory.offHeap.enabled=true`. This allows +Comet to share an off-heap memory pool with Spark, reducing the overall memory overhead. The size of the pool is +specified by `spark.memory.offHeap.size`. For more details about Spark off-heap memory mode, please refer to +Spark documentation: https://spark.apache.org/docs/latest/configuration.html. + +### Configuring Comet Memory in On-Heap Mode + +When running in on-heap mode, Comet memory can be allocated by setting `spark.comet.memoryOverhead`. If this setting +is not provided, it will be calculated by multiplying the current Spark executor memory by +`spark.comet.memory.overhead.factor` (default value is `1.0`). This is a conservative default that provides Comet +with the same amount of memory that Spark was originally using. + +Shuffle memory must be separately allocated using `spark.comet.columnar.shuffle.memorySize`. If this setting is not +provided, it will be calculated by multiplying `spark.comet.memoryOverhead` by +`spark.comet.columnar.shuffle.memory.factor` (default value is `1.0`). + +### Determining How Much Memory to Allocate + +Generally, increasing the amount of memory allocated to Comet will improve query performance by reducing the +amount of time spent spilling to disk, especially for aggregate, join, and shuffle operations. Allocating insufficient +memory can result in out-of-memory errors. This is no different from allocating memory in Spark and the amount of +memory will vary for different workloads, so some experimentation will be required. + +Here is a real-world example, based on running benchmarks derived from TPC-H. + +**TODO: this section is a work-in-progress** + +The following table shows performance of Spark compared to Comet in both Off-Heap and On-Heap modes. The table shows +total query time for TPC-H @ 100GB. Smaller is better. + +| Total Executor Memory (GB) | Spark | Comet Off-Heap | Comet On-Heap | +| -- | - | -- | - | +| 1 | OOM | OOM| OOM | +| 2 | OOM | OOM| OOM | +| 3 | 744 | OOM| OOM | +| 4 | 739 | OOM| OOM | +| 5 | 681 | 342| OOM | +| 6 | 665 || 344 | +| 7 | 657 || 340 | +| 8 | 632 | 295| 334 | +| 9 | 623 || | +| 10 | 622 || | + +TODO: WIP conclusions: + +- Spark can complete the benchmark with as little as 3GB but shows best performance at 9-10 GB +- When Comet is enabled, Spark needs at least 5 GB of memory but provides a ~2x improvement in performance for that level of memory allocation +- With Comet enabled, performance with 5 GB is 1.8x faster than Spark with 9-10 GB +- TODO run Comet with half the CPUs and show same performance? i.e. demonstrate same performance for half the cost + +## Advanced Memory Tuning + +## Configuring spark.executor.memoryOverhead + +In some environments, such as Kubernetes a
[I] Unparse of logical plans with `LEFT ANTI` and `LEFT SEMI` joins generate invalid SQL [datafusion]
nuno-faria opened a new issue, #15127: URL: https://github.com/apache/datafusion/issues/15127 ### Describe the bug When unparsing a logical plan containing a `LeftAnti Join` or a `LeftSemi Join` operator with the `unparser::dialect::PostgreSqlDialect`, the resulting unparsed SQL contains `LEFT ANTI JOIN` or `LEFT SEMI JOIN`, respectively, which are not supported by Postgres. The same thing is true for other dialects such as `unparser::dialect::MySqlDialect` and `unparser::dialect::SqliteDialect`. This causes issues when attempting to run federated queries in Datafusion over systems such as Postgres (namely, in the [datafusion-table-providers](https://github.com/datafusion-contrib/datafusion-table-providers) repo). I'm not sure if this is a bug or if the responsibility to generate valid SQL in a specific dialect should be delegated downstream. However, since there appears to be a previous PR with a similar issue (https://github.com/apache/datafusion/pull/10625), I leave this one here just in case. ### To Reproduce ```rust use datafusion::{ error::Result, prelude::SessionContext, sql::unparser::{self, Unparser}, }; #[tokio::main] async fn main() -> Result<()> { let ctx = SessionContext::new(); ctx.sql("create table t1 (c int)").await?.collect().await?; ctx.sql("create table t2 (c int)").await?.collect().await?; let df = ctx .sql("select * from t1 where c not in (select c from t2)") .await?; let plan = df.into_optimized_plan()?; // optimizing the plan will introduct the LeftAnti Join println!("{}", plan); let sql = Unparser::new(&unparser::dialect::PostgreSqlDialect {}).plan_to_sql(&plan)?; println!("{}", sql); Ok(()) } ``` Which returns the (invalid in Postgres) query: ```sql SELECT * FROM "t1" LEFT ANTI JOIN "t2" AS "__correlated_sq_1" ON "t1"."c" = "__correlated_sq_1"."c" ``` ### Expected behavior Generate an equivalent valid query, such as: ```sql SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2) ``` ### Additional context ```toml datafusion = "45.0.0" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]
parthchandra commented on PR #1443: URL: https://github.com/apache/datafusion-comet/pull/1443#issuecomment-2725354473 @mbutrovich, @comphead I've rebased this on main (after factoring out the object store related changes) and force pushed. If you could please take a look again. @andygrove I ran into conflicts because of the change from `ParquetExec` to `DataSourceExec`. Would appreciate it if you could double check that I resolved the conflicts correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr` [datafusion]
irenjj commented on code in PR #15253: URL: https://github.com/apache/datafusion/pull/15253#discussion_r1997046455 ## datafusion/expr/src/expr.rs: ## @@ -2596,6 +2600,43 @@ impl Display for SchemaDisplay<'_> { } } +struct SqlDisplay<'a>(&'a Expr); +impl Display for SqlDisplay<'_> { +fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { +match self.0 { +Expr::Column(c) => { +write!(f, "{}", c.name) +} +Expr::Literal(_) => { +write!(f, "aa") +} +Expr::ScalarVariable(..) => { +write!(f, "bb") +} +Expr::OuterReferenceColumn(..) => { +write!(f, "cc") +} +Expr::Placeholder(_) => { +write!(f, "dd") +} +Expr::Wildcard { .. } => { +write!(f, "ee") +} +Expr::AggregateFunction(AggregateFunction { func, params }) => { +match func.sql_name(params) { +Ok(name) => { +write!(f, "{name}") +} +Err(e) => { +write!(f, "got error from schema_name {}", e) +} +} +} +_ => write!(f, "{}", self.0.schema_name()), +} +} +} + Review Comment: Just a demo here. `AggregateFunctionExpr`'s `name` is constructed at logical expr phase, we need a new member `sql_name` for tree explain, and a new constructor `SqlDisplay` to handle the inner `Expr` recursively. cc @alamb 👀 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -1338,7 +1336,7 @@ physical_plan 12)-┌─┴─┐ 13)-│ AggregateExec │ 14)-│ │ -15)-│ aggr: count(Int64(1)) │ +15)-│ aggr: count(aa) │ Review Comment: We can see the final display changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
onlyjackfrost commented on PR #15209: URL: https://github.com/apache/datafusion/pull/15209#issuecomment-2721725989 @eliaperantoni could you help review? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Change in behavior for deep structure columns with the latest sql parser upgrade [datafusion]
adragomir opened a new issue, #15118: URL: https://github.com/apache/datafusion/issues/15118 ### Describe the bug with SQLParser 0.53, this works: ``` SELECT * FROM (meta_asset_featurization AS asset_meta INNER JOIN meta_asset_summary_metrics AS asset_metrics ON ( asset_meta.struct_field['substruct']['substruct'] = asset_metrics.struct_field['substruct']['substruct'] )) ``` with SQLParser 0.54, this no longer works, the newly added `sql_compound_field_access_to_expr` interprets the `root` (in this case for example `asset_meta`) as a column name instead of a table. The solution (workaround) for SQLParser 0.54 is to do ``` SELECT * FROM (meta_asset_featurization AS asset_meta INNER JOIN meta_asset_summary_metrics AS asset_metrics ON ( asset_meta.struct_field.substruct.substruct = asset_metrics.struct_field.substruct.substruct )) ``` Is this intended ? I have tried to see the history, but unsure which behavior is correct. Since we do interpret the `[]` index access in the case of NOT having to specify the table correctly (so, if I have a single table and I specify directly the column `col['substruct']['substruct']` it works), why wouldn't this case work ? ### To Reproduce _No response_ ### Expected behavior Structure field access with either dot acess `.subfield` or subscript access`['subfield']` should work in the same way and yield the same result in all conditions ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: topk functionality for aggregates should support utf8view and largeutf8 [datafusion]
alamb commented on PR #15152: URL: https://github.com/apache/datafusion/pull/15152#issuecomment-2721226874 FYI @avantgardnerio -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] A collection of tickets for improved WASM support in DataFusion [datafusion]
savaliyabhargav commented on issue #13815: URL: https://github.com/apache/datafusion/issues/13815#issuecomment-2724623356 Dear @alamb , I hope you’re doing well. I’m very interested in contributing to the WASM support for DataFusion project as part of GSoC 2025. Enhancing embeddability and ensuring robust WASM integration aligns with my passion for building reliable and maintainable software. I’d greatly appreciate it if you could share more details on the project’s current challenges and key milestones. Understanding these aspects will help me prepare a thoughtful and well-structured proposal. Thank you for your time and guidance. Best regards, bhargav -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [substrait] Build basic test suite to validate produced Substrait plans [datafusion]
amoeba opened a new issue, #15069: URL: https://github.com/apache/datafusion/issues/15069 ### Is your feature request related to a problem or challenge? In https://github.com/apache/datafusion/issues/12244 we found that DataFusion was producing invalid Substrait plans. The `datafusion-substrait` crate has quite a bit of tests but none that actually verify plans valid according to the https://github.com/substrait-io/substrait-validator. ### Describe the solution you'd like Add a test suite to the `datafusion-substrait` crate that uses https://github.com/substrait-io/substrait-validator to validate plans. We can start small and just add a test that would've failed prior to https://github.com/apache/datafusion/pull/15011. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Upgrade Guide for DataFusion 46 does not include the array signatures change [datafusion]
jkosh44 commented on issue #15105: URL: https://github.com/apache/datafusion/issues/15105#issuecomment-2725983472 > [@jkosh44](https://github.com/jkosh44) Would you be able to add a note about array signature changes to https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md ? Yes, I'll do that tonight. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expose global context [datafusion-python]
timsaucer commented on issue #1045: URL: https://github.com/apache/datafusion-python/issues/1045#issuecomment-2710856896 I understand what you mean by "global context is already exposed" but I mean that it should be treated in the wrapper functions as well so that users get a nice python interface and don't need to jump into the `_internal` package. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Use Datafusion's existing empty stream [datafusion-comet]
codecov-commenter commented on PR #1517: URL: https://github.com/apache/datafusion-comet/pull/1517#issuecomment-2719120442 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1517?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 57.84%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`56806da`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/56806daa9b256ef01f565504a7e2669bdf6abe0a?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 74 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1517 +/- ## + Coverage 56.12% 57.84% +1.71% - Complexity 976 985 +9 Files 119 122 +3 Lines 1174312195 +452 Branches 2251 2303 +52 + Hits 6591 7054 +463 + Misses 4012 3958 -54 - Partials 1140 1183 +43 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1517?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). 🚀 New features to boost your workflow: - ❄ [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] datafusion-cli: add streaming state struct [datafusion]
shruti2522 opened a new pull request, #15234: URL: https://github.com/apache/datafusion/pull/15234 ## Which issue does this PR close? - Closes #14886 . ## Rationale for this change ## What changes are included in this PR? add `OutputStreamStruct` to avoid dead code in datafusion-cli printing logic ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: compound_field_access doesn't identifier qualifier. [datafusion]
jonahgao commented on code in PR #15153: URL: https://github.com/apache/datafusion/pull/15153#discussion_r1995820915 ## datafusion/core/tests/sql/select.rs: ## @@ -350,3 +350,45 @@ async fn test_version_function() { assert_eq!(version.value(0), expected_version); } + +#[tokio::test] +async fn test_subscript() -> Result<()> { Review Comment: This is more suitable as a sqllogictest test. Maybe we can move it to [struct.slt](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/struct.slt) or [identifiers.slt](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/identifiers.slt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Renaming Internal Structs [datafusion-python]
Spaarsh commented on PR #1059: URL: https://github.com/apache/datafusion-python/pull/1059#issuecomment-2717683480 Now the `test_wrapper_coverage.py` checks for modules in this way: Skip checks if it is global_context (No changes made in this part) For classes started with `Raw`, it breaks down the class name to extract the wrapper class and checks if that class has the given `Raw` class Later, when the conditional statement of `isinstance` is checked for, the `Raw` classes are skipped since they have been checked for in the earlier part of the test and, more importantly, unlike other classes that return true for `isinstance`, `Raw` classes are only internally exposed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simpler to see expressions in explain `tree` mode [datafusion]
alamb commented on code in PR #15163: URL: https://github.com/apache/datafusion/pull/15163#discussion_r1996232099 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -739,43 +736,42 @@ physical_plan 01)┌───┐ 02)│ ProjectionExec │ 03)│ │ -04)│rolling_sum: │ -05)│ sum(t1.v1) ORDER BY [t1.v1│ -06)│ASC NULLS LAST] ROWS │ -07)│ BETWEEN 1 PRECEDING │ -08)│ AND CURRENT ROW@1│ -09)│ │ -10)│ v1: v1@0 │ -11)└─┬─┘ -12)┌─┴─┐ -13)│BoundedWindowAggExec │ -14)│ │ -15)│mode: Sorted │ -16)│ │ -17)│select_list: │ -18)│ sum(t1.v1) ORDER BY [t1.v1│ -19)│ASC NULLS LAST] ROWS │ -20)│ BETWEEN 1 PRECEDING │ -21)│ AND CURRENT ROW │ -22)└─┬─┘ -23)┌─┴─┐ -24)│ SortExec │ -25)│ │ -26)│v1@0 ASC NULLS LAST│ -27)└─┬─┘ -28)┌─┴─┐ -29)│ ProjectionExec │ -30)│ │ -31)│v1: value@0│ -32)└─┬─┘ -33)┌─┴─┐ -34)│ LazyMemoryExec │ -35)│ │ -36)│ batch_generators: │ -37)│ generate_series: start=1, │ -38)│end=1000, batch_size │ -39)│ =8192 │ -40)└───┘ +04)│ sum(t1.v1) ORDER BY [t1.v1│ +05)│ASC NULLS LAST] ROWS │ +06)│ BETWEEN 1 PRECEDING │ +07)│ AND CURRENT ROW │ +08)│ │ +09)│ v1│ +10)└─┬─┘ +11)┌─┴─┐ +12)│BoundedWindowAggExec │ +13)│ │ +14)│mode: Sorted │ +15)│ │ +16)│select_list: │ +17)│ sum(t1.v1) ORDER BY [t1.v1│ +18)│ASC NULLS LAST] ROWS │ +19)│ BETWEEN 1 PRECEDING │ +20)│ AND CURRENT ROW │ +21)└─┬─┘ +22)┌─┴─┐ +23)│ SortExec │ +24)│ │ +25)│v1@0 ASC NULLS LAST│ Review Comment: Maybe we can file a follow on PR to improve other plans 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: topk functionality for aggregates should support utf8view and largeutf8 [datafusion]
alamb commented on PR #15152: URL: https://github.com/apache/datafusion/pull/15152#issuecomment-2725521365 Thanks again @zhuqi-lucas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [branch-46] Fix wasm32 build on version 46 [datafusion]
alamb closed pull request #15229: [branch-46] Fix wasm32 build on version 46 URL: https://github.com/apache/datafusion/pull/15229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSS] Release DataFusion `46.0.1` Patch or `46.1.0` minor release (March 2025) [datafusion]
alamb commented on issue #15151: URL: https://github.com/apache/datafusion/issues/15151#issuecomment-2724618093 - Unfortunately https://github.com/apache/datafusion/issues/15114 doesn't seem likely to make it into this release. I'll proceed with making backport PRs now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Saner handling of nulls inside arrays [datafusion]
jayzhan211 commented on code in PR #15149: URL: https://github.com/apache/datafusion/pull/15149#discussion_r1996471597 ## datafusion/expr-common/src/signature.rs: ## @@ -865,6 +867,39 @@ impl Signature { volatility, } } + +/// Specialized Signature for ArrayPrepend and similar functions +pub fn element_and_array(volatility: Volatility) -> Self { +Signature { +type_signature: TypeSignature::ArraySignature( +ArrayFunctionSignature::Array { +arguments: vec![ +ArrayFunctionArgument::Element, +ArrayFunctionArgument::Array, +], +array_coercion: Some(ListCoercion::FixedSizedListToList), Review Comment: What we need is to provide customizable signature, so if they want List for append, they can add `array_coercion: Some(ListCoercion::FixedSizedListToList)`, if they want to keep fixed-size-list they can add `array_coercion: None`. Both cases should be possible. For datafusion implementation, we choose to convert to List by default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]
mbutrovich commented on code in PR #1443: URL: https://github.com/apache/datafusion-comet/pull/1443#discussion_r1996167458 ## native/core/src/parquet/mod.rs: ## @@ -620,12 +619,21 @@ fn get_batch_context<'a>(handle: jlong) -> Result<&'a mut BatchContext, CometErr } } -/* -#[inline] -fn get_batch_reader<'a>(handle: jlong) -> Result<&'a mut ParquetRecordBatchReader, CometError> { -Ok(&mut get_batch_context(handle)?.batch_reader.unwrap()) +fn get_file_groups_single_file( +path: &Path, +file_size: u64, +start: i64, +length: i64, +) -> Vec> { Review Comment: Outer vector is partitions, inner vector is the files belonging to that partition, I believe: https://github.com/apache/datafusion/blob/f8828abb46d6f9280f4de146896cd59a740e06ed/datafusion/datasource/src/file_scan_config.rs#L145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] WIP: User defined sorting [datafusion]
tobixdev commented on PR #15106: URL: https://github.com/apache/datafusion/pull/15106#issuecomment-2725048476 Happy to hear any kind of feedback on that @paleolimbot . So take a look if you've time. Also good to hear that others also have these requirements in their projects :). Could you think of a way to implement your use case by extending the proposed solution (to get a feeling whether this idea generalizes to other use cases)? As this is quite a fundamental change that can influence many more features to come, we would definitely need some input from someone more familiar with the project. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Remove all subdependencies [datafusion-comet]
andygrove merged PR #1514: URL: https://github.com/apache/datafusion-comet/pull/1514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: Minor code cleanup in native scan type-checking [datafusion-comet]
andygrove opened a new pull request, #1537: URL: https://github.com/apache/datafusion-comet/pull/1537 ## Which issue does this PR close? N/A ## Rationale for this change Unify code for `native_datafusion` and `native_iceberg_compat` when checking for supported types ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Minor code cleanup in native scan type-checking [datafusion-comet]
andygrove commented on code in PR #1537: URL: https://github.com/apache/datafusion-comet/pull/1537#discussion_r1997045440 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -199,15 +199,15 @@ class CometSparkSessionExtensions _, _, _) - if CometScanExec.isFileFormatSupported(fileFormat) + if COMET_EXEC_ENABLED.get() +&& CometConf.isExperimentalNativeScan Review Comment: I moved this check earlier because it is much cheaper than checking if the schema is supported (especially if the schema is deeply nested) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Attach Diagnostic to "incompatible type in unary expression" error [datafusion]
onlyjackfrost commented on PR #15209: URL: https://github.com/apache/datafusion/pull/15209#issuecomment-2725191499 @eliaperantoni could I raise another PR for the others unary expressions and keep this PR for the PLUS unary expression? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement tree explain for `LocalLimitExec` [datafusion]
alamb commented on PR #15232: URL: https://github.com/apache/datafusion/pull/15232#issuecomment-2725553415 Thanks again @shruti2522 and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Document guidelines for physical operator yielding [datafusion]
ozankabak commented on code in PR #15030: URL: https://github.com/apache/datafusion/pull/15030#discussion_r1984828017 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -260,13 +260,30 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { /// used. /// Thus, [`spawn`] is disallowed, and instead use [`SpawnedTask`]. /// +/// To enable timely cancellation, the [`Stream`] that is returned must not +/// pin the CPU and must yield back to the tokio runtime regularly. This can Review Comment: Right - but that is the only place AFAICT we do this. So it is (and should be) very rare - which the docs ought to mention. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: use common implementation of handling object store and hdfs urls for native_datafusion and native_iceberg_compat [datafusion-comet]
parthchandra opened a new pull request, #1494: URL: https://github.com/apache/datafusion-comet/pull/1494 Addresses part of comments in https://github.com/apache/datafusion-comet/pull/1443 Replaces the `register_object_store` method that provided hdfs support with a unified `prepare_object_store` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: various improvements to tuning guide [datafusion-comet]
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r1996139134 ## docs/source/user-guide/tuning.md: ## @@ -17,18 +17,96 @@ specific language governing permissions and limitations under the License. --> -# Tuning Guide +# Comet Tuning Guide Comet provides some tuning options to help you get the best performance from your queries. ## Memory Tuning -### Unified Memory Management with Off-Heap Memory +It is necessary to specify how much memory Comet can use in addition to memory already allocated to Spark. In some +cases, it may be possible to reduce the amount of memory allocated to Spark so that overall memory allocation is +the same or lower than the original configuration. In other cases, enabling Comet may require allocating more memory +than before. See the [Determining How Much Memory to Allocate] section for more details. -The recommended way to share memory between Spark and Comet is to set `spark.memory.offHeap.enabled=true`. This allows -Comet to share an off-heap memory pool with Spark. The size of the pool is specified by `spark.memory.offHeap.size`. For more details about Spark off-heap memory mode, please refer to Spark documentation: https://spark.apache.org/docs/latest/configuration.html. +[Determining How Much Memory to Allocate]: #determining-how-much-memory-to-allocate -The type of pool can be specified with `spark.comet.exec.memoryPool`. +Comet supports Spark's on-heap (the default) and off-heap mode for allocating memory. However, we strongly recommend +using off-heap mode. Comet has some limitations when running in on-heap mode, such as requiring more memory overall, +and requiring shuffle memory to be separately configured. + +### Configuring Comet Memory in Off-Heap Mode + +The recommended way to allocate memory for Comet is to set `spark.memory.offHeap.enabled=true`. This allows +Comet to share an off-heap memory pool with Spark, reducing the overall memory overhead. The size of the pool is +specified by `spark.memory.offHeap.size`. For more details about Spark off-heap memory mode, please refer to +Spark documentation: https://spark.apache.org/docs/latest/configuration.html. + +### Configuring Comet Memory in On-Heap Mode + +When running in on-heap mode, Comet memory can be allocated by setting `spark.comet.memoryOverhead`. If this setting +is not provided, it will be calculated by multiplying the current Spark executor memory by +`spark.comet.memory.overhead.factor` (default value is `0.2`) which may or may not result in enough memory for +Comet to operate. It is not recommended to rely on this behavior. It is better to specify `spark.comet.memoryOverhead` +explicitly. + +Comet supports native shuffle and columnar shuffle (these terms are explained in the [shuffle] section below). +In on-heap mode, columnar shuffle memory must be separately allocated using `spark.comet.columnar.shuffle.memorySize`. +If this setting is not provided, it will be calculated by multiplying `spark.comet.memoryOverhead` by +`spark.comet.columnar.shuffle.memory.factor` (default value is `1.0`). If a shuffle exceeds this amount of memory +then the query will fail. + +[shuffle]: #shuffle + +### Determining How Much Memory to Allocate + +Generally, increasing the amount of memory allocated to Comet will improve query performance by reducing the +amount of time spent spilling to disk, especially for aggregate, join, and shuffle operations. Allocating insufficient +memory can result in out-of-memory errors. This is no different from allocating memory in Spark and the amount of +memory will vary for different workloads, so some experimentation will be required. + +Here is a real-world example, based on running benchmarks derived from TPC-H, running on a single executor against +local Parquet files using the 100 GB data set. + +Baseline Spark Performance + +- Spark completes the benchmark in 632 seconds with 8 cores and 8 GB RAM +- With less than 8 GB RAM, performance degrades due to spilling +- Spark can complete the benchmark with as little as 3 GB of RAM, but with worse performance (744 seconds) + +Comet Performance + +- Comet requires at least 5 GB of RAM in off-heap mode and 6 GB RAM in On-Heap mode, but performance at this level + is around 340 seconds, which is significantly faster than Spark with any amount of RAM +- Comet running in off-heap with 8 cores completes the benchmark in 295 seconds, more than 2x faster than Spark +- It is worth noting that running Comet with only 4 cores and 4 GB RAM completes the benchmark in 520 seconds, + providing better performance than Spark for half the resource + +It may be possible to reduce Comet's memory overhead by reducing batch sizes or increasing number of partitions. + +### SortExec + +Comet's SortExec implementation spills to disk when under memory pressure, but there are some known issues in the +underlying DataFusion Sor
[PR] Implement tree explain for CoalescePartitionsExec [datafusion]
Shreyaskr1409 opened a new pull request, #15225: URL: https://github.com/apache/datafusion/pull/15225 ## Which issue does this PR close? - Closes #15195 . ## Rationale for this change ## What changes are included in this PR? Changes explain_tree.slt and coalesce_partitions.rs ## Are these changes tested? Yes ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Change mapping of SQL `VARCHAR` from `Utf8` to `Utf8View` [datafusion]
zhuqi-lucas commented on issue #15096: URL: https://github.com/apache/datafusion/issues/15096#issuecomment-2727226225 New subt_task: - [ ] Support Utf8View datatype for range queries -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Simplify display format of `AggregateFunctionExpr` [datafusion]
irenjj opened a new pull request, #15253: URL: https://github.com/apache/datafusion/pull/15253 ## Which issue does this PR close? - Closes #15252 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update version to 46.0.1, add CHANGELOG (#15243) [datafusion]
xudong963 commented on PR #15244: URL: https://github.com/apache/datafusion/pull/15244#issuecomment-2726776760 > 🤔 something seems to be wrong here yeah, missed something, fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Minor code cleanup in native scan type checking [datafusion-comet]
codecov-commenter commented on PR #1537: URL: https://github.com/apache/datafusion-comet/pull/1537#issuecomment-2726800741 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1537?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `0%` with `4 lines` in your changes missing coverage. Please review. > Project coverage is 57.49%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`4416704`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/4416704138345fbe5aca3a3937ce205107c119d1?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 83 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1537?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...org/apache/comet/CometSparkSessionExtensions.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1537?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2FCometSparkSessionExtensions.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9Db21ldFNwYXJrU2Vzc2lvbkV4dGVuc2lvbnMuc2NhbGE=) | 0.00% | [0 Missing and 4 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1537?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1537 +/- ## + Coverage 56.12% 57.49% +1.36% + Complexity 976 971 -5 Files 119 123 +4 Lines 1174312169 +426 Branches 2251 2282 +31 + Hits 6591 6996 +405 + Misses 4012 4002 -10 - Partials 1140 1171 +31 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1537?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). 🚀 New features to boost your workflow: - ❄ [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Publish official Docker images to Docker Hub under Apache account [datafusion-comet]
comphead commented on issue #1510: URL: https://github.com/apache/datafusion-comet/issues/1510#issuecomment-2726807145 thanks @andygrove I think it makes sense to me, Apache Spark also includes the platform and Comet has the platform variety as well. Although I'm not sure if platforms other than Linux are in demand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Minor code cleanup in native scan type checking [datafusion-comet]
andygrove commented on code in PR #1537: URL: https://github.com/apache/datafusion-comet/pull/1537#discussion_r1997192540 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -82,11 +82,7 @@ abstract class ParquetReadSuite extends CometTestBase { } test("unsupported Spark types") { -// for native iceberg compat, CometScanExec supports some types that native_comet does not. Review Comment: this is no longer true -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `native_datafusion` scan is only enabled when `spark.comet.exec.enabled` is set [datafusion-comet]
viirya commented on issue #1536: URL: https://github.com/apache/datafusion-comet/issues/1536#issuecomment-2726904269 Hmm, I think it is a mistake. In the commit 1cca8d6f7bd2dfb8e1996bdd55ebe09d08eb8221, I transformed `CometScanExec` added for fully native scan in `CometScanRule` to `CometNativeScanExec` in `CometExecRule`. Maybe at the moment, I thought that this fully native scan is only used with native execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Publish official Docker images to Docker Hub under Apache account [datafusion-comet]
andygrove commented on issue #1510: URL: https://github.com/apache/datafusion-comet/issues/1510#issuecomment-2726915643 The first image is live. There are a number of security warnings that we need to look into. https://hub.docker.com/r/apache/datafusion-comet/tags -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support Push down expression evaluation in `TableProviders` [datafusion]
adriangb commented on issue #14993: URL: https://github.com/apache/datafusion/issues/14993#issuecomment-2704601029 > But roughly, DataFusion asks the table provider which expressions it can push-down, and the node is configured with both a projection expression and a filter expression. Exact same mechanism as filter expressions. @gatesn I'm in agreement with you, what you are proposing makes total sense -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release DataFusion `46.0.0` [datafusion]
alamb commented on issue #14123: URL: https://github.com/apache/datafusion/issues/14123#issuecomment-2703664954 > Should [ScalarUDFImpl::invoke_batch](https://github.com/apache/datafusion/blob/43ecd9b807877946706628633308f73a4645de1f/datafusion/expr/src/udf.rs#L616) be marked as deprecated? (There is already a comment saying that in the docstring, but the compiler doesn't catch that 😅 ) That way we'd at least get some compile-time warning that we're doing something bad, now there was only runtime failures. I think we should mark it deprecated (though we had bad luck in the past with the compiler not complaining about implementing a deprecated trait method 🤔 ) We also tried to document this in the (new for the first time) upgrade guide: https://datafusion.apache.org/library-user-guide/upgrading.html#use-invoke-with-args-instead-of-invoke-and-invoke-batch I think we may simply want to remove `invoke_with_batch` / other older methods ahead of deprecation schedule to avoid the confusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Per file filter evaluation [datafusion]
adriangb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r1988141252 ## datafusion/datasource-parquet/src/source.rs: ## @@ -559,24 +556,8 @@ impl FileSource for ParquetSource { .predicate() .map(|p| format!(", predicate={p}")) .unwrap_or_default(); -let pruning_predicate_string = self -.pruning_predicate() -.map(|pre| { -let mut guarantees = pre -.literal_guarantees() -.iter() -.map(|item| format!("{}", item)) -.collect_vec(); -guarantees.sort(); -format!( -", pruning_predicate={}, required_guarantees=[{}]", -pre.predicate_expr(), -guarantees.join(", ") -) -}) -.unwrap_or_default(); -write!(f, "{}{}", predicate_string, pruning_predicate_string) Review Comment: This is what's causing tests to fail. Tests assert against the formatting of a ParquetSource and accessing it's pruning predicate method. I'm not sure if we should rewrite the tests and somehow make the generated predicates accessible, bin them, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: `INSERT INTO` support [datafusion-ballista]
milenkovicm closed pull request #1177: feat: `INSERT INTO` support URL: https://github.com/apache/datafusion-ballista/pull/1177 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: More comment to aggregation fuzzer [datafusion]
2010YOUY01 opened a new pull request, #15048: URL: https://github.com/apache/datafusion/pull/15048 ## Which issue does this PR close? - Closes #. ## Rationale for this change I'm thinking about enhancing the sort fuzzer, so I checked our nice aggregate fuzzer to find reusable parts. While doing so, I also took the opportunity to add more comments to it. ## What changes are included in this PR? More comment to the aggregation fuzzer. ## Are these changes tested? NA ## Are there any user-facing changes? NO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove inline table scan analyzer rule [datafusion]
alamb commented on code in PR #15201: URL: https://github.com/apache/datafusion/pull/15201#discussion_r1996808102 ## datafusion/core/tests/dataframe/mod.rs: ## @@ -1571,14 +1571,18 @@ async fn with_column_join_same_columns() -> Result<()> { assert_snapshot!( df_with_column.logical_plan(), -@r###" +@r" Projection: t1.c1, t2.c1, Boolean(true) AS new_column Limit: skip=0, fetch=1 Sort: t1.c1 ASC NULLS FIRST Inner Join: t1.c1 = t2.c1 -TableScan: t1 -TableScan: t2 -"### +SubqueryAlias: t1 Review Comment: I am concerned about this change -- it seems like a regression to me. Maybe it is ok, but I don't understand the difference/ why there are a bunch of projections / SubqueryAlias nows If these are fixed by physical plan time it is fine, but from here it looks like something is wrong -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Migrate user_defined tests to insta [datafusion]
shruti2522 opened a new pull request, #15255: URL: https://github.com/apache/datafusion/pull/15255 ## Which issue does this PR close? - Closes #15247 . ## Rationale for this change ## What changes are included in this PR? migrate tests in `datafusion/core/tests/user_defined` to insta ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Reanimate Code Coverage [datafusion]
blaginin opened a new pull request, #15256: URL: https://github.com/apache/datafusion/pull/15256 ## Which issue does this PR close? Related to https://github.com/apache/datafusion/issues/15155#issuecomment-2715406304 ## Rationale for this change We have quite a lot of code that isn't covered in tests. And it's actually easy to miss covering some case when you submit a big PR ## What changes are included in this PR? Now, when a new PR introduces untested code, a warning message will be posted. Example: [here](https://github.com/blaginin/datafusion/pull/4#issuecomment-2726555358). https://github.com/user-attachments/assets/4ccdbc37-d0d1-4d4c-bd9e-74b275454799"; /> Coverage data is aggregated from: - `linux-test` - `linux-test-datafusion-cli` - `sqllogictest-postgres` - `macos-aarch64` - `test-datafusion-pyarrow` Excluded from coverage: - `datafusion-examples` — not technically tests. - `verify-benchmark-results` — IMO we shouldn't rely on them for functionality tests, those should be covered on the unit test level - `doctests` — because they requite nightly rust and I [couldn't](https://github.com/blaginin/datafusion/actions/runs/13876294634) (yet) make it work in the CI (although will try more) There are ways to make checks even more annoying (e.g. putting ❌ in CI checks). I want to do this later, but first, I want to make sure we're measuring coverage properly and others are okay with it. If that’s the case, we’ll tighten CI checks in future PRs. Absolute coverage example: https://app.codecov.io/gh/blaginin/datafusion/pull/4/tree ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Reanimate Code Coverage [datafusion]
blaginin commented on PR #15256: URL: https://github.com/apache/datafusion/pull/15256#issuecomment-2727009428 To make this work, we'll need to [add](https://docs.codecov.com/docs/adding-the-codecov-token) `CODECOV_TOKEN` and add the [app](https://github.com/apps/codecov) to the repo (may alreacy be there since we used it before) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improved CI test coverage for rust features [datafusion]
blaginin commented on issue #15155: URL: https://github.com/apache/datafusion/issues/15155#issuecomment-2727008592 WIP PoC is [here](https://github.com/apache/datafusion/pull/15256). The way I think this should work is here: https://github.com/blaginin/datafusion/pull/4#issuecomment-2726555358 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add additional ruff suggestions [datafusion-python]
timsaucer commented on PR #1062: URL: https://github.com/apache/datafusion-python/pull/1062#issuecomment-2727021187 I’m not sure then. I’m away for the weekend and won’t be able to test until Monday. I can see then if I can reproduce. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add additional ruff suggestions [datafusion-python]
Spaarsh commented on PR #1062: URL: https://github.com/apache/datafusion-python/pull/1062#issuecomment-2727025175 The errors are due to one of the rules I enabled in one of my commits `SIM102`. Since I have some 10 other ruff rules to enable anyway, should I disable this one for now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expose to `GroupsAccumulator` whether all the groups are sorted [datafusion]
akurmustafa commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2727036019 > [@akurmustafa](https://github.com/akurmustafa) we can add it to the post now. Or we could always make another post "Using `WITH ORDER` with DataFusion..." > > I would be happy to help write it :) I can add this to the original post as a new section, will let you know when it is ready for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add additional ruff suggestions [datafusion-python]
timsaucer commented on PR #1062: URL: https://github.com/apache/datafusion-python/pull/1062#issuecomment-2727017583 You may be using a different version of ruff. These do change from time to time as new lints get added in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add additional ruff suggestions [datafusion-python]
Spaarsh commented on PR #1062: URL: https://github.com/apache/datafusion-python/pull/1062#issuecomment-2727017123 I don't understand why the ruff tests are failing. My local ruff check shows no errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add additional ruff suggestions [datafusion-python]
Spaarsh commented on PR #1062: URL: https://github.com/apache/datafusion-python/pull/1062#issuecomment-2727019630 The workflow is using `0.9.1` and I `pip install` the same. Still I don't see any errors on my local tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `native_datafusion` scan is only enabled when `spark.comet.exec.enabled` is set [datafusion-comet]
andygrove commented on issue #1536: URL: https://github.com/apache/datafusion-comet/issues/1536#issuecomment-2727044778 Thanks @viirya. Yes, fully native scan is only used with native execution. Maybe the design is correct as it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Enable `used_underscore_binding` clippy lint [datafusion]
alamb merged PR #15189: URL: https://github.com/apache/datafusion/pull/15189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Expose to `GroupsAccumulator` whether all the groups are sorted [datafusion]
alamb commented on issue #14991: URL: https://github.com/apache/datafusion/issues/14991#issuecomment-2726553041 Here is the example / test case from @asubiotto in https://github.com/apache/datafusion/issues/14991#issuecomment-2720197770 In the following plan, both `Aggregate` should have `ordering_mode=Sorted ` So instead of ``` Aggregate: groupBy=[[unnested.generated_id]], aggr=[[array_agg(unnested.ar)]] ``` It should be ``` AggregateExec: mode=FinalPartitioned, gby=[generated_id@0 as generated_id], aggr=[sum(unnested.ar)], ordering_mode=Sorted ``` Here is the full query ```sql EXPLAIN WITH unnested AS (SELECT ROW_NUMBER() OVER () AS generated_id, unnest(array[value]) as ar FROM range(1,5)) SELECT array_agg(ar) FROM unnested group by generated_id; +---+---+ | plan_type | plan | +---+---+ | logical_plan | Projection: array_agg(unnested.ar) | | | Aggregate: groupBy=[[unnested.generated_id]], aggr=[[array_agg(unnested.ar)]] | | | SubqueryAlias: unnested | | | Projection: generated_id, __unnest_placeholder(make_array(tmp_table.value),depth=1) AS UNNEST(make_array(tmp_table.value)) AS ar | | | Unnest: lists[__unnest_placeholder(make_array(tmp_table.value))|depth=1] structs[]
Re: [PR] chore: revert "Upgrade to Spark 3.5.4 (#1471)" [datafusion-comet]
andygrove commented on PR #1493: URL: https://github.com/apache/datafusion-comet/pull/1493#issuecomment-2711566564 > @andygrove I was going to put a rpad fix in https://github.com/apache/datafusion-comet/pull/1482/files but if we decide to revert, I can separate the PR. Let me know. I am fine with either approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Only unnest source for `EmptyRelation` [datafusion]
blaginin commented on code in PR #15159: URL: https://github.com/apache/datafusion/pull/15159#discussion_r1997328648 ## datafusion/sql/src/unparser/plan.rs: ## @@ -377,8 +377,17 @@ impl Unparser<'_> { }; if self.dialect.unnest_as_table_factor() && unnest_input_type.is_some() { if let LogicalPlan::Unnest(unnest) = &p.input.as_ref() { -return self -.unnest_to_table_factor_sql(unnest, query, select, relation); +if let LogicalPlan::Projection(projection) = unnest.input.as_ref() +{ +if matches!( +projection.input.as_ref(), +LogicalPlan::EmptyRelation(_) Review Comment: Thank you, added in https://github.com/apache/datafusion/pull/15159/commits/28f53014482399c6f74ac2bfda464de712705663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Different unnests on `plan_to_sql` are merged [datafusion]
blaginin commented on issue #15128: URL: https://github.com/apache/datafusion/issues/15128#issuecomment-2727002296 I also don't see other cases... They are possible in theory, and I guess by the time we get to them, we'll need to refactor the unparser quite thoroughly 🥲 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] build(deps): bump uuid from 1.13.1 to 1.16.0 [datafusion-python]
dependabot[bot] opened a new pull request, #1068: URL: https://github.com/apache/datafusion-python/pull/1068 Bumps [uuid](https://github.com/uuid-rs/uuid) from 1.13.1 to 1.16.0. Release notes Sourced from https://github.com/uuid-rs/uuid/releases";>uuid's releases. v1.16.0 What's Changed Mark Uuid::new_v8 const by https://github.com/tguichaoua";>@tguichaoua in https://redirect.github.com/uuid-rs/uuid/pull/815";>uuid-rs/uuid#815 Prepare for 1.16.0 release by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/817";>uuid-rs/uuid#817 New Contributors https://github.com/tguichaoua";>@tguichaoua made their first contribution in https://redirect.github.com/uuid-rs/uuid/pull/815";>uuid-rs/uuid#815 Full Changelog: https://github.com/uuid-rs/uuid/compare/v1.15.1...v1.16.0";>https://github.com/uuid-rs/uuid/compare/v1.15.1...v1.16.0 v1.15.1 What's Changed Guarantee v7 timestamp will never overflow by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/811";>uuid-rs/uuid#811 Prepare for 1.15.1 release by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/812";>uuid-rs/uuid#812 Full Changelog: https://github.com/uuid-rs/uuid/compare/v1.15.0...v1.15.1";>https://github.com/uuid-rs/uuid/compare/v1.15.0...v1.15.1 v1.15.0 What's Changed Add a manual Debug implementation for NonNilUUid by https://github.com/rick-de-water";>@rick-de-water in https://redirect.github.com/uuid-rs/uuid/pull/808";>uuid-rs/uuid#808 Support higher precision, shiftable timestamps in V7 UUIDs by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/809";>uuid-rs/uuid#809 Prepare for 1.15.0 release by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/810";>uuid-rs/uuid#810 New Contributors https://github.com/rick-de-water";>@rick-de-water made their first contribution in https://redirect.github.com/uuid-rs/uuid/pull/808";>uuid-rs/uuid#808 Full Changelog: https://github.com/uuid-rs/uuid/compare/v1.14.0...v1.15.0";>https://github.com/uuid-rs/uuid/compare/v1.14.0...v1.15.0 v1.14.0 What's Changed Add FromStr impls to the fmt structs by https://github.com/tysen";>@tysen in https://redirect.github.com/uuid-rs/uuid/pull/806";>uuid-rs/uuid#806 Prepare for 1.14.0 release by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/807";>uuid-rs/uuid#807 New Contributors https://github.com/tysen";>@tysen made their first contribution in https://redirect.github.com/uuid-rs/uuid/pull/806";>uuid-rs/uuid#806 Full Changelog: https://github.com/uuid-rs/uuid/compare/v1.13.2...v1.14.0";>https://github.com/uuid-rs/uuid/compare/v1.13.2...v1.14.0 v1.13.2 What's Changed Add a compile_error when no source of randomness is available on wasm32-unknown-unknown by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/804";>uuid-rs/uuid#804 Prepare for 1.13.2 release by https://github.com/KodrAus";>@KodrAus in https://redirect.github.com/uuid-rs/uuid/pull/805";>uuid-rs/uuid#805 Full Changelog: https://github.com/uuid-rs/uuid/compare/1.13.1...v1.13.2";>https://github.com/uuid-rs/uuid/compare/1.13.1...v1.13.2 Commits https://github.com/uuid-rs/uuid/commit/c36beb14d50f835c1f1220117ca51aae64860a3e";>c36beb1 Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/817";>#817 from uuid-rs/cargo/v1.16.0 https://github.com/uuid-rs/uuid/commit/5338b246b7a8244cab3cfaa85b14fe1d1bcdcd96";>5338b24 prepare for 1.16.0 release https://github.com/uuid-rs/uuid/commit/420f6279aeff48f0e12b0b39af43a5c149963382";>420f627 Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/815";>#815 from tguichaoua/new_v8_const https://github.com/uuid-rs/uuid/commit/254258c8c7c7d6c41aaf6f573dc1731549d519b2";>254258c mark Uuid::new_v8 const https://github.com/uuid-rs/uuid/commit/4e5b88e7af12f06ea526088506752c450dc991e3";>4e5b88e Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/812";>#812 from uuid-rs/cargo/v1.15.1 https://github.com/uuid-rs/uuid/commit/7fb64f78c745fe46e209f0f5a50883b711e25f04";>7fb64f7 prepare for 1.15.1 release https://github.com/uuid-rs/uuid/commit/f05b6df98e8d521eecc80dc7923c9b38e2dff634";>f05b6df Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/811";>#811 from uuid-rs/fix/v7-overflow https://github.com/uuid-rs/uuid/commit/c2d313fbbb3157c186f0511c9e2a914d174f258f";>c2d313f guarantee v7 timestamp will never overflow https://github.com/uuid-rs/uuid/commit/56ba68ff13983f3917263b86e06a81c00ee97a3d";>56ba68f Merge pull request https://redirect.github.com/uuid-rs/uuid/issues/810";>#810 from uuid-rs/cargo/v1.15.0 https://github.com/uuid-rs/uuid/commit/26c8a9
[PR] build(deps): bump pyo3-build-config from 0.23.4 to 0.24.0 [datafusion-python]
dependabot[bot] opened a new pull request, #1067: URL: https://github.com/apache/datafusion-python/pull/1067 Bumps [pyo3-build-config](https://github.com/pyo3/pyo3) from 0.23.4 to 0.24.0. Release notes Sourced from https://github.com/pyo3/pyo3/releases";>pyo3-build-config's releases. PyO3 0.24.0 This release is an incremental improvement of refinements and optimizations following the new APIs established in PyO3's last few releases. Support for jiff datetime conversions have been added, and also UUID conversions. The FromPyObject derive macro has gained new #[pyo3(default = ...)] and #[pyo3(rename_all = ...)] options, and the IntoPyObject derive macro has gained a new #[pyo3(into_py_with = ...)] option. PyO3 will now pass positional arguments to Python functions using the "vectorcall" protocol in many cases, which should be an optimization over the previous behaviour (of creating a Python tuple of positional arguments). Many methods on iterators of Python collections have been optimized. There are also many other incremental improvements, bug fixes and smaller features. Thank you to everyone who contributed code, documentation, design ideas, bug reports, and feedback. The following contributors' commits are included in this release: https://github.com/0x676e67";>@0x676e67 https://github.com/alex";>@alex https://github.com/arielb1";>@arielb1 https://github.com/bschoenmaeckers";>@bschoenmaeckers https://github.com/davidhewitt";>@davidhewitt https://github.com/dependabot";>@dependabot[bot] https://github.com/eltociear";>@eltociear https://github.com/Icxolu";>@Icxolu https://github.com/IvanIsCoding";>@IvanIsCoding https://github.com/JeanArhancet";>@JeanArhancet https://github.com/kahojyun";>@kahojyun https://github.com/kemingy";>@kemingy https://github.com/kylebarron";>@kylebarron https://github.com/LifeLex";>@LifeLex https://github.com/LilyFoote";>@LilyFoote https://github.com/lundybernard";>@lundybernard https://github.com/matt-codecov";>@matt-codecov https://github.com/mattip";>@mattip https://github.com/mejrs";>@mejrs https://github.com/messense";>@messense https://github.com/msimacek";>@msimacek https://github.com/ngoldbaum";>@ngoldbaum https://github.com/nicolasavru";>@nicolasavru https://github.com/Owen-CH-Leung";>@Owen-CH-Leung https://github.com/peterjoel";>@peterjoel https://github.com/SilverBzH";>@SilverBzH https://github.com/Tpt";>@Tpt https://github.com/yoav-orca";>@yoav-orca https://github.com/yotamofek";>@yotamofek PyO3 0.23.5 This release is a final set of backports onto the PyO3 0.23 series: PyPy 3.11 support Fixes to #[pyclass(freelist)] on free-threaded Python 3.13 Fix to Python::run for a case when __builtins__ is not loaded correctly on Python 3.10+ ... (truncated) Changelog Sourced from https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md";>pyo3-build-config's changelog. [0.24.0] - 2025-03-09 Packaging Add supported CPython/PyPy versions to cargo package metadata. https://redirect.github.com/PyO3/pyo3/pull/4756";>#4756 Bump target-lexicon dependency to 0.13. https://redirect.github.com/PyO3/pyo3/pull/4822";>#4822 Add optional jiff dependency to add conversions for jiff datetime types. https://redirect.github.com/PyO3/pyo3/pull/4823";>#4823 Bump minimum supported inventory version to 0.3.5. https://redirect.github.com/PyO3/pyo3/pull/4954";>#4954 Added Add PyIterator::send method to allow sending values into a python generator. https://redirect.github.com/PyO3/pyo3/pull/4746";>#4746 Add PyCallArgs trait for passing arguments into the Python calling protocol. This enabled using a faster calling convention for certain types, improving performance. https://redirect.github.com/PyO3/pyo3/pull/4768";>#4768 Add #[pyo3(default = ...'] option for #[derive(FromPyObject)] to set a default value for extracted fields of named structs. https://redirect.github.com/PyO3/pyo3/pull/4829";>#4829 Add #[pyo3(into_py_with = ...)] option for #[derive(IntoPyObject, IntoPyObjectRef)]. https://redirect.github.com/PyO3/pyo3/pull/4850";>#4850 Add uuid to/from python conversions. https://redirect.github.com/PyO3/pyo3/pull/4864";>#4864 Add FFI definitions PyThreadState_GetFrame and PyFrame_GetBack. https://redirect.github.com/PyO3/pyo3/pull/4866";>#4866 Optimize last for BoundListIterator, BoundTupleIterator and BorrowedTupleIterator. https://redirect.github.com/PyO3/pyo3/pull/4878";>#4878 Optimize Iterator::count() for PyDict, PyList, PyTuple & PySet. https://redirect.github.com/PyO3/pyo3/pull/4878";>#4878 Optimize nth, nth_back, advance_by and advance_back_by for BoundTupleIterator https://redirect.github.com/PyO3/pyo3/pull/4897";>#4897 Add support for types.GenericAlias as pyo3::types::PyGenericAlias. https://redirect.
[PR] build(deps): bump tokio from 1.43.0 to 1.44.1 [datafusion-python]
dependabot[bot] opened a new pull request, #1069: URL: https://github.com/apache/datafusion-python/pull/1069 Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.43.0 to 1.44.1. Release notes Sourced from https://github.com/tokio-rs/tokio/releases";>tokio's releases. Tokio v1.44.1 1.44.1 (March 13th, 2025) Fixed rt: skip defer queue in block_in_place context (https://redirect.github.com/tokio-rs/tokio/issues/7216";>#7216) https://redirect.github.com/tokio-rs/tokio/issues/7216";>#7216: https://redirect.github.com/tokio-rs/tokio/pull/7216";>tokio-rs/tokio#7216 Tokio v1.44.0 1.44.0 (March 7th, 2025) This release changes the from_std method on sockets to panic if a blocking socket is provided. We determined this change is not a breaking change as Tokio is not intended to operate using blocking sockets. Doing so results in runtime hangs and should be considered a bug. Accidentally passing a blocking socket to Tokio is one of the most common user mistakes. If this change causes an issue for you, please comment on https://redirect.github.com/tokio-rs/tokio/issues/7172";>#7172. Added coop: add task::coop module (https://redirect.github.com/tokio-rs/tokio/issues/7116";>#7116) process: add Command::get_kill_on_drop() (https://redirect.github.com/tokio-rs/tokio/issues/7086";>#7086) sync: add broadcast::Sender::closed (https://redirect.github.com/tokio-rs/tokio/issues/6685";>#6685, https://redirect.github.com/tokio-rs/tokio/issues/7090";>#7090) sync: add broadcast::WeakSender (https://redirect.github.com/tokio-rs/tokio/issues/7100";>#7100) sync: add oneshot::Receiver::is_empty() (https://redirect.github.com/tokio-rs/tokio/issues/7153";>#7153) sync: add oneshot::Receiver::is_terminated() (https://redirect.github.com/tokio-rs/tokio/issues/7152";>#7152) Fixed fs: empty reads on File should not start a background read (https://redirect.github.com/tokio-rs/tokio/issues/7139";>#7139) process: calling start_kill on exited child should not fail (https://redirect.github.com/tokio-rs/tokio/issues/7160";>#7160) signal: fix CTRL_CLOSE, CTRL_LOGOFF, CTRL_SHUTDOWN on windows (https://redirect.github.com/tokio-rs/tokio/issues/7122";>#7122) sync: properly handle panic during mpsc drop (https://redirect.github.com/tokio-rs/tokio/issues/7094";>#7094) Changes runtime: clean up magic number in registration set (https://redirect.github.com/tokio-rs/tokio/issues/7112";>#7112) coop: make coop yield using waker defer strategy (https://redirect.github.com/tokio-rs/tokio/issues/7185";>#7185) macros: make select! budget-aware (https://redirect.github.com/tokio-rs/tokio/issues/7164";>#7164) net: panic when passing a blocking socket to from_std (https://redirect.github.com/tokio-rs/tokio/issues/7166";>#7166) io: clean up buffer casts (https://redirect.github.com/tokio-rs/tokio/issues/7142";>#7142) Changes to unstable APIs rt: add before and after task poll callbacks (https://redirect.github.com/tokio-rs/tokio/issues/7120";>#7120) tracing: make the task tracing API unstable public (https://redirect.github.com/tokio-rs/tokio/issues/6972";>#6972) Documented docs: fix nesting of sections in top-level docs (https://redirect.github.com/tokio-rs/tokio/issues/7159";>#7159) fs: rename symlink and hardlink parameter names (https://redirect.github.com/tokio-rs/tokio/issues/7143";>#7143) io: swap reader/writer in simplex doc test (https://redirect.github.com/tokio-rs/tokio/issues/7176";>#7176) macros: docs about select! alternatives (https://redirect.github.com/tokio-rs/tokio/issues/7110";>#7110) net: rename the argument for send_to (https://redirect.github.com/tokio-rs/tokio/issues/7146";>#7146) ... (truncated) Commits https://github.com/tokio-rs/tokio/commit/d413c9c02af8f2b4fea14b769b86484b12f46595";>d413c9c chore: prepare Tokio v1.44.1 (https://redirect.github.com/tokio-rs/tokio/issues/7217";>#7217) https://github.com/tokio-rs/tokio/commit/addbfb9204be25a8621feb3f20b44a7c1f00edbd";>addbfb9 rt: skip defer queue in block_in_place context (https://redirect.github.com/tokio-rs/tokio/issues/7216";>#7216) https://github.com/tokio-rs/tokio/commit/8182ecf2628d5e80dac52b8ed1ea466dbb0925b9";>8182ecf chore: prepare Tokio v1.44.0 (https://redirect.github.com/tokio-rs/tokio/issues/7202";>#7202) https://github.com/tokio-rs/tokio/commit/a258bff7018940b438e5de3fb846588454df4e4d";>a258bff ci: enable printing in multi thread loom tests (https://redirect.github.com/tokio-rs/tokio/issues/7200";>#7200) https://github.com/tokio-rs/tokio/commit/e076d21f679a35ae2697165d46d111285d09e3b4";>e076d21 process: clarify Child::kill behavior (https://redirect.github.com/tokio-rs/tokio/issues/7162";>#7162) https://github.com/tokio-rs/tokio/commit/042433cdccdf0dd33408c1751a80ddd50a077872";>042433c net: debug_assert on creating a tokio socket from a blocking on
[PR] build(deps): bump async-trait from 0.1.86 to 0.1.88 [datafusion-python]
dependabot[bot] opened a new pull request, #1070: URL: https://github.com/apache/datafusion-python/pull/1070 Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.86 to 0.1.88. Release notes Sourced from https://github.com/dtolnay/async-trait/releases";>async-trait's releases. 0.1.88 Fix lifetime bounding on generic parameters that have cfg (https://redirect.github.com/dtolnay/async-trait/issues/289";>#289) 0.1.87 Documentation improvements Commits https://github.com/dtolnay/async-trait/commit/b3a59195c29c5b336490cec1bac23cff8d3e4483";>b3a5919 Release 0.1.88 https://github.com/dtolnay/async-trait/commit/a306be84ec998f46acc700e8b24a3b68b77a873a";>a306be8 Merge pull request https://redirect.github.com/dtolnay/async-trait/issues/289";>#289 from dtolnay/cfg https://github.com/dtolnay/async-trait/commit/d3059849a4024425f80f0713bc802d8959290d96";>d305984 Fix lifetime bounding on generic parameters that have cfg https://github.com/dtolnay/async-trait/commit/78506f17149e08594c1a120f1df828411772a0b8";>78506f1 Add regression test for issue 288 https://github.com/dtolnay/async-trait/commit/a11384eec60634098f66a3d6ac89c23beccdbbc8";>a11384e Add issue 283 link in test https://github.com/dtolnay/async-trait/commit/32540aadec088a87eeb88f7688ec5b211a0d5167";>32540aa Release 0.1.87 https://github.com/dtolnay/async-trait/commit/137d14caf3fda9b066384e32e883f2c9bddcbf63";>137d14c Resolve mem_replace_with_default clippy lint https://github.com/dtolnay/async-trait/commit/45fd82a71eca47ddbd0627546fb1ba7489e51712";>45fd82a Ignore elidable_lifetime_names pedantic clippy lint https://github.com/dtolnay/async-trait/commit/ea2f2a29a29b71f36479ee99d3d51953f80e44d8";>ea2f2a2 Point standard library links to stable https://github.com/dtolnay/async-trait/commit/3b78161de94175ffa72c6d3a5451415b70cace51";>3b78161 Update ui test suite to nightly-2025-02-12 See full diff in https://github.com/dtolnay/async-trait/compare/0.1.86...0.1.88";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump async-trait from 0.1.86 to 0.1.87 [datafusion-python]
dependabot[bot] closed pull request #1046: build(deps): bump async-trait from 0.1.86 to 0.1.87 URL: https://github.com/apache/datafusion-python/pull/1046 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] build(deps): bump object_store from 0.11.2 to 0.12.0 [datafusion-python]
dependabot[bot] opened a new pull request, #1071: URL: https://github.com/apache/datafusion-python/pull/1071 Bumps [object_store](https://github.com/apache/arrow-rs) from 0.11.2 to 0.12.0. Changelog Sourced from https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md";>object_store's changelog. Historical Changelog https://github.com/apache/arrow-rs/tree/54.2.0";>54.2.0 (2025-02-12) https://github.com/apache/arrow-rs/compare/54.1.0...54.2.0";>Full Changelog Implemented enhancements: Casting from Utf8View to Dict(k, Utf8View) https://redirect.github.com/apache/arrow-rs/issues/7114";>#7114 Support creating map arrays with key metadata https://redirect.github.com/apache/arrow-rs/issues/7100";>#7100 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [parquet] Print Parquet BasicTypeInfo id when present https://redirect.github.com/apache/arrow-rs/issues/7081";>#7081 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] Add arrow-ipc benchmarks for the IPC reader and writer https://redirect.github.com/apache/arrow-rs/issues/6968";>#6968 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Fixed bugs: NullBufferBuilder::allocated_size Returns Size in Bits https://redirect.github.com/apache/arrow-rs/issues/7121";>#7121 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] [Regression in 54.0.0]. Decimal cast to smaller precision gives invalid (off-by-one) result in some cases https://redirect.github.com/apache/arrow-rs/issues/7069";>#7069 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Minor: Fix deprecated note to point to the correct const https://redirect.github.com/apache/arrow-rs/issues/7067";>#7067 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] incorrect error message for reading definition levels https://redirect.github.com/apache/arrow-rs/issues/7056";>#7056 [https://github.com/apache/arrow-rs/labels/parquet";>parquet] First None in ListArray panics in cast_with_options https://redirect.github.com/apache/arrow-rs/issues/7043";>#7043 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] Documentation updates: Minor: Clarify documentation on NullBufferBuilder::allocated_size https://redirect.github.com/apache/arrow-rs/pull/7089";>#7089 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/alamb";>alamb) Minor: Update release schedule https://redirect.github.com/apache/arrow-rs/pull/7086";>#7086 (https://github.com/alamb";>alamb) Improve ListArray documentation for slices https://redirect.github.com/apache/arrow-rs/pull/7039";>#7039 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/alamb";>alamb) Merged pull requests: fix: NullBufferBuilder::allocated_size should return Size in Bytes https://redirect.github.com/apache/arrow-rs/pull/7122";>#7122 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/shuozel";>shuozel) minor: fix deprecated_note https://redirect.github.com/apache/arrow-rs/pull/7105";>#7105 [https://github.com/apache/arrow-rs/labels/arrow";>arrow] (https://github.com/Chen-Yuan-Lai";>Chen-Yuan-Lai) ... (truncated) Commits See full diff in https://github.com/apache/arrow-rs/commits";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore
Re: [PR] build(deps): bump async-trait from 0.1.86 to 0.1.87 [datafusion-python]
dependabot[bot] commented on PR #1046: URL: https://github.com/apache/datafusion-python/pull/1046#issuecomment-2726951093 Superseded by #1070. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump uuid from 1.13.1 to 1.15.1 [datafusion-python]
dependabot[bot] commented on PR #1039: URL: https://github.com/apache/datafusion-python/pull/1039#issuecomment-2726950949 Superseded by #1068. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump tokio from 1.43.0 to 1.44.0 [datafusion-python]
dependabot[bot] commented on PR #1047: URL: https://github.com/apache/datafusion-python/pull/1047#issuecomment-2726951020 Superseded by #1069. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump tokio from 1.43.0 to 1.44.0 [datafusion-python]
dependabot[bot] closed pull request #1047: build(deps): bump tokio from 1.43.0 to 1.44.0 URL: https://github.com/apache/datafusion-python/pull/1047 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump uuid from 1.13.1 to 1.15.1 [datafusion-python]
dependabot[bot] closed pull request #1039: build(deps): bump uuid from 1.13.1 to 1.15.1 URL: https://github.com/apache/datafusion-python/pull/1039 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSSION] physical-plan-common crate ~and Revert the datasource - physical-plan Dependency~ [datafusion]
berkaysynnada commented on issue #15111: URL: https://github.com/apache/datafusion/issues/15111#issuecomment-2726959050 > physical-plan is not possible to import `datasource`, you would end up moving everything inside physical-plan. Why are you thinking so? If you worry about the catalog, when we have physical-plan-common, we can easily convert physical-plan dependency of catalog crate to physical-plan-common. ```mermaid flowchart TD physical-plan["physical-plan"] --> physical-plan-common["physical-plan-common"] & datasource["datasource"] datasource --> physical-plan-common n1["catalog"] --> physical-plan-common datasource --> n1 n1@{ shape: rect} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Order Requirement Analysis [datafusion-site]
Omega359 commented on code in PR #58: URL: https://github.com/apache/datafusion-site/pull/58#discussion_r1987631033 ## content/blog/2025-03-05-ordering-analysis.md: ## @@ -0,0 +1,353 @@ +--- +layout: post +title: Analysis of Ordering for Better Plans +date: 2025-03-05 +author: Mustafa Akur, Andrew Lamb +categories: [tutorial] +--- + + + + + +## Introduction +In this blog post, we will explore how to determine whether an ordering requirement of an operator is satisfied by its input data. This analysis is essential for order-based optimizations and is often more complex than one might initially think. + +Ordering Requirement for an operator refers to the condition that input data must be sorted in a certain way for the operator to function as intended. If this condition is not met, the operator may not perform as expected. It is the job of the planner to make sure that the requirements - such as specific ordering, specific distribution, etc. - of all operators are satisfied during execution. + + +There are various use cases, where this type of analysis can be useful. +### Removing Unnecessary Sorts +Imagine a user wants to execute the following query: +```SQL +SELECT hostname, log_line +FROM telemetry ORDER BY time ASC limit 10 +``` +If we don't know anything about the `telemetry` table, we need to sort it by `time ASC` and then retrieve the first 10 rows to get the correct result. However, if the table is already ordered by `time ASC`, simply retrieving the first 10 rows is sufficient. This approach executes much faster and uses less memory compared to the first version. + +The key is that the query optimizer needs to know the data is already sorted. For simple queries that is likely simple, but it gets complicated fast, like for example, what if your data is sorted by `[hostname, time ASC]` and your query is +```sql +SELECT hostname, log_line +FROM telemetry WHERE hostname = 'app.example.com' ORDER BY time ASC; +``` +In this case, the system still doesn't have to do any sorting -- but only if it has enough analysis to be able to reason about the sortedness of the stream when we know `hostname` has a single value. + +### Optimizing Execution Modes Using Ordering Information +As another use case, some operators can utilize the ordering information to change its underlying algorithm to execute more efficiently. Consider the following query: +```SQL +SELECT COUNT(log_line) +FROM telemetry GROUP BY hostname; +``` +when `telemetry` is sorted by `hostname`, aggregation doesn't need to hash the entire data at its input. It can use a much more efficient algorithm for grouping the data according to the `hostname` values. Failure to detect the ordering can result in choosing the sub-optimal algorithm variant for the operator. To see this in practice, check out the [source](https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/order) for ordered variant of the `Aggregation` in `Datafusion`. Review Comment: `input is sorted corectly` ... -> input is sorted correctly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat/improve ruff test coverage [datafusion-python]
timsaucer commented on PR #1055: URL: https://github.com/apache/datafusion-python/pull/1055#issuecomment-2710849717 #1056 contains the follow on work to apply additional rules -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Dropping Spark 3.3 support [datafusion-comet]
andygrove closed issue #646: Dropping Spark 3.3 support URL: https://github.com/apache/datafusion-comet/issues/646 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use insta for `DataFrame` tests [datafusion]
alamb commented on PR #15165: URL: https://github.com/apache/datafusion/pull/15165#issuecomment-2725669379 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Improve eliminate_outer_join rule [datafusion]
suibianwanwank opened a new pull request, #15254: URL: https://github.com/apache/datafusion/pull/15254 ## Which issue does this PR close? - Closes #13232 ## Rationale for this change Inspired by the approach in #13249. This PR explores an alternative that does not require simplify and covering more expressions. Additionally, I believe `evaluates_to_null` can be useful in more optimization rules ## What changes are included in this PR? 1. add `evaluates_to_not_true` and `evaluates_to_null` utility func for optimizing expressions. 2. Improve eliminate_outer_join rule to support more expr. ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Re-enable GitHub discussions [datafusion-comet]
codecov-commenter commented on PR #1535: URL: https://github.com/apache/datafusion-comet/pull/1535#issuecomment-2726766217 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1535?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 57.48%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`8c9ed74`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/8c9ed74c75ae6899adadbcc82f520f3e9bda4713?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 83 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1535 +/- ## + Coverage 56.12% 57.48% +1.35% + Complexity 976 971 -5 Files 119 123 +4 Lines 1174312174 +431 Branches 2251 2284 +33 + Hits 6591 6998 +407 + Misses 4012 4004 -8 - Partials 1140 1172 +32 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1535?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). 🚀 New features to boost your workflow: - ❄ [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix wildcard dataframe case [datafusion]
jayzhan211 commented on code in PR #15230: URL: https://github.com/apache/datafusion/pull/15230#discussion_r1995599910 ## datafusion/sql/src/select.rs: ## @@ -741,8 +722,17 @@ impl SqlToRel<'_, S> { } /// Wrap a plan in a projection -fn project(&self, input: LogicalPlan, expr: Vec) -> Result { -self.validate_schema_satisfies_exprs(input.schema(), &expr)?; +fn project(&self, input: LogicalPlan, expr: Vec) -> Result { +// convert to Expr for validate_schema_satisfies_exprs +let exprs = expr +.iter() +.filter_map(|e| match e { +SelectExpr::Expression(expr) => Some(expr.to_owned()), +_ => None, +}) +.collect::>(); +self.validate_schema_satisfies_exprs(input.schema(), &exprs)?; Review Comment: maybe we can change the params to SelectExpr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Attach `Diagnostic` to more than one column errors in scalar_subquery and in_subquery [datafusion]
changsun20 commented on PR #15143: URL: https://github.com/apache/datafusion/pull/15143#issuecomment-2725384326 Thank you all for your help! Looking forward to contributing more to the project soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add `ctx = SessionContext()` to __init__ [datafusion-python]
deanm opened a new issue, #1065: URL: https://github.com/apache/datafusion-python/issues/1065 That way there'd just be one line ```python from datafusion import ctx, col, lit, functions as F ```  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [branch-46] Update ring to v0.17.13 (#15063) [datafusion]
alamb opened a new pull request, #15228: URL: https://github.com/apache/datafusion/pull/15228 - Part of https://github.com/apache/datafusion/issues/15151 Rationale: get a clean CI run on 46 Changes: - Backport https://github.com/apache/datafusion/pull/15063 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] distinct_query_sql benchmark is failing [datafusion]
getChan commented on issue #15213: URL: https://github.com/apache/datafusion/issues/15213#issuecomment-2724220964 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement tree explain for InterleaveExec [datafusion]
alamb merged PR #15219: URL: https://github.com/apache/datafusion/pull/15219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: remove code duplication in native_datafusion and native_iceberg_compat implementations [datafusion-comet]
andygrove commented on code in PR #1443: URL: https://github.com/apache/datafusion-comet/pull/1443#discussion_r1996051635 ## native/core/src/parquet/mod.rs: ## @@ -645,65 +653,42 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat .get_string(&JString::from_raw(file_path)) .unwrap() .into(); -let batch_stream: Option; -// TODO: (ARROW NATIVE) Use the common global runtime + let runtime = tokio::runtime::Builder::new_multi_thread() .enable_all() .build()?; let session_ctx = SessionContext::new(); + let (object_store_url, object_store_path) = -prepare_object_store(session_ctx.runtime_env(), path.clone())?; +prepare_object_store(session_ctx.runtime_env(), path.clone()).unwrap(); Review Comment: This should be `?` and the function should return a `Result` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org