[PR] chore(deps): bump blake3 from 1.6.1 to 1.7.0 [datafusion]

2025-03-20 Thread via GitHub
dependabot[bot] opened a new pull request, #15331: URL: https://github.com/apache/datafusion/pull/15331 Bumps [blake3](https://github.com/BLAKE3-team/BLAKE3) from 1.6.1 to 1.7.0. Release notes Sourced from https://github.com/BLAKE3-team/BLAKE3/releases";>blake3's releases. 1

[PR] chore(deps): bump indexmap from 2.7.1 to 2.8.0 [datafusion]

2025-03-20 Thread via GitHub
dependabot[bot] opened a new pull request, #15333: URL: https://github.com/apache/datafusion/pull/15333 Bumps [indexmap](https://github.com/indexmap-rs/indexmap) from 2.7.1 to 2.8.0. Changelog Sourced from https://github.com/indexmap-rs/indexmap/blob/main/RELEASES.md";>indexmap's

Re: [PR] Saner handling of nulls inside arrays [datafusion]

2025-03-20 Thread via GitHub
joroKr21 commented on PR #15149: URL: https://github.com/apache/datafusion/pull/15149#issuecomment-2739679149 > I recommend following whatever DuckDB (or postgres do) -- there is not muchv alue in DataFusion having different semantics from other systems * DuckDB doesn't have union for

Re: [I] Failed optimizations with Int64 type [datafusion]

2025-03-20 Thread via GitHub
aectaan commented on issue #15291: URL: https://github.com/apache/datafusion/issues/15291#issuecomment-2739694259 Ok, probably it's related to `Analyzer`. After disabling it optimisations are ok -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15168: URL: https://github.com/apache/datafusion/pull/15168#discussion_r2004518126 ## datafusion/spark/src/function/math/expm1.rs: ## @@ -0,0 +1,169 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license a

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-03-20 Thread via GitHub
korowa commented on PR #15324: URL: https://github.com/apache/datafusion/pull/15324#issuecomment-2739393073 Thank you @waynexia, I'm planning to check it out at most tomorrow. I have a question in advance before reviewing -- have you been considering to implement groups accumulator fo

Re: [I] Native scan panic with native_iceberg_compat on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
comphead commented on issue #1553: URL: https://github.com/apache/datafusion-comet/issues/1553#issuecomment-2737306709 interesting why the panic is nonunwinding, by default the `panic` on Rust for release should be unwinding -- This is an automated message from the Apache Git Service. To

Re: [I] Allow UDFs to return custom `Diagnostic` [datafusion]

2025-03-20 Thread via GitHub
eliaperantoni commented on issue #15276: URL: https://github.com/apache/datafusion/issues/15276#issuecomment-2739463296 Hey @jsai28, that looks good! I have some points that I'd like to hear your opinion on: 1. I think `FnCallSpans.args` should have exactly as many elements as the ar

Re: [I] [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts [datafusion]

2025-03-20 Thread via GitHub
xudong963 commented on issue #15271: URL: https://github.com/apache/datafusion/issues/15271#issuecomment-2739420835 @alamb Thank you for summarizing, I'm also interested in this topic and may have more time to join the game in May, but I will keep an eye on the progress. -- This is an aut

Re: [PR] Add doc for the `statistics_from_parquet_meta_calc method` [datafusion]

2025-03-20 Thread via GitHub
xudong963 commented on code in PR #15330: URL: https://github.com/apache/datafusion/pull/15330#discussion_r2005013876 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -797,10 +797,34 @@ pub async fn fetch_statistics( statistics_from_parquet_meta_calc(&metadata, ta

[PR] Add doc for the `statistics_from_parquet_meta_calc method` [datafusion]

2025-03-20 Thread via GitHub
xudong963 opened a new pull request, #15330: URL: https://github.com/apache/datafusion/pull/15330 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/pull/15289 ## Rationale for this change I'm refactor the method `statistics_from

Re: [PR] docs: various improvements to tuning guide [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #1525: URL: https://github.com/apache/datafusion-comet/pull/1525#discussion_r2004547775 ## docs/source/user-guide/tuning.md: ## @@ -17,18 +17,96 @@ specific language governing permissions and limitations under the License. --> -# Tuning Guid

Re: [PR] chore(deps): Update sqlparser to 0.55.0 [datafusion]

2025-03-20 Thread via GitHub
PokIsemaine commented on code in PR #15183: URL: https://github.com/apache/datafusion/pull/15183#discussion_r2005632036 ## datafusion/sql/src/planner.rs: ## @@ -560,11 +558,11 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { SQLDataType::SmallInt(_) | SQLDataType::

Re: [PR] chore(deps): bump quote from 1.0.38 to 1.0.40 [datafusion]

2025-03-20 Thread via GitHub
xudong963 merged PR #15332: URL: https://github.com/apache/datafusion/pull/15332 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on code in PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2005775054 ## native/core/src/parquet/mod.rs: ## @@ -641,6 +640,8 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat session_timezo

Re: [PR] feat: Support serde for JsonSource PhysicalPlan [datafusion]

2025-03-20 Thread via GitHub
westhide commented on code in PR #15311: URL: https://github.com/apache/datafusion/pull/15311#discussion_r2005735108 ## datafusion/proto/src/physical_plan/mod.rs: ## @@ -247,6 +247,15 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode { .with_file_compressio

[PR] Prep for 0.1.0rc2 [datafusion-ray]

2025-03-20 Thread via GitHub
robtandy opened a new pull request, #86: URL: https://github.com/apache/datafusion-ray/pull/86 This PR is long but it does not affect the core functionality of DataFusion for Ray, and does not differ from `0.1.0rc1` which has been extensively used by me in benchmarking from `test.pypi`.

Re: [PR] Prep for 0.1.0rc2 [datafusion-ray]

2025-03-20 Thread via GitHub
robtandy commented on PR #86: URL: https://github.com/apache/datafusion-ray/pull/86#issuecomment-2740336109 @andygrove Here is the PR I mentioned to you that I would submit with benchmarking code and results. I have some good graphs of the results, but i'll submit them in a subseque

Re: [I] Add most functions to the Expr class so that they're chainable. [datafusion-python]

2025-03-20 Thread via GitHub
deanm commented on issue #1064: URL: https://github.com/apache/datafusion-python/issues/1064#issuecomment-2740490227 Is this something a PR would be accepted for or no? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] perf: unwrap cast for comparing ints =/!= strings [datafusion]

2025-03-20 Thread via GitHub
alan910127 commented on code in PR #15110: URL: https://github.com/apache/datafusion/pull/15110#discussion_r2005666997 ## datafusion/optimizer/src/analyzer/type_coercion.rs: ## @@ -290,19 +290,72 @@ impl<'a> TypeCoercionRewriter<'a> { right: Expr, right_schema:

Re: [PR] perf: unwrap cast for comparing ints =/!= strings [datafusion]

2025-03-20 Thread via GitHub
alan910127 commented on code in PR #15110: URL: https://github.com/apache/datafusion/pull/15110#discussion_r2005666997 ## datafusion/optimizer/src/analyzer/type_coercion.rs: ## @@ -290,19 +290,72 @@ impl<'a> TypeCoercionRewriter<'a> { right: Expr, right_schema:

Re: [PR] perf: unwrap cast for comparing ints =/!= strings [datafusion]

2025-03-20 Thread via GitHub
alan910127 commented on code in PR #15110: URL: https://github.com/apache/datafusion/pull/15110#discussion_r2005677168 ## datafusion/optimizer/src/analyzer/type_coercion.rs: ## @@ -290,19 +290,72 @@ impl<'a> TypeCoercionRewriter<'a> { right: Expr, right_schema:

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on code in PR #15324: URL: https://github.com/apache/datafusion/pull/15324#discussion_r2005787338 ## datafusion/functions-aggregate/src/count.rs: ## @@ -752,10 +761,245 @@ impl Accumulator for DistinctCountAccumulator { } } +/// GroupsAccumulator for

Re: [PR] fix: check if handle has been initialized before closing [datafusion-comet]

2025-03-20 Thread via GitHub
wForget commented on PR #1554: URL: https://github.com/apache/datafusion-comet/pull/1554#issuecomment-2740681759 > I just wanted to see in what condition the NativeBatchReader can be called after close has been called. The scenario I encountered was not NativeBatchReader called afte

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on code in PR #15324: URL: https://github.com/apache/datafusion/pull/15324#discussion_r2005787338 ## datafusion/functions-aggregate/src/count.rs: ## @@ -752,10 +761,245 @@ impl Accumulator for DistinctCountAccumulator { } } +/// GroupsAccumulator for

Re: [PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
westhide commented on code in PR #15335: URL: https://github.com/apache/datafusion/pull/15335#discussion_r2005989730 ## datafusion/proto/proto/datafusion.proto: ## @@ -997,6 +997,7 @@ message FileScanExecConf { reserved 10; datafusion_common.Constraints constraints = 11;

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740980237 > Thanks for checking [@alamb](https://github.com/alamb) ! > > I think a large portion is spent in the hash join (repartitioning the right side input) - I think because it r

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2741010424 > where does Memtable belong datasource or catalog? it is TableProvider implementation so I thought It was going to be in catalog, but I m not so sure anymore as it has dependency on d

Re: [PR] Migrate physical plan tests to `insta` (Part-1) [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15313: URL: https://github.com/apache/datafusion/pull/15313#issuecomment-2740902341 FYI @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
logan-keede commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2740991793 where does Memtable belong datasource or catalog? it is TableProvider implementation so I thought It was going to be in catalog, but I m not so sure anymore as it has dependency

Re: [PR] [WIP] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2005998677 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -1334,26 +1334,46 @@ object CometSparkSessionExtensions extends Logging {

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
logan-keede commented on PR #15316: URL: https://github.com/apache/datafusion/pull/15316#issuecomment-2741075889 I thought it mattered because `datasource` has an dependency on `catalog` but on a second look it is only `Session`. Any plans on pulling `Session` out? also corresponding `

Re: [PR] Blog post on Parquet pruning in datafusion [datafusion-site]

2025-03-20 Thread via GitHub
alamb commented on PR #60: URL: https://github.com/apache/datafusion-site/pull/60#issuecomment-2741065799 And it is live: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Add hooks to `SchemaAdapter` to add custom column generators [datafusion]

2025-03-20 Thread via GitHub
adriangb commented on PR #15261: URL: https://github.com/apache/datafusion/pull/15261#issuecomment-2741174091 Marking as ready for review. The main TODO is an API for transmitting statistics information for generated columns before they get generated, but that can even be a followup PR. -

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-20 Thread via GitHub
Omega359 commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2006154315 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,259 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author: Xia

Re: [PR] Fix parquet pruning blog post hyperlink [datafusion-site]

2025-03-20 Thread via GitHub
XiangpengHao commented on PR #62: URL: https://github.com/apache/datafusion-site/pull/62#issuecomment-2741287848 Thank you @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [I] [Rust] [datafusion] Allow integration in non libc environments [datafusion]

2025-03-20 Thread via GitHub
arpity22 commented on issue #102: URL: https://github.com/apache/datafusion/issues/102#issuecomment-2741307866 Since this issue was opened a while ago, has it been resolved but not updated here? -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-20 Thread via GitHub
alamb merged PR #15253: URL: https://github.com/apache/datafusion/pull/15253 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740739642 > Note that late materialization (the join / semi join rewrite) needs join operator support that DataFusion doesn't yet have (we could add it but it will take non trivial effo

Re: [PR] Migrate tests to insta [datafusion]

2025-03-20 Thread via GitHub
blaginin commented on code in PR #15288: URL: https://github.com/apache/datafusion/pull/15288#discussion_r2004184344 ## datafusion/core/tests/parquet/custom_reader.rs: ## @@ -96,17 +97,15 @@ async fn route_data_access_ops_to_parquet_file_reader_factory() { let task_ctx = s

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2740745436 Thank you @irenjj @jayzhan211 and @xudong963 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [I] [tree explain] Simplify display format of `AggregateFunctionExpr` [datafusion]

2025-03-20 Thread via GitHub
alamb closed issue #15252: [tree explain] Simplify display format of `AggregateFunctionExpr` URL: https://github.com/apache/datafusion/issues/15252 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]

2025-03-20 Thread via GitHub
findepi commented on issue #13814: URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2740776452 > I was wondering recently if the the probably could be related to all the re-exports (`pub use `) we do in DataFusion. Maybe we could see if reducing them helped 🤔 i

Re: [PR] feat: introduce `JoinSetTracer` trait for tracing context propagation in spawned tasks [datafusion]

2025-03-20 Thread via GitHub
alamb merged PR #14547: URL: https://github.com/apache/datafusion/pull/14547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: introduce `JoinSetTracer` trait for tracing context propagation in spawned tasks [datafusion]

2025-03-20 Thread via GitHub
alamb commented on PR #14547: URL: https://github.com/apache/datafusion/pull/14547#issuecomment-2740784949 I also added this feature to the list of things we should document with the next release - https://github.com/apache/datafusion/issues/15072 Thanks again @geoffreyclaude and @

[PR] Enforce JOIN plan to require condition [datafusion]

2025-03-20 Thread via GitHub
goldmedal opened a new pull request, #15334: URL: https://github.com/apache/datafusion/pull/15334 ## Which issue does this PR close? - Closes #13486 ## Rationale for this change When working on unparsing the plan optimized by `ScalarSubqueryToJoin`, I notice

Re: [PR] feat: Support serde for FileScanConfig `batch_size` [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15335: URL: https://github.com/apache/datafusion/pull/15335#discussion_r2005949069 ## datafusion/proto/proto/datafusion.proto: ## @@ -997,6 +997,7 @@ message FileScanExecConf { reserved 10; datafusion_common.Constraints constraints = 11; +

Re: [I] Unsupported NdJsonExec plan and extension codec [datafusion-ballista]

2025-03-20 Thread via GitHub
alamb closed issue #1209: Unsupported NdJsonExec plan and extension codec URL: https://github.com/apache/datafusion-ballista/issues/1209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2740932019 Thanks @adriangb -- I will try and review it asap (hopefully tomorrow afternoon or tomorrow) -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
Dandandan commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740936826 Thanks for checking @alamb ! I think a large portion is spent in the h join (repartitioning the right input) - I think because it runs as `Partitioned` hash join, instea

Re: [PR] feat: Support serde for JsonSource PhysicalPlan [datafusion]

2025-03-20 Thread via GitHub
alamb merged PR #15311: URL: https://github.com/apache/datafusion/pull/15311 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740888007 I tried the rewrite into a Semi join and indeed it is over 2x slower (5.3sec vs 12sec) ```sql > SELECT * from 'hits_partitioned' WHERE "URL" LIKE '%google%' ORDER BY "Ev

Re: [I] Make ClickBench Q23 Go Faster [datafusion]

2025-03-20 Thread via GitHub
alamb commented on issue #15177: URL: https://github.com/apache/datafusion/issues/15177#issuecomment-2740900315 I am not really sure where the time is going 🤔 output of explain analyze: [explain.txt](https://github.com/user-attachments/files/19370532/explain.txt) -- This

Re: [PR] include some BinaryOperator from sqlparser [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15327: URL: https://github.com/apache/datafusion/pull/15327#discussion_r2005971323 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -793,8 +793,10 @@ impl BinaryExpr { BitwiseShiftRight => bitwise_shift_right_dyn(left, righ

Re: [PR] refactor: move `CteWorkTable`, `default_table_source` a bunch of files out of core [datafusion]

2025-03-20 Thread via GitHub
alamb commented on code in PR #15316: URL: https://github.com/apache/datafusion/pull/15316#discussion_r2005982815 ## datafusion/physical-expr/src/physical_expr.rs: ## @@ -146,6 +148,38 @@ pub fn create_ordering( Ok(all_sort_orders) } +/// Create a physical sort expressio

Re: [PR] chore: Fix some inconsistencies in memory pool configuration [datafusion-comet]

2025-03-20 Thread via GitHub
andygrove commented on code in PR #1561: URL: https://github.com/apache/datafusion-comet/pull/1561#discussion_r2006287257 ## spark/src/main/scala/org/apache/comet/CometExecIterator.scala: ## @@ -63,9 +64,28 @@ class CometExecIterator( }.toArray private val plan = { va

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-20 Thread via GitHub
kazuyukitanimura commented on code in PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#discussion_r2006238491 ## spark/src/test/scala/org/apache/comet/WithHdfsCluster.scala: ## @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under on

[PR] Always use `PartitionMode::Auto` in planner [datafusion]

2025-03-20 Thread via GitHub
Dandandan opened a new pull request, #15339: URL: https://github.com/apache/datafusion/pull/15339 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes teste

<    1   2