Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2935052670 > We know exactly when this sort of a yielding will be needed thanks to the information exposed to the planner by the ExecutionPlan APIs > We know exactly when this sort of a

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
alamb commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2123545316 ## datafusion/physical-expr/src/equivalence/properties/dependency.rs: ## @@ -907,72 +796,13 @@ mod tests { for (exprs, expected) in test_cases {

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2935084653 In the meantime, I've been experimenting with factoring out the poll/yield budget part inspired a bit by `tokio::task::coop`. The `maybe_poll` macro is not as elegant as `consume_bu

[I] Perf : read from csv default datatype setting to utf8view [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas opened a new issue, #16241: URL: https://github.com/apache/datafusion/issues/16241 ### Is your feature request related to a problem or challenge? Currently, read from CSV default to UTF8, when setting to UTF8, the performance improved a lot. See the result: `

Re: [I] Perf : read from csv default datatype setting to utf8view [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on issue #16241: URL: https://github.com/apache/datafusion/issues/16241#issuecomment-2935157081 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] Minor: Print cargo command in bench script [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16236: URL: https://github.com/apache/datafusion/pull/16236#issuecomment-2935110790 Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2935116007 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
alamb commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2123722886 ## datafusion/catalog/src/listing_schema.rs: ## @@ -143,7 +141,7 @@ impl ListingSchemaProvider { order_exprs: vec![],

Re: [PR] Minor: Print cargo command in bench script [datafusion]

2025-06-03 Thread via GitHub
alamb merged PR #16236: URL: https://github.com/apache/datafusion/pull/16236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] Improve sql planing performance (optimize `try_process_unnest`) [datafusion]

2025-06-03 Thread via GitHub
alamb opened a new issue, #16242: URL: https://github.com/apache/datafusion/issues/16242 ### Is your feature request related to a problem or challenge? While I was reviewing https://github.com/apache/datafusion/pull/16217 I did some profiling on the planning benchmarks It looks

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-03 Thread via GitHub
alamb commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2935196264 I hope to begin working on the delta-rs upgrade tomorrow as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [I] Read from csv default datatype setting to utf8view [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on issue #16241: URL: https://github.com/apache/datafusion/issues/16241#issuecomment-2935189750 I am not sure if the performance result is a noise, i rerun, it show different result sometimes. -- This is an automated message from the Apache Git Service. To respond to

[PR] Perf: load default Utf8View for CSV datatype [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas opened a new pull request, #16243: URL: https://github.com/apache/datafusion/pull/16243 ## Which issue does this PR close? - Closes [#16241](https://github.com/apache/datafusion/issues/16241) ## Rationale for this change ## What changes are inc

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2934797274 > 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running I think the planning slowdow

Re: [PR] feat: Support defining custom MetricValues in PhysicalPlans [datafusion]

2025-06-03 Thread via GitHub
sfluor commented on code in PR #16195: URL: https://github.com/apache/datafusion/pull/16195#discussion_r2123513896 ## datafusion/physical-plan/src/metrics/value.rs: ## @@ -516,6 +596,21 @@ impl MetricValue { (Self::EndTimestamp(timestamp), Self::EndTimestamp(other_

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-06-03 Thread via GitHub
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2934649736 I posted about this on twitter too in case anyone is interested: https://x.com/andrewlamb/status/1929852296323547273 -- This is an automated message from the Apache Git Serv

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2933950378 It's more complex than i expected, i need more time to investigate about the rule plan, because we add a new YieldStreamExec. Several exec will have specified rule to manag

[I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-06-03 Thread via GitHub
mmooyyii opened a new issue, #16240: URL: https://github.com/apache/datafusion/issues/16240 I want write a csv file only include headers; ``` use datafusion::config::CsvOptions; use datafusion::dataframe::DataFrameWriteOptions; use datafusion::error::Result; use datafusion:

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2934263065 @zhuqi-lucas, you may be having trouble with current rules because `YieldStreamExec` doesn't implement certain `ExecutionPlan` APIs. I would be surprised if the rules are that brit

Re: [I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-06-03 Thread via GitHub
mmooyyii commented on issue #16240: URL: https://github.com/apache/datafusion/issues/16240#issuecomment-2934238303 I want do something like materialized view. I have to deal with this special case if datafusion can't write empty data frame; -- This is an automated message from the Apache

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2934462871 > @pepijnve, an intrusive solution will be a hard sell for me. There are simply too many cases, each with their own context and somewhat specific details. @ozankabak could yo

[PR] Add documentation for native_datafusion Parquet scanner's S3 support [datafusion-comet]

2025-06-03 Thread via GitHub
Kontinuation opened a new pull request, #1832: URL: https://github.com/apache/datafusion-comet/pull/1832 ## Which issue does this PR close? Relates to #1829. ## Rationale for this change Document the S3 supported added for `native_datafusion` Parquet scanner in https://g

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2934552973 > > @pepijnve, an intrusive solution will be a hard sell for me. There are simply too many cases, each with their own context and somewhat specific details. > > @ozankabak

Re: [I] Snowflake: TIMESTAMP precision regression [datafusion-sqlparser-rs]

2025-06-03 Thread via GitHub
psukys commented on issue #1861: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1861#issuecomment-2934554910 Is TimestampNtz shared between Databricks and Snowflake? Checking on slightly older version (sqloxide==0.1.54), the parsed result is: ```json [ {

Re: [I] Support integration with Parquet modular encryption [datafusion]

2025-06-03 Thread via GitHub
adamreeve commented on issue #15216: URL: https://github.com/apache/datafusion/issues/15216#issuecomment-2933379249 I've created a draft PR with an example of what integration with a KMS could look like: https://github.com/apache/datafusion/pull/16237 Any feedback would be much apprec

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2934052479 > It's more complex than i expected, i need more time to investigate about the rule plan, because we add a new YieldStreamExec. > > Several exec will have specified rule to ma

Re: [PR] Remove use of deprecated dict_ordered in datafusion-proto (#16218) [datafusion]

2025-06-03 Thread via GitHub
xudong963 merged PR #16220: URL: https://github.com/apache/datafusion/pull/16220 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [I] Remove use of deprecated dict_ordered in datafusion-proto [datafusion]

2025-06-03 Thread via GitHub
xudong963 closed issue #16218: Remove use of deprecated dict_ordered in datafusion-proto URL: https://github.com/apache/datafusion/issues/16218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Always add parentheses when formatting`BinaryExpr` with `SchemaDisplay` [datafusion]

2025-06-03 Thread via GitHub
xudong963 commented on PR #16209: URL: https://github.com/apache/datafusion/pull/16209#issuecomment-2934084892 @hendrikmakait Hi, there are still a few places that need to change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-06-03 Thread via GitHub
xudong963 commented on code in PR #16213: URL: https://github.com/apache/datafusion/pull/16213#discussion_r2123103628 ## datafusion/common/src/pruning.rs: ## @@ -258,7 +261,16 @@ impl PruningStatistics for PartitionPruningStatistics { } /// Prune a set of containers represen

Re: [I] Migrate `core` tests to `insta` [datafusion]

2025-06-03 Thread via GitHub
Chen-Yuan-Lai commented on issue #15791: URL: https://github.com/apache/datafusion/issues/15791#issuecomment-2933783071 @blaginin Thank you for following up on this. I noticed some tests used a customized assertion macro, so I have two questions for this: 1. For `assert_metrics`: I'm

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-03 Thread via GitHub
shehabgamin commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2934117630 I plan to test over the next couple of days as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] feat: Support defining custom MetricValues in PhysicalPlans [datafusion]

2025-06-03 Thread via GitHub
gabotechs commented on code in PR #16195: URL: https://github.com/apache/datafusion/pull/16195#discussion_r2123429852 ## datafusion/physical-plan/src/metrics/value.rs: ## @@ -516,6 +596,21 @@ impl MetricValue { (Self::EndTimestamp(timestamp), Self::EndTimestamp(oth

Re: [I] Interuptable queries in jupyter notebooks [datafusion-python]

2025-06-03 Thread via GitHub
kosiew commented on issue #1136: URL: https://github.com/apache/datafusion-python/issues/1136#issuecomment-2934661968 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2934960971 @pepijnve this is a good summary of why I am against changing each operator individually: > IIRC yes -- and of few flavors. Sorting unconditionally suffers from this problem

Re: [PR] fix: add missing row count limits to TPC-H queries [datafusion]

2025-06-03 Thread via GitHub
xudong963 merged PR #16230: URL: https://github.com/apache/datafusion/pull/16230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [I] Enable merge queue in github to avoid commit confliction. [datafusion]

2025-06-03 Thread via GitHub
blaginin commented on issue #6880: URL: https://github.com/apache/datafusion/issues/6880#issuecomment-2934980680 Based on ASF Slack, I believe MQ aren't currently supported in `.asf.yaml` because there's no API support (https://github.com/orgs/community/discussions/50893) -- This is an a

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2934983867 Thanks for all the reviews. I will address them today 🚀 With the computational complexity issue removed, I agree that we can progressively reduce the "constant factor" in ru

Re: [PR] docs: Add documentation for native_datafusion Parquet scanner's S3 support [datafusion-comet]

2025-06-03 Thread via GitHub
codecov-commenter commented on PR #1832: URL: https://github.com/apache/datafusion-comet/pull/1832#issuecomment-2934593678 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1832?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935269917 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935270233 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935270486 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

[I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-03 Thread via GitHub
tglanz opened a new issue, #1833: URL: https://github.com/apache/datafusion-comet/issues/1833 ### Describe the bug Trying to insert/query from iceberg table according to https://github.com/apache/datafusion-comet/blob/main/docs/source/user-guide/iceberg.md and encountering the follow

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935279183 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [I] Read from csv default datatype setting to utf8view [datafusion]

2025-06-03 Thread via GitHub
alamb commented on issue #16241: URL: https://github.com/apache/datafusion/issues/16241#issuecomment-2935286362 > I am not sure if the performance result is a noise, i rerun, it show different result sometimes. I think it is noise -- I think the h2o benchmarks read data from CSV so th

[PR] fix: Enable InjectRuntimeFilterSuite Spark SQL tests [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove opened a new pull request, #1834: URL: https://github.com/apache/datafusion-comet/pull/1834 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1831 ## Rationale for this change More tests can be enabled no

Re: [I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-06-03 Thread via GitHub
alamb commented on issue #16240: URL: https://github.com/apache/datafusion/issues/16240#issuecomment-2935297472 I am not sure -- it seems like maybe we should have an option in DataFrameWriteOptions that will force an empty file to be written even if there are no rows 🤔 -- This is an au

Re: [I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]

2025-06-03 Thread via GitHub
linhr commented on issue #15914: URL: https://github.com/apache/datafusion/issues/15914#issuecomment-2935366003 > @linhr has some ideas around making sqllogictests easier to work with Here is my idea to automate the test setup, without bringing Spark as a hard dependency. 1. We cr

Re: [I] Read from csv default datatype setting to utf8view [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on issue #16241: URL: https://github.com/apache/datafusion/issues/16241#issuecomment-2935384294 > > I am not sure if the performance result is a noise, i rerun, it show different result sometimes. > > I think it is noise -- I think the h2o benchmarks read data fr

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935487885 🤖: Benchmark completed Details ``` Comparing HEAD and constant_agg_window Benchmark clickbench_extended.json

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2935488103 🤖 `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] fix: Enable more Spark SQL tests [datafusion-comet]

2025-06-03 Thread via GitHub
codecov-commenter commented on PR #1834: URL: https://github.com/apache/datafusion-comet/pull/1834#issuecomment-2935502474 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1834?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935658767 > 🤖: Benchmark completed ran the wrong benchmark 😢 -- will fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[PR] fix: support `map_values` [datafusion-comet]

2025-06-03 Thread via GitHub
comphead opened a new pull request, #1835: URL: https://github.com/apache/datafusion-comet/pull/1835 ## Which issue does this PR close? Closes #1789. ## Rationale for this change ## What changes are included in this PR? ## How are these chan

Re: [PR] Simplify FileSource / SchemaAdapterFactory API [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16214: URL: https://github.com/apache/datafusion/pull/16214#issuecomment-2935683108 @xudong963 would you have a moment to help review this PR? It is something I would like to get in before 48.0.0 is released (as it contains a change to an as yet unreleased API) --

Re: [PR] fix: support `map_values` [datafusion-comet]

2025-06-03 Thread via GitHub
comphead commented on code in PR #1835: URL: https://github.com/apache/datafusion-comet/pull/1835#discussion_r2124107669 ## native/spark-expr/src/array_funcs/get_array_struct_fields.rs: ## @@ -150,11 +150,23 @@ fn get_array_struct_fields( .downcast_ref::() .exp

Re: [PR] fix: support `map_values` [datafusion-comet]

2025-06-03 Thread via GitHub
mbutrovich commented on PR #1835: URL: https://github.com/apache/datafusion-comet/pull/1835#issuecomment-2935815090 Can we generate new diffs that reenable the Spark SQL tests for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-03 Thread via GitHub
snmvaughan commented on issue #1833: URL: https://github.com/apache/datafusion-comet/issues/1833#issuecomment-2935862103 Removing the relocation results in a shaded jar that includes a specific embedded version of `parquet-column` with a `pom.xml` that does not express that dependency. Th

Re: [I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-03 Thread via GitHub
snmvaughan commented on issue #1833: URL: https://github.com/apache/datafusion-comet/issues/1833#issuecomment-2935873947 If it isn't going to be relocated, I'd suggest we shadow those dependencies and allow Maven's dependency management handle the things -- This is an automated message f

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2935875166 🤖: Benchmark completed Details ``` group epic_async-udf main -

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935875381 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] fix: support `map_values` [datafusion-comet]

2025-06-03 Thread via GitHub
codecov-commenter commented on PR #1835: URL: https://github.com/apache/datafusion-comet/pull/1835#issuecomment-2935878217 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1835?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] fix: Enable more Spark SQL tests [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove merged PR #1834: URL: https://github.com/apache/datafusion-comet/pull/1834 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Enabling Test "Runtime bloom filter join: do not add bloom filter if dpp filter exists on the same column" fails with IllegalStateException in AdaptiveSparkPlanExec.newQueryStage [datafusion-c

2025-06-03 Thread via GitHub
andygrove closed issue #1831: Enabling Test "Runtime bloom filter join: do not add bloom filter if dpp filter exists on the same column" fails with IllegalStateException in AdaptiveSparkPlanExec.newQueryStage URL: https://github.com/apache/datafusion-comet/issues/1831 -- This is an automated

Re: [PR] feat: Add experimental auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2935920067 @parthchandra @mbutrovich Could I get a review? I changed the scope to adding the "auto" option without changing the default. There is a manual workflow where we can run the S

Re: [PR] feat: Add experimental auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove commented on code in PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2124215025 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends Rule[Spar

Re: [PR] feat: Add experimental auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove commented on code in PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#discussion_r2124216384 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -93,21 +93,63 @@ case class CometScanRule(session: SparkSession) extends Rule[Spar

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2935960865 🤖: Benchmark completed Details ``` Comparing HEAD and constant_agg_window Benchmark h2o_window.json ┏━━

[I] Avoid unnecessary uses of CopyExec in native plan [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove opened a new issue, #1836: URL: https://github.com/apache/datafusion-comet/issues/1836 ### What is the problem the feature request solves? As pointed out in https://github.com/apache/datafusion-comet/pull/1793#issuecomment-2916690342, we currently always wrap ScanExec in a

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2936001738 Finally, the CI greens again, i think i fixed all testing cases. Next steps is: 1. Update the solution for the corner cases. 2. Adding performance benchmark result. -- Th

[I] Add `output_bytes` metrics to Explain Analyze [datafusion]

2025-06-03 Thread via GitHub
PokIsemaine opened a new issue, #16244: URL: https://github.com/apache/datafusion/issues/16244 ### Is your feature request related to a problem or challenge? Currently, using `Explain Analyze` seems to only provide the metric for `output_rows`. Is it possible to add a metric for the n

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2936099164 > │ QQuery 1 │ 1873.46ms │335.27ms │ +5.59x faster │ Quite nice! -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
jonathanc-n commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2936099509 Those benchmarks look nice, seems to have been skewed on my computer for the other queries. -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] Improve performance of constant aggregate window expression [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16234: URL: https://github.com/apache/datafusion/pull/16234#issuecomment-2936100737 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124345816 ## datafusion/catalog/src/listing_schema.rs: ## @@ -143,7 +141,7 @@ impl ListingSchemaProvider { order_exprs: vec![],

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124345816 ## datafusion/catalog/src/listing_schema.rs: ## @@ -143,7 +141,7 @@ impl ListingSchemaProvider { order_exprs: vec![],

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-03 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2936193853 @zhuqi-lucas great work. I've continued playing around with alternative structures in the meantime, but I keep coming back to your `YieldStream` as the most elegant solution. It's s

Re: [PR] minor: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove merged PR #1821: URL: https://github.com/apache/datafusion-comet/pull/1821 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Additional placeholder datatype inferencing [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #15980: URL: https://github.com/apache/datafusion/pull/15980#issuecomment-2936183106 Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is read

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124375915 ## datafusion/physical-expr/src/equivalence/properties/mod.rs: ## @@ -190,382 +241,363 @@ impl EquivalenceProperties { &self.oeq_class } -/// Re

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124399512 ## datafusion/sqllogictest/test_files/topk.slt: ## @@ -370,7 +370,7 @@ query TT explain select number, letter, age, number as column4, letter as column5 from part

Re: [PR] feat: add metadata to physical literal expressions [datafusion]

2025-06-03 Thread via GitHub
timsaucer commented on PR #16053: URL: https://github.com/apache/datafusion/pull/16053#issuecomment-2936285535 Closing in favor of https://github.com/apache/datafusion/pull/16170 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124418429 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -845,7 +840,7 @@ pub struct SortExec { /// Fetch highest/lowest n results fetch: Option, /// Nor

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-03 Thread via GitHub
viirya commented on code in PR #16217: URL: https://github.com/apache/datafusion/pull/16217#discussion_r2124440817 ## datafusion/physical-expr-common/src/sort_expr.rs: ## @@ -516,162 +460,240 @@ impl Display for LexOrdering { } } -impl FromIterator for LexOrdering { -

Re: [PR] chore: IgnoreCometNativeScan on a few more Spark SQL tests [datafusion-comet]

2025-06-03 Thread via GitHub
mbutrovich commented on PR #1837: URL: https://github.com/apache/datafusion-comet/pull/1837#issuecomment-2936363015 Draft while I do 3.4.3, 3.5.4, and 4.0.0-preview1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-03 Thread via GitHub
huaxingao commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2124462679 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometTPCDSMicroBenchmark.scala: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software

Re: [PR] docs: Add documentation for native_datafusion Parquet scanner's S3 support [datafusion-comet]

2025-06-03 Thread via GitHub
parthchandra merged PR #1832: URL: https://github.com/apache/datafusion-comet/pull/1832 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] Track peak_mem_used in ExternalSorter [datafusion]

2025-06-03 Thread via GitHub
ding-young commented on PR #16192: URL: https://github.com/apache/datafusion/pull/16192#issuecomment-2936483760 @2010YOUY01 Hi, I’ve been struggling a bit with tracking peak memory in SPM step, and I was wondering if I could ask for some help. ### 1. Can we add the memory for conv

Re: [PR] Chore: implement bit_count as ScalarUDFImpl [datafusion-comet]

2025-06-03 Thread via GitHub
kazantsev-maksim commented on PR #1826: URL: https://github.com/apache/datafusion-comet/pull/1826#issuecomment-2936605214 Thanks @andygrove! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Chore: implement bit_count as ScalarUDFImpl [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove merged PR #1826: URL: https://github.com/apache/datafusion-comet/pull/1826 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Perf: load default Utf8View for CSV datatype [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16243: URL: https://github.com/apache/datafusion/pull/16243#issuecomment-2936634009 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] Perf: load default Utf8View for CSV datatype [datafusion]

2025-06-03 Thread via GitHub
alamb commented on PR #16243: URL: https://github.com/apache/datafusion/pull/16243#issuecomment-2936699578 🤖: Benchmark completed Details ``` Comparing HEAD and default_utf8_for_unkown_type Benchmark h2o_window.json

Re: [I] Access Data from S3 in DeltaLake format using Ballista on Kubernetes [datafusion-ballista]

2025-06-03 Thread via GitHub
milenkovicm closed issue #1268: Access Data from S3 in DeltaLake format using Ballista on Kubernetes URL: https://github.com/apache/datafusion-ballista/issues/1268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [I] Access Data from S3 in DeltaLake format using Ballista on Kubernetes [datafusion-ballista]

2025-06-03 Thread via GitHub
milenkovicm commented on issue #1268: URL: https://github.com/apache/datafusion-ballista/issues/1268#issuecomment-2936706643 As noted in #1241 there is no out of the built in support for `deltalake` file format, it's up to users to integrate it if needed. I have updated https://githu

Re: [I] [EPIC] Spark SQL test failures when Comet JVM shuffle is used [datafusion-comet]

2025-06-03 Thread via GitHub
andygrove commented on issue #1254: URL: https://github.com/apache/datafusion-comet/issues/1254#issuecomment-2936747407 > Barring AQE and DPP tets ( addressed in a different PR [#1811](https://github.com/apache/datafusion-comet/pull/1811) ) , I am able to run these tests successfully . Not

[PR] MySQL: `index_name` in FK constraints [datafusion-sqlparser-rs]

2025-06-03 Thread via GitHub
MohamedAbdeen21 opened a new pull request, #1871: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1871 Add support for `index_name` field in FK constraints in both CREATE and ALTER TABLE statements docs: https://dev.mysql.com/doc/refman/8.4/en/create-table-foreign-keys.htm

Re: [I] Access Data from S3 in DeltaLake format using Ballista on Kubernetes [datafusion-ballista]

2025-06-03 Thread via GitHub
milenkovicm commented on issue #1268: URL: https://github.com/apache/datafusion-ballista/issues/1268#issuecomment-2936939066 Let me know if you need more help @janbraunsdorff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [PR] feat: add metadata to physical literal expressions [datafusion]

2025-06-03 Thread via GitHub
timsaucer closed pull request #16053: feat: add metadata to physical literal expressions URL: https://github.com/apache/datafusion/pull/16053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] chore: IgnoreCometNativeScan on a few more Spark SQL tests [datafusion-comet]

2025-06-03 Thread via GitHub
codecov-commenter commented on PR #1837: URL: https://github.com/apache/datafusion-comet/pull/1837#issuecomment-2936987610 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1837?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Fix: Map functions crash on out of bounds cases [datafusion]

2025-06-03 Thread via GitHub
comphead commented on PR #16203: URL: https://github.com/apache/datafusion/pull/16203#issuecomment-2936960243 Thanks @krishvishal the latest version becomes much more complicated compared to prev one. This can be a subject to check the performance. What is the reason for adding the sp

  1   2   >