Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-31 Thread via GitHub
chenkovsky closed pull request #16185: fix: Fix `EquivalenceClass` calculation for Union queries URL: https://github.com/apache/datafusion/pull/16185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-31 Thread via GitHub
CookiePieWw commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2925153931 Thanks for your feedback! I found `ValueStatistics` has already have [`max_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#metho

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2925152088 > This is exactly what I want to avoid -- there are too many execs and we shouldn't have a structure where this is repeated everywhere in code. Isn't that somewhat unavoidable

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-31 Thread via GitHub
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2924659989 So if we're to change `RecordBatch` to `Vec` in the hashjoin, I think this is what's roughly needed: * Change `batch` to `Vec` in `JoinLeftData` * Initialize `JoinLef

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924675924 > @ozankabak > > > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe tha

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924684413 Just as an FYI, I was reading through the tokio docs yesterday and found https://docs.rs/tokio/latest/tokio/task/coop/index.html#cooperative-scheduling which seems very similar to t

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on PR #16182: URL: https://github.com/apache/datafusion/pull/16182#issuecomment-2924686568 Thank you for review. I’ve applied your suggestions, and I’ll run it a few more times and check the output display once again before requesting a final review. If there are any ot

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-31 Thread via GitHub
Kontinuation commented on code in PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#discussion_r2117529719 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadFromS3Suite.scala: ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (AS

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117532342 ## benchmarks/src/clickbench.rs: ## @@ -128,36 +128,70 @@ impl RunOpt { let ctx = SessionContext::new_with_config_rt(config, rt_builder.build_arc()?);

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924691085 > but it's kind of similar to the count-batches-and-yield solution. Thank you @pepijnve , this is really helpful, so our choice seems like good enough: 1. For performanc

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-31 Thread via GitHub
Kontinuation commented on PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#issuecomment-2924690797 > I have another thought on this. Any number of users have developed custom `AWSCredentialsProvider`s in Java but we would not have corresponding implementations in Rust (t

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117536617 ## benchmarks/compare.py: ## @@ -156,8 +173,8 @@ def compare( console.print(table) # Calculate averages -avg_baseline_time = total_baseline_time

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117541964 ## benchmarks/src/sort_tpch.rs: ## @@ -294,7 +297,7 @@ impl RunOpt { let mut stream = execute_stream(physical_plan.clone(), state.task_ctx())?;

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924931929 > I am currently against merging this as is, unless we get stuck in our search to find a proper way to solve this. Let's keep hacking at finding a good solution -- we are making

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924937197 > This is an interesting find, and I think it would be great if we could somehow utilize this (or something similar to this). I was just studying the implementation. It would

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-31 Thread via GitHub
alamb commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2925498438 Thanks @comphead and @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[PR] Chore: implement bit_not as ScalarUDFImpl [datafusion-comet]

2025-05-31 Thread via GitHub
kazantsev-maksim opened a new pull request, #1825: URL: https://github.com/apache/datafusion-comet/pull/1825 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1819 Closes #. See https://github.com/apache/datafusion-comet/issues/1

Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-31 Thread via GitHub
alamb commented on PR #16185: URL: https://github.com/apache/datafusion/pull/16185#issuecomment-2925525241 - This PR might also be superceded by https://github.com/apache/datafusion/pull/16217 -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Handle dicts for distinct count [datafusion]

2025-05-31 Thread via GitHub
alamb commented on code in PR #15871: URL: https://github.com/apache/datafusion/pull/15871#discussion_r2118139282 ## datafusion/functions-aggregate/src/count.rs: ## @@ -179,6 +179,107 @@ impl Count { } } } +fn get_primitive_type_accumulator(data_type: &DataType) -

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924752646 > https://docs.rs/tokio/latest/tokio/task/coop/fn.consume_budget.html This is an interesting find, and I think it would be great if we could somehow utilize this (or somethin

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924751830 I am currently against merging this as is, unless we get stuck in our search to find a proper way to solve this. Let's keep hacking at finding a good solution -- we are making prog

Re: [PR] Always add parentheses when formatting`BinaryExpr` with `SchemaDisplay` [datafusion]

2025-05-31 Thread via GitHub
hendrikmakait commented on code in PR #16209: URL: https://github.com/apache/datafusion/pull/16209#discussion_r2118214706 ## datafusion/expr/src/expr.rs: ## @@ -2500,7 +2500,7 @@ impl Display for SchemaDisplay<'_> { } } Expr::BinaryExpr

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2925042496 > I just updated the PR also support sort exec, also added end to end testing. This is exactly what I want to avoid -- there are too many execs and we shouldn't have a struc

Re: [PR] Handle dicts for distinct count [datafusion]

2025-05-31 Thread via GitHub
blaginin commented on PR #15871: URL: https://github.com/apache/datafusion/pull/15871#issuecomment-2925530843 thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments) -- This is an automated message from the Apache Git Service. To

Re: [PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-31 Thread via GitHub
comphead commented on code in PR #16216: URL: https://github.com/apache/datafusion/pull/16216#discussion_r2118165176 ## docs/source/library-user-guide/upgrading.md: ## @@ -21,6 +21,55 @@ ## DataFusion `48.0.0` +### `VARCHAR` SQL type changed to map to `Utf8View` Arrow type

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-31 Thread via GitHub
Dandandan commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2118781458 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -975,6 +996,12 @@ pub(crate) fn adjust_indices_by_join_type( // the left_indices will not be u