Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924371687 Seems like folks are on the way to finding a better solution -- let's see where the discussion goes. -- This is an automated message from the Apache Git Service. To respond to th

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117240352 ## benchmarks/src/util/run.rs: ## @@ -138,6 +144,13 @@ impl BenchmarkRun { } } +/// Mark current query +pub fn mark_failed(&mut self) {

Re: [I] Expr formatting missing parentheses [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on issue #16054: URL: https://github.com/apache/datafusion/issues/16054#issuecomment-2924301139 > I've had a closer look, we simply wrap `- {foo}` in parentheses regardless of whether that's strictly necessary. Following the same principle for `BinaryExpr` won't look as

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 merged PR #16215: URL: https://github.com/apache/datafusion/pull/16215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-30 Thread via GitHub
Copilot commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117256270 ## benchmarks/compare.py: ## @@ -156,8 +173,8 @@ def compare( console.print(table) # Calculate averages -avg_baseline_time = total_baseline_time /

Re: [I] union all +aggregate function in the recursive cte results an infinite loop [datafusion-python]

2025-05-30 Thread via GitHub
l1t1 closed issue #1131: union all +aggregate function in the recursive cte results an infinite loop URL: https://github.com/apache/datafusion-python/issues/1131 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2924177927 This feature is intended purely for performance, right? Not because there are cases that can only be run with NLJ. If that's the case, I suggest to do some benchmarks upfront to

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
comphead commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2116905907 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -221,9 +221,11 @@ impl SortMergeJoinExec { // When output schema contains only the right

Re: [PR] Parquet encryption [datafusion]

2025-05-30 Thread via GitHub
corwinjoy closed pull request #16219: Parquet encryption URL: https://github.com/apache/datafusion/pull/16219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

[PR] Parquet encryption [datafusion]

2025-05-30 Thread via GitHub
corwinjoy opened a new pull request, #16219: URL: https://github.com/apache/datafusion/pull/16219 Draft PR to add encryption to datafusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub
jonathanc-n commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923567825 I'll try to give that a try, thanks @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923538632 This might help here as well I think? https://github.com/apache/arrow-rs/pull/6427/files -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923526973 Yeah - one of the options to optimize join is by avoiding the `take` + `concat` (which happens when generating fewer rows per input batch in join, so I think that could use a

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2116661763 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -1126,6 +1153,28 @@ where .collect() } +pub(crate) fn get_mark_indices( +range: &Range, +

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2116652162 ## datafusion/sql/src/unparser/plan.rs: ## @@ -738,21 +739,38 @@ impl Unparser<'_> { let negated = match join.join_type {

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
ctsk commented on PR #16083: URL: https://github.com/apache/datafusion/pull/16083#issuecomment-2923436003 After updating `JoinType::supports_swap` to include LeftMark/RightMark join, the join_selection rule *should* already plan right joins where appropriate. Subsequently running the sqllog

Re: [PR] build: Specify -Dsbt.log.noformat=true in sbt CI runs [datafusion-comet]

2025-05-30 Thread via GitHub
codecov-commenter commented on PR #1822: URL: https://github.com/apache/datafusion-comet/pull/1822#issuecomment-2923397979 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1822?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
ctsk commented on PR #16083: URL: https://github.com/apache/datafusion/pull/16083#issuecomment-2923375501 Really cool work! I'll look over it in more detail over the weekend =) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

[I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1824: URL: https://github.com/apache/datafusion-comet/issues/1824 ### Describe the bug ``` 2025-05-30T18:23:30.5178844Z 2025-05-30T18:23:30.6051928Z [info] - SPARK-32038: NormalizeFloatingNumbers should work on dis

Re: [PR] [WIP] Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` config [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove commented on PR #1736: URL: https://github.com/apache/datafusion-comet/pull/1736#issuecomment-2923366806 There are now many more test failures than when this PR was first created. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[I] Panic in Comet shuffle when building structs [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1823: URL: https://github.com/apache/datafusion-comet/issues/1823 ### Describe the bug In PR https://github.com/apache/datafusion-comet/pull/1736 there is a panic during Comet shuffle when building structs. ``` 2025-05-30T18:36:16.9739600Z 

[PR] build: Specify -Dsbt.log.noformat=true in sbt CI runs [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new pull request, #1822: URL: https://github.com/apache/datafusion-comet/pull/1822 ## Which issue does this PR close? N/A ## Rationale for this change Reduce verbose logging in CI ## What changes are included in this PR?

Re: [I] Support Push down expression evaluation in `TableProviders` [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #14993: URL: https://github.com/apache/datafusion/issues/14993#issuecomment-2923352013 @adriangb and I spoke about this -- we think having some sort of motivating example where the difference between schemas in the table and the file leads to bad performance will he

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-30 Thread via GitHub
kazuyukitanimura commented on PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#issuecomment-2923351378 Merged Thank you @kazantsev-maksim @parthchandra @mbutrovich -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-30 Thread via GitHub
kazuyukitanimura merged PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922200568 > > For example, FileStream could count how many batches it sends back-to-back without yielding, and after a certain threshold, it yields. WDYT? > > Perhaps DataSourceExec

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on code in PR #16207: URL: https://github.com/apache/datafusion/pull/16207#discussion_r2116490428 ## datafusion/expr/src/expr.rs: ## @@ -330,7 +331,7 @@ pub enum Expr { /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt AggregateFunction(Aggre

Re: [PR] minor: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
codecov-commenter commented on PR #1821: URL: https://github.com/apache/datafusion-comet/pull/1821#issuecomment-2922995270 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1821?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] minor: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove commented on code in PR #1821: URL: https://github.com/apache/datafusion-comet/pull/1821#discussion_r2116291551 ## native/core/src/execution/planner.rs: ## @@ -146,15 +146,7 @@ pub struct PhysicalPlanner { impl Default for PhysicalPlanner { fn default() -> Self

[PR] minore: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new pull request, #1821: URL: https://github.com/apache/datafusion-comet/pull/1821 ## Which issue does this PR close? N/A ## Rationale for this change As we add more expressions from `datafusion-spark` we would need to update two identi

[I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1820: URL: https://github.com/apache/datafusion-comet/issues/1820 ### What is the problem the feature request solves? The `datafusion-spark` crate now provides a sha2 implementation (https://github.com/apache/datafusion/pull/16168), so we should be able

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922908904 I'll leave this open for a few days to gather any remaining feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-30 Thread via GitHub
kazantsev-maksim commented on PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#issuecomment-2922907576 Thank you for the feedback! @kazuyukitanimura @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2922905121 > hope downstream users didn't use it If they need to write data to IPC files, I think either of the upstream arrow-rs choices is better: 1. https://docs.rs/arrow-ipc/55.1.0/a

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922901439 @ozankabak > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe that the solut

Re: [I] docs: Instructions for running sbt fail due to OOM [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on issue #1812: URL: https://github.com/apache/datafusion-comet/issues/1812#issuecomment-2922892698 Spark's instructions for using sbt have mostly be applicable to Comet for me https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt -- This is

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922888436 I also agree we could make a query cancellation token work -- the potential challenges of that approach are: 1. We have to find some way to trigger the cancellation token when the s

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659103 ๐Ÿค–: Benchmark completed Details ``` group main reduce-expr-struct-size -

[PR] chore(deps): bump testcontainers-modules from 0.12.0 to 0.12.1 [datafusion]

2025-05-30 Thread via GitHub
dependabot[bot] opened a new pull request, #16212: URL: https://github.com/apache/datafusion/pull/16212 Bumps [testcontainers-modules](https://github.com/testcontainers/testcontainers-rs-modules-community) from 0.12.0 to 0.12.1. Release notes Sourced from https://github.com/testco

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922813081 Drop off for today, i will start the work tomorrow. Thank you @pepijnve for good discussion, i think it deserve we have the cancellation solution. -- This is an automated messa

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922801898 > Would this require any changes to `SessionContext`? Cancellation is per query, so I think you would want to add the `cancelled: AtomicBoolean`, `fn cancel()` and `fn is_cancell

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#issuecomment-2922781217 I have another thought on this. Any number of users have developed custom `AWSCredentialsProvider`s in Java but we would not have corresponding implementations in Rust (the

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922773141 Would this require any changes to `SessionContext`? Cancellation is per query, so I think you would want to add the `cancelled: AtomicBoolean`, `fn cancel()` and `fn is_cancelled()

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2922765656 I believe a follow up pull request can be made to actually have this be used in the physical plan based on some statistics. @comphead Does this look fine for now? -- This is a

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on code in PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#discussion_r2116175913 ## native/Cargo.lock: ## @@ -375,6 +375,353 @@ version = "1.4.0" source = "registry+https://github.com/rust-lang/crates.io-index"; checksum = "ace50bade

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922755179 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and reduce-expr-struct-size Benchmark clickbench_extended.json --

Re: [I] Support `map_values` [datafusion-comet]

2025-05-30 Thread via GitHub
mbutrovich commented on issue #1789: URL: https://github.com/apache/datafusion-comet/issues/1789#issuecomment-2922749343 The investigation seems to be diverging from `map_values` and seems more focused on map value nullability, and in particular struct field nullability and how that's enco

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922745280 > After my previous comment I started thinking about how our Java code handles this kind of thing. Typically when we do a `Future#cancel(mayInterrupt: true)` we have `Thread#isIn

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
comphead commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2922744744 hope downstream users didn't use it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-05-30 Thread via GitHub
comphead merged PR #16213: URL: https://github.com/apache/datafusion/pull/16213 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-30 Thread via GitHub
alamb opened a new pull request, #16216: URL: https://github.com/apache/datafusion/pull/16216 ## Which issue does this PR close? - Related to https://github.com/apache/datafusion/issues/15096 - Follow on to https://github.com/apache/datafusion/pull/16142 ## Rationale for

Re: [PR] fix: map parquet field_id correctly (native_iceberg_compat) [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on PR #1815: URL: https://github.com/apache/datafusion-comet/pull/1815#issuecomment-2922733351 Changing this to draft while I look into the CI failures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922714438 After my previous comment I started thinking about how our Java code handles this kind of thing. Typically when we do a `Future#cancel(mayInterrupt: true)` we have `Thread#isInterru

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922689219 Thank you @pepijnve , i may got it now, it means: # Per-Batch Overhead Comparison ## 1. Counter + compare ```rust this.batches_processed += 1; // I

Re: [I] Spark-compatible CAST operation [datafusion]

2025-05-30 Thread via GitHub
Blizzara commented on issue #11201: URL: https://github.com/apache/datafusion/issues/11201#issuecomment-2922682982 Just fwiw, we already use Comet's Cast in our use of DF, through a rewrite rule: ``` impl FunctionRewrite for OurCastRewrite { fn name(&self) -> &str { "

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659200 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922625826 > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an AggregateExec issue. @ozankabak It's indeed a universal problem, but since

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922512158 Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe that the solution has to be uni

Re: [I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]

2025-05-30 Thread via GitHub
andygrove commented on issue #15914: URL: https://github.com/apache/datafusion/issues/15914#issuecomment-2922618245 @shehabgamin @alamb I created an epic in Comet for implementing our current expressions as `ScalarUDFImpl` rather than `PhysicalExpr` as a first step towards contributing them

Re: [PR] chore: Use unique artifact names in Java test run [datafusion-comet]

2025-05-30 Thread via GitHub
codecov-commenter commented on PR #1818: URL: https://github.com/apache/datafusion-comet/pull/1818#issuecomment-2922612812 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1818?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[I] [EPIC] Implement expressions as ScalarUDFImpl [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1819: URL: https://github.com/apache/datafusion-comet/issues/1819 ### What is the problem the feature request solves? Many of Comet's Spark-compatible DataFusion expressions are currently implemented as `PhysicalExpr`. We would like to update them to imp

Re: [PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16216: URL: https://github.com/apache/datafusion/pull/16216#discussion_r2116071996 ## docs/source/library-user-guide/upgrading.md: ## @@ -21,6 +21,55 @@ ## DataFusion `48.0.0` +### `VARCHAR` SQL type changed to map to `Utf8View` Arrow type + +

[PR] chore: Use unique artifact names in Java test run [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new pull request, #1818: URL: https://github.com/apache/datafusion-comet/pull/1818 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
berkaysynnada commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922128094 > > For example, FileStream could count how many batches it sends back-to-back without yielding, and after a certain threshold, it yields. WDYT? > > Perhaps DataSourceExe

Re: [PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on code in PR #16216: URL: https://github.com/apache/datafusion/pull/16216#discussion_r2116036299 ## docs/source/library-user-guide/upgrading.md: ## @@ -21,6 +21,55 @@ ## DataFusion `48.0.0` +### `VARCHAR` SQL type changed to map to `Utf8View` Arrow ty

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922519861 Sounds good -- let's go with the first choice (moving fast with the release and merging this right after) unless someone else brings a good argument to the contrary. -- This is

[PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-05-30 Thread via GitHub
alamb opened a new pull request, #16213: URL: https://github.com/apache/datafusion/pull/16213 ## Which issue does this PR close? - Follow on to https://github.com/apache/datafusion/pull/16139 ## Rationale for this change Implement suggestions from @xudong963: https://git

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922549725 Makes sense, thanks @zhuqi-lucas. Both @berkaysynnada and I will be happy to help with brainstorming/searching as you search for a good solution to this problem ๐Ÿš€ -- This is an

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922500062 > I think the issue is for streams like AggregateExec that can always make progress. So while this PR wraps the input to the AggregateExecStream, I think logically the yielding i

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922543426 > > Do some changes in arrow-rs clickbench benchmark: > > Do I understand that the changes you report are due to simply rewriting the parquet files to have a page inde

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922532031 > > Even if I couldn't get consensus on that direction, this version could still be improved. For example, by inserting the YieldStream only when itโ€™s truly needed, such as in th

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16207: URL: https://github.com/apache/datafusion/pull/16207#discussion_r2116011961 ## datafusion/expr/src/expr.rs: ## @@ -330,7 +331,7 @@ pub enum Expr { /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt AggregateFunction(Aggregate

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922513729 I kicked off some benchmarks for this PR -- thank you @hendrikmakait ๐Ÿ™ -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922512365 ๐Ÿค– `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922481568 Let's definitely time the merge and DF48 release in a cooperative way. We have two competing realities at play: (1) The PR attracts a lot of conflicts and it is not easy to resolve

Re: [PR] Additional placeholder datatype inferencing [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #15980: URL: https://github.com/apache/datafusion/pull/15980#discussion_r2115997415 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -1494,6 +1494,14 @@ impl LogicalPlan { let mut param_types: HashMap> = HashMap::new(); self.a

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922493420 > Do some changes in arrow-rs clickbench benchmark: Do I understand that the changes you report are due to simply rewriting the parquet files to have a page index and be unc

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922489274 > Move fast with the release while this gets reviews, which has the advantage of making DF48 a "checkpoint" before this PR. The downside is that we will need to do extensive testing bo

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922479564 > Even if I couldn't get consensus on that direction, this version could still be improved. For example, by inserting the YieldStream only when itโ€™s truly needed, such as in the empty

Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16185: URL: https://github.com/apache/datafusion/pull/16185#discussion_r2115978901 ## datafusion/physical-expr/src/equivalence/class.rs: ## @@ -422,6 +423,60 @@ impl EquivalenceGroup { self.bridge_classes() } +/// Returns a set

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922419073 > However, the same example can also consume the pending return generated by Aggregate, so the guarantee cannot be upheld in our model. Indeed, which is why I've introduced co

Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-30 Thread via GitHub
chenkovsky commented on code in PR #16185: URL: https://github.com/apache/datafusion/pull/16185#discussion_r2115970133 ## datafusion/physical-expr/src/equivalence/class.rs: ## @@ -422,6 +423,60 @@ impl EquivalenceGroup { self.bridge_classes() } +/// Returns a

Re: [PR] feat: improve with_constants performance [datafusion]

2025-05-30 Thread via GitHub
chenkovsky closed pull request #16211: feat: improve with_constants performance URL: https://github.com/apache/datafusion/pull/16211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
xudong963 commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922442995 One thing on my mind is that maybe it's better to hold up the PR until DF48 release, then we can have enough time to check and verify the PR after it merges. -- This is an autom

Re: [I] Remove use of deprecated dict_ordered in datafusion-proto [datafusion]

2025-05-30 Thread via GitHub
cj-zhukov commented on issue #16218: URL: https://github.com/apache/datafusion/issues/16218#issuecomment-2922435442 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[I] Remove use of deprecated dict_ordered in datafusion-proto [datafusion]

2025-05-30 Thread via GitHub
cj-zhukov opened a new issue, #16218: URL: https://github.com/apache/datafusion/issues/16218 ### Is your feature request related to a problem or challenge? Following the discussion in [PR #14227](https://github.com/apache/datafusion/pull/14227) about dict_ordered, Iโ€™m going to open a

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922392615 Thanks for the quick response @alamb -- I added some reviewing tips and tricks to the PR body to help you guys out. -- This is an automated message from the Apache Git Service. T

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922389594 @wiedld @xudong963 and @suremarc perhaps you are also interested in this PR given your past interest in the equivalence code. Also cc @chenkovsky -- This is an automated message f

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922387217 ๐Ÿ˜ฎ I will put this on my list to review asap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [I] Consolidate schema adapter tests in schema_adapter_integration_tests.rs [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16202: URL: https://github.com/apache/datafusion/issues/16202#issuecomment-2922385651 Thanks @kosiew -- here is a PR: - https://github.com/apache/datafusion/pull/16214 -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2922383864 I have noticed something similar -- that the pattern of "take / concat" results in potentially non trival overhead. I have been recently working on some related code (the sa

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
berkaysynnada commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922380421 @pepijnve thanks for the example. However, the same example can also consume the pending return generated by Aggregate, so the guarantee cannot be upheld in our model. I

Re: [PR] Change default SQL mapping for `VARCAHR` from `Utf8` to `Utf8View` [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16142: URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2922372187 Thanks again @zhuqi-lucas and @xudong963 I have also made a PR to add a mention of this change to the upgrade guide: - https://github.com/apache/datafusion/pull/16216 -- Th

[PR] [MAJOREquivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak opened a new pull request, #16217: URL: https://github.com/apache/datafusion/pull/16217 ## Which issue does this PR close? - Closes #13748. ## Rationale for this change DF's theoretical approach to representing orderings, equivalences and constants is sound, bu

[PR] Fix: GROUPING SETS accept values without parenthesis [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
Vedin opened a new pull request, #1867: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1867 `GROUPING SETS` currently accepts only tuples. It can be single elements as well. At least, I can see this behaviour in Hive and Snowflake: https://hive.apache.org/docs/latest/3015

Re: [PR] Implement schema adapter support for FileSource and add integration tests [datafusion]

2025-05-30 Thread via GitHub
alamb merged PR #16148: URL: https://github.com/apache/datafusion/pull/16148 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922322640 I agree with @pepijnve that adding a yield to the producer is not clearly a better choice -- I think the issue is for streams like AggregateExec that can always make progress S

Re: [PR] Adds support for mysql's drop index [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
alamb commented on PR #1864: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1864#issuecomment-2922335147 The code just keeps churning along in this repo. Thank you @iffyio and @dmzmk -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] chore(deps): bump testcontainers-modules from 0.12.0 to 0.12.1 [datafusion]

2025-05-30 Thread via GitHub
xudong963 merged PR #16212: URL: https://github.com/apache/datafusion/pull/16212 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] Add new stats pruning helpers to allow combining partition values in file level stats [datafusion]

2025-05-30 Thread via GitHub
xudong963 commented on code in PR #16139: URL: https://github.com/apache/datafusion/pull/16139#discussion_r2115539220 ## datafusion/common/src/pruning.rs: ## @@ -122,3 +127,984 @@ pub trait PruningStatistics { values: &HashSet, ) -> Option; } + +/// Prune files ba

  1   2   >