[PR] Parquet encryption [datafusion]

2025-05-30 Thread via GitHub
corwinjoy opened a new pull request, #16219: URL: https://github.com/apache/datafusion/pull/16219 Draft PR to add encryption to datafusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-30 Thread via GitHub
comphead commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2116905907 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -221,9 +221,11 @@ impl SortMergeJoinExec { // When output schema contains only the right

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-30 Thread via GitHub
Copilot commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117256270 ## benchmarks/compare.py: ## @@ -156,8 +173,8 @@ def compare( console.print(table) # Calculate averages -avg_baseline_time = total_baseline_time /

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 merged PR #16215: URL: https://github.com/apache/datafusion/pull/16215 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [I] Expr formatting missing parentheses [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on issue #16054: URL: https://github.com/apache/datafusion/issues/16054#issuecomment-2924301139 > I've had a closer look, we simply wrap `- {foo}` in parentheses regardless of whether that's strictly necessary. Following the same principle for `BinaryExpr` won't look as

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117240352 ## benchmarks/src/util/run.rs: ## @@ -138,6 +144,13 @@ impl BenchmarkRun { } } +/// Mark current query +pub fn mark_failed(&mut self) {

Re: [I] union all +aggregate function in the recursive cte results an infinite loop [datafusion-python]

2025-05-30 Thread via GitHub
l1t1 closed issue #1131: union all +aggregate function in the recursive cte results an infinite loop URL: https://github.com/apache/datafusion-python/issues/1131 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Parquet encryption [datafusion]

2025-05-30 Thread via GitHub
corwinjoy closed pull request #16219: Parquet encryption URL: https://github.com/apache/datafusion/pull/16219 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub
2010YOUY01 commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2924177927 This feature is intended purely for performance, right? Not because there are cases that can only be run with NLJ. If that's the case, I suggest to do some benchmarks upfront to

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924371687 Seems like folks are on the way to finding a better solution -- let's see where the discussion goes. -- This is an automated message from the Apache Git Service. To respond to th

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922057060 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark cancellation.json โ”

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
berkaysynnada commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922061847 Thanks @zhuqi-lucas. The problem is clearly visible here, and the solution makes sense. It doesn't sacrifice performance as seen in the benchmarks, and not introduce any comple

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922117491 > For example, FileStream could count how many batches it sends back-to-back without yielding, and after a certain threshold, it yields. WDYT? Perhaps DataSourceExec could wra

Re: [PR] WIP: Test DataFusion with experimental parquet pushdown [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16208: URL: https://github.com/apache/datafusion/pull/16208#discussion_r2115668031 ## datafusion/physical-plan/src/filter.rs: ## @@ -689,6 +702,46 @@ fn filter_and_project( }) } +impl FilterExecStream { +/// Evaluates the predicate

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
berkaysynnada commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922064218 > Thanks @zhuqi-lucas. The problem is clearly visible here, and the solution makes sense. It doesn't sacrifice performance as seen in the benchmarks, and not introduce any comp

Re: [PR] Consolidate feature flags into configuration guide [datafusion]

2025-05-30 Thread via GitHub
alamb closed pull request #14657: Consolidate feature flags into configuration guide URL: https://github.com/apache/datafusion/pull/14657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[PR] Fix: GROUPING SETS accept values without parenthesis [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
Vedin opened a new pull request, #1867: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1867 `GROUPING SETS` currently accepts only tuples. It can be single elements as well. At least, I can see this behaviour in Hive and Snowflake: https://hive.apache.org/docs/latest/3015

[PR] [MAJOREquivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak opened a new pull request, #16217: URL: https://github.com/apache/datafusion/pull/16217 ## Which issue does this PR close? - Closes #13748. ## Rationale for this change DF's theoretical approach to representing orderings, equivalences and constants is sound, bu

Re: [PR] Change default SQL mapping for `VARCAHR` from `Utf8` to `Utf8View` [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16142: URL: https://github.com/apache/datafusion/pull/16142#issuecomment-2922372187 Thanks again @zhuqi-lucas and @xudong963 I have also made a PR to add a mention of this change to the upgrade guide: - https://github.com/apache/datafusion/pull/16216 -- Th

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2922383864 I have noticed something similar -- that the pattern of "take / concat" results in potentially non trival overhead. I have been recently working on some related code (the sa

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
berkaysynnada commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922380421 @pepijnve thanks for the example. However, the same example can also consume the pending return generated by Aggregate, so the guarantee cannot be upheld in our model. I

Re: [I] Consolidate schema adapter tests in schema_adapter_integration_tests.rs [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16202: URL: https://github.com/apache/datafusion/issues/16202#issuecomment-2922385651 Thanks @kosiew -- here is a PR: - https://github.com/apache/datafusion/pull/16214 -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922387217 ๐Ÿ˜ฎ I will put this on my list to review asap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922389594 @wiedld @xudong963 and @suremarc perhaps you are also interested in this PR given your past interest in the equivalence code. Also cc @chenkovsky -- This is an automated message f

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922392615 Thanks for the quick response @alamb -- I added some reviewing tips and tricks to the PR body to help you guys out. -- This is an automated message from the Apache Git Service. T

Re: [PR] Additional placeholder datatype inferencing [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #15980: URL: https://github.com/apache/datafusion/pull/15980#discussion_r2115997415 ## datafusion/expr/src/logical_plan/plan.rs: ## @@ -1494,6 +1494,14 @@ impl LogicalPlan { let mut param_types: HashMap> = HashMap::new(); self.a

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922493420 > Do some changes in arrow-rs clickbench benchmark: Do I understand that the changes you report are due to simply rewriting the parquet files to have a page index and be unc

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922489274 > Move fast with the release while this gets reviews, which has the advantage of making DF48 a "checkpoint" before this PR. The downside is that we will need to do extensive testing bo

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2922481568 Let's definitely time the merge and DF48 release in a cooperative way. We have two competing realities at play: (1) The PR attracts a lot of conflicts and it is not easy to resolve

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922512365 ๐Ÿค– `./gh_compare_branch_bench.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch_bench.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922513729 I kicked off some benchmarks for this PR -- thank you @hendrikmakait ๐Ÿ™ -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16166: URL: https://github.com/apache/datafusion/pull/16166#discussion_r2115541241 ## datafusion/common/src/config.rs: ## @@ -1566,8 +1517,68 @@ impl TableOptions { /// # Parameters /// /// * `format`: The file format to use (e.g., C

Re: [PR] Set Formatted TableOptions Enum [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16166: URL: https://github.com/apache/datafusion/pull/16166#issuecomment-2921854046 > > However, I didn't see any tests or examples that show a different behavior after this PR. > > In my initial quick read through it I would say it has made the APIs harder to

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922512158 Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe that the solution has to be uni

Re: [PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16216: URL: https://github.com/apache/datafusion/pull/16216#discussion_r2116071996 ## docs/source/library-user-guide/upgrading.md: ## @@ -21,6 +21,55 @@ ## DataFusion `48.0.0` +### `VARCHAR` SQL type changed to map to `Utf8View` Arrow type + +

Re: [PR] chore: Use unique artifact names in Java test run [datafusion-comet]

2025-05-30 Thread via GitHub
codecov-commenter commented on PR #1818: URL: https://github.com/apache/datafusion-comet/pull/1818#issuecomment-2922612812 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1818?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[I] [EPIC] Implement expressions as ScalarUDFImpl [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1819: URL: https://github.com/apache/datafusion-comet/issues/1819 ### What is the problem the feature request solves? Many of Comet's Spark-compatible DataFusion expressions are currently implemented as `PhysicalExpr`. We would like to update them to imp

Re: [I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]

2025-05-30 Thread via GitHub
andygrove commented on issue #15914: URL: https://github.com/apache/datafusion/issues/15914#issuecomment-2922618245 @shehabgamin @alamb I created an epic in Comet for implementing our current expressions as `ScalarUDFImpl` rather than `PhysicalExpr` as a first step towards contributing them

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922625826 > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an AggregateExec issue. @ozankabak It's indeed a universal problem, but since

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659200 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [I] Spark-compatible CAST operation [datafusion]

2025-05-30 Thread via GitHub
Blizzara commented on issue #11201: URL: https://github.com/apache/datafusion/issues/11201#issuecomment-2922682982 Just fwiw, we already use Comet's Cast in our use of DF, through a rewrite rule: ``` impl FunctionRewrite for OurCastRewrite { fn name(&self) -> &str { "

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922689219 Thank you @pepijnve , i may got it now, it means: # Per-Batch Overhead Comparison ## 1. Counter + compare ```rust this.batches_processed += 1; // I

Re: [PR] WIP: Test DataFusion with experimental parquet pushdown [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on code in PR #16208: URL: https://github.com/apache/datafusion/pull/16208#discussion_r2115388263 ## datafusion/physical-plan/src/filter.rs: ## @@ -689,6 +702,46 @@ fn filter_and_project( }) } +impl FilterExecStream { +/// Evaluates the predic

Re: [PR] Adds support for mysql's drop index [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
iffyio merged PR #1864: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1864 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2921910759 @zhuqi-lucas I took the liberty of moving the code again as I gave you a bad suggestion yesterday. I also updated the description as i think this PR now handles NoGrouping and

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2921914352 > @zhuqi-lucas I took the liberty of moving the code again as I gave you a bad suggestion yesterday. > > I also updated the description as i think this PR now handles NoGro

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2115587079 ## datafusion/common-runtime/src/common.rs: ## @@ -98,6 +103,71 @@ impl Drop for SpawnedTask { } } +/// Number of batches to yield before voluntarily r

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2921920586 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2921921243 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: support inability to yield for loop when it's not using Tokio MPSC (RecordBatchReceiverStream) [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2115561327 ## datafusion/common-runtime/src/common.rs: ## @@ -98,6 +103,71 @@ impl Drop for SpawnedTask { } } +/// Number of batches to yield before voluntarily returni

Re: [PR] Substrait: handle identical grouping expressions [datafusion]

2025-05-30 Thread via GitHub
alamb merged PR #16189: URL: https://github.com/apache/datafusion/pull/16189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Improve plan shutdown speed on cancel (improve performance on the cancellation benchmark) [datafusion]

2025-05-30 Thread via GitHub
alamb commented on issue #15314: URL: https://github.com/apache/datafusion/issues/15314#issuecomment-2921951978 There is a new proposed fix in - https://github.com/apache/datafusion/pull/16196 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] Always add parentheses when formatting`BinaryExpr` with `SchemaDisplay` [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16209: URL: https://github.com/apache/datafusion/pull/16209#issuecomment-2921965832 Given how widely used this code is I suspect this change will be quite invasive / require changing many tests -- This is an automated message from the Apache Git Service. To resp

Re: [I] Substrait: Reading a plan with aggregation with two identical grouping exprs fails to produce correct output columns [datafusion]

2025-05-30 Thread via GitHub
alamb closed issue #16140: Substrait: Reading a plan with aggregation with two identical grouping exprs fails to produce correct output columns URL: https://github.com/apache/datafusion/issues/16140 -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922813081 Drop off for today, i will start the work tomorrow. Thank you @pepijnve for good discussion, i think it deserve we have the cancellation solution. -- This is an automated messa

[PR] chore(deps): bump testcontainers-modules from 0.12.0 to 0.12.1 [datafusion]

2025-05-30 Thread via GitHub
dependabot[bot] opened a new pull request, #16212: URL: https://github.com/apache/datafusion/pull/16212 Bumps [testcontainers-modules](https://github.com/testcontainers/testcontainers-rs-modules-community) from 0.12.0 to 0.12.1. Release notes Sourced from https://github.com/testco

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659103 ๐Ÿค–: Benchmark completed Details ``` group main reduce-expr-struct-size -

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922755179 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and reduce-expr-struct-size Benchmark clickbench_extended.json --

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922801898 > Would this require any changes to `SessionContext`? Cancellation is per query, so I think you would want to add the `cancelled: AtomicBoolean`, `fn cancel()` and `fn is_cancell

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922888436 I also agree we could make a query cancellation token work -- the potential challenges of that approach are: 1. We have to find some way to trigger the cancellation token when the s

Re: [I] docs: Instructions for running sbt fail due to OOM [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on issue #1812: URL: https://github.com/apache/datafusion-comet/issues/1812#issuecomment-2922892698 Spark's instructions for using sbt have mostly be applicable to Comet for me https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt -- This is

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922714438 After my previous comment I started thinking about how our Java code handles this kind of thing. Typically when we do a `Future#cancel(mayInterrupt: true)` we have `Thread#isInterru

[PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-30 Thread via GitHub
alamb opened a new pull request, #16216: URL: https://github.com/apache/datafusion/pull/16216 ## Which issue does this PR close? - Related to https://github.com/apache/datafusion/issues/15096 - Follow on to https://github.com/apache/datafusion/pull/16142 ## Rationale for

Re: [PR] fix: map parquet field_id correctly (native_iceberg_compat) [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on PR #1815: URL: https://github.com/apache/datafusion-comet/pull/1815#issuecomment-2922733351 Changing this to draft while I look into the CI failures -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922773141 Would this require any changes to `SessionContext`? Cancellation is per query, so I think you would want to add the `cancelled: AtomicBoolean`, `fn cancel()` and `fn is_cancelled()

Re: [PR] minor: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
codecov-commenter commented on PR #1821: URL: https://github.com/apache/datafusion-comet/pull/1821#issuecomment-2922995270 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1821?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922745280 > After my previous comment I started thinking about how our Java code handles this kind of thing. Typically when we do a `Future#cancel(mayInterrupt: true)` we have `Thread#isIn

Re: [I] Support `map_values` [datafusion-comet]

2025-05-30 Thread via GitHub
mbutrovich commented on issue #1789: URL: https://github.com/apache/datafusion-comet/issues/1789#issuecomment-2922749343 The investigation seems to be diverging from `map_values` and seems more focused on map value nullability, and in particular struct field nullability and how that's enco

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#issuecomment-2922781217 I have another thought on this. Any number of users have developed custom `AWSCredentialsProvider`s in Java but we would not have corresponding implementations in Rust (the

Re: [PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-05-30 Thread via GitHub
comphead merged PR #16213: URL: https://github.com/apache/datafusion/pull/16213 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
comphead commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2922744744 hope downstream users didn't use it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-05-30 Thread via GitHub
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2922765656 I believe a follow up pull request can be made to actually have this be used in the physical plan based on some statistics. @comphead Does this look fine for now? -- This is a

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-30 Thread via GitHub
parthchandra commented on code in PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#discussion_r2116175913 ## native/Cargo.lock: ## @@ -375,6 +375,353 @@ version = "1.4.0" source = "registry+https://github.com/rust-lang/crates.io-index"; checksum = "ace50bade

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922901439 @ozankabak > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe that the solut

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2922905121 > hope downstream users didn't use it If they need to write data to IPC files, I think either of the upstream arrow-rs choices is better: 1. https://docs.rs/arrow-ipc/55.1.0/a

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16207: URL: https://github.com/apache/datafusion/pull/16207#issuecomment-2922908904 I'll leave this open for a few days to gather any remaining feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-30 Thread via GitHub
kazantsev-maksim commented on PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#issuecomment-2922907576 Thank you for the feedback! @kazuyukitanimura @parthchandra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

[I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new issue, #1820: URL: https://github.com/apache/datafusion-comet/issues/1820 ### What is the problem the feature request solves? The `datafusion-spark` crate now provides a sha2 implementation (https://github.com/apache/datafusion/pull/16168), so we should be able

[PR] minore: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove opened a new pull request, #1821: URL: https://github.com/apache/datafusion-comet/pull/1821 ## Which issue does this PR close? N/A ## Rationale for this change As we add more expressions from `datafusion-spark` we would need to update two identi

Re: [PR] minor: Refactor PhysicalPlanner::default() to avoid duplicate code [datafusion-comet]

2025-05-30 Thread via GitHub
andygrove commented on code in PR #1821: URL: https://github.com/apache/datafusion-comet/pull/1821#discussion_r2116291551 ## native/core/src/execution/planner.rs: ## @@ -146,15 +146,7 @@ pub struct PhysicalPlanner { impl Default for PhysicalPlanner { fn default() -> Self

Re: [PR] Add support for `TABLESAMPLE` pipe operator [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
iffyio merged PR #1860: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1860 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16213: URL: https://github.com/apache/datafusion/pull/16213#discussion_r2115648865 ## datafusion/common/src/pruning.rs: ## @@ -258,7 +261,16 @@ impl PruningStatistics for PartitionPruningStatistics { } /// Prune a set of containers represented

Re: [PR] Minor: update documentation for PrunableStatistics [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16213: URL: https://github.com/apache/datafusion/pull/16213#discussion_r2115648305 ## datafusion/common/src/pruning.rs: ## @@ -137,19 +138,21 @@ pub trait PruningStatistics { #[derive(Clone)] pub struct PartitionPruningStatistics { /// Value

Re: [PR] Substrait: handle identical grouping expressions [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16189: URL: https://github.com/apache/datafusion/pull/16189#issuecomment-2921966282 Thanks everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Add new stats pruning helpers to allow combining partition values in file level stats [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16139: URL: https://github.com/apache/datafusion/pull/16139#discussion_r2115649463 ## datafusion/common/src/pruning.rs: ## @@ -122,3 +127,984 @@ pub trait PruningStatistics { values: &HashSet, ) -> Option; } + +/// Prune files based

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922028579 ๐Ÿค–: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark clickbench_extended.json โ”โ”

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922028695 ๐Ÿค– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] Simplify FileSource / SchemaAdapterFactory API [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16214: URL: https://github.com/apache/datafusion/pull/16214#discussion_r2115790804 ## datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs: ## @@ -258,3 +259,187 @@ fn test_schema_adapter_preservation() { // Verify the s

Re: [PR] Implement schema adapter support for FileSource and add integration tests [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16148: URL: https://github.com/apache/datafusion/pull/16148#discussion_r2115794350 ## datafusion/datasource/src/test_util.rs: ## @@ -81,6 +83,8 @@ impl FileSource for MockSource { fn file_type(&self) -> &str { "mock" } + +imp

Re: [PR] Add new stats pruning helpers to allow combining partition values in file level stats [datafusion]

2025-05-30 Thread via GitHub
alamb merged PR #16139: URL: https://github.com/apache/datafusion/pull/16139 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[PR] Simplify FileSource / SchemaAdapterFactory API [datafusion]

2025-05-30 Thread via GitHub
alamb opened a new pull request, #16214: URL: https://github.com/apache/datafusion/pull/16214 ## Which issue does this PR close? - Follow on for https://github.com/apache/datafusion/pull/16148 ## Rationale for this change During the review of https://github.co

Re: [PR] WIP: Test DataFusion with experimental parquet pushdown [datafusion]

2025-05-30 Thread via GitHub
Dandandan commented on code in PR #16208: URL: https://github.com/apache/datafusion/pull/16208#discussion_r2115799584 ## datafusion/physical-plan/src/filter.rs: ## @@ -689,6 +702,46 @@ fn filter_and_project( }) } +impl FilterExecStream { +/// Evaluates the predic

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-293237 > yes, it sounds better. It will bring less friction and will be more inclusive Think about this a bit more, it's an interesting design question whose responsibility it is to

Re: [PR] Adds support for mysql's drop index [datafusion-sqlparser-rs]

2025-05-30 Thread via GitHub
alamb commented on PR #1864: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1864#issuecomment-2922335147 The code just keeps churning along in this repo. Thank you @iffyio and @dmzmk -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922322640 I agree with @pepijnve that adding a yield to the producer is not clearly a better choice -- I think the issue is for streams like AggregateExec that can always make progress S

Re: [PR] Implement schema adapter support for FileSource and add integration tests [datafusion]

2025-05-30 Thread via GitHub
alamb merged PR #16148: URL: https://github.com/apache/datafusion/pull/16148 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922532031 > > Even if I couldn't get consensus on that direction, this version could still be improved. For example, by inserting the YieldStream only when itโ€™s truly needed, such as in th

Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-05-30 Thread via GitHub
alamb commented on code in PR #16207: URL: https://github.com/apache/datafusion/pull/16207#discussion_r2116011961 ## datafusion/expr/src/expr.rs: ## @@ -330,7 +331,7 @@ pub enum Expr { /// [`ExprFunctionExt`]: crate::expr_fn::ExprFunctionExt AggregateFunction(Aggregate

Re: [I] Blog post about parquet vs custom file formats [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on issue #16149: URL: https://github.com/apache/datafusion/issues/16149#issuecomment-2922543426 > > Do some changes in arrow-rs clickbench benchmark: > > Do I understand that the changes you report are due to simply rewriting the parquet files to have a page inde

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-30 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922500062 > I think the issue is for streams like AggregateExec that can always make progress. So while this PR wraps the input to the AggregateExecStream, I think logically the yielding i

  1   2   >