Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954860574 Great, we will incorporate your feedback, answer your questions and write a list of follow-on work to serve as a basis for the tickets. I will only merge after these are done so we

Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-08 Thread via GitHub
2010YOUY01 commented on code in PR #16325: URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072 ## datafusion/sql/src/select.rs: ## @@ -77,11 +82,29 @@ impl SqlToRel<'_, S> { } // Process `from` clause -let plan = self.plan_from

Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-08 Thread via GitHub
2010YOUY01 commented on code in PR #16325: URL: https://github.com/apache/datafusion/pull/16325#discussion_r2135088072 ## datafusion/sql/src/select.rs: ## @@ -77,11 +82,29 @@ impl SqlToRel<'_, S> { } // Process `from` clause -let plan = self.plan_from

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954755281 Hi @alamb , @comphead raises a couple of good questions about the PR, so I'm linking it here to hear you thoughts. https://github.com/apache/datafusion/pull/16332#discuss

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-08 Thread via GitHub
huaxingao commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2135066516 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -1437,7 +1437,7 @@ object QueryPlanSerde extends Logging with CometExprShim {

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-08 Thread via GitHub
huaxingao commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2135053492 ## common/src/main/java/org/apache/comet/parquet/TypeUtil.java: ## @@ -74,7 +74,8 @@ public static ColumnDescriptor convertToParquet(StructField field) {

Re: [I] SparkSha2 is not compliant with Spark and does not support Int32 type [datafusion]

2025-06-08 Thread via GitHub
rishvin commented on issue #16336: URL: https://github.com/apache/datafusion/issues/16336#issuecomment-2954640277 I will create a PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[I] SparkSha2 is not compliant with Spark and does not support Int32 type [datafusion]

2025-06-08 Thread via GitHub
rishvin opened a new issue, #16336: URL: https://github.com/apache/datafusion/issues/16336 ### Describe the bug This ticket is related to [#1820](https://github.com/apache/datafusion-comet/issues/1820) from Comet. We are working on using Datafusion's Sha2 (`SparkSha2`) implemen

[PR] Add support `UInt64` data type for `to_hex` [datafusion]

2025-06-08 Thread via GitHub
tlm365 opened a new pull request, #16335: URL: https://github.com/apache/datafusion/pull/16335 ## Which issue does this PR close? - Closes #16327 . ## Rationale for this change ## What changes are included in this PR? ## Are these changes te

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [I] to_hex cannot take UInt64 [datafusion]

2025-06-08 Thread via GitHub
chenkovsky commented on issue #16327: URL: https://github.com/apache/datafusion/issues/16327#issuecomment-2954547105 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] feat: support FixedSizeList for array_has [datafusion]

2025-06-08 Thread via GitHub
chenkovsky opened a new pull request, #16333: URL: https://github.com/apache/datafusion/pull/16333 ## Which issue does this PR close? ## Rationale for this change array_has doesn't support fixed size list currently. ## What changes are included in this PR? add

Re: [PR] Feature/parse float as decimal default true [datafusion]

2025-06-08 Thread via GitHub
github-actions[bot] commented on PR #14752: URL: https://github.com/apache/datafusion/pull/14752#issuecomment-2954457572 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Fix inconsistent schema projection in ListingTable when file order varies by tracking schema source [datafusion]

2025-06-08 Thread via GitHub
kosiew commented on code in PR #16305: URL: https://github.com/apache/datafusion/pull/16305#discussion_r2134362818 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -2452,4 +2178,381 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn infer_preser

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954400147 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark cancellation.json ┏

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954399910 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954399882 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark clickbench_extended.json ┏━

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n closed pull request #16210: feat: Support null aware + equijoins for `NestedLoopJoin` URL: https://github.com/apache/datafusion/pull/16210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954372429 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16319: URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2954372374 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16318 Benchmark clickbench_extended.json ┏━

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16319: URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2954341770 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954343017 Looks great @zhuqi-lucas -- I will find time to review the arrow changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16319: URL: https://github.com/apache/datafusion/pull/16319#discussion_r2134861645 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1126,14 +1127,20 @@ impl ExecutionPlan for SortExec { Ok(Box::pin(RecordBatchStreamAdapter::ne

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134860028 ## datafusion/sqllogictest/test_files/arrow_files.slt: ## @@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/' PARTITIONED BY (part); # sel

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2954336669 @comphead Pointers have been resolved, good for another review. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954329852 I am happy to wait for a bit more testing on this PR -- we have now about a month before the next release so there is no pressure from there. However, I do like a bias of action

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2134852899 ## datafusion/physical-optimizer/src/optimizer.rs: ## @@ -137,6 +138,7 @@ impl PhysicalOptimizer { // are not present, the load of executors such as joi

Re: [I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on issue #1833: URL: https://github.com/apache/datafusion-comet/issues/1833#issuecomment-2954321305 Does it make sense to decide that Iceberg integration is unsupported for Spark 3.x - and work on it only for Spark 4.0, which uses Parquet 1.15.2 (in the release vers

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2134838674 ## pom.xml: ## @@ -600,7 +600,7 @@ under the License. 2.13.14 Review Comment: Scala was upgraded in the 4.0.0 release version: ```

Re: [PR] upgraded spark 3.5.5 to 3.5.6 [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on PR #1861: URL: https://github.com/apache/datafusion-comet/pull/1861#issuecomment-2954292860 Rebased latest changes and updated diff file (missing IgnoreComet) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [I] Add `CoalesceBatchesExec` to `NestedLoopJoinExec` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n commented on issue #16328: URL: https://github.com/apache/datafusion/issues/16328#issuecomment-2954283635 This one would help with performances related to #16210, I would be willing to pick this up after we decide whether that pr should be merged. -- This is an automated messa

Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-08 Thread via GitHub
alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2134828346 ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions Review Comment: I thin

[PR] Blog: Optimizing SQL and DataFrames [datafusion-site]

2025-06-08 Thread via GitHub
alamb opened a new pull request, #74: URL: https://github.com/apache/datafusion-site/pull/74 @akurmustafa and I wrote a piece on the InfluxData blog about optimizers in DataFusion that I have referred to several times when describing DataFusion Since all of the content is about Optimi

Re: [PR] WIP Blog post for Datafusion 47.0.0 [datafusion-site]

2025-06-08 Thread via GitHub
alamb closed pull request #70: WIP Blog post for Datafusion 47.0.0 URL: https://github.com/apache/datafusion-site/pull/70 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

[PR] Metadata handling announcement [datafusion-site]

2025-06-08 Thread via GitHub
timsaucer opened a new pull request, #73: URL: https://github.com/apache/datafusion-site/pull/73 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134795452 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134795185 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954218719 > As time permits, we can explore alternate, more universal strategies for cancellation > 100% agree with not merging this until we are in agreement I can't help but feel t

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794430 ## datafusion-cli/src/functions.rs: ## @@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc { Ok(Arc::new(parquet_metadata)) } } + +

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794387 ## datafusion-cli/src/functions.rs: ## @@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc { Ok(Arc::new(parquet_metadata)) } } + +

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2954201095 Sorry for not having a chance to test this work earlier @clflushopt I really look forward to checking it out and will try to do so later this week. -- This is an

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954199489 I have added a draft for this PR. Would be happy for your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[PR] Add support for glob string in datafusion-cli [datafusion]

2025-06-08 Thread via GitHub
a-agmon opened a new pull request, #16332: URL: https://github.com/apache/datafusion/pull/16332 Partly closes #16303 Introduces glob() table function that allows running queries on multiple files, like: ``` SELECT id FROM glob('s3://tests/data/file-a*.csv'); SELECT id FROM

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954193366 > @Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954189088 @Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok in

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134778315 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134777847 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134773918 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134773638 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954182220 It might just be that cheap 😃, I do expect it to be very cheap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[I] panic `StructBuilder and field_builder with index 0 (Utf8) are of unequal lengths: (1 != 0)` when running with delta lake extension [datafusion-comet]

2025-06-08 Thread via GitHub
rluvaton opened a new issue, #1867: URL: https://github.com/apache/datafusion-comet/issues/1867 ### Describe the bug When I try to write to a delta table when the extension enabled I get panic. (I also sometimes had it when reading) ``` 25/06/08 20:01:28 ERROR Executor: Exce

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954177103 This seems really nice 🚀 On my machine I get roughly 10% improvement on queries with SPM - which I think makes sense on a 10 core machine (with less cores it might busy-poll).

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134772014 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954178899 > Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op. `clickbench_partitioned` have row group

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134771715 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954153997 > This is similar to my results, some larger gains on TPC-H, smaller gains (spreaded out over different queries) for clickbench. SWEET! Just wait until we get rid of the

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #14286: URL: https://github.com/apache/datafusion/pull/14286#issuecomment-2954152637 Here is an updated version of the example, using the new ObjectStore API: - https://github.com/apache/datafusion/pull/16331 Let's continue the conversation there -- This is

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-06-08 Thread via GitHub
alamb closed pull request #14286: Example for using a separate threadpool for CPU bound work (try 2) URL: https://github.com/apache/datafusion/pull/14286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] Example for using a separate threadpool for CPU bound work (try 3) [datafusion]

2025-06-08 Thread via GitHub
alamb opened a new pull request, #16331: URL: https://github.com/apache/datafusion/pull/16331 Note: This PR contains an example and supporting code. It has no changes to the core. ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/12393

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954144239 @alamb FYI I plan to merge this soon. It is OK if you don't have the bandwidth to take a look, it is the first step towards the design we discussed before. -- This is an automat

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954139362 Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op. -- This is an automated message from the Apache

Re: [PR] feat: upgrade df48 dependency [datafusion-python]

2025-06-08 Thread via GitHub
timsaucer commented on PR #1143: URL: https://github.com/apache/datafusion-python/pull/1143#issuecomment-2954138196 TODO: expose `lit_with_metadata` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] feat: pass ignore_nulls flag to first and last [datafusion-comet]

2025-06-08 Thread via GitHub
rluvaton opened a new pull request, #1866: URL: https://github.com/apache/datafusion-comet/pull/1866 ## Which issue does this PR close? N/A ## Rationale for this change Actually use `ignore_nulls` that was used in: - #1626 ## What changes are included in this

Re: [PR] fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack [datafusion-comet]

2025-06-08 Thread via GitHub
codecov-commenter commented on PR #1865: URL: https://github.com/apache/datafusion-comet/pull/1865#issuecomment-2954133056 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1865?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954128617 🤖: Benchmark completed Details ``` Comparing HEAD and late-pruning-files Benchmark clickbench_extended.json

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954130905 > 🤖: Benchmark completed > > Details This is similar to my results, some larger gains on TPC-H, smaller gains (spreaded out over different queries) for clickbench. -

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1824: URL: https://github.com/apache/datafusion-comet/issues/1824#issuecomment-2954128840 REopening since tests are still disabled - will be fixed when https://github.com/apache/datafusion-comet/pull/1865 merges -- This is an automated message from the Apach

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove closed issue #1824: Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` URL: https://github.com/apache/datafusion-comet/issues/1824 -- This is an automated message from the Apache Git Service. To respond to the me

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1824: URL: https://github.com/apache/datafusion-comet/issues/1824#issuecomment-2954122677 Resolved by upgrading to DF 48 rc3 - tests now pass in https://github.com/apache/datafusion-comet/pull/1865 -- This is an automated message from the Apache Git Service.

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#issuecomment-2954116663 It would be interesting to use our new tracing feature to compare on-heap vs off-heap memory usage with range partitioning supported natively versus falling back to Spark. -

Re: [PR] fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on PR #1736: URL: https://github.com/apache/datafusion-comet/pull/1736#issuecomment-2954113662 replaced with https://github.com/apache/datafusion-comet/pull/1865 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove closed pull request #1736: fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack URL: https://github.com/apache/datafusion-comet/pull/1736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

[PR] fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove opened a new pull request, #1865: URL: https://github.com/apache/datafusion-comet/pull/1865 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1254 Closes https://github.com/apache/datafusion-comet/issues/1252 ##

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954109280 🤖: Benchmark completed Details ``` Comparing HEAD and alamb_test_upstream_coalesce Benchmark clickbench_extended.json -

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954109301 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134715891 ## datafusion/sqllogictest/test_files/arrow_files.slt: ## @@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/' PARTITIONED BY (part);

Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2954102662 Thanks for working on this @rishvin. You are correct that Spark doesn't support UInt32 because it is JVM-based. JVM only has signed ints. The upper-case output seem

Re: [PR] chore: Upgrade to DataFusion 48.0.0-rc3 [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove merged PR #1863: URL: https://github.com/apache/datafusion-comet/pull/1863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954095839 > I've rebased this and it's looking nice now. I think the main open question is the concern about performance / overhead: I'll fire up some benchmarks and see if we can see anyt

Re: [I] Request to update crates.io ownership [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16323: URL: https://github.com/apache/datafusion/issues/16323#issuecomment-2954090714 Also I have updated the crates to have @xudong963 as an owner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954090213 > @alamb benchmark runs ok now Awesome -- thanks -- I restarted it now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954090088 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] Chore: implement predicate exprs as ScalarUDFImpl [datafusion-comet]

2025-06-08 Thread via GitHub
codecov-commenter commented on PR #1864: URL: https://github.com/apache/datafusion-comet/pull/1864#issuecomment-2954088958 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1864?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954080922 > +1.98x faster 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-08 Thread via GitHub
Omega359 commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2954068858 fyi I did run this through my test suite and it seemed to work correctly :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[I] Optimize performance of `ByteViewGroupValueBuilder` on batches with inlined views [datafusion]

2025-06-08 Thread via GitHub
Dandandan opened a new issue, #16330: URL: https://github.com/apache/datafusion/issues/16330 ### Is your feature request related to a problem or challenge? Currently a large part of view performance `do_append_val_inner` `do_equal_to_inner` and others is spent

Re: [PR] Always add parentheses when formatting `BinaryExpr` with `SchemaDisplay` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16209: URL: https://github.com/apache/datafusion/pull/16209#issuecomment-2953942139 > @alamb: That's a fair point, I actually brought it up in the issue ([#16054 (comment)](https://github.com/apache/datafusion/issues/16054#issuecomment-2924301139)). I'm happy to adju

Re: [PR] Fix `array_position` on empty list [datafusion]

2025-06-08 Thread via GitHub
alamb merged PR #16292: URL: https://github.com/apache/datafusion/pull/16292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix `array_position` on empty list [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16292: URL: https://github.com/apache/datafusion/pull/16292#issuecomment-2953940424 Thanks again @Blizzara -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Minor: Add upgrade guide for `Expr::WindowFunction` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16313: URL: https://github.com/apache/datafusion/pull/16313#issuecomment-2953936723 Thanks again for the review @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Minor: Add upgrade guide for `Expr::WindowFunction` [datafusion]

2025-06-08 Thread via GitHub
alamb merged PR #16313: URL: https://github.com/apache/datafusion/pull/16313 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Encapsulate metadata for literals on to a `FieldMetadata` structure [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16317: URL: https://github.com/apache/datafusion/pull/16317#discussion_r2134620674 ## datafusion/expr/src/expr.rs: ## @@ -413,6 +413,162 @@ impl<'a> TreeNodeContainer<'a, Self> for Expr { } } +/// Literal metadata +/// +/// Stores metadata

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953904462 > I can see that ListingTableUrl::parse supports glob strings, so does it make sense to simply implement this as a listing table? Yes this is what I would expect --

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953903299 Thnks @a-agmon -- maybe this example would help: https://docs.rs/datafusion/latest/datafusion/catalog/trait.AsyncSchemaProvider.html I agree the trick will be figuring out

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953899593 I gave it a shot but it ended up being somewhat messy. Thats mostly due to the fact that on the one hand `TableFunctionImpl::call()` is synchronous, yet, on the other hand, it

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2953664649 Test experiment for datafusion: https://github.com/apache/arrow-rs/pull/7624 https://github.com/apache/datafusion/pull/16329 ```rust python3 ./compare

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134525536 ## datafusion/expr-common/src/type_coercion/binary.rs: ## @@ -1202,7 +1202,7 @@ pub fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option Op

  1   2   >