Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953899593 I gave it a shot but it ended up being somewhat messy. Thats mostly due to the fact that on the one hand `TableFunctionImpl::call()` is synchronous, yet, on the other hand, it

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953904462 > I can see that ListingTableUrl::parse supports glob strings, so does it make sense to simply implement this as a listing table? Yes this is what I would expect --

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2953903299 Thnks @a-agmon -- maybe this example would help: https://docs.rs/datafusion/latest/datafusion/catalog/trait.AsyncSchemaProvider.html I agree the trick will be figuring out

Re: [PR] Chore: implement predicate exprs as ScalarUDFImpl [datafusion-comet]

2025-06-08 Thread via GitHub
codecov-commenter commented on PR #1864: URL: https://github.com/apache/datafusion-comet/pull/1864#issuecomment-2954088958 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1864?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954090213 > @alamb benchmark runs ok now Awesome -- thanks -- I restarted it now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954090088 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [I] Request to update crates.io ownership [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16323: URL: https://github.com/apache/datafusion/issues/16323#issuecomment-2954090714 Also I have updated the crates to have @xudong963 as an owner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954095839 > I've rebased this and it's looking nice now. I think the main open question is the concern about performance / overhead: I'll fire up some benchmarks and see if we can see anyt

Re: [PR] chore: Upgrade to DataFusion 48.0.0-rc3 [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove merged PR #1863: URL: https://github.com/apache/datafusion-comet/pull/1863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134771715 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954178899 > Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op. `clickbench_partitioned` have row group

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954177103 This seems really nice 🚀 On my machine I get roughly 10% improvement on queries with SPM - which I think makes sense on a 10 core machine (with less cores it might busy-poll).

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134772014 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

[I] panic `StructBuilder and field_builder with index 0 (Utf8) are of unequal lengths: (1 != 0)` when running with delta lake extension [datafusion-comet]

2025-06-08 Thread via GitHub
rluvaton opened a new issue, #1867: URL: https://github.com/apache/datafusion-comet/issues/1867 ### Describe the bug When I try to write to a delta table when the extension enabled I get panic. (I also sometimes had it when reading) ``` 25/06/08 20:01:28 ERROR Executor: Exce

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134773638 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134773918 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954182220 It might just be that cheap 😃, I do expect it to be very cheap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134777847 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspo

Re: [PR] Encapsulate metadata for literals on to a `FieldMetadata` structure [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16317: URL: https://github.com/apache/datafusion/pull/16317#discussion_r2134620674 ## datafusion/expr/src/expr.rs: ## @@ -413,6 +413,162 @@ impl<'a> TreeNodeContainer<'a, Self> for Expr { } } +/// Literal metadata +/// +/// Stores metadata

Re: [PR] Minor: Add upgrade guide for `Expr::WindowFunction` [datafusion]

2025-06-08 Thread via GitHub
alamb merged PR #16313: URL: https://github.com/apache/datafusion/pull/16313 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Minor: Add upgrade guide for `Expr::WindowFunction` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16313: URL: https://github.com/apache/datafusion/pull/16313#issuecomment-2953936723 Thanks again for the review @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Fix `array_position` on empty list [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16292: URL: https://github.com/apache/datafusion/pull/16292#issuecomment-2953940424 Thanks again @Blizzara -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Fix `array_position` on empty list [datafusion]

2025-06-08 Thread via GitHub
alamb merged PR #16292: URL: https://github.com/apache/datafusion/pull/16292 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Always add parentheses when formatting `BinaryExpr` with `SchemaDisplay` [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16209: URL: https://github.com/apache/datafusion/pull/16209#issuecomment-2953942139 > @alamb: That's a fair point, I actually brought it up in the issue ([#16054 (comment)](https://github.com/apache/datafusion/issues/16054#issuecomment-2924301139)). I'm happy to adju

Re: [PR] Add support for mysql's drop index (`ALTER TABLE table_a DROP INDEX idx_a`) [datafusion-sqlparser-rs]

2025-06-08 Thread via GitHub
vimko commented on code in PR #1865: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1865#discussion_r2134494593 ## tests/sqlparser_common.rs: ## @@ -9132,7 +9132,9 @@ fn test_create_index_with_with_clause() { #[test] fn parse_drop_index() { let sql = "DROP I

Re: [PR] Add support for mysql's drop index (`ALTER TABLE table_a DROP INDEX idx_a`) [datafusion-sqlparser-rs]

2025-06-08 Thread via GitHub
vimko commented on code in PR #1865: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1865#discussion_r2134498562 ## src/ast/ddl.rs: ## @@ -187,6 +187,14 @@ pub enum AlterTableOperation { DropForeignKey { name: Ident, }, +/// `DROP INDEX ` +

Re: [PR] Test experiment load page index with columns [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on code in PR #16329: URL: https://github.com/apache/datafusion/pull/16329#discussion_r2134495342 ## Cargo.toml: ## @@ -88,19 +88,19 @@ ahash = { version = "0.8", default-features = false, features = [ "runtime-rng", ] } apache-avro = { version = "0

[I] Optimize performance of `ByteViewGroupValueBuilder` on batches with inlined views [datafusion]

2025-06-08 Thread via GitHub
Dandandan opened a new issue, #16330: URL: https://github.com/apache/datafusion/issues/16330 ### Is your feature request related to a problem or challenge? Currently a large part of view performance `do_append_val_inner` `do_equal_to_inner` and others is spent

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954109280 🤖: Benchmark completed Details ``` Comparing HEAD and alamb_test_upstream_coalesce Benchmark clickbench_extended.json -

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954109301 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2954102662 Thanks for working on this @rishvin. You are correct that Spark doesn't support UInt32 because it is JVM-based. JVM only has signed ints. The upper-case output seem

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134715891 ## datafusion/sqllogictest/test_files/arrow_files.slt: ## @@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/' PARTITIONED BY (part);

[PR] fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove opened a new pull request, #1865: URL: https://github.com/apache/datafusion-comet/pull/1865 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1254 Closes https://github.com/apache/datafusion-comet/issues/1252 ##

Re: [PR] fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on PR #1736: URL: https://github.com/apache/datafusion-comet/pull/1736#issuecomment-2954113662 replaced with https://github.com/apache/datafusion-comet/pull/1865 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

Re: [PR] fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove closed pull request #1736: fix: Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` hack URL: https://github.com/apache/datafusion-comet/pull/1736 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#issuecomment-2954116663 It would be interesting to use our new tracing feature to compare on-heap vs off-heap memory usage with range partitioning supported natively versus falling back to Spark. -

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1824: URL: https://github.com/apache/datafusion-comet/issues/1824#issuecomment-2954122677 Resolved by upgrading to DF 48 rc3 - tests now pass in https://github.com/apache/datafusion-comet/pull/1865 -- This is an automated message from the Apache Git Service.

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove closed issue #1824: Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` URL: https://github.com/apache/datafusion-comet/issues/1824 -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

[PR] Metadata handling announcement [datafusion-site]

2025-06-08 Thread via GitHub
timsaucer opened a new pull request, #73: URL: https://github.com/apache/datafusion-site/pull/73 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] WIP Blog post for Datafusion 47.0.0 [datafusion-site]

2025-06-08 Thread via GitHub
alamb closed pull request #70: WIP Blog post for Datafusion 47.0.0 URL: https://github.com/apache/datafusion-site/pull/70 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

[PR] Blog: Optimizing SQL and DataFrames [datafusion-site]

2025-06-08 Thread via GitHub
alamb opened a new pull request, #74: URL: https://github.com/apache/datafusion-site/pull/74 @akurmustafa and I wrote a piece on the InfluxData blog about optimizers in DataFusion that I have referred to several times when describing DataFusion Since all of the content is about Optimi

Re: [PR] Metadata handling announcement [datafusion-site]

2025-06-08 Thread via GitHub
alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2134828346 ## content/blog/2025-06-09-metadata-handling.md: ## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions Review Comment: I thin

Re: [I] Add `CoalesceBatchesExec` to `NestedLoopJoinExec` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n commented on issue #16328: URL: https://github.com/apache/datafusion/issues/16328#issuecomment-2954283635 This one would help with performances related to #16210, I would be willing to pick this up after we decide whether that pr should be merged. -- This is an automated messa

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954189088 @Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok in

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134778315 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2954193366 > @Dandandan project newbie question, my daily practice at work is to handle code review comments using amend/force-push. Did so out of habit before thinking to as ask. Is that ok

Re: [I] Support reading multiple parquet files via `datafusion-cli` [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on issue #16303: URL: https://github.com/apache/datafusion/issues/16303#issuecomment-2954199489 I have added a draft for this PR. Would be happy for your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[PR] Add support for glob string in datafusion-cli [datafusion]

2025-06-08 Thread via GitHub
a-agmon opened a new pull request, #16332: URL: https://github.com/apache/datafusion/pull/16332 Partly closes #16303 Introduces glob() table function that allows running queries on multiple files, like: ``` SELECT id FROM glob('s3://tests/data/file-a*.csv'); SELECT id FROM

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794430 ## datafusion-cli/src/functions.rs: ## @@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc { Ok(Arc::new(parquet_metadata)) } } + +

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134794387 ## datafusion-cli/src/functions.rs: ## @@ -460,3 +473,94 @@ impl TableFunctionImpl for ParquetMetadataFunc { Ok(Arc::new(parquet_metadata)) } } + +

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954218719 > As time permits, we can explore alternate, more universal strategies for cancellation > 100% agree with not merging this until we are in agreement I can't help but feel t

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
comphead commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134795185 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2954201095 Sorry for not having a chance to test this work earlier @clflushopt I really look forward to checking it out and will try to do so later this week. -- This is an

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on code in PR #16322: URL: https://github.com/apache/datafusion/pull/16322#discussion_r2134795452 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -216,36 +212,49 @@ impl SortPreservingMergeStream { // Once all partitions have set their correspon

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [PR] bug: remove busy-wait while sort is ongoing [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on PR #16322: URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2953693505 @berkaysynnada I hope it's ok that I ping you directly; reaching out because I believe you are the other of the test case in question. I believe this PR surfaced a mistake in the `C

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2953664649 Test experiment for datafusion: https://github.com/apache/arrow-rs/pull/7624 https://github.com/apache/datafusion/pull/16329 ```rust python3 ./compare

[I] Add `CoalesceBatchesExec` to `NestedLoopJoinExec` [datafusion]

2025-06-08 Thread via GitHub
UBarney opened a new issue, #16328: URL: https://github.com/apache/datafusion/issues/16328 ### Is your feature request related to a problem or challenge? The `NestedLoopJoinExec` operator can produce output batches with fewer rows than the configured batch_size. To improve performance

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134525536 ## datafusion/expr-common/src/type_coercion/binary.rs: ## @@ -1202,7 +1202,7 @@ pub fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option Op

Re: [PR] [MAJOR] Equivalence System Overhaul [datafusion]

2025-06-08 Thread via GitHub
Omega359 commented on PR #16217: URL: https://github.com/apache/datafusion/pull/16217#issuecomment-2954068858 fyi I did run this through my test suite and it seemed to work correctly :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954080922 > +1.98x faster 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] Busy-waiting in SortPreservingMergeStream [datafusion]

2025-06-08 Thread via GitHub
pepijnve commented on issue #16321: URL: https://github.com/apache/datafusion/issues/16321#issuecomment-2953712501 Just to have a hyperlink between the issues: as far as I can tell the behavior was introduced by #12302 to resolve #12300. -- This is an automated message from the Apache Git

Re: [PR] Add support for mysql's drop index (`ALTER TABLE table_a DROP INDEX idx_a`) [datafusion-sqlparser-rs]

2025-06-08 Thread via GitHub
vimko commented on code in PR #1865: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1865#discussion_r2134498562 ## src/ast/ddl.rs: ## @@ -187,6 +187,14 @@ pub enum AlterTableOperation { DropForeignKey { name: Ident, }, +/// `DROP INDEX ` +

[PR] Test experiment load page index with columns [datafusion]

2025-06-08 Thread via GitHub
zhuqi-lucas opened a new pull request, #16329: URL: https://github.com/apache/datafusion/pull/16329 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tes

Re: [I] Incorrect results with JVM shuffle: Spark SQL `- SPARK-32038: NormalizeFloatingNumbers should work on distinct aggregate` [datafusion-comet]

2025-06-08 Thread via GitHub
andygrove commented on issue #1824: URL: https://github.com/apache/datafusion-comet/issues/1824#issuecomment-2954128840 REopening since tests are still disabled - will be fixed when https://github.com/apache/datafusion-comet/pull/1865 merges -- This is an automated message from the Apach

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954128617 🤖: Benchmark completed Details ``` Comparing HEAD and late-pruning-files Benchmark clickbench_extended.json

Re: [PR] fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack [datafusion-comet]

2025-06-08 Thread via GitHub
codecov-commenter commented on PR #1865: URL: https://github.com/apache/datafusion-comet/pull/1865#issuecomment-2954133056 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1865?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
Dandandan commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954130905 > 🤖: Benchmark completed > > Details This is similar to my results, some larger gains on TPC-H, smaller gains (spreaded out over different queries) for clickbench. -

[PR] feat: pass ignore_nulls flag to first and last [datafusion-comet]

2025-06-08 Thread via GitHub
rluvaton opened a new pull request, #1866: URL: https://github.com/apache/datafusion-comet/pull/1866 ## Which issue does this PR close? N/A ## Rationale for this change Actually use `ignore_nulls` that was used in: - #1626 ## What changes are included in this

Re: [PR] feat: upgrade df48 dependency [datafusion-python]

2025-06-08 Thread via GitHub
timsaucer commented on PR #1143: URL: https://github.com/apache/datafusion-python/pull/1143#issuecomment-2954138196 TODO: expose `lit_with_metadata` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-08 Thread via GitHub
adriangb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2954139362 Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op. -- This is an automated message from the Apache

Re: [PR] Feature/parse float as decimal default true [datafusion]

2025-06-08 Thread via GitHub
github-actions[bot] commented on PR #14752: URL: https://github.com/apache/datafusion/pull/14752#issuecomment-2954457572 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

[PR] feat: support FixedSizeList for array_has [datafusion]

2025-06-08 Thread via GitHub
chenkovsky opened a new pull request, #16333: URL: https://github.com/apache/datafusion/pull/16333 ## Which issue does this PR close? ## Rationale for this change array_has doesn't support fixed size list currently. ## What changes are included in this PR? add

Re: [I] to_hex cannot take UInt64 [datafusion]

2025-06-08 Thread via GitHub
chenkovsky commented on issue #16327: URL: https://github.com/apache/datafusion/issues/16327#issuecomment-2954547105 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134812154 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

Re: [PR] Add support for glob string in datafusion-cli query [datafusion]

2025-06-08 Thread via GitHub
a-agmon commented on code in PR #16332: URL: https://github.com/apache/datafusion/pull/16332#discussion_r2134804642 ## datafusion-cli/tests/sql/integration/glob_test.sql: ## @@ -0,0 +1,15 @@ +-- Test glob function with files available in CI +-- Test 1: Single CSV file - verify b

[PR] Add support `UInt64` data type for `to_hex` [datafusion]

2025-06-08 Thread via GitHub
tlm365 opened a new pull request, #16335: URL: https://github.com/apache/datafusion/pull/16335 ## Which issue does this PR close? - Closes #16327 . ## Rationale for this change ## What changes are included in this PR? ## Are these changes te

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-06-08 Thread via GitHub
alamb closed pull request #14286: Example for using a separate threadpool for CPU bound work (try 2) URL: https://github.com/apache/datafusion/pull/14286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] Example for using a separate threadpool for CPU bound work (try 3) [datafusion]

2025-06-08 Thread via GitHub
alamb opened a new pull request, #16331: URL: https://github.com/apache/datafusion/pull/16331 Note: This PR contains an example and supporting code. It has no changes to the core. ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/12393

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #14286: URL: https://github.com/apache/datafusion/pull/14286#issuecomment-2954152637 Here is an updated version of the example, using the new ObjectStore API: - https://github.com/apache/datafusion/pull/16331 Let's continue the conversation there -- This is

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954144239 @alamb FYI I plan to merge this soon. It is OK if you don't have the bandwidth to take a look, it is the first step towards the design we discussed before. -- This is an automat

Re: [PR] Draft: Use upstream arrow `coalesce` kernel in DataFusion [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16249: URL: https://github.com/apache/datafusion/pull/16249#issuecomment-2954153997 > This is similar to my results, some larger gains on TPC-H, smaller gains (spreaded out over different queries) for clickbench. SWEET! Just wait until we get rid of the

Re: [PR] upgraded spark 3.5.5 to 3.5.6 [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on PR #1861: URL: https://github.com/apache/datafusion-comet/pull/1861#issuecomment-2954292860 Rebased latest changes and updated diff file (missing IgnoreComet) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2134838674 ## pom.xml: ## @@ -600,7 +600,7 @@ under the License. 2.13.14 Review Comment: Scala was upgraded in the 4.0.0 release version: ```

Re: [I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-08 Thread via GitHub
YanivKunda commented on issue #1833: URL: https://github.com/apache/datafusion-comet/issues/1833#issuecomment-2954321305 Does it make sense to decide that Iceberg integration is unsupported for Spark 3.x - and work on it only for Spark 4.0, which uses Parquet 1.15.2 (in the release vers

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2134852899 ## datafusion/physical-optimizer/src/optimizer.rs: ## @@ -137,6 +138,7 @@ impl PhysicalOptimizer { // are not present, the load of executors such as joi

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954329852 I am happy to wait for a bit more testing on this PR -- we have now about a month before the next release so there is no pressure from there. However, I do like a bias of action

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2954336669 @comphead Pointers have been resolved, good for another review. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] feat: mapping sql Char/Text/String default to Utf8View [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16290: URL: https://github.com/apache/datafusion/pull/16290#discussion_r2134860028 ## datafusion/sqllogictest/test_files/arrow_files.slt: ## @@ -61,22 +61,12 @@ LOCATION '../core/tests/data/partitioned_table_arrow/' PARTITIONED BY (part); # sel

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on code in PR #16319: URL: https://github.com/apache/datafusion/pull/16319#discussion_r2134861645 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1126,14 +1127,20 @@ impl ExecutionPlan for SortExec { Ok(Box::pin(RecordBatchStreamAdapter::ne

Re: [I] Reduce page metadata loading to only what is necessary for query execution in ParquetOpen [datafusion]

2025-06-08 Thread via GitHub
alamb commented on issue #16200: URL: https://github.com/apache/datafusion/issues/16200#issuecomment-2954343017 Looks great @zhuqi-lucas -- I will find time to review the arrow changes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16319: URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2954341770 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16319: URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2954372374 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16318 Benchmark clickbench_extended.json ┏━

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954372429 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-08 Thread via GitHub
jonathanc-n closed pull request #16210: feat: Support null aware + equijoins for `NestedLoopJoin` URL: https://github.com/apache/datafusion/pull/16210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Fix inconsistent schema projection in ListingTable when file order varies by tracking schema source [datafusion]

2025-06-08 Thread via GitHub
kosiew commented on code in PR #16305: URL: https://github.com/apache/datafusion/pull/16305#discussion_r2134362818 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -2452,4 +2178,381 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn infer_preser

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954399910 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubun

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954399882 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark clickbench_extended.json ┏━

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-08 Thread via GitHub
alamb commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2954400147 🤖: Benchmark completed Details ``` Comparing HEAD and issue_16193 Benchmark cancellation.json ┏

  1   2   >