Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on PR #15685: URL: https://github.com/apache/datafusion/pull/15685#issuecomment-2798502900 Hey @jayzhan211 thank you for putting the work into trying to clarify this. At this point I think it would be best to wait for #15566 or a PR that replaces it to be merged so

Re: [I] Document `CREATE EXTERNAL TABLE ... OPTIONS` [datafusion]

2025-04-11 Thread via GitHub
marvelshan commented on issue #10451: URL: https://github.com/apache/datafusion/issues/10451#issuecomment-2798494312 Before proceeding with implementation, I'd like to confirm my approach is correct. I'm planning to create a new file named `options.md`dedicated to documenting the available

Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-11 Thread via GitHub
kosiew commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798494451 hi @Dandandan I am getting failed tests with ```rust #[test] fn test_all_one() -> Result<()> { // Helper function to run tests and repo

Re: [I] Change mapping of SQL `VARCHAR` from `Utf8` to `Utf8View` [datafusion]

2025-04-11 Thread via GitHub
getChan commented on issue #15096: URL: https://github.com/apache/datafusion/issues/15096#issuecomment-2798458529 Update subtasks list (maybe) - [X] Support approx_distinct for Utf8View is done. by https://github.com/apache/datafusion/pull/15200 - [ ] approx_percentile_cont should supp

Re: [I] Unparse of Joins is ignoring projections [datafusion]

2025-04-11 Thread via GitHub
chenkovsky commented on issue #15688: URL: https://github.com/apache/datafusion/issues/15688#issuecomment-2798435085 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on PR #15685: URL: https://github.com/apache/datafusion/pull/15685#issuecomment-2798405570 https://github.com/apache/datafusion/pull/15568#discussion_r2038773841 # why the change is equivalent to your in the high level idea. > 1. DynamicFilterPhysicalExpr ge

Re: [I] java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.execution.PartitionedFileUtil$.splitFiles(org.apache.spark.sql.SparkSession [datafusion-comet]

2025-04-11 Thread via GitHub
Kontinuation commented on issue #1639: URL: https://github.com/apache/datafusion-comet/issues/1639#issuecomment-2798421233 This should have been addressed by https://github.com/apache/datafusion-comet/pull/1565 and https://github.com/apache/datafusion-comet/pull/1573. -- This is an auto

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040512866 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -159,35 +139,13 @@ impl DynamicFilterPhysicalExpr { ) })?

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040512371 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -159,35 +139,13 @@ impl DynamicFilterPhysicalExpr { ) })?

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040510160 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -36,16 +36,8 @@ use super::Column; /// A dynamic [`PhysicalExpr`] that can be updated by

Re: [D] Should ExecutionPlan spawn tasks in `execute` function [datafusion]

2025-04-11 Thread via GitHub
GitHub user westonpace added a comment to the discussion: Should ExecutionPlan spawn tasks in `execute` function This may just be a failure in my ability to read the manual. I now see this in the [docs](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymet

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040505924 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -335,22 +313,12 @@ mod test { ])); // Each ParquetExec calls `with_new_c

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040504133 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -335,22 +313,12 @@ mod test { ])); // Each ParquetExec calls `with_new_c

Re: [PR] fix: Modify Spark SQL core 2 tests for `native_datafusion` reader, change 3.5.5 diff hash length to 11 [datafusion-comet]

2025-04-11 Thread via GitHub
codecov-commenter commented on PR #1641: URL: https://github.com/apache/datafusion-comet/pull/1641#issuecomment-2798374572 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1641?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040471645 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -335,22 +313,12 @@ mod test { ])); // Each ParquetExec calls `with_new_c

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040469037 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -36,16 +36,8 @@ use super::Column; /// A dynamic [`PhysicalExpr`] that can be updated by an

Re: [I] Internal error in ExternalSorter when running with memory limit [datafusion]

2025-04-11 Thread via GitHub
DerGut commented on issue #15675: URL: https://github.com/apache/datafusion/issues/15675#issuecomment-2798317659 You are right! With `v46.0.1`, the ExternalSorter estimates `35840 ` bytes for the first record batch. Running with `sort_spill_reservation_bytes + record batch size == m

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040467822 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -159,35 +139,13 @@ impl DynamicFilterPhysicalExpr { ) })?

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040466799 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -335,22 +313,12 @@ mod test { ])); // Each ParquetExec calls `with_new_chi

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040452282 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -36,16 +36,8 @@ use super::Column; /// A dynamic [`PhysicalExpr`] that can be updated by

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040451293 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -159,35 +139,13 @@ impl DynamicFilterPhysicalExpr { ) })?

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2040450375 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -335,22 +313,12 @@ mod test { ])); // Each ParquetExec calls `with_new_c

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-11 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2798250520 @alamb Yes once I address the couple of prioritized issues I have open for `v1.0.0` the next step will be to work on the integration, I agree with having table functions but

Re: [I] Read STRUCT of MAP fields with datafusion reader fails with schema issue [datafusion-comet]

2025-04-11 Thread via GitHub
parthchandra commented on issue #1633: URL: https://github.com/apache/datafusion-comet/issues/1633#issuecomment-2798210998 To make it easier for the next person looking at this, the only difference in the types is that the `value` field in the expected schema is `nullable: false` while in

Re: [PR] Add Table Functions to FFI Crate [datafusion]

2025-04-11 Thread via GitHub
timsaucer commented on PR #15581: URL: https://github.com/apache/datafusion/pull/15581#issuecomment-2798145855 I’ll resolve those clippy warnings next time I’m at my computer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-11 Thread via GitHub
parthchandra commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2798109113 > > Wow that's phenomenal! Are you able to share some (vague if necessary) descriptions of your workload, cluster hardware, storage source, and what sort of tuning (if

Re: [I] Iceberg writes [datafusion-python]

2025-04-11 Thread via GitHub
kevinjqliu commented on issue #1097: URL: https://github.com/apache/datafusion-python/issues/1097#issuecomment-2798091391 This issue will be a good reference. I'll probably also start a tracking issue for iceberg integration -- This is an automated message from the Apache Git Service. To

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
timsaucer commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798046180 Running CI on it now: https://github.com/apache/datafusion-python/pull/1104 -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] Add Table Functions to FFI Crate [datafusion]

2025-04-11 Thread via GitHub
timsaucer commented on code in PR #15581: URL: https://github.com/apache/datafusion/pull/15581#discussion_r2040329018 ## datafusion/ffi/tests/ffi_udtf.rs: ## @@ -0,0 +1,100 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreemen

[PR] Add support for `PRINT` statement for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc opened a new pull request, #1811: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1811 Reference: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/print-transact-sql?view=sql-server-ver16 Making `message` a `Box` instead of an enum of (national) stri

Re: [I] java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.execution.PartitionedFileUtil$.splitFiles(org.apache.spark.sql.SparkSession [datafusion-comet]

2025-04-11 Thread via GitHub
parthchandra commented on issue #1639: URL: https://github.com/apache/datafusion-comet/issues/1639#issuecomment-2798047128 The third parameter to `PartitionedFileUtils.splitFiles` is a `Path` which your call seems to be missing. The full stack trace might show where this is being called fr

[PR] Testing DF 47 before release cut [datafusion-python]

2025-04-11 Thread via GitHub
timsaucer opened a new pull request, #1104: URL: https://github.com/apache/datafusion-python/pull/1104 This PR is just to test upstream datafusion 47 prior to cutting a release -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [PR] Specialize join matching when values in map are unique [datafusion]

2025-04-11 Thread via GitHub
Dandandan commented on PR #15690: URL: https://github.com/apache/datafusion/pull/15690#issuecomment-2798027465 This is promising, need to fix the test and make sure the limit is respected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[PR] chore: reduce log levels for few log statements [datafusion-ballista]

2025-04-11 Thread via GitHub
milenkovicm opened a new pull request, #1237: URL: https://github.com/apache/datafusion-ballista/pull/1237 # Which issue does this PR close? Closes none. # Rationale for this change reduce log levels for few log statements, I would argue they do not need to be printe

Re: [PR] fix: executor can't read s3 config in push-staged mode [datafusion-ballista]

2025-04-11 Thread via GitHub
milenkovicm commented on PR #1236: URL: https://github.com/apache/datafusion-ballista/pull/1236#issuecomment-2797957413 thanks for patch @mmooyyii there is a test to test object store access but it does not cover all cases unfortunately, we definitely need to improve testing. jus

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
timsaucer commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2797943836 > FYI [@timsaucer](https://github.com/timsaucer) we are getting ready to release datafusion 47 -- shall we test with datafusion-python before doing so? I've been using a

Re: [PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc commented on code in PR #1810: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1810#discussion_r2040246620 ## src/parser/mod.rs: ## @@ -5265,18 +5271,71 @@ impl<'a> Parser<'a> { trigger_object, include_each, condition

Re: [I] Unique identifier for MemoryConsumer [datafusion]

2025-04-11 Thread via GitHub
alamb closed issue #15126: Unique identifier for MemoryConsumer URL: https://github.com/apache/datafusion/issues/15126 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

Re: [PR] feat: Add unique id for every memory consumer [datafusion]

2025-04-11 Thread via GitHub
alamb commented on PR #15613: URL: https://github.com/apache/datafusion/pull/15613#issuecomment-2797940153 This is great -- thanks again for the work and contribution @EmilyMatt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] feat: Add unique id for every memory consumer [datafusion]

2025-04-11 Thread via GitHub
alamb merged PR #15613: URL: https://github.com/apache/datafusion/pull/15613 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add Table Functions to FFI Crate [datafusion]

2025-04-11 Thread via GitHub
alamb commented on code in PR #15581: URL: https://github.com/apache/datafusion/pull/15581#discussion_r2040253122 ## datafusion/ffi/tests/ffi_udtf.rs: ## @@ -0,0 +1,100 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements.

[PR] Specialize join matching when values in map are unique [datafusion]

2025-04-11 Thread via GitHub
Dandandan opened a new pull request, #15690: URL: https://github.com/apache/datafusion/pull/15690 ## Which issue does this PR close? - Closes #. ## Rationale for this change Performance improvements for this case. ## What changes are included in

[PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc opened a new pull request, #1810: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1810 Adjacent to: https://github.com/apache/datafusion-sqlparser-rs/pull/1808 with similar considerations --- This PR introduces support for parsing `CREATE TRIGGER` for SQL

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2797923658 > I just read your blogpost today, and I am really happy to have a faster generator. The post focussed on generating tpc-h to files, but I see you also discussed something like th

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2797918175 FYI @timsaucer we are getting ready to release datafusion 47 -- shall we test with datafusion-python before doing so? -- This is an automated message from the Apache Git Se

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2797919224 I also tested the upgrade in delta.rs and it seems to have gone well for me - https://github.com/delta-io/delta-rs/pull/3378 -- This is an automated message from the Apache Gi

Re: [PR] Optimize BinaryExpr Evaluation with Short-Circuiting for AND/OR Operators [datafusion]

2025-04-11 Thread via GitHub
alamb commented on PR #15648: URL: https://github.com/apache/datafusion/pull/15648#issuecomment-2797909596 @kosiew -- I wonder if you saw this post from @Dandandan : https://github.com/apache/datafusion/issues/15631#issuecomment-2796844672 It seems a simpler way to improve perfo

[PR] fix: Modify Spark SQL core 2 tests for `native_datafusion` reader, change 3.5.5 diff hash length to 11 [datafusion-comet]

2025-04-11 Thread via GitHub
mbutrovich opened a new pull request, #1641: URL: https://github.com/apache/datafusion-comet/pull/1641 ## Which issue does this PR close? Closes #1640. Partially address #1545 by reducing test failures. ## Rationale for this change ## What changes are incl

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2040196781 ## tests/sqlparser_mssql.rs: ## @@ -187,6 +187,145 @@ fn parse_mssql_create_procedure() { let _ = ms().verified_stmt("CREATE PROCEDURE [foo] AS

Re: [PR] Fix tokenization of qualified identifiers with numeric prefix. [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
iffyio merged PR #1803: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2040136168 ## src/parser/mod.rs: ## @@ -5135,6 +5146,63 @@ impl<'a> Parser<'a> { })) } +/// Parse `CREATE FUNCTION` for [SQL Server] +//

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-11 Thread via GitHub
alamb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2797869301 This definitely is an API change -- I hit it in the delta-rs upgrade: - https://github.com/delta-io/delta-rs/pull/3378 I'll make a note to add it to the upgrade guide -- Thi

Re: [I] Regression in `last_value` functionality [datafusion]

2025-04-11 Thread via GitHub
Dandandan commented on issue #15676: URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2797856807 Sorry to have caused so much discussion. I'm totally in favor of keeping this open and have the function (without `order by`) for now matching the expectation of "first

Re: [I] Regression in `last_value` functionality [datafusion]

2025-04-11 Thread via GitHub
andygrove closed issue #15676: Regression in `last_value` functionality URL: https://github.com/apache/datafusion/issues/15676 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[I] Standardize Spark diff hash length [datafusion-comet]

2025-04-11 Thread via GitHub
mbutrovich opened a new issue, #1640: URL: https://github.com/apache/datafusion-comet/issues/1640 ### Describe the bug We have diffs for Spark 3.4.3, 3.5.4, 3.5.5, and 4.0.0-preview1 for running Spark SQL tests. These were generated with different hash abbreviation lengths, so genera

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2038224067 ## src/ast/mod.rs: ## @@ -4050,6 +4051,16 @@ pub enum Statement { arguments: Vec, options: Vec, }, +/// Return (SQL Server

Re: [I] Regression in `last_value` functionality [datafusion]

2025-04-11 Thread via GitHub
andygrove commented on issue #15676: URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2797840426 I'm ok with closing this issue since the behavior of first/last aggregates without explicit ordering is not generally deterministic. We will figure out an approach in Comet. T

Re: [I] Regression in `last_value` functionality [datafusion]

2025-04-11 Thread via GitHub
andygrove commented on issue #15676: URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2797833569 > Perhaps we can add order by for the tests? Spark SQL doesn't seem to support an `ORDER BY` clause in this context. -- This is an automated messag

Re: [PR] Add Extension Type / Metadata support for Scalar UDFs [datafusion]

2025-04-11 Thread via GitHub
timsaucer commented on PR #15646: URL: https://github.com/apache/datafusion/pull/15646#issuecomment-2797745297 I need to take some time to review these comments and think more about it, likely next week. Also I'm dropping a note for myself that the current implementation isn't sufficient fo

Re: [PR] fix: fix spark/sql test failures in native_iceberg_compat [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on code in PR #1593: URL: https://github.com/apache/datafusion-comet/pull/1593#discussion_r2040094303 ## common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java: ## @@ -263,111 +272,129 @@ public void init() throws URISyntaxException, IOExceptio

Re: [I] ListingTable statistics improperly merges statistics when files have different schemas [datafusion]

2025-04-11 Thread via GitHub
friendlymatthew commented on issue #15689: URL: https://github.com/apache/datafusion/issues/15689#issuecomment-2797733418 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2797691953 The upgrade to arrow 55 is now ready for review too: - https://github.com/apache/datafusion/pull/15466 -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-04-11 Thread via GitHub
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797704916 _Note: I'm sorry about the super long write ups. I'm not trying to bike shed._ I was thinking about https://github.com/apache/datafusion/pull/15361/commits/f906af55df1

Re: [PR] Fix tokenization of qualified identifiers with numeric prefix. [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
romanb commented on code in PR #1803: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1803#discussion_r2040073355 ## src/tokenizer.rs: ## @@ -895,7 +895,7 @@ impl<'a> Tokenizer<'a> { }; let mut location = state.location(); -while let Some

[PR] Fix typo in opening paragraph [datafusion-site]

2025-04-11 Thread via GitHub
alamb opened a new pull request, #68: URL: https://github.com/apache/datafusion-site/pull/68 Scale Factor 100 is 36GB not 3.6GB -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Consolidate statistics merging code (try 2) [datafusion]

2025-04-11 Thread via GitHub
alamb merged PR #15661: URL: https://github.com/apache/datafusion/pull/15661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-11 Thread via GitHub
alamb commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2040065184 ## datafusion/physical-plan/src/statistics.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license ag

Re: [PR] Consolidate statistics merging code (try 2) [datafusion]

2025-04-11 Thread via GitHub
alamb commented on PR #15661: URL: https://github.com/apache/datafusion/pull/15661#issuecomment-2797689097 > It seems that the PR still has the issue that is mentioned here: https://github.com/xudong963/arrow-datafusion/pull/5#discussion_r2034641672. Yes I think you are right -- howev

[I] ListingTable statistics improperly merges statistics when files have different schemas [datafusion]

2025-04-11 Thread via GitHub
alamb opened a new issue, #15689: URL: https://github.com/apache/datafusion/issues/15689 ### Describe the bug As @xudong963 mentions in - https://github.com/xudong963/arrow-datafusion/pull/5#discussion_r2034641672. And also brought up again in - https://github.com/apac

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-04-11 Thread via GitHub
Omega359 commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2797640563 Thanks for the additional work on this @friendlymatthew ! I think this approach is solid - the overhead for the casting is limited to only the cases where the format string includes

Re: [PR] Upgrade to arrow/parquet 55, and `object_store` to `0.12.0` and pyo3 to `0.24.0` [datafusion]

2025-04-11 Thread via GitHub
mbutrovich commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2797625880 > I thought it might be related to improved pre-fetching / fewer IOs due to This should be easy to confirm with `dtruss`/`dtrace`/`bpftrace`. Let me see if I find a moment.

Re: [PR] Upgrade to arrow/parquet 55 [datafusion]

2025-04-11 Thread via GitHub
alamb commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2797605664 > Still seeing if this is just noise, but here are flame graphs for Q14 from my machine if anyone else wants to stare at them: My theory is that the improvement is due to @rluvat

Re: [PR] Fix typo in opening paragraph of tpch generator blog [datafusion-site]

2025-04-11 Thread via GitHub
alamb merged PR #68: URL: https://github.com/apache/datafusion-site/pull/68 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusio

Re: [I] Regression in `last_value` functionality [datafusion]

2025-04-11 Thread via GitHub
kazuyukitanimura commented on issue #15676: URL: https://github.com/apache/datafusion/issues/15676#issuecomment-2797612697 > Rewrite our tests for LAST to stop comparing to Spark and implement some other means to determine that the behavior is correct, and also document that Comet is not co

Re: [PR] Fix typo in opening paragraph of tpch generator blog [datafusion-site]

2025-04-11 Thread via GitHub
alamb commented on PR #68: URL: https://github.com/apache/datafusion-site/pull/68#issuecomment-2797615019 Thanks @timsaucer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] Document `CREATE EXTERNAL TABLE ... OPTIONS` [datafusion]

2025-04-11 Thread via GitHub
marvelshan commented on issue #10451: URL: https://github.com/apache/datafusion/issues/10451#issuecomment-2797584011 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2797543695 Thank you as well! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
ozankabak commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2797532385 All right -- we will submit a PR early next week and get it merged ASAP to enable you to carry on. We will also keep on collaborating with you for subsequent PRs as this functional

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2039938786 ## src/ast/ddl.rs: ## @@ -2157,6 +2157,10 @@ impl fmt::Display for ClusteredBy { #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[c

Re: [PR] Introduce DynamicFilterSource and DynamicPhysicalExpr [datafusion]

2025-04-11 Thread via GitHub
jayzhan211 commented on PR #15568: URL: https://github.com/apache/datafusion/pull/15568#issuecomment-2796973499 https://github.com/apache/datafusion/pull/15685 @adriangb, the `snapshot` is different but I think the overall idea should be the same, while we avoid remapping each time we

[PR] add config parse_hex_as_fixed_size_binary [datafusion]

2025-04-11 Thread via GitHub
leoyvens opened a new pull request, #15687: URL: https://github.com/apache/datafusion/pull/15687 ## Which issue does this PR close? - Closes #15686. ## What changes are included in this PR? Adds `parse_hex_as_fixed_size_binary` parser option. ## Are these changes t

[PR] Add support for `GO` batch delimiter in SQL Server [datafusion-sqlparser-rs]

2025-04-11 Thread via GitHub
aharpervc opened a new pull request, #1809: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1809 Reference: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/sql-server-utilities-statements-go Lots of conventional SQL Server tooling supports `GO`, so it seems r

[I] Unparse of Joins is ignoring projections [datafusion]

2025-04-11 Thread via GitHub
nuno-faria opened a new issue, #15688: URL: https://github.com/apache/datafusion/issues/15688 ### Describe the bug The unparse of Join operators is ignoring the projected columns, ending up projecting everything. Two conditions cause this to happen: - the final projected col

Re: [PR] perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on code in PR #1619: URL: https://github.com/apache/datafusion-comet/pull/1619#discussion_r2039804022 ## native/core/src/parquet/parquet_exec.rs: ## @@ -61,7 +61,12 @@ pub(crate) fn init_datasource_exec( data_filters: Option>>, session_timezone: &st

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2797397320 Thanks -- I plan to make a test PR for delta.rs later this afternoon and will report back -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
berkaysynnada commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2796969760 > I don't think dynamic vs. static is the right distinction to make here. I did it since your examples were on dynamic filters. I just wanted to show dynamic filters case

Re: [I] Questionable hash seed reuse between RepartitionExec and HashJoinExec [datafusion]

2025-04-11 Thread via GitHub
alamb commented on issue #15620: URL: https://github.com/apache/datafusion/issues/15620#issuecomment-2797395565 It seems like a reasonable proposal to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[I] benchmarks: Read SessionConfig from Environment [datafusion]

2025-04-11 Thread via GitHub
ctsk opened a new issue, #15684: URL: https://github.com/apache/datafusion/issues/15684 ### Is your feature request related to a problem or challenge? I think it would be useful if the benchmarks would consider environment variables for the datafusion configuration. This would let dev

[I] Allow parsing byte literals as FixedSizeBinary [datafusion]

2025-04-11 Thread via GitHub
leoyvens opened a new issue, #15686: URL: https://github.com/apache/datafusion/issues/15686 ### Is your feature request related to a problem or challenge? Say you have a column `byte_column` of type `FixedSizeBinary`. Doing `where byte_column = x'deadbeef'` will fail, because the lite

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
berkaysynnada commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2796862617 > Could you help clarify when the FilterExec nodes get inserted? Maybe some examples with DataSourceExecs that do not accept any filters would help. You can look at the n

Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-11 Thread via GitHub
Dandandan commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2796844672 Btw as a simple concept, I tested this yesterday to reduce execution time of short circuiting all false / all true cases by -25% compared to `true_count` / `false_count`:

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2797315179 > @adriangb perhaps we can work on creating a new PR (stacked on this one) that hooks everything up for dynamic filter pushdown. That way we can have things ready to go once we get

Re: [PR] dynamic filter refactor [datafusion]

2025-04-11 Thread via GitHub
adriangb commented on code in PR #15685: URL: https://github.com/apache/datafusion/pull/15685#discussion_r2039832734 ## datafusion/physical-expr/src/expressions/dynamic_filters.rs: ## @@ -105,47 +97,44 @@ impl DynamicFilterPhysicalExpr { inner: Arc, ) -> Self {

Re: [PR] perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on code in PR #1619: URL: https://github.com/apache/datafusion-comet/pull/1619#discussion_r2039766214 ## native/core/src/parquet/parquet_exec.rs: ## @@ -61,7 +61,12 @@ pub(crate) fn init_datasource_exec( data_filters: Option>>, session_timezone: &st

Re: [PR] perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on code in PR #1619: URL: https://github.com/apache/datafusion-comet/pull/1619#discussion_r2039766851 ## native/core/src/parquet/parquet_exec.rs: ## @@ -61,7 +61,12 @@ pub(crate) fn init_datasource_exec( data_filters: Option>>, session_timezone: &st

Re: [PR] perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on PR #1619: URL: https://github.com/apache/datafusion-comet/pull/1619#issuecomment-2797232477 > I addressed the feedback but I no longer see a performance improvement with `native_datafusion` when disabling `PARQUET_FILTER_PUSHDOWN_ENABLED`, so I have moved this to dra

Re: [PR] perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on PR #1619: URL: https://github.com/apache/datafusion-comet/pull/1619#issuecomment-2797156817 I addressed the feedback but I no longer see a performance improvement with `native_datafusion` when disabling `PARQUET_FILTER_PUSHDOWN_ENABLED`, so I have moved this to draft

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-11 Thread via GitHub
berkaysynnada commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2796865438 > One more question: it seems like in all cases we end up eagerly cloning every node: `Join::try_new`. If I understand correctly this may even happen twice per node as we do th

Re: [PR] Support bounds evaluation for temporal data types [datafusion]

2025-04-11 Thread via GitHub
ch-sc commented on PR #14523: URL: https://github.com/apache/datafusion/pull/14523#issuecomment-2797103457 Sorry @berkaysynnada, I got side-tracked from this yesterday. The test is fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.execution.PartitionedFileUtil$.splitFiles(org.apache.spark.sql.SparkSession [datafusion-comet]

2025-04-11 Thread via GitHub
andygrove commented on issue #1639: URL: https://github.com/apache/datafusion-comet/issues/1639#issuecomment-2797061885 It looks like this error is happening in Spark code and not in Comet code? It is difficult to know how to help with this since you have a custom Spark image. --

  1   2   >