Re: [D] Is HashAggregation spilling enabled by default? [datafusion]

2025-04-03 Thread via GitHub
GitHub user 2010YOUY01 added a comment to the discussion: Is HashAggregation spilling enabled by default? Spilling is always enabled, the difference is: - If no memory limit is specified (default), memory accounting inside operators never fails. - If a memory limit is specified, memory account

Re: [PR] added functionality to handle MSSQL output statement [datafusion-sqlparser-rs]

2025-04-03 Thread via GitHub
dilovancelik commented on PR #1790: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790#issuecomment-2774659454 I updated the old usage to use the new parse_select_into function and ran all tests :) -- This is an automated message from the Apache Git Service. To respond to t

Re: [I] Ballista: Partition columns are duplicated in protobuf decoding. [datafusion-ballista]

2025-04-03 Thread via GitHub
milenkovicm commented on issue #484: URL: https://github.com/apache/datafusion-ballista/issues/484#issuecomment-2774686851 If this is still issue it's yours @iho -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-03 Thread via GitHub
mixermt commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2774706637 We compiled comet by ourself (spark 3.5, scala 2.12), without any visible issues. We have a wide variety of servers in our K8s cluster (all kinds of Dell's PowerEdge

Re: [PR] Respect ignore_nulls in array_agg [datafusion]

2025-04-03 Thread via GitHub
gabotechs commented on code in PR #15544: URL: https://github.com/apache/datafusion/pull/15544#discussion_r2026317591 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -3194,6 +3196,28 @@ select array_agg(column1) from t; statement ok drop table t; +# array_agg_igno

Re: [I] Audit-check fails in main branch [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on issue #15554: URL: https://github.com/apache/datafusion/issues/15554#issuecomment-2774749408 If upgrade `pyo3` to `0.24.1`, more info: ``` cargo build Updating crates.io index error: failed to select a version for `pyo3-ffi`. ... required by packa

Re: [I] Regression with compound field access and join schema [datafusion]

2025-04-03 Thread via GitHub
kosiew commented on issue #15549: URL: https://github.com/apache/datafusion/issues/15549#issuecomment-2774976608 @alexwilcoxson-rel I could not reproduce this with commit `28451b5` on the `main` branch: ``` DataFusion CLI v46.0.1 > create table u as values({r: 'a', c: 1})

Re: [I] Add documentation for benchmarking Comet in AWS with S3 data source [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove commented on issue #1583: URL: https://github.com/apache/datafusion-comet/issues/1583#issuecomment-2772820500 Here are some rough notes on creating the dataset in S3. - create s3 bucket - create `m7i.4xlarge` ec2 instance with 2 TB EBS volume ``` sudo yum instal

[PR] Add tables with compound fields and JOIN condition tests for compound… [datafusion]

2025-04-03 Thread via GitHub
kosiew opened a new pull request, #15556: URL: https://github.com/apache/datafusion/pull/15556 ## Which issue does this PR close? - I added some slt tests to investigate the issue #15549 ## Rationale for this change This change adds SQL logic tests to validate support for

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2775012298 > Would really appreciate if could add the following PR to the release as well: > > * [fix: update group by columns for merge phase after spill  #15531](https://github.c

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-03 Thread via GitHub
gabotechs commented on code in PR #14412: URL: https://github.com/apache/datafusion/pull/14412#discussion_r2026415116 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -65,6 +83,7 @@ make_udaf_expr_and_func!( #[derive(Debug)] pub struct StringAgg { signature: Sig

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-03 Thread via GitHub
mixermt commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2774725949 Comet was built on server with following lscpu ``` Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s):

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-03 Thread via GitHub
jayzhan211 commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2026105522 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +437,80 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-04-03 Thread via GitHub
berkaysynnada commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2774860687 > Instead of running the benchmark, how about adding those benchmark query to tests, I don't think we need to actually "benchmark" the code for each merge. Keeping a

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-03 Thread via GitHub
gabotechs commented on code in PR #14412: URL: https://github.com/apache/datafusion/pull/14412#discussion_r2026461993 ## datafusion/functions-aggregate/src/string_agg.rs: ## @@ -129,52 +172,326 @@ impl AggregateUDFImpl for StringAgg { #[derive(Debug)] pub(crate) struct Strin

Re: [PR] Respect ignore_nulls in array_agg [datafusion]

2025-04-03 Thread via GitHub
gabotechs commented on code in PR #15544: URL: https://github.com/apache/datafusion/pull/15544#discussion_r2026317591 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -3194,6 +3196,28 @@ select array_agg(column1) from t; statement ok drop table t; +# array_agg_igno

Re: [PR] feat: Improve fetch partition performance, support skip validation arrow ipc files [datafusion-ballista]

2025-04-03 Thread via GitHub
westhide commented on PR #1216: URL: https://github.com/apache/datafusion-ballista/pull/1216#issuecomment-2777671990 > > Q1: As the `BallistaFlightService` keep listenning on each Executor,writting it allow client to send a `do_get` request, and without check `FetchPartition` action's `pat

Re: [PR] feat: add test to check for `ctx.read_json()` [datafusion-ballista]

2025-04-03 Thread via GitHub
westhide commented on code in PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#discussion_r2028175201 ## ballista/scheduler/src/scheduler_server/grpc.rs: ## @@ -125,10 +125,15 @@ impl SchedulerGrpc let mut tasks = vec![]; for (_

Re: [PR] Migrate datafusion/sql tests to insta, part4 [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15548: URL: https://github.com/apache/datafusion/pull/15548#issuecomment-2776376489 Thanks again! This is looking really nice now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Docs : Added Sql exmaples for window Functions : nth_val , etc [datafusion]

2025-04-03 Thread via GitHub
Adez017 commented on PR #1: URL: https://github.com/apache/datafusion/pull/1#issuecomment-2776400110 hi @alamb could you trigger the CI for this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Migrate datafusion/sql tests to insta, part4 [datafusion]

2025-04-03 Thread via GitHub
alamb merged PR #15548: URL: https://github.com/apache/datafusion/pull/15548 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2027374194 ## datafusion/core/tests/physical_optimizer/partition_statistics.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or m

Re: [PR] fix: avoid panic caused by close null handle of parquet reader [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove merged PR #1604: URL: https://github.com/apache/datafusion-comet/pull/1604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Native scan panic with native_iceberg_compat on hdfs [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove closed issue #1553: Native scan panic with native_iceberg_compat on hdfs URL: https://github.com/apache/datafusion-comet/issues/1553 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Fix Possible Congestion Scenario in `SortPreservingMergeExec` [datafusion]

2025-04-03 Thread via GitHub
alamb commented on code in PR #12302: URL: https://github.com/apache/datafusion/pull/12302#discussion_r2027651936 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -154,14 +165,36 @@ impl SortPreservingMergeStream { if self.aborted { return Poll::Ready(

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
blaginin commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027650462 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

[I] Improve time for SortPreservingMerge stream / uninitiated_partitions VecDeque [datafusion]

2025-04-03 Thread via GitHub
alamb opened a new issue, #15573: URL: https://github.com/apache/datafusion/issues/15573 ### Is your feature request related to a problem or challenge? Both @rluvaton and I have seen https://github.com/user-attachments/assets/cd91c702-51fa-45b7-9214-7913c1281161"; /> !

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2776787483 BTW I filed a ticket to track this as I have seen it too - https://github.com/apache/datafusion/issues/15573 -- This is an automated message from the Apache Git Service. To respon

Re: [PR] Minor: add Arc for statistics in FileGroup [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15564: URL: https://github.com/apache/datafusion/pull/15564#issuecomment-2776794019 Thank you @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] tpcbench.py add --query support to run custom query [datafusion-ray]

2025-04-03 Thread via GitHub
jazracherif commented on code in PR #84: URL: https://github.com/apache/datafusion-ray/pull/84#discussion_r2027648678 ## tpch/tpcbench.py: ## @@ -186,8 +186,28 @@ def main( args = parser.parse_args() +if (args.qnum != -1 and args.query is not None): +print("

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-03 Thread via GitHub
alamb commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2776820608 > I think I have the the same problem but in `AggregateExec` when using `row_hash`, as it spills as well and use `SortPreservingMergeStream`. > > I think the solution should

Re: [I] Blog post about user defined window functions [datafusion]

2025-04-03 Thread via GitHub
alamb commented on issue #6781: URL: https://github.com/apache/datafusion/issues/6781#issuecomment-2776822081 Thanks @Adez017 -- You can make a blog post for the DataFusion blog on by making a PR to this repo: https://github.com/alamb/datafusion-site -- This is an automated messa

Re: [I] Add `topk` information into `tree` explain plans [datafusion]

2025-04-03 Thread via GitHub
alamb closed issue #15546: Add `topk` information into `tree` explain plans URL: https://github.com/apache/datafusion/issues/15546 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Add topk information into tree explain plans [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15547: URL: https://github.com/apache/datafusion/pull/15547#issuecomment-2776824435 Thanks again @kumarlokesh -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-04-03 Thread via GitHub
friendlymatthew commented on PR #15361: URL: https://github.com/apache/datafusion/pull/15361#issuecomment-2776384329 > @friendlymatthew - Running the benchmarks between main and the sidebranch would be very helpful - it should give us an idea as to the overhead of the format parsing for dat

Re: [I] Add more developer documentation [datafusion-comet]

2025-04-03 Thread via GitHub
psvri commented on issue #230: URL: https://github.com/apache/datafusion-comet/issues/230#issuecomment-2776417665 Thanks for this @andygrove , compared to the time I raised the PR and now, I see a lot of much needed information is added. -- This is an automated message from the Apache G

[PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-03 Thread via GitHub
adriangb opened a new pull request, #15566: URL: https://github.com/apache/datafusion/pull/15566 Work towards #15037 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-03 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027429964 ## datafusion/core/src/datasource/file_format/parquet.rs: ## @@ -67,13 +67,13 @@ pub(crate) mod test_util { .into_iter() .zip(tmp_files.

[PR] chore: return `404` for api requests if path does not exist [datafusion-ballista]

2025-04-03 Thread via GitHub
milenkovicm opened a new pull request, #1224: URL: https://github.com/apache/datafusion-ballista/pull/1224 # Which issue does this PR close? Closes #1223 . # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-03 Thread via GitHub
rluvaton commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2776605858 I think I have the the same problem but in `AggregateExec` when using `row_hash`, as it spills as well and use `SortPreservingMergeStream`. I think the solution should ac

[PR] chore: fix clippy issues after update to rust 1.86 [datafusion-ballista]

2025-04-03 Thread via GitHub
milenkovicm opened a new pull request, #1225: URL: https://github.com/apache/datafusion-ballista/pull/1225 # Which issue does this PR close? Closes #. # Rationale for this change # What changes are included in this PR? - fix clippy issues after rust 1.86 u

Re: [PR] Add topk information into tree explain plans [datafusion]

2025-04-03 Thread via GitHub
alamb commented on code in PR #15547: URL: https://github.com/apache/datafusion/pull/15547#discussion_r2027592484 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1112,7 +1112,10 @@ impl DisplayAs for SortExec { impl ExecutionPlan for SortExec { fn name(&self) -> &'

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-03 Thread via GitHub
mixermt commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2776287792 While its possible to execute on specific machines I prefer not to break our K8s resource pool. So in order to compile to broadwell, should I just concat it to target

Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]

2025-04-03 Thread via GitHub
LiaCastaneda commented on PR #15438: URL: https://github.com/apache/datafusion/pull/15438#issuecomment-2771896879 Thanks for the review @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Spark executor fail to start occasionally with SIGILL [datafusion-comet]

2025-04-03 Thread via GitHub
mbutrovich commented on issue #1598: URL: https://github.com/apache/datafusion-comet/issues/1598#issuecomment-2776293399 Just `-Ctarget-cpu=broadwell`. Sorry I should have been more specific. -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2027363781 ## datafusion/physical-plan/src/joins/cross_join.rs: ## @@ -344,6 +345,26 @@ impl ExecutionPlan for CrossJoinExec { )) } +fn statistics_by_pa

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2027384075 ## datafusion/physical-plan/src/statistics.rs: ## @@ -0,0 +1,151 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licens

Re: [I] Decorrelate scalar subqueries with more complex filter expressions [datafusion]

2025-04-03 Thread via GitHub
duongcongtoai commented on issue #14554: URL: https://github.com/apache/datafusion/issues/14554#issuecomment-2776463863 thank you, i'll take a look at the PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] address failure caused by method signature change in SPARK-48791 [datafusion-comet]

2025-04-03 Thread via GitHub
parthchandra commented on issue #692: URL: https://github.com/apache/datafusion-comet/issues/692#issuecomment-2776495289 Seems like a perennial issue. This signature changes in every release it appears (it is private after all). https://github.com/apache/datafusion-comet/issues/1576 --

Re: [PR] feat: Add config `max_temp_directory_size` to limit max disk usage for spilling queries [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15520: URL: https://github.com/apache/datafusion/pull/15520#issuecomment-2776693763 Thanks again @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [PR] feat: Add config `max_temp_directory_size` to limit max disk usage for spilling queries [datafusion]

2025-04-03 Thread via GitHub
alamb merged PR #15520: URL: https://github.com/apache/datafusion/pull/15520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] fix: update group by columns for merge phase after spill [datafusion]

2025-04-03 Thread via GitHub
alamb merged PR #15531: URL: https://github.com/apache/datafusion/pull/15531 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] fix: update group by columns for merge phase after spill [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15531: URL: https://github.com/apache/datafusion/pull/15531#issuecomment-2776696614 Thank you @rluvaton -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [I] Internal error: PhysicalExpr Column references bound error, Failure in spilling for `AggregateMode::Single` [datafusion]

2025-04-03 Thread via GitHub
alamb closed issue #15530: Internal error: PhysicalExpr Column references bound error, Failure in spilling for `AggregateMode::Single` URL: https://github.com/apache/datafusion/issues/15530 -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

Re: [I] [comet-parquet-exec] Track remaining test failures in POC 1 & 2 [datafusion-comet]

2025-04-03 Thread via GitHub
comphead commented on issue #1228: URL: https://github.com/apache/datafusion-comet/issues/1228#issuecomment-2776744809 Closing this in favor of #1441 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on PR #15567: URL: https://github.com/apache/datafusion/pull/15567#issuecomment-2776882350 > is this the last one? This is the last one for migrating tests in `tests/sql_integration.rs`. There are still some cases in `tests/cases/plan_to_sql.rs` and `tests

Re: [I] [comet-parquet-exec] Track remaining test failures in POC 1 & 2 [datafusion-comet]

2025-04-03 Thread via GitHub
mkgada commented on issue #1228: URL: https://github.com/apache/datafusion-comet/issues/1228#issuecomment-2776590893 Has there been an update on this? My workflow specifies a schema to my read function through a JSON file, date-related fields are specified to be timestamp type.

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15567: URL: https://github.com/apache/datafusion/pull/15567#issuecomment-2776729981 is this the last one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [I] [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts [datafusion]

2025-04-03 Thread via GitHub
alamb commented on issue #15271: URL: https://github.com/apache/datafusion/issues/15271#issuecomment-2776837729 > so If I have a lot of spill files or if every batch is really huge (contains very large lists - like result for array_agg on large dataset) we have all of this in memory.

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027712552 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027712552 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

[PR] Introduce DynamicFilterSource and DynamicPhysicalExpr [datafusion]

2025-04-03 Thread via GitHub
adriangb opened a new pull request, #15568: URL: https://github.com/apache/datafusion/pull/15568 Work towards #15512 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [PR] Migrate datafusion/sql tests to insta, part4 [datafusion]

2025-04-03 Thread via GitHub
blaginin commented on code in PR #15548: URL: https://github.com/apache/datafusion/pull/15548#discussion_r2026909887 ## datafusion/sql/tests/sql_integration.rs: ## @@ -77,9 +77,8 @@ fn parse_decimals() { for (a, b) in test_data { let sql = format!("SELECT {a}");

Re: [I] Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) [datafusion]

2025-04-03 Thread via GitHub
adriangb closed issue #15037: Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) URL: https://github.com/apache/datafusion/issues/15037 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) [datafusion]

2025-04-03 Thread via GitHub
adriangb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2776667178 Sorry closed the issue instead of the PR, my bad! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027702690 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4665,18 +4674,18 @@ Projection: person.id, person.age } #[test] -fn test_prepare_statement_infer_types_fr

Re: [I] Remove unwraps in `hash_array_small_decimal` [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove closed issue #1599: Remove unwraps in `hash_array_small_decimal` URL: https://github.com/apache/datafusion-comet/issues/1599 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Add topk information into tree explain plans [datafusion]

2025-04-03 Thread via GitHub
alamb merged PR #15547: URL: https://github.com/apache/datafusion/pull/15547 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] [substrait] Build basic test suite to validate produced Substrait plans [datafusion]

2025-04-03 Thread via GitHub
alamb commented on issue #15069: URL: https://github.com/apache/datafusion/issues/15069#issuecomment-2776844057 > So I don't believe automatically running the full sqllogictests through Substrait is feasible today. A more realistic goal would be to have a few hand-crafted queries (TPCH quer

Re: [PR] Update concepts-readings-events.md [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #15541: URL: https://github.com/apache/datafusion/pull/15541#issuecomment-2776844631 Thanks @oznur-synnada and @berkaysynnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-03 Thread via GitHub
rluvaton commented on code in PR #15562: URL: https://github.com/apache/datafusion/pull/15562#discussion_r2027711140 ## datafusion/physical-plan/src/sorts/merge.rs: ## @@ -241,10 +239,13 @@ impl SortPreservingMergeStream { _ => { //

[I] `cargo audit` is failing on main [datafusion]

2025-04-03 Thread via GitHub
alamb opened a new issue, #15571: URL: https://github.com/apache/datafusion/issues/15571 ### Describe the bug We are seeing a cargo audit failure on @zebsme 's PR: https://github.com/apache/datafusion/pull/15454 ``` Crate: proc-macro-error Version: 1.0.4 Warn

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027712552 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

Re: [PR] perf: replace `merge` `uninitiated_partitions` `VecDeque` with custom fixed size queue [datafusion]

2025-04-03 Thread via GitHub
rluvaton commented on PR #15562: URL: https://github.com/apache/datafusion/pull/15562#issuecomment-2776872261 I removed the custom fixed size queue and replaced with matching vec deque functions that are 0-cost as it's just index manipulation like my prev impl -- This is an automated m

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-03 Thread via GitHub
alamb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027656995 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -1445,119 +1465,26 @@ mod tests { // batch1: c1(string) let batch1 = string_batch

Re: [PR] Fix Possible Congestion Scenario in `SortPreservingMergeExec` [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #12302: URL: https://github.com/apache/datafusion/pull/12302#issuecomment-2776791139 FYI @rluvaton and I have noticed a `clone()` introduced in this PR appearing in some traces: - https://github.com/apache/datafusion/issues/15573 -- This is an automated messag

Re: [PR] Docs : Added Sql examples for window Functions : `nth_val` , etc [datafusion]

2025-04-03 Thread via GitHub
alamb commented on PR #1: URL: https://github.com/apache/datafusion/pull/1#issuecomment-2776823505 It looks like there are a few small errors to fix to get a clean CI run -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-03 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027772249 ## datafusion/sqllogictest/test_files/parquet.slt: ## @@ -625,7 +625,7 @@ physical_plan 01)CoalesceBatchesExec: target_batch_size=8192 02)--FilterExec: column1@

[PR] Fix clippy lint on rust 1.86 [datafusion-sqlparser-rs]

2025-04-03 Thread via GitHub
iffyio opened a new pull request, #1796: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1796 Fixes [lint failures in CI](https://github.com/apache/datafusion-sqlparser-rs/actions/runs/14236730212/job/39938621832?pr=1790) from the latest rust release -- This is an automated m

Re: [I] 【TPCH】Comet do not show performance advantages over native Spark? [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove commented on issue #1450: URL: https://github.com/apache/datafusion-comet/issues/1450#issuecomment-2775796160 I'll close this issue for now since I cannot reproduce. Please feel free to reopen it if you'd like to continue with this. -- This is an automated message from the Apac

Re: [I] Add more developer documentation [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove closed issue #230: Add more developer documentation URL: https://github.com/apache/datafusion-comet/issues/230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-03 Thread via GitHub
adriangb commented on code in PR #15561: URL: https://github.com/apache/datafusion/pull/15561#discussion_r2027851655 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -295,3 +312,89 @@ fn create_initial_plan( // default to scanning all row groups Ok(ParquetAccessPl

Re: [I] `cargo audit` is failing on main [datafusion]

2025-04-03 Thread via GitHub
Jiashu-Hu commented on issue #15571: URL: https://github.com/apache/datafusion/issues/15571#issuecomment-2777150559 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] `cargo audit` is failing on main [datafusion]

2025-04-03 Thread via GitHub
Jiashu-Hu commented on issue #15571: URL: https://github.com/apache/datafusion/issues/15571#issuecomment-2777171963 For now, it might be best to just allow this warning. Since everything is working fine as it is. Alternatively we could try to use 'proc-macro-error2' instead, but that might

Re: [I] Audit-check fails in main branch [datafusion]

2025-04-03 Thread via GitHub
xudong963 closed issue #15554: Audit-check fails in main branch URL: https://github.com/apache/datafusion/issues/15554 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-04-03 Thread via GitHub
jayzhan211 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2777328776 > You are also right here about the cost, but what if we can have 2 modes for benchmarks, one for the actual benchmarking purpose, and one with just to validate. If it is in

Re: [I] Blog post about user defined window functions [datafusion]

2025-04-03 Thread via GitHub
Adez017 commented on issue #6781: URL: https://github.com/apache/datafusion/issues/6781#issuecomment-2776464810 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [I] Remove record_batch! macro once upstream updates [datafusion]

2025-04-03 Thread via GitHub
alamb commented on issue #13037: URL: https://github.com/apache/datafusion/issues/13037#issuecomment-2776834231 Thanks @ByteBaker -- I don't have any strong opinion / advice here unfortunately -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] TPCH DataGen Not working [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove commented on issue #1157: URL: https://github.com/apache/datafusion-comet/issues/1157#issuecomment-2775885159 Closing this issue since it is for an issue with a Databricks repo that we do not control -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] 【TPCH】Comet do not show performance advantages over native Spark? [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove closed issue #1450: 【TPCH】Comet do not show performance advantages over native Spark? URL: https://github.com/apache/datafusion-comet/issues/1450 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027712552 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

Re: [I] Include Apple macOS support in jars in Maven central [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove commented on issue #1010: URL: https://github.com/apache/datafusion-comet/issues/1010#issuecomment-2775877457 Closing as duplicate of https://github.com/apache/datafusion-comet/issues/947 -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Migrate datafusion/sql tests to insta, part5 [datafusion]

2025-04-03 Thread via GitHub
qstommyshu commented on code in PR #15567: URL: https://github.com/apache/datafusion/pull/15567#discussion_r2027712552 ## datafusion/sql/tests/sql_integration.rs: ## @@ -3388,26 +3389,15 @@ fn ident_normalization_parser_options_ident_normalization() -> ParserOptions { } }

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2027370640 ## datafusion/core/tests/physical_optimizer/partition_statistics.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or m

Re: [PR] Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on code in PR #15503: URL: https://github.com/apache/datafusion/pull/15503#discussion_r2027374194 ## datafusion/core/tests/physical_optimizer/partition_statistics.rs: ## @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or m

Re: [I] Different semantics of casting from int64 to timestamp between Comet and Spark [datafusion-comet]

2025-04-03 Thread via GitHub
parthchandra commented on issue #146: URL: https://github.com/apache/datafusion-comet/issues/146#issuecomment-2776426708 > the min value here is actually causing an overflow in Spark Probably because Spark is converting the value from millis to micros? -- This is an automated mes

Re: [PR] chore: Remove some unwraps in hashing code [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove merged PR #1600: URL: https://github.com/apache/datafusion-comet/pull/1600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

[I] Erroneous warning on unset options during FFI table operation [datafusion]

2025-04-03 Thread via GitHub
timsaucer opened a new issue, #15565: URL: https://github.com/apache/datafusion/issues/15565 ### Describe the bug When you use an FFI Table Provider, we create a session context by parsing a string representation of a session config. This happens in [this line](https://github.com/apa

Re: [I] Different semantics of casting from int64 to timestamp between Comet and Spark [datafusion-comet]

2025-04-03 Thread via GitHub
andygrove commented on issue #146: URL: https://github.com/apache/datafusion-comet/issues/146#issuecomment-2776435399 > > the min value here is actually causing an overflow in Spark > > Probably because Spark is converting the value from millis to micros? Right, my understandin

Re: [PR] Respect ignore_nulls in array_agg [datafusion]

2025-04-03 Thread via GitHub
Dandandan commented on code in PR #15544: URL: https://github.com/apache/datafusion/pull/15544#discussion_r2027495310 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -500,11 +548,30 @@ impl Accumulator for OrderSensitiveArrayAggAccumulator { return Ok(());

  1   2   >