Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077195109 ## datafusion/common/src/scalar/mod.rs: ## @@ -3415,6 +3415,100 @@ impl ScalarValue { .map(|sv| sv.size() - size_of_val(sv)) .su

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077189430 ## datafusion/functions-aggregate/src/min_max.rs: ## @@ -645,19 +645,29 @@ fn min_max_batch_struct(array: &ArrayRef, ordering: Ordering) -> Result {{ if

Re: [PR] feat: add macros for DataFusionError variants [datafusion]

2025-05-07 Thread via GitHub
Chen-Yuan-Lai commented on code in PR #15946: URL: https://github.com/apache/datafusion/pull/15946#discussion_r2077245042 ## datafusion/common/src/error.rs: ## @@ -808,12 +808,18 @@ make_error!(plan_err, plan_datafusion_err, Plan); // Exposes a macro to create `DataFusionError:

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12 2025 [datafusion]

2025-05-07 Thread via GitHub
GitHub user alamb added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12 2025 BTW we have planned a meetup for June 9, 2025 in San Francisco. Signup link is here: https://lu.ma/uuxd443e GitHub link: https://github.com/apach

[PR] Implement RightSemi join for SortMergeJoin [datafusion]

2025-05-07 Thread via GitHub
irenjj opened a new pull request, #15972: URL: https://github.com/apache/datafusion/pull/15972 ## Which issue does this PR close? - Closes #13471 ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

Re: [PR] perf: Add performance tracing capability [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove commented on PR #1706: URL: https://github.com/apache/datafusion-comet/pull/1706#issuecomment-2858740190 I moved this to draft because it is still quite experimental. I am now working on adding tracing for JVM off-heap usage in `CometUnifiedShuffleMemoryAllocator` and `CometTaskM

Re: [I] Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` [datafusion]

2025-05-07 Thread via GitHub
adriangb commented on issue #15173: URL: https://github.com/apache/datafusion/issues/15173#issuecomment-2858707706 I'll point out that I was testing DuckDB and they have this very nice feature: ``` D select filename, sum(row_count) as row_count from read_parquet('/Users/adriangb/D

[PR] Demonstrate wrong statistics reported from parquet [datafusion]

2025-05-07 Thread via GitHub
robert3005 opened a new pull request, #15977: URL: https://github.com/apache/datafusion/pull/15977 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

Re: [PR] fix: Allow ORDER BY aggregates not present in SELECT list [datafusion]

2025-05-07 Thread via GitHub
UBarney commented on code in PR #15876: URL: https://github.com/apache/datafusion/pull/15876#discussion_r2077729484 ## datafusion/expr/src/logical_plan/builder.rs: ## @@ -797,26 +807,146 @@ impl LogicalPlanBuilder { } // remove pushed down sort columns -

Re: [I] Cache Parquet Metadata [datafusion]

2025-05-07 Thread via GitHub
adriangb commented on issue #15582: URL: https://github.com/apache/datafusion/issues/15582#issuecomment-2858701825 I'll mention that we now avoid reading metadata entirely for a lot of queries using an approach along the lines of https://github.com/apache/datafusion/issues/15585 -- This

Re: [PR] Demonstrate wrong statistics reported from parquet [datafusion]

2025-05-07 Thread via GitHub
robert3005 commented on PR #15977: URL: https://github.com/apache/datafusion/pull/15977#issuecomment-2858755012 The file on this branch can be used for parquet testing to reproduce the issue. Looks like it should go to https://github.com/apache/parquet-testing -- This is an automated mess

Re: [PR] perf: Add memory profiling [datafusion-comet]

2025-05-07 Thread via GitHub
mbutrovich commented on PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#issuecomment-2858775592 Not the end of the world, but this introduces warnings on macOS when using jemalloc: ``` | use tikv_jemalloc_ctl::{epoch, stats}; | ^ ^^^

Re: [PR] feat: Set/cancel with job tag and make max broadcast table size configurable [datafusion-comet]

2025-05-07 Thread via GitHub
parthchandra commented on code in PR #1693: URL: https://github.com/apache/datafusion-comet/pull/1693#discussion_r2078146339 ## spark/src/main/spark-3.4/org/apache/comet/shims/ShimCometBroadcastExchangeExec.scala: ## @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Found

Re: [PR] Make Expr::alias and alias_qualified smarter by calling unalias [datafusion]

2025-05-07 Thread via GitHub
github-actions[bot] closed pull request #14749: Make Expr::alias and alias_qualified smarter by calling unalias URL: https://github.com/apache/datafusion/pull/14749 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Show LogicalType name for `INFORMATION_SCHEMA` [datafusion]

2025-05-07 Thread via GitHub
goldmedal commented on PR #15965: URL: https://github.com/apache/datafusion/pull/15965#issuecomment-2861534893 Thanks @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Re-Add CodeCov [datafusion]

2025-05-07 Thread via GitHub
2010YOUY01 commented on PR #15256: URL: https://github.com/apache/datafusion/pull/15256#issuecomment-2861663728 > Basically my opinion here is that the few times I tried to review the codecov reports, I found them useless. Maybe it has gotten better since they were originally > > Whe

[PR] Support Schema Evolution in iceberg [datafusion-comet]

2025-05-07 Thread via GitHub
huaxingao opened a new pull request, #1723: URL: https://github.com/apache/datafusion-comet/pull/1723 ## Which issue does this PR close? We original have CometConf.COMET_SCHEMA_EVOLUTION_ENABLED to set schema evolution to true in Scan rule if the scan is Iceberg table scan. However, i

Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

2025-05-07 Thread via GitHub
TheBuilderJR commented on PR #15295: URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2861173475 @kosiew so I think the tricky part is that there are actually multiple evolutions. Basically my code currenty looks like this ``` let con

Re: [PR] [wip] feat: Add framework for supporting multiple telemetry providers [datafusion-comet]

2025-05-07 Thread via GitHub
codecov-commenter commented on PR #1722: URL: https://github.com/apache/datafusion-comet/pull/1722#issuecomment-2861237379 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1722?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #15958: URL: https://github.com/apache/datafusion/pull/15958#discussion_r2078769590 ## datafusion/spark/src/function/math/ceil_floor.rs: ## @@ -0,0 +1,720 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #15958: URL: https://github.com/apache/datafusion/pull/15958#discussion_r2078772494 ## datafusion/spark/src/function/math/ceil_floor.rs: ## @@ -0,0 +1,720 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-05-07 Thread via GitHub
irenjj commented on code in PR #15958: URL: https://github.com/apache/datafusion/pull/15958#discussion_r2078862746 ## datafusion/spark/src/function/math/ceil_floor.rs: ## @@ -0,0 +1,720 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [D] Multiple 'group by's, one scan [datafusion]

2025-05-07 Thread via GitHub
GitHub user pepijnve edited a discussion: Multiple 'group by's, one scan In the system I'm working on I want to perform multiple aggregates using different group by criteria over large data sets. I don't think grouping sets are an option since those support computing a single set of aggregates

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-07 Thread via GitHub
kazuyukitanimura commented on code in PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#discussion_r2078943165 ## native/spark-expr/src/bitwise_funcs/bitwise_count.rs: ## @@ -0,0 +1,177 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or mo

[I] [EPIC] Observability [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove opened a new issue, #1718: URL: https://github.com/apache/datafusion-comet/issues/1718 ### What is the problem the feature request solves? This epic is to track observability work to make it easier to monitor and debug issues with Comet in a distributed environment, such as

Re: [PR] perf: Add performance tracing capability [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #1706: URL: https://github.com/apache/datafusion-comet/pull/1706#discussion_r2077706523 ## native/core/src/execution/tracing.rs: ## @@ -0,0 +1,111 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

[PR] refactor: remove deprecated `ParquetExec` [datafusion]

2025-05-07 Thread via GitHub
miroim opened a new pull request, #15973: URL: https://github.com/apache/datafusion/pull/15973 ## Which issue does this PR close? Part of #15950 . ## Rationale for this change Remove deprecated `ParquetExec` and `ParquetExecBuilder` ## What changes are inc

Re: [PR] Fix `datafusion-cli` memory leak by using `snmalloc` [datafusion]

2025-05-07 Thread via GitHub
Dandandan commented on PR #15963: URL: https://github.com/apache/datafusion/pull/15963#issuecomment-2857791560 I think it makes sense to run some benchmarks again since also jemalloc has implemented some optimizations over time. -- This is an automated message from the Apache Git Service.

Re: [I] Vectorize window functions [datafusion]

2025-05-07 Thread via GitHub
2010YOUY01 commented on issue #15607: URL: https://github.com/apache/datafusion/issues/15607#issuecomment-2857786990 > [@Dandandan](https://github.com/Dandandan) I've reviewed the current implementation of window functions in DataFusion and studied some related concepts. This DuckDB blog po

Re: [I] Vectorize window functions [datafusion]

2025-05-07 Thread via GitHub
Dandandan commented on issue #15607: URL: https://github.com/apache/datafusion/issues/15607#issuecomment-2857839305 My idea for the scope of this issue wasn't to support arbitrary expressions in the window frames / implement`Segment Tree`, but rather vectorize the existing implementation -

[PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

2025-05-07 Thread via GitHub
kosiew opened a new pull request, #15295: URL: https://github.com/apache/datafusion/pull/15295 ## Which issue does this PR close? - Closes #14757. ## Rationale for this change [arrow-rs suggests that SchemaAdapter is better approach for handling evolving struct](https:/

Re: [PR] make can_expr_be_pushed_down_with_schemas public again [datafusion]

2025-05-07 Thread via GitHub
berkaysynnada merged PR #15971: URL: https://github.com/apache/datafusion/pull/15971 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] perf: Add performance tracing capability [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove commented on PR #1706: URL: https://github.com/apache/datafusion-comet/pull/1706#issuecomment-2859201881 I can now see the off-heap memory being used by `CometUnsafeShuffleWriter`, which was 4GB in this case (500MB per concurrent task). ![2025-05-07_10-15](https://github.c

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-05-07 Thread via GitHub
wiedld commented on PR #15409: URL: https://github.com/apache/datafusion/pull/15409#issuecomment-2858928895 @2010YOUY01 -- comments addressed, and a conditional removed due to this reason: https://github.com/apache/datafusion/pull/15409#discussion_r2076020887 -- This is an automated messa

Re: [PR] Add support for table valued functions for SQL Server [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
aharpervc commented on code in PR #1839: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1839#discussion_r2078030410 ## tests/sqlparser_mssql.rs: ## @@ -283,6 +294,50 @@ fn parse_create_function() { END\ "; let _ = ms().verified_stmt(create_functi

[PR] Additional placeholder datatype inferencing [datafusion]

2025-05-07 Thread via GitHub
kczimm opened a new pull request, #15980: URL: https://github.com/apache/datafusion/pull/15980 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15978 - Closes https://github.com/apache/datafusion/issues/15979 ## Rationale for thi

[PR] Add support for parsing with semicolons optional [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
aharpervc opened a new pull request, #1843: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1843 Note: this PR is temporarily rebased on https://github.com/apache/datafusion-sqlparser-rs/pull/1834 before that is merged --- This PR introduces support for parsing SQL

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
chenkovsky commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077187981 ## datafusion/common/src/scalar/mod.rs: ## @@ -3415,6 +3415,100 @@ impl ScalarValue { .map(|sv| sv.size() - size_of_val(sv)) .

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on PR #15924: URL: https://github.com/apache/datafusion/pull/15924#issuecomment-2857822728 @alamb @chenkovsky I pushed a version that uses `copy_array_data` to make sure `take` isn't doing anything funny (looking at the code, at the moment it doesn't, but the docs pro

[PR] re-export can_expr_be_pushed_down_with_schemas to be public [datafusion]

2025-05-07 Thread via GitHub
adriangb opened a new pull request, #15974: URL: https://github.com/apache/datafusion/pull/15974 Followup to #15971. I forgot to re-export it 🤦🏻 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] re-export can_expr_be_pushed_down_with_schemas to be public [datafusion]

2025-05-07 Thread via GitHub
berkaysynnada merged PR #15974: URL: https://github.com/apache/datafusion/pull/15974 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

[PR] fix: allow arbitrary operators with ANY and ALL on Postgres [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
freshtonic opened a new pull request, #1842: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1842 In sqlparser PR #963 a check was introduced which limits which operators can be used with `ANY` and `ALL` expressions. Postgres can parse more (possibly _all_ binary operators

Re: [PR] feat: create builder for disk manager [datafusion]

2025-05-07 Thread via GitHub
jdrouet closed pull request #15975: feat: create builder for disk manager URL: https://github.com/apache/datafusion/pull/15975 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [I] Postgres does not limit which operators can be used with `ANY` and `ALL` [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
freshtonic commented on issue #1841: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1841#issuecomment-2858434626 Now with PR: #1842 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[PR] feat: create builder for disk manager [datafusion]

2025-05-07 Thread via GitHub
jdrouet opened a new pull request, #15975: URL: https://github.com/apache/datafusion/pull/15975 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12 2025 [datafusion]

2025-05-07 Thread via GitHub
GitHub user phillipleblanc edited a discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12 2025 I'll be traveling to San Francisco to attend the Databricks Data & AI Summit in San Francisco this June. I can't commit to hosting a full meetup, but i

[PR] Optimize hash partitioning for cache friendliness [datafusion]

2025-05-07 Thread via GitHub
ctsk opened a new pull request, #15981: URL: https://github.com/apache/datafusion/pull/15981 ## Which issue does this PR close? Helps https://github.com/apache/datafusion/issues/6822 a bit. ## Rationale for this change Before this PR, hash partitioning worked roughly like

Re: [PR] Migrate Optimizer tests to insta, part5 [datafusion]

2025-05-07 Thread via GitHub
qstommyshu commented on PR #15945: URL: https://github.com/apache/datafusion/pull/15945#issuecomment-2859773824 > I wonder if we are actually done now (❤️ thanks again @qstommyshu ) Not yet 😅. I just checked, there are at least 5 files in optimizer tests still needs to be migrated.

Re: [PR] Optimize hash partitioning for cache friendliness [datafusion]

2025-05-07 Thread via GitHub
Dandandan commented on PR #15981: URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859776487 Nice, that seems like a great result! i think the main improvement seems to be after this would be using the `take_in` API you proposed in arrow-rs (mainly to avoid `concat`)

Re: [PR] Migrate Optimizer tests to insta, part5 [datafusion]

2025-05-07 Thread via GitHub
alamb commented on PR #15945: URL: https://github.com/apache/datafusion/pull/15945#issuecomment-2859832899 > > I wonder if we are actually done now (❤️ thanks again @qstommyshu ) > > Not yet 😅. I just checked, there are at least 5 files in optimizer tests still needs to be migrated.

Re: [PR] Add support for parsing with semicolons optional [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
aharpervc commented on PR #1843: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1843#issuecomment-2859830502 Keeping this in draft at the moment to work out the "compile-no-std" ci job, and possibly some more test cases. However, anyone feel free to post a review/thoughts --

Re: [PR] chore: Comet + Iceberg (1.8.1) CI [datafusion-comet]

2025-05-07 Thread via GitHub
huaxingao commented on code in PR #1715: URL: https://github.com/apache/datafusion-comet/pull/1715#discussion_r2078282350 ## .github/workflows/iceberg_spark_test_native_datafusion.yml: ## @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more c

[I] feat: bucketed scan for native_datafusion Parquet scan [datafusion-comet]

2025-05-07 Thread via GitHub
mbutrovich opened a new issue, #1719: URL: https://github.com/apache/datafusion-comet/issues/1719 ### What is the problem the feature request solves? The native_datafusion Parquet scan does not support bucketed scan, and fails most of the tests in Spark's BucketedReadSuite without a f

[PR] fix: Bucketed scan fallback for native_datafusion Parquet scan [datafusion-comet]

2025-05-07 Thread via GitHub
mbutrovich opened a new pull request, #1720: URL: https://github.com/apache/datafusion-comet/pull/1720 ## Which issue does this PR close? Closes #. ## Rationale for this change See https://github.com/apache/datafusion-comet/issues/1719 ## What chang

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077195109 ## datafusion/common/src/scalar/mod.rs: ## @@ -3415,6 +3415,100 @@ impl ScalarValue { .map(|sv| sv.size() - size_of_val(sv)) .su

Re: [PR] Map file-level column statistics to the table-level [datafusion]

2025-05-07 Thread via GitHub
adriangb commented on code in PR #15865: URL: https://github.com/apache/datafusion/pull/15865#discussion_r2077215405 ## datafusion/datasource/src/schema_adapter.rs: ## @@ -334,4 +340,126 @@ impl SchemaMapper for SchemaMapping { let record_batch = RecordBatch::try_new_wi

[PR] make can_expr_be_pushed_down_with_schemas public again [datafusion]

2025-05-07 Thread via GitHub
adriangb opened a new pull request, #15971: URL: https://github.com/apache/datafusion/pull/15971 I changed this from `pub fn` to pub(crate) fn` in https://github.com/apache/datafusion/pull/15769 because it was no longer being used outside of the crate. However then I went to update o

Re: [I] Vectorize window functions [datafusion]

2025-05-07 Thread via GitHub
suibianwanwank commented on issue #15607: URL: https://github.com/apache/datafusion/issues/15607#issuecomment-2858132562 Sorry for mixing up the two issues and causing some confusion. I agree that vectorizing the current implementation and supporting expr in frame are orthogonal. -- This

Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

2025-05-07 Thread via GitHub
kosiew commented on PR #15295: URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2857995748 hi @TheBuilderJR , In the [2 schemas you quoted](https://github.com/apache/datafusion/pull/15295#issuecomment-2851094063), there are: 1. file_schema (from the parquet files)

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077160318 ## datafusion/common/src/scalar/mod.rs: ## @@ -3415,6 +3415,100 @@ impl ScalarValue { .map(|sv| sv.size() - size_of_val(sv)) .su

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

2025-05-07 Thread via GitHub
ashdnazg commented on code in PR #15924: URL: https://github.com/apache/datafusion/pull/15924#discussion_r2077195109 ## datafusion/common/src/scalar/mod.rs: ## @@ -3415,6 +3415,100 @@ impl ScalarValue { .map(|sv| sv.size() - size_of_val(sv)) .su

Re: [PR] PERF : modify SMJ shuffle file reader to skip validation [datafusion]

2025-05-07 Thread via GitHub
getChan commented on code in PR #15948: URL: https://github.com/apache/datafusion/pull/15948#discussion_r2077971001 ## datafusion/physical-plan/benches/sort_merge_join.rs: ## @@ -0,0 +1,117 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] PERF : modify SMJ shuffle file reader to skip validation [datafusion]

2025-05-07 Thread via GitHub
getChan commented on code in PR #15948: URL: https://github.com/apache/datafusion/pull/15948#discussion_r2077971001 ## datafusion/physical-plan/benches/sort_merge_join.rs: ## @@ -0,0 +1,117 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] PERF : modify SMJ shuffle file reader to skip validation [datafusion]

2025-05-07 Thread via GitHub
getChan commented on code in PR #15948: URL: https://github.com/apache/datafusion/pull/15948#discussion_r2077984399 ## datafusion/physical-plan/benches/sort_merge_join.rs: ## @@ -0,0 +1,117 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] Optimize hash partitioning for cache friendliness [datafusion]

2025-05-07 Thread via GitHub
ctsk commented on PR #15981: URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859158487 Another tried-and-true strategy for this kind of problem is to partition in multiple stages: Instead of having a "wide" fanout partitioning to, for instance 256 partitions, it is prefer

Re: [PR] Migrate Optimizer tests to insta, part5 [datafusion]

2025-05-07 Thread via GitHub
alamb merged PR #15945: URL: https://github.com/apache/datafusion/pull/15945 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Migrate Optimizer tests to insta, part5 [datafusion]

2025-05-07 Thread via GitHub
alamb commented on PR #15945: URL: https://github.com/apache/datafusion/pull/15945#issuecomment-2858885708 I wonder if we are actually done now (❤️ thanks again @qstommyshu ) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

[I] Placeholder datatype not inferred for `Expr::InSubquery` [datafusion]

2025-05-07 Thread via GitHub
kczimm opened a new issue, #15979: URL: https://github.com/apache/datafusion/issues/15979 ### Describe the bug Parameterized queries with placeholders before `IN ` are not inferred. For example, ```sql SELECT * FROM my_table WHERE $1 IN (SELECT A FROM my_table WHERE B > 3);

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-05-07 Thread via GitHub
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2859288602 After implementing reuse Rows, it was found that there was no improvement in the overall execution of `SortPreservingMergeExec`. @Dandandan Therefore, I measured the r

Re: [PR] Optimize hash partitioning for cache friendliness [datafusion]

2025-05-07 Thread via GitHub
Dandandan commented on PR #15981: URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859289524 nice, could you share some perf numbers of this approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] chore: Comet + Iceberg (1.8.1) CI [datafusion-comet]

2025-05-07 Thread via GitHub
hsiang-c commented on code in PR #1715: URL: https://github.com/apache/datafusion-comet/pull/1715#discussion_r2078058620 ## dev/diffs/iceberg/1.8.1.diff: ## @@ -0,0 +1,179 @@ +diff --git a/spark/v3.4/build.gradle b/spark/v3.4/build.gradle +index 6eb26e8..90d848d 100644 +--- a/sp

Re: [PR] implement `AggregateExec.partition_statistics` [datafusion]

2025-05-07 Thread via GitHub
xudong963 commented on PR #15954: URL: https://github.com/apache/datafusion/pull/15954#issuecomment-2858784424 @UBarney Thank you, I'll review it in two days -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

[I] Replace `ObjectStoreRegistry` with `object_store`'s new `ObjectStoreRegistry` [datafusion]

2025-05-07 Thread via GitHub
criccomini opened a new issue, #15983: URL: https://github.com/apache/datafusion/issues/15983 ### Is your feature request related to a problem or challenge? I'm working with the `object_store` folks on upstreaming DataFusion's [`ObjectStoreRegistry`](https://docs.rs/datafusion/latest/

Re: [PR] Allow stored procedures to be defined without `BEGIN`/`END` [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
aharpervc commented on code in PR #1834: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1834#discussion_r2078041514 ## tests/sqlparser_mssql.rs: ## @@ -100,48 +100,52 @@ fn parse_mssql_delimited_identifiers() { #[test] fn parse_create_procedure() { -let sql

Re: [I] Postgres does not limit which operators can be used with `ANY` and `ALL` [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
mvzink commented on issue #1841: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1841#issuecomment-2859241001 Note that there are some other things Postgres supports for `ANY`/`ALL` which our current parsing approach doesn't; e.g. it treats `LIKE` as a valid operator in that

[I] Treat truncated parquet stats as inexact [datafusion]

2025-05-07 Thread via GitHub
robert3005 opened a new issue, #15976: URL: https://github.com/apache/datafusion/issues/15976 ### Describe the bug When reading parquet files with truncated stats datafusion will report the min/max as exact even though metadata in the file indicates that min/max has been truncated

Re: [PR] Optimize hash partitioning for cache friendliness [datafusion]

2025-05-07 Thread via GitHub
ctsk commented on PR #15981: URL: https://github.com/apache/datafusion/pull/15981#issuecomment-2859359952 I've ran clickbench_partitioned and tpch_mem10 - on a machine with 16 cores. The clickbench results are pretty much the same, tpch_mem10 ran significantly faster. data

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-05-07 Thread via GitHub
Dandandan commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2859376503 Interesting @acking-you, do you have some code/branch you could share? I wonder if you tried out using `RowConverter::append` (e.g. clear + append) https://docs.rs/a

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-05-07 Thread via GitHub
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2859387925 > Interesting @acking-you, do you have some code/branch you could share? is here: https://github.com/acking-you/arrow-datafusion/commit/f020522eab82f1ff8a7b42b97b1c9

Re: [PR] perf: Add performance tracing capability [datafusion-comet]

2025-05-07 Thread via GitHub
parthchandra commented on code in PR #1706: URL: https://github.com/apache/datafusion-comet/pull/1706#discussion_r2078115426 ## native/core/src/execution/tracing.rs: ## @@ -0,0 +1,111 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licen

[I] Placeholder datatype not inferred after `LIMIT` clause [datafusion]

2025-05-07 Thread via GitHub
kczimm opened a new issue, #15978: URL: https://github.com/apache/datafusion/issues/15978 ### Describe the bug When using a parameterized query with a placeholder indicating the value in the `LIMIT` clause, the datatype is not inferred. ### To Reproduce ```rust let sc

[PR] [wip] feat: Add framework for supporting multiple telemetry providers [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove opened a new pull request, #1722: URL: https://github.com/apache/datafusion-comet/pull/1722 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1718 ## Rationale for this change Experimenting with supporti

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #15958: URL: https://github.com/apache/datafusion/pull/15958#discussion_r2078773309 ## datafusion/spark/src/function/math/ceil_floor.rs: ## @@ -0,0 +1,720 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor

Re: [PR] Introducing mutation testing [datafusion]

2025-05-07 Thread via GitHub
github-actions[bot] commented on PR #14590: URL: https://github.com/apache/datafusion/pull/14590#issuecomment-2861252887 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] BUG: schema_force_view_type configuration not working for CREATE EXTERNAL TABLE [datafusion]

2025-05-07 Thread via GitHub
github-actions[bot] commented on PR #14922: URL: https://github.com/apache/datafusion/pull/14922#issuecomment-2861252728 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] [wip] attach diagnostic to duplicate table name error [datafusion]

2025-05-07 Thread via GitHub
github-actions[bot] closed pull request #14767: [wip] attach diagnostic to duplicate table name error URL: https://github.com/apache/datafusion/pull/14767 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [I] `native_datafusion/native_iceberg_compat` scans case sensitive [datafusion-comet]

2025-05-07 Thread via GitHub
wForget commented on issue #1574: URL: https://github.com/apache/datafusion-comet/issues/1574#issuecomment-2861237181 > Isn't this addressed by [#1575](https://github.com/apache/datafusion-comet/pull/1575) ? As commented in https://github.com/apache/datafusion-comet/pull/1575#discus

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #15958: URL: https://github.com/apache/datafusion/pull/15958#discussion_r2078776570 ## datafusion/sqllogictest/test_files/spark/math/ceil.slt: ## @@ -0,0 +1,141 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contribu

Re: [PR] chore: Comet + Iceberg (1.8.1) CI [datafusion-comet]

2025-05-07 Thread via GitHub
kazuyukitanimura commented on code in PR #1715: URL: https://github.com/apache/datafusion-comet/pull/1715#discussion_r2078969812 ## .github/workflows/iceberg_spark_test.yml: ## @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

Re: [PR] chore: Comet + Iceberg (1.8.1) CI [datafusion-comet]

2025-05-07 Thread via GitHub
kazuyukitanimura commented on code in PR #1715: URL: https://github.com/apache/datafusion-comet/pull/1715#discussion_r2078969812 ## .github/workflows/iceberg_spark_test.yml: ## @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor

[PR] refactor: remove deprecated `AvroExec` [datafusion]

2025-05-07 Thread via GitHub
miroim opened a new pull request, #15987: URL: https://github.com/apache/datafusion/pull/15987 ## Which issue does this PR close? Part of #15950 . ## Rationale for this change The `AvroExec` structure was deprecated in DataFusion 46 and is scheduled for removal. Dev

[PR] Postgresql ALTER TABLE operation: REPLICA IDENTITY [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
MohamedAbdeen21 opened a new pull request, #1844: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1844 Add support for the psql-specific ALTER TABLE operation REPLICA IDENTITY Docs: https://www.postgresql.org/docs/current/sql-altertable.html -- This is an automated messa

Re: [PR] fix: Bucketed scan fallback for native_datafusion Parquet scan [datafusion-comet]

2025-05-07 Thread via GitHub
andygrove commented on code in PR #1720: URL: https://github.com/apache/datafusion-comet/pull/1720#discussion_r2078441658 ## spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala: ## @@ -96,6 +96,13 @@ case class CometScanRule(session: SparkSession) extends Rule[Spark

Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-07 Thread via GitHub
parthchandra commented on code in PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#discussion_r2078437197 ## native/spark-expr/src/bitwise_funcs/bitwise_count.rs: ## @@ -0,0 +1,103 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Add support for table valued functions for SQL Server [datafusion-sqlparser-rs]

2025-05-07 Thread via GitHub
aharpervc commented on code in PR #1839: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1839#discussion_r2078461916 ## tests/sqlparser_mssql.rs: ## @@ -283,6 +294,50 @@ fn parse_create_function() { END\ "; let _ = ms().verified_stmt(create_functi

Re: [PR] WIP: Testing parquet page cache reader [datafusion]

2025-05-07 Thread via GitHub
alamb closed pull request #15903: WIP: Testing parquet page cache reader URL: https://github.com/apache/datafusion/pull/15903 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] chore(deps): bump sha2 from 0.10.8 to 0.10.9 [datafusion]

2025-05-07 Thread via GitHub
alamb merged PR #15970: URL: https://github.com/apache/datafusion/pull/15970 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Support inferring new predicates to push down [datafusion]

2025-05-07 Thread via GitHub
alamb commented on PR #15906: URL: https://github.com/apache/datafusion/pull/15906#issuecomment-2860308364 Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is read

Re: [PR] minor: Warn if memory pool is dropped with bytes still reserved [datafusion-comet]

2025-05-07 Thread via GitHub
codecov-commenter commented on PR #1721: URL: https://github.com/apache/datafusion-comet/pull/1721#issuecomment-2860400666 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1721?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] `native_datafusion/native_iceberg_compat` scans case sensitive [datafusion-comet]

2025-05-07 Thread via GitHub
parthchandra commented on issue #1574: URL: https://github.com/apache/datafusion-comet/issues/1574#issuecomment-2860407714 Isn't this addressed by #1575 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

  1   2   >