Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

2025-03-24 Thread via GitHub
berkaysynnada commented on PR #64: URL: https://github.com/apache/datafusion-site/pull/64#issuecomment-274826 This looks good to me, WDYT @alamb @ozankabak ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2010510700 ## datafusion/sql/tests/sql_integration.rs: ## @@ -1901,8 +1901,9 @@ fn union_by_name_different_columns() { \nProjection: NULL AS Int64(1), order_id\

Re: [I] March 17, 2025: This week(s) in DataFusion [datafusion]

2025-03-24 Thread via GitHub
alamb commented on issue #15269: URL: https://github.com/apache/datafusion/issues/15269#issuecomment-2748698965 Another blog post by @XiangpengHao about how to build S3 select in 400 lines of Rust (and FDAP) https://blog.xiangpeng.systems/posts/build-s3-select/ -- This is an automated

Re: [I] Unsupported OS/arch [datafusion-comet]

2025-03-24 Thread via GitHub
parthchandra commented on issue #1552: URL: https://github.com/apache/datafusion-comet/issues/1552#issuecomment-2748709450 > `org.apache.comet.serde.QueryPlanSerde Comet native execution is disabled due to: unsupported Spark partitioning: org.apache.spark.sql.catalyst.plans.physical.RangeP

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-24 Thread via GitHub
gabotechs commented on PR #14413: URL: https://github.com/apache/datafusion/pull/14413#issuecomment-2748975545 > I also encourage you to help review other PRs that are flowing into the repo -- this will likely make it easier for you to get people to review your PRs πŸ‘ Sounds fair, I'

Re: [I] Change mapping of SQL `VARCHAR` from `Utf8` to `Utf8View` [datafusion]

2025-03-24 Thread via GitHub
zhuqi-lucas commented on issue #15096: URL: https://github.com/apache/datafusion/issues/15096#issuecomment-2742822184 Updated: Most of the tasks are resolved, i am trying to do more performance investigation and testing if we default change to Utf8View for all varchar. -- This is a

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-24 Thread via GitHub
suibianwanwank commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006221885 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -644,10 +737,72 @@ impl RecordBatchStore { } } +struct TopKDynamicFilterSource { +/// The Top

Re: [PR] Always use `PartitionMode::Auto` in planner [datafusion]

2025-03-24 Thread via GitHub
Dandandan commented on PR #15339: URL: https://github.com/apache/datafusion/pull/15339#issuecomment-2749311747 FYI @ozankabak this is now ready -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] Optimize CASE expression for "expr or expr" usage. [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #13953: URL: https://github.com/apache/datafusion/pull/13953#issuecomment-2749108694 @findepi has located an issue in this PR -- please see - https://github.com/apache/datafusion/issues/15384 - https://github.com/apache/datafusion/pull/15390 -- This is an autom

Re: [PR] Restore lazy evaluation of fallible CASE [datafusion]

2025-03-24 Thread via GitHub
alamb commented on code in PR #15390: URL: https://github.com/apache/datafusion/pull/15390#discussion_r2010747252 ## datafusion/sqllogictest/test_files/case.slt: ## @@ -467,7 +467,18 @@ FROM t; [{foo: blarg}] +query II +SELECT v, CASE WHEN v != 0 THEN 10/v ELSE 42 END F

Re: [PR] Migrate physical plan tests to `insta` (Part-2) [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15364: URL: https://github.com/apache/datafusion/pull/15364#issuecomment-2749132020 Thanks again @Shreyaskr1409 and @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Documentation: Plan custom expressions [datafusion]

2025-03-24 Thread via GitHub
alamb merged PR #15353: URL: https://github.com/apache/datafusion/pull/15353 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Migrate physical plan tests to `insta` [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on issue #15248: URL: https://github.com/apache/datafusion/issues/15248#issuecomment-2749195939 oh, also joins I think (e.g. `join_splitted_batch`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [D] More thorough contribution guideline [datafusion]

2025-03-24 Thread via GitHub
GitHub user alamb added a comment to the discussion: More thorough contribution guideline We do have this guide for API stability (aka trying to reduce breaking changes) https://datafusion.apache.org/contributor-guide/api-health.html > Use 'cargo-semver-checks' to detect unintentional API bre

Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15348: URL: https://github.com/apache/datafusion/pull/15348#issuecomment-2749309575 > utf8_view performs worse for high cardinality cases: I think it would be a great project to improve the performance of utf8_view for sorting high cardinality - maybe we can add

[I] Use spill manager in row hasher [datafusion]

2025-03-24 Thread via GitHub
comphead opened a new issue, #15401: URL: https://github.com/apache/datafusion/issues/15401 Follow up on #15355 In the #15355 the spill manager was introduced unifying the spill API and metrics. It would be good to reuse the API in row hasher _Originally posted by @comphead in

Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-03-24 Thread via GitHub
mvzink commented on code in PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2011090576 ## src/dialect/mysql.rs: ## @@ -145,6 +153,280 @@ impl Dialect for MySqlDialect { fn supports_comma_separated_set_assignments(&self) -> bool {

[PR] Blog post for DataFusion 46.0.0 [datafusion-site]

2025-03-24 Thread via GitHub
oznur-synnada opened a new pull request, #64: URL: https://github.com/apache/datafusion-site/pull/64 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] Support Avg distinct [datafusion]

2025-03-24 Thread via GitHub
qazxcdswe123 commented on code in PR #15356: URL: https://github.com/apache/datafusion/pull/15356#discussion_r2010196071 ## datafusion/functions-aggregate-common/src/aggregate/avg_distinct/numeric.rs: ## @@ -0,0 +1,109 @@ +// Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] chore(deps): bump rust_decimal from 1.37.0 to 1.37.1 [datafusion]

2025-03-24 Thread via GitHub
findepi merged PR #15378: URL: https://github.com/apache/datafusion/pull/15378 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafu

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
Omega359 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2010222655 ## datafusion/sql/tests/sql_integration.rs: ## @@ -1901,8 +1901,9 @@ fn union_by_name_different_columns() { \nProjection: NULL AS Int64(1), order_id\

Re: [PR] Only unnest source for `EmptyRelation` [datafusion]

2025-03-24 Thread via GitHub
goldmedal merged PR #15159: URL: https://github.com/apache/datafusion/pull/15159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] remove the duplicate test for unparser [datafusion]

2025-03-24 Thread via GitHub
goldmedal commented on PR #15385: URL: https://github.com/apache/datafusion/pull/15385#issuecomment-2748283824 Thanks @blaginin @findepi @xudong963 πŸ‘ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Migrate datasource tests to insta [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on code in PR #15258: URL: https://github.com/apache/datafusion/pull/15258#discussion_r2010279421 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -608,19 +609,19 @@ mod tests { .round_trip_to_batches(vec![batch1, batch2])

Re: [PR] fix: `core_expressions` feature flag broken, move `overlay` into `core` functions [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15217: URL: https://github.com/apache/datafusion/pull/15217#issuecomment-2749053698 I merged up from main and removed a reference in the CI tests to the now removed `core_expressions` feature -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Triggering extended tests through PR comment [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15101: URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2749403724 When I tested this on my fork, it seems to have run extended tests for my PR without any prompting: - https://github.com/alamb/datafusion/pull/32#issuecomment-2749402970 -- This i

Re: [PR] fix: write hive partitions for any int/uint/float [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15337: URL: https://github.com/apache/datafusion/pull/15337#issuecomment-2749136273 Thank you @xudong963 and @Omega359 for the review @christophermcdermott thank you for the very nice first contribution -- This is an automated message from the Apache Git Ser

Re: [PR] fix: `core_expressions` feature flag broken, move `overlay` into `core` functions [datafusion]

2025-03-24 Thread via GitHub
alamb merged PR #15217: URL: https://github.com/apache/datafusion/pull/15217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] Migrate subtrait tests to insta [datafusion]

2025-03-24 Thread via GitHub
blaginin opened a new issue, #15398: URL: https://github.com/apache/datafusion/issues/15398 In https://github.com/apache/datafusion/issues/15178, we're switching hard-coded constants in tests to `insta`. This issue targets updating **subtrait tests** (`datafusion/substrait`).

Re: [PR] Migrate physical plan tests to `insta` (Part-2) [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15364: URL: https://github.com/apache/datafusion/pull/15364#issuecomment-2749131666 Passed on rerun. πŸš€ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] Migrate physical plan tests to `insta` [datafusion]

2025-03-24 Thread via GitHub
Shreyaskr1409 commented on issue #15248: URL: https://github.com/apache/datafusion/issues/15248#issuecomment-2749142273 > [@Shreyaskr1409](https://github.com/Shreyaskr1409) are there any more tests to migrate? Or is this ticket complete? @alamb tests for `sort` are still left. The

Re: [PR] Migrate physical plan tests to `insta` (Part-3 / Final) [datafusion]

2025-03-24 Thread via GitHub
jayzhan211 merged PR #15399: URL: https://github.com/apache/datafusion/pull/15399 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Refactor: add `FileGroup` structure for `Vec` [datafusion]

2025-03-24 Thread via GitHub
xudong963 commented on code in PR #15379: URL: https://github.com/apache/datafusion/pull/15379#discussion_r2011237231 ## datafusion/datasource/src/file_groups.rs: ## @@ -354,6 +361,115 @@ impl FileGroupPartitioner { } } +/// Represents a group of partitioned files that'l

Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]

2025-03-24 Thread via GitHub
2010YOUY01 commented on PR #15266: URL: https://github.com/apache/datafusion/pull/15266#issuecomment-2749981776 I haven't been following the recent conversations regarding hashmap optimization, but I also feel, if it needs pre-aggregate to make the low-cardinality case run faster, there mig

Re: [PR] Triggering extended tests through PR comment [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15101: URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2749093157 Thanks @danila-b and @Omega359 πŸ™ -- I started testing this on my fork here: - https://github.com/alamb/datafusion/pull/32 -- This is an automated message from the Apache Git Serv

Re: [PR] fix: `core_expressions` feature flag broken, move `overlay` into `core` functions [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15217: URL: https://github.com/apache/datafusion/pull/15217#issuecomment-2749136863 Thanks again @shruti2522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [I] Migrate physical plan tests to `insta` [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on issue #15248: URL: https://github.com/apache/datafusion/issues/15248#issuecomment-2749139298 I think there are a bit more to update, e.g. https://github.com/apache/datafusion/blob/0ff89844a2f2c7a3bbcaff995db7e464deaeb997/datafusion/physical-plan/src/windows/boun

Re: [PR] Always use `PartitionMode::Auto` in planner [datafusion]

2025-03-24 Thread via GitHub
Dandandan commented on code in PR #15339: URL: https://github.com/apache/datafusion/pull/15339#discussion_r2010754139 ## datafusion/sqllogictest/test_files/explain_tree.slt: ## @@ -345,63 +345,68 @@ FROM physical_plan 01)β”Œβ”€β”€β”€β” -02)β”‚CoalesceBat

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-24 Thread via GitHub
alamb commented on PR #61: URL: https://github.com/apache/datafusion-site/pull/61#issuecomment-2749290808 I am hoping to publish this tomorrow or Wednesday -- to try and draw out the writing πŸš‹ from @XiangpengHao πŸ˜† We need to let this one settle for a day https://blog.xiangpeng.syste

Re: [PR] fix: handle duplicate WindowFunction expressions in Substrait consumer [datafusion]

2025-03-24 Thread via GitHub
Blizzara commented on PR #15211: URL: https://github.com/apache/datafusion/pull/15211#issuecomment-273805 > I merged this PR up from main to resolve a conflict and plan to merge it when the CI passes Thanks @alamb , should be good to go! -- This is an automated message from the

Re: [PR] feat(datafusion-functions-aggregate): add support for lists and other nested types in `min` and `max` [datafusion]

2025-03-24 Thread via GitHub
github-actions[bot] closed pull request #13991: feat(datafusion-functions-aggregate): add support for lists and other nested types in `min` and `max` URL: https://github.com/apache/datafusion/pull/13991 -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [I] Speed up hash partitioning [datafusion]

2025-03-24 Thread via GitHub
alamb commented on issue #6822: URL: https://github.com/apache/datafusion/issues/6822#issuecomment-2749338072 I believe the plans also effectively hash the group keys three times for aggregate plans: 1. initial hash to find the group in the initial aggregate phase 2. hash to compute th

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-24 Thread via GitHub
2010YOUY01 commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2749933166 > > > > 3. After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue. > > > >**Inefficency:** No

Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream [datafusion]

2025-03-24 Thread via GitHub
zhuqi-lucas commented on PR #15348: URL: https://github.com/apache/datafusion/pull/15348#issuecomment-2749930210 > Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single `Utf8View` column, but it gets slower: > > Reprodu

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-24 Thread via GitHub
2010YOUY01 commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2749930341 > Thanks @2010YOUY01 love tests. I think we need to move other ops like sort_merge_join or row hash to spilled manager? > > I'll create a tickets if so > > #15401 #15

[I] Investigate why Sort-preserving merging on a single `Utf8View` column will cause sort_tpch q3 slow [datafusion]

2025-03-24 Thread via GitHub
zhuqi-lucas opened a new issue, #15403: URL: https://github.com/apache/datafusion/issues/15403 ### Is your feature request related to a problem or challenge? See the comments, https://github.com/apache/datafusion/pull/15348#issuecomment-2743150385 Sort-preserving mergin

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2011222623 ## datafusion/sql/tests/sql_integration.rs: ## @@ -1901,8 +1901,9 @@ fn union_by_name_different_columns() { \nProjection: NULL AS Int64(1), order_id\

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2011223553 ## datafusion/sqllogictest/test_files/union_by_name.slt: ## @@ -152,22 +152,22 @@ NULL 4 # Limit query III -SELECT 1 UNION BY NAME SELECT * FROM unnest(range(

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2011223823 ## datafusion/sqllogictest/test_files/union_by_name.slt: ## @@ -152,22 +152,22 @@ NULL 4 # Limit query III -SELECT 1 UNION BY NAME SELECT * FROM unnest(range(

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2011223823 ## datafusion/sqllogictest/test_files/union_by_name.slt: ## @@ -152,22 +152,22 @@ NULL 4 # Limit query III -SELECT 1 UNION BY NAME SELECT * FROM unnest(range(

[PR] FIX : some benchmarks are failing [datafusion]

2025-03-24 Thread via GitHub
getChan opened a new pull request, #15367: URL: https://github.com/apache/datafusion/pull/15367 ## Which issue does this PR close? - Closes #15213 . ## Rationale for this change It is not certain, but it seems that the tokio runtime creation code should be includ

Re: [I] Migrate `datafusion/sql` tests to `insta` [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on issue #15397: URL: https://github.com/apache/datafusion/issues/15397#issuecomment-2749168478 This ticket is a bit more tricky than the others, because there are several places where direct swap to `assert_snapshot` won't work - I listed them above. For `quick_te

Re: [I] Migrate physical plan tests to `insta` [datafusion]

2025-03-24 Thread via GitHub
Shreyaskr1409 commented on issue #15248: URL: https://github.com/apache/datafusion/issues/15248#issuecomment-2749205655 > oh, also joins I think (e.g. `join_splitted_batch`) I will wrap up everything and cross check in next PR. we can proceed any changes to be made there. -- This i

[PR] Migrate physical plan tests to `insta` (Part-3 / Final) [datafusion]

2025-03-24 Thread via GitHub
Shreyaskr1409 opened a new pull request, #15399: URL: https://github.com/apache/datafusion/pull/15399 ## Which issue does this PR close? - Closes #15248 . ## Rationale for this change Migrate physical plan tests to insta ## What changes are included in this PR?

Re: [PR] Restore lazy evaluation of fallible CASE [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15390: URL: https://github.com/apache/datafusion/pull/15390#issuecomment-2749299083 Here are my benchmark results. My conclusion is no-discernable-difference ``` group findepi_lazy-case main -

Re: [PR] Chore: simplify array related functions impl [datafusion-comet]

2025-03-24 Thread via GitHub
codecov-commenter commented on PR #1490: URL: https://github.com/apache/datafusion-comet/pull/1490#issuecomment-2749198942 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1490?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream [datafusion]

2025-03-24 Thread via GitHub
alamb commented on code in PR #15348: URL: https://github.com/apache/datafusion/pull/15348#discussion_r2010872746 ## datafusion/physical-plan/src/sorts/cursor.rs: ## @@ -294,16 +294,44 @@ impl CursorValues for StringViewArray { } fn eq(l: &Self, l_idx: usize, r: &Sel

Re: [I] `core_expressions` feature is broken in the `datafusion-functions` [datafusion]

2025-03-24 Thread via GitHub
alamb closed issue #15207: `core_expressions` feature is broken in the `datafusion-functions` URL: https://github.com/apache/datafusion/issues/15207 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-24 Thread via GitHub
kazuyukitanimura commented on PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#issuecomment-2749132027 @parthchandra https://github.com/apache/datafusion-comet/pull/1555#discussion_r2006926052 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-24 Thread via GitHub
alamb merged PR #15355: URL: https://github.com/apache/datafusion/pull/15355 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2749165158 > > > 3. After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue. > > >**Inefficency:** Now `Extern

Re: [PR] fix: Redundant files spilled during external sort + introduce `SpillManager` [datafusion]

2025-03-24 Thread via GitHub
alamb commented on PR #15355: URL: https://github.com/apache/datafusion/pull/15355#issuecomment-2749167836 It looks to me like there are 4 approvals of this PR and a bunch of potential work stacked up on it, so let's merge it to keep the code flowing Thank you everyone for the reviews

Re: [PR] Refactor: add `FileGroup` structure [datafusion]

2025-03-24 Thread via GitHub
alamb commented on code in PR #15379: URL: https://github.com/apache/datafusion/pull/15379#discussion_r2010801289 ## datafusion/datasource/src/file_groups.rs: ## @@ -354,6 +361,115 @@ impl FileGroupPartitioner { } } +/// Represents a group of partitioned files that'll be

Re: [PR] Documentation: Plan custom expressions [datafusion]

2025-03-24 Thread via GitHub
Jiashu-Hu commented on code in PR #15353: URL: https://github.com/apache/datafusion/pull/15353#discussion_r2010728423 ## docs/source/library-user-guide/adding-udfs.md: ## @@ -1160,6 +1160,89 @@ async fn main() -> Result<()> { // +---+ ``` +## Custom Expression Planning + +Da

Re: [PR] Add "end to end parquet reading test" for WASM [datafusion]

2025-03-24 Thread via GitHub
alamb merged PR #15362: URL: https://github.com/apache/datafusion/pull/15362 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] Postgres dialect fails to parse "~ any(...)" [datafusion-sqlparser-rs]

2025-03-24 Thread via GitHub
romanb opened a new issue, #1776: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1776 The following query ```sql select a from aa where a ~ any(array['x']) ``` yields ``` Expected one of [=, >, <, =>, =<, !=] as comparison operator, found: ~ at Line: 1, Colu

Re: [D] More thorough contribution guideline [datafusion]

2025-03-24 Thread via GitHub
GitHub user alamb added a comment to the discussion: More thorough contribution guideline > Use 'cargo-semver-checks' to detect unintentional API breakages. Smallest > things can break APIs in ways we can not predict. > [Here](https://predr.ag/blog/semver-in-rust-tooling-breakage-and-edge-cas

Re: [D] More thorough contribution guideline [datafusion]

2025-03-24 Thread via GitHub
GitHub user alamb added a comment to the discussion: More thorough contribution guideline In general I think there will always be a tension between: 1. What is good for people developing DataFusion (aka making improvements as few constraints as possible) 2. What is good for specific projects (

Re: [I] Scalars are too verbose [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on issue #15395: URL: https://github.com/apache/datafusion/issues/15395#issuecomment-2749417038 Thank you Bruce!!! Btw adding _column type_ (second line of the first row) can actually be a good thing to implement separately -- This is an automated message from the Apac

Re: [PR] Move `DataSink` to `datasource` and add session crate [datafusion]

2025-03-24 Thread via GitHub
berkaysynnada commented on PR #15371: URL: https://github.com/apache/datafusion/pull/15371#issuecomment-2749421023 I've some work on this (I'll try to not separate the session and make datasourceExec not dependent to catalog) -- This is an automated message from the Apache Git Service. To

Re: [PR] Support Avg distinct [datafusion]

2025-03-24 Thread via GitHub
jayzhan211 commented on code in PR #15356: URL: https://github.com/apache/datafusion/pull/15356#discussion_r2011076901 ## datafusion/functions-aggregate/src/average.rs: ## @@ -60,6 +64,17 @@ make_udaf_expr_and_func!( avg_udaf ); +pub fn avg_distinct(expr: Expr) -> Expr {

Re: [PR] Use `equals_datatype` to compare type when type coercion [datafusion]

2025-03-24 Thread via GitHub
jayzhan211 commented on code in PR #15366: URL: https://github.com/apache/datafusion/pull/15366#discussion_r2011064209 ## datafusion/expr-common/src/type_coercion/binary.rs: ## @@ -708,7 +708,7 @@ pub fn try_type_union_resolution_with_struct( /// strings. For example when comp

Re: [I] Sort query won't get round-robin repartitioned if input is `MemTable` [datafusion]

2025-03-24 Thread via GitHub
wiedld commented on issue #15088: URL: https://github.com/apache/datafusion/issues/15088#issuecomment-2749619788 I have a fix. Should be up a bit later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-24 Thread via GitHub
codecov-commenter commented on PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569#issuecomment-2749898991 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1569?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Add substrait tpch round trip tests from sql query [datafusion]

2025-03-24 Thread via GitHub
github-actions[bot] closed pull request #13888: Add substrait tpch round trip tests from sql query URL: https://github.com/apache/datafusion/pull/13888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-03-24 Thread via GitHub
zhuqi-lucas commented on issue #15375: URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2749856474 Thank you @2010YOUY01 for review and good suggestion, i will improve my POC code and add more testing. -- This is an automated message from the Apache Git Service. To res

[PR] refactor: consistent null handling in coercible signatures [datafusion]

2025-03-24 Thread via GitHub
alan910127 opened a new pull request, #15404: URL: https://github.com/apache/datafusion/pull/15404 ## Which issue does this PR close? - Closes #15013. ## Rationale for this change Currently, there are a special case for null handling in `TypeSignatureClass::m

Re: [PR] upgraded spark 3.5.4 to 3.5.5 [datafusion-comet]

2025-03-24 Thread via GitHub
wForget commented on code in PR #1565: URL: https://github.com/apache/datafusion-comet/pull/1565#discussion_r2011249870 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,15 @@ trait ShimCometScanExec { protected def isNeede

Re: [PR] refactor: consistent null handling in coercible signatures [datafusion]

2025-03-24 Thread via GitHub
alan910127 commented on code in PR #15404: URL: https://github.com/apache/datafusion/pull/15404#discussion_r2011249477 ## datafusion/sqllogictest/test_files/information_schema.slt: ## @@ -736,8 +736,11 @@ select specific_name, data_type, ordinal_position, parameter_mode, rid fr

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
Omega359 commented on PR #15242: URL: https://github.com/apache/datafusion/pull/15242#issuecomment-2748440730 > Unfortunately I don't think the problem is solved there. Upon further investigation, it seems like another problem exists with there being information loss between the logical and

Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

2025-03-24 Thread via GitHub
ozankabak commented on code in PR #64: URL: https://github.com/apache/datafusion-site/pull/64#discussion_r2010334544 ## content/blog/2025-03-24-datafusion-46.0.0.md: ## @@ -0,0 +1,92 @@ +--- +layout: post +title: Apache DataFusion 46.0.0 Released +date: 2025-03-24 +author: oznur

[I] Weekly Plan (Andrew Lamb) March 24, 2025 [datafusion]

2025-03-24 Thread via GitHub
alamb opened a new issue, #15393: URL: https://github.com/apache/datafusion/issues/15393 This is an attempt to organize myself and make what I plan to work on more visible ## Weekly High Level Goals - [ ] Complete making tpch data generator screaming fast to generate parquet with

Re: [I] Weekly Plan (Andrew Lamb) March 17, 2025 [datafusion]

2025-03-24 Thread via GitHub
alamb commented on issue #15274: URL: https://github.com/apache/datafusion/issues/15274#issuecomment-2748496834 Update here is we completed the initial version of `tree` explain plan! - https://github.com/apache/datafusion/issues/14914 Also the arrow release is out Next week

Re: [PR] Reuse alias if possible [datafusion]

2025-03-24 Thread via GitHub
blaginin commented on code in PR #14781: URL: https://github.com/apache/datafusion/pull/14781#discussion_r2010529772 ## datafusion/sql/src/unparser/plan.rs: ## @@ -860,8 +877,12 @@ impl Unparser<'_> { query: &mut Option, select: &mut SelectBuilder, rel

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-24 Thread via GitHub
mbutrovich commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890 Note that we actually expect performance to be worse with pushdown with the `DataSourceExec`-driven native decoders at the moment. @andygrove tested `native_datafusion` with

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2010555721 ## datafusion/sql/tests/sql_integration.rs: ## @@ -1901,8 +1901,9 @@ fn union_by_name_different_columns() { \nProjection: NULL AS Int64(1), order_id\

Re: [PR] fix: Unconditionally wrap UNION BY NAME input nodes w/ `Projection` [datafusion]

2025-03-24 Thread via GitHub
rkrishn7 commented on code in PR #15242: URL: https://github.com/apache/datafusion/pull/15242#discussion_r2010555721 ## datafusion/sql/tests/sql_integration.rs: ## @@ -1901,8 +1901,9 @@ fn union_by_name_different_columns() { \nProjection: NULL AS Int64(1), order_id\

Re: [PR] Migrate datasource tests to insta [datafusion]

2025-03-24 Thread via GitHub
xudong963 commented on code in PR #15258: URL: https://github.com/apache/datafusion/pull/15258#discussion_r2010229831 ## datafusion/core/src/datasource/file_format/csv.rs: ## @@ -708,13 +717,17 @@ mod tests { let query_result = ctx.sql(query).await?.collect().await?;

Re: [PR] Blog post on Parquet filter pushdown [datafusion-site]

2025-03-24 Thread via GitHub
XiangpengHao commented on code in PR #61: URL: https://github.com/apache/datafusion-site/pull/61#discussion_r2010253442 ## content/blog/2025-03-21-parquet-pushdown.md: ## @@ -0,0 +1,312 @@ +--- +layout: post +title: Efficient Filter Pushdown in Parquet +date: 2025-03-21 +author:

[PR] Add "end to end parquet reading test" for WASM [datafusion]

2025-03-24 Thread via GitHub
jsai28 opened a new pull request, #15362: URL: https://github.com/apache/datafusion/pull/15362 ## Which issue does this PR close? - Closes #15357. ## Rationale for this change End to end parquet reading test for WASM. ## What changes are included in this PR? ## Are thes

[I] Shuffle files written by native CometExchange operator cannot be cleaned [datafusion-comet]

2025-03-24 Thread via GitHub
Kontinuation opened a new issue, #1567: URL: https://github.com/apache/datafusion-comet/issues/1567 ### Describe the bug Running TPC-H SF=100 on a single node repeatedly will eventually run out of disk when native or auto shuffle mode is enabled. The shuffle files generated when runn

Re: [I] Weekly Plan (Andrew Lamb) March 17, 2025 [datafusion]

2025-03-24 Thread via GitHub
alamb closed issue #15274: Weekly Plan (Andrew Lamb) March 17, 2025 URL: https://github.com/apache/datafusion/issues/15274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [I] Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) [datafusion]

2025-03-24 Thread via GitHub
adriangb commented on issue #7955: URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2748278512 I think that since #15301 pushes an arbitrary `Arc` down min/max, inlist, etc. should all be doable πŸ˜„ -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] Fix array_has_all and array_has_any with empty array [datafusion]

2025-03-24 Thread via GitHub
LuQQiu commented on PR #15039: URL: https://github.com/apache/datafusion/pull/15039#issuecomment-2748650838 Thanks for all the reviews and suggestions! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] Fix empty aggregation function count() in Substrait [datafusion]

2025-03-24 Thread via GitHub
gabotechs commented on code in PR #15345: URL: https://github.com/apache/datafusion/pull/15345#discussion_r2007643274 ## datafusion/substrait/src/logical_plan/consumer.rs: ## @@ -1975,6 +1975,13 @@ pub async fn from_substrait_agg_func( let args = from_substrait_func_args(

[PR] Minor: Keep debug symbols for `release-nonlto` build [datafusion]

2025-03-24 Thread via GitHub
2010YOUY01 opened a new pull request, #15350: URL: https://github.com/apache/datafusion/pull/15350 ## Which issue does this PR close? - Closes #. ## Rationale for this change `release-nonlto` build produces binaries with performance close to `--release` build

Re: [I] Make `DiskManagerBuilder` to construct DiskManagers [datafusion]

2025-03-24 Thread via GitHub
alamb commented on issue #15319: URL: https://github.com/apache/datafusion/issues/15319#issuecomment-2740835113 Agreed -- I think this API change is beneficial regardless -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-24 Thread via GitHub
friendlymatthew commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2009208756 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = forma

Re: [PR] simplify `array_has` UDF to `InList` expr when haystack is constant [datafusion]

2025-03-24 Thread via GitHub
davidhewitt commented on code in PR #15354: URL: https://github.com/apache/datafusion/pull/15354#discussion_r2010026615 ## datafusion/functions-nested/src/array_has.rs: ## @@ -121,6 +123,43 @@ impl ScalarUDFImpl for ArrayHas { Ok(DataType::Boolean) } +fn simp

Re: [PR] Restore lazy evaluation of fallible CASE [datafusion]

2025-03-24 Thread via GitHub
findepi commented on PR #15390: URL: https://github.com/apache/datafusion/pull/15390#issuecomment-2748049344 Alternatively, we could change `Literal::evaluate_selection` to return null scalar when selection is empty. However, for this to be effective, it would require adding evaluate_se

  1   2   3   >