date:20250328

[PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub



dependabot[bot] opened a new pull request, #15470:
URL: https://github.com/apache/datafusion/pull/15470

   Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.0 to 
1.6.1.
   
   Commits
   
   See full diff in https://github.com/smithy-lang/smithy-rs/commits";>compare view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=aws-config&package-manager=cargo&previous-version=1.6.0&new-version=1.6.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub



zhuqi-lucas opened a new issue, #15471:
URL: https://github.com/apache/datafusion/issues/15471

   ### Describe the bug
   
   The average time compute for clickbench query should not inside the query 
iterator. 
   
   I was mistakenly added inside the iterator.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-03-28 Thread via GitHub



xudong963 commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2760506147

   Fyi, I'm working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub



zhuqi-lucas commented on PR #15472:
URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760512944

   cc @xudong963 @2010YOUY01 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub



zhuqi-lucas commented on issue #15471:
URL: https://github.com/apache/datafusion/issues/15471#issuecomment-2760505849

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub



xudong963 opened a new pull request, #15473:
URL: https://github.com/apache/datafusion/pull/15473

   ## Which issue does this PR close?
   
   
   
   - Closes 
https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825
   
   ## Rationale for this change
   
   
   
   As @suremarc and @leoyvens said, we should make the method load balanced to 
keep the parallel.
   
   ## What changes are included in this PR?
   
   
   
   Introduced a new method to split file groups into new groups based on 
statistics to enable efficient parallel processing
   
   ## Are these changes tested?
   
   
   TBD
   
   ## Are there any user-facing changes?
   
   
   
   
   
   Users need to choose the proper `split_groups_by_statistics` method by their 
cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub



xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018151341


##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
 })
 }
 
+/// Splits file groups into new groups based on statistics to enable 
efficient parallel processing.
+///
+/// The method distributes files across a target number of partitions 
while ensuring
+/// files within each partition maintain sort order based on their min/max 
statistics.
+///
+/// The algorithm works by:
+/// 1. Sorting all files by their minimum values
+/// 2. Trying to place each file into an existing group where it can 
maintain sort order
+/// 3. Creating new groups when necessary if a file cannot fit into 
existing groups
+/// 4. Prioritizing smaller groups when multiple suitable groups exist 
(for load balancing)
+///
+/// # Parameters
+/// * `table_schema`: Schema containing information about the columns
+/// * `file_groups`: The original file groups to split
+/// * `sort_order`: The lexicographical ordering to maintain within each 
group
+/// * `target_partitions`: The desired number of output partitions
+///
+/// # Returns
+/// A new set of file groups, where files within each group are 
non-overlapping with respect to
+/// their min/max statistics and maintain the specified sort order.
+pub fn split_groups_by_statistics_v2(

Review Comment:
   The name is TBD.
   
   I'm wondering if we need to keep the old method because I think both of them 
have applicable scenarios



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Partial fix for #1078 — [Add Dataframe display config] [datafusion-python]

2025-03-28 Thread via GitHub



kosiew opened a new pull request, #1086:
URL: https://github.com/apache/datafusion-python/pull/1086

   ## Which issue does this PR close?
   
   Partial fix for #1078
   
   ## Rationale for this change
   
   This PR adds configurable display settings for `DataFrame` representations 
in the Python bindings of DataFusion. These changes improve flexibility and 
control over how DataFrames are visualized, especially in interactive 
environments like Jupyter notebooks. Users can now fine-tune memory usage, cell 
truncation behavior, and row limits for both string and HTML representations.
   
   ## What changes are included in this PR?
   
   - Introduced a new `DisplayConfig` Python class to encapsulate display 
settings.
   - Extended `PyDataFrame` to store and expose `DisplayConfig` via a new 
`display_config` property.
   - Added `configure_display()` and `reset_display_config()` methods to allow 
users to modify or reset configuration.
   - Updated DataFrame display logic (`__repr__` and `_repr_html_`) to respect 
the active configuration.
   - Modified Rust backend to pass `DisplayConfig` to relevant formatting 
functions.
   - Added extensive unit tests in `test_dataframe.py` to validate:
 - Custom configuration values
 - Invalid input handling
 - Display behavior for large cells, large tables, and row limits
   
   ## Are there any user-facing changes?
   
   Yes. This PR introduces:
   
   - `df.display_config`: Property to access current display configuration.
   - `df.configure_display(...)`: Method to customize display settings.
   - `df.reset_display_config()`: Method to reset to default configuration.
   - More intuitive and memory-conscious display formatting in notebook and 
terminal environments.
   
   > These changes enhance user control and improve display ergonomics for 
large or complex DataFrames.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub



ctsk commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2760574284

   I think one issue is that the short-circuit logic is not handling cases 
where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to 
`NULL`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub



zhuqi-lucas opened a new pull request, #15472:
URL: https://github.com/apache/datafusion/pull/15472

   ## Which issue does this PR close?
   
   - Closes [#15471](https://github.com/apache/datafusion/issues/15471)
   
   ## Rationale for this change
   
   the average time for clickbench query compute should outside the iterator 
loop
   
   ## What changes are included in this PR?
   
   the average time for clickbench query compute should outside the iterator 
loop
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   Fix the wrong average time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub



xudong963 merged PR #15470:
URL: https://github.com/apache/datafusion/pull/15470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]

2025-03-28 Thread via GitHub



2010YOUY01 merged PR #15472:
URL: https://github.com/apache/datafusion/pull/15472


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] The average time compute for clickbench query is wrong [datafusion]

2025-03-28 Thread via GitHub



2010YOUY01 closed issue #15471: The average time compute for clickbench query 
is wrong
URL: https://github.com/apache/datafusion/issues/15471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Partial fix for 1078: [refactor: Simplify HTML generation in PyDataFrame by extracting helper functions] [datafusion-python]

2025-03-28 Thread via GitHub



kosiew opened a new pull request, #1087:
URL: https://github.com/apache/datafusion-python/pull/1087

   # Which issue does this PR close?
   
   Partial fix for #1078
   
   # Rationale for this change
   
   > Split up some of the html generation into a set of helper functions.
   
   The rendering logic for DataFrame HTML output was previously monolithic. By 
modularizing the generation of HTML structure, styles, and table rows into 
clearly named helper functions, we improve readability, testability, and 
long-term maintainability.
   
   This refactor lays the foundation for easier customization and potential 
future features like theming or enhanced interactivity.
   
   # What changes are included in this PR?
   
   - Extracted style definitions into `get_html_style_definitions()`
   - Moved the opening table tag into `get_html_table_opening()`
   - Separated table header generation into `get_html_table_header()`
   - Encapsulated array formatter creation logic in `create_batch_formatters()`
   - Delegated row and cell HTML generation to `get_html_table_rows()` and 
`format_table_cell()`
   - Moved JavaScript logic for expandable cells into `get_html_js_functions()`
   - Preserved original behavior with no changes to the rendered output
   
   # Are there any user-facing changes?
   
   No functional changes for end users. The generated HTML and behavior remain 
the same, but the underlying implementation is now modular and easier to 
maintain.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub



niebayes commented on issue #15456:
URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760761409

   The line number in the error message is the row index of a certain record 
batch, not the line number in the csv file. I have filed an issue to arrow-rs 
for making this error message more user-friendly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub



niebayes commented on issue #15456:
URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760766577

   > why there are two head rows
   
   I didn't find this. You might find the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub



acking-you opened a new pull request, #15475:
URL: https://github.com/apache/datafusion/pull/15475

   ## Which issue does this PR close?
   
   
   
   - Closes #15465 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub



acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761137454

   > I think one issue is that the short-circuit logic is not handling cases 
where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to 
`NULL`
   
   Thank you very much for your hint, it will be very helpful for me to fix 
these tests!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]

2025-03-28 Thread via GitHub



alamb commented on issue #1077:
URL: 
https://github.com/apache/datafusion-python/issues/1077#issuecomment-2742876701

   FYI @timsaucer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub



comphead commented on PR #15476:
URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762263935

   > Note that this does break for users of HashJoinExec that
   > 
   > * Use the CollectLeft mode, with >1 partition on the build side AND
   > * Construct their physical plan without running EnforceDistribution
   
   Thanks @ctsk what exactly is broken? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub



codecov-commenter commented on PR #1577:
URL: 
https://github.com/apache/datafusion-comet/pull/1577#issuecomment-2762138134

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1577?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 56.16%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`6f83dd8`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/6f83dd84d02fa77ae196684bf35adc4326bd97df?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 105 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1577  +/-   ##
   
   + Coverage 56.12%   56.16%   +0.03% 
   + Complexity  976  909  -67 
   
 Files   119  122   +3 
 Lines 1174312228 +485 
 Branches   2251 2262  +11 
   
   + Hits   6591 6868 +277 
   - Misses 4012 4246 +234 
   + Partials   1140 1114  -26 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1577?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub



ctsk commented on PR #15476:
URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762302514

   Before this PR, if someone hand-wired a CollectLeft HashJoin where the left 
child has more than one output partition, the HashJoin would automatically add 
a CoalescePartitions exec. This behaviour never triggers for plans that are 
constructed by datafusion, because the EnforceDistribution pass makes sure that 
that CoalescePartitions exist.
   
   After this PR, if someone hand-wires a CollectLeft HashJoin and does not use 
the EnforceDistribution pass and provides a left child that has more than 1 
output partition, the resulting plan, when executed will return a wrong result 
(because the hash table will only built on partition 0).
   
   Now that I've written it out. I believe it is strictly better to return a 
plan_err! if one constructs such a plan rather than return a wrong result.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub



friendlymatthew commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019270254


##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> 
Result {
 let result = formatter.value(idx).try_to_string();
 match result {
 Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {

Review Comment:
   > Casting Date32 to Date64 should be incredibly cheap I think
   
   I agree, converting involves a cast between primitives and a mult. 
   
   
https://github.com/apache/arrow-rs/blob/206fe024cdd9b9de770714baa27b58a1442fcc75/arrow-cast/src/cast/mod.rs#L1423-L1427
   
   > Yeah, this is a limitation of arrow-rs currently. I never got around to 
pushing a PR to arrow-rs to change that.
   
   I'd be happy to work on this next, if you'd like. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: enable iceberg compat tests, more tests for complex types [datafusion-comet]

2025-03-28 Thread via GitHub



parthchandra commented on code in PR #1550:
URL: https://github.com/apache/datafusion-comet/pull/1550#discussion_r2019242229


##
spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala:
##
@@ -490,8 +490,7 @@ object CometScanExec extends DataTypeSupport {
   // TODO add map
   dt match {
 case s: StructType => s.fields.map(_.dataType).forall(isTypeSupported)
-// TODO: Add nested array and iceberg compat support
-// case a: ArrayType => isTypeSupported(a.elementType)
+case a: ArrayType => isTypeSupported(a.elementType)

Review Comment:
   I'm not sure myself how well arrays are supported by the scans. I suppose we 
need to add more tests for arrays? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub



alamb commented on issue #15037:
URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990

   @adriangb and I had a discussion about 
https://github.com/apache/datafusion/pull/15301
   
   here are some notes:
   ## Usecases:
   - TopK dynamic filter pushdown
 -  Prune files with dynamic filter based on statistics
 - Prune row groups with dynamic filter based on statistics 
 - Prune row pages with dynamic filter based on statistics
 - Apply during filtering when pushdown enabled
   - Join SIPs
   
   ## Pros / Cons
   The pros for merging this PR are:
   - We already have benchmarks that show some performance improvement
   The cons:
   - It requires special implementation for any operators (like FileOpenenr) to 
take advantage of such filters. THis is not a blocker in my mind – but I do 
think implementing a PhysicalExpr is a cleaner design. As Adrian says, we can 
refactor it over time if/when PhysicalExpr gets more sophisticated
   -  We will get even more performance when filter_pushdown is enabled (again 
maybe this is just follow on work)
   
   ## Nice to haves
   -  For a plan with multiple partitions (e.g. for 16 input partitions, we end 
up with 17 top heaps – one for each partition and then a global one), but this 
PR can only apply the per-partition top k value.
 - It would be nice to somehow be able to use all the top values (aka pick 
the smallest one) when filtering. 
   - This PR takes a snapshot of the contents of the TopK heap when a file is 
opened and never changes it. 
 - This is good for pruning as all the pruning (file, row group and page) 
happens on file opening
 - It is not as good for filter_pushdown when the values in the topK heap 
can change over the course of the query so using the snapshot means the dynamic 
filter doesn’t improve over time
   
   I believe adrian is going to look into these – but I also think they could 
easily be done as a follow on PR
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub



mbutrovich commented on PR #1566:
URL: 
https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2762276704

   > Looks good, do we still need to wait for arrow-rs based on [#1566 
(comment)](https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890)
 ?
   
   We can merge it. I'd rather be testing the pushdown logic as the default, 
even if it's a little bit slower right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] minor: Allow to run TPCH bench for a specific query [datafusion]

2025-03-28 Thread via GitHub



comphead commented on PR #15467:
URL: https://github.com/apache/datafusion/pull/15467#issuecomment-276152

   Thanks @xudong963 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-28 Thread via GitHub



alamb merged PR #15352:
URL: https://github.com/apache/datafusion/pull/15352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub



kazuyukitanimura commented on PR #1566:
URL: 
https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2761992952

   Looks good, do we still need to wait for arrow-rs based on 
https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Organize fields inside `SortMergeJoinStream` [datafusion]

2025-03-28 Thread via GitHub



suibianwanwank commented on issue #15406:
URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162

   Hi, @2010YOUY01. I have read and tried to understand both 
SortMergeJoinStream and GroupedHashAggregateStream (though I still have some 
uncertainties). I have some initial thoughts on organizing the structure, but 
I’m not sure about the right level of granularity.
   
   One idea is to introduce a SortMergeState, which would roughly comprise:
   ```rust
   struct SortMergeState {
   /// State of streamed
   pub streamed_state: StreamedState,
   /// State of buffered
   pub buffered_state: BufferedState,
   /// Staging output size, including output batches and staging joined 
results.
   /// Increased when we put rows into buffer and decreased after we 
actually output batches.
   /// Used to trigger output when sufficient rows are ready
   pub output_size: usize,
   /// Target output batch size
   pub batch_size: usize,
   /// Current processing record batch of streamed
   pub streamed_batch: StreamedBatch,
   /// Current buffered data
   pub buffered_data: BufferedData,
   /// (used in outer join) Is current streamed row joined at least once?
   pub streamed_joined: bool,
   /// (used in outer join) Is current buffered batches joined at least 
once?
   pub buffered_joined: bool,
   /// A unique number for each batch
   pub streamed_batch_counter: AtomicUsize,
   /// Staging output array builders
   pub staging_output_record_batches: JoinedRecordBatches,
   /// Output buffer. Currently used by filtering as it requires double 
buffering
   /// to avoid small/empty batches. Non-filtered join outputs directly 
from `staging_output_record_batches.batches`
   pub output: RecordBatch,
   }
   ```
   These contain the parts of the stream, buffer, and output that change during 
the merge process.
   
   Another idea is to organize the contents of the buffer, stream and output 
separately, which would roughly comprise:
   ```rust
   struct StreamContext {
   /// State of streamed
   pub streamed_state: StreamedState,
   /// Current processing record batch of streamed
   pub streamed_batch: StreamedBatch,
   /// (used in outer join) Is current streamed row joined at least once?
   pub streamed_joined: bool,
   /// Join key columns of streamed
   pub on_streamed: Vec,
   /// Streamed data stream
   pub streamed: SendableRecordBatchStream,
   /// Input schema of streamed
   pub streamed_schema: SchemaRef,
   }
   
   struct BufferContext {
   //...
   }
   
   struct OutputState {
   //...
   }
   ```
   
   Do you have any suggestions on which approach would be clearer and more 
readable?  :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub



jinwenjie123 commented on issue #458:
URL: 
https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762497216

   Hi @andygrove 
   May I ask why we decide not support RangePartitioning ? and will it be 
supported in the near future ?
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Feature is not implemeneted: Unsupported cast with list of structs [datafusion]

2025-03-28 Thread via GitHub



ion-elgreco commented on issue #15338:
URL: https://github.com/apache/datafusion/issues/15338#issuecomment-2762411530

   > - Related discussion: https://github.com/apache/arrow-rs/issues/7176
   > 
   > I think @kosiew may have a PR up that is related
   > - https://github.com/apache/datafusion/pull/15295
   
   I don't think the PR solves it for us, to my understanding that PR helps 
when you read parquet data.
   
   But we are evolving the schema within a dataframe cast


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub



Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019195514


##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> 
Result {
 let result = formatter.value(idx).try_to_string();
 match result {
 Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {

Review Comment:
   I thought about this a bit and was mentally toying with the idea of testing 
the format string(s) for time formats vs just casting always ... and generally 
I don't think it would be worth it at all. Casting Date32 to Date64 should be 
incredibly cheap I think 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub



Omega359 commented on code in PR #15361:
URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019198451


##
datafusion/functions/src/datetime/to_char.rs:
##
@@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> 
Result {
 let result = formatter.value(idx).try_to_string();
 match result {
 Ok(value) => results.push(Some(value)),
-Err(e) => return exec_err!("{}", e),
+Err(e) => {
+if data_type == &Date32 {

Review Comment:
   
   > _*The entire `Date32` array because for every format string, we create a 
new `ArrayFormatter` by supplying the entire array and specifying the index for 
the formatter to format:_
   > 
   > 
https://github.com/apache/datafusion/blob/fdb4e848b65c001dd3f65b477296e07cbe8e0b07/datafusion/functions/src/datetime/to_char.rs#L274-L277
   
   Yeah, this is a limitation of arrow-rs currently. I never got around to 
pushing a PR to arrow-rs to change that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub



adriangb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2762301099

   From discussion with Andrew here are a couple notes:
   - The most granular update frequency of the filters is when the TopK itself 
updates, so we should switch from polling to pushing: `TopKDynamicFilterSource` 
should just hold a reference to it's current filter and the TopK should update 
`TopKDynamicFilterSource` when it gets updated and not the other way around.
   - When you have say 10 partitions you end up with 11 TopK operators: one per 
partition + a global one on top. Can we push down filters from the global TopK? 
My suggestion was that yes we can if we have each TopK accept filters from 
above it and forward them on.
   - Can we update the dynamic filters for each batch that gets read inside of 
the openers? I think we might be able to, might require some trickery though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] refactor: Move `Memtable` to catalog [datafusion]

2025-03-28 Thread via GitHub



logan-keede commented on code in PR #15459:
URL: https://github.com/apache/datafusion/pull/15459#discussion_r2019291249


##
datafusion/catalog/src/memory/table.rs:
##
@@ -0,0 +1,377 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`MemTable`] for querying `Vec` by DataFusion.
+
+use std::any::Any;
+use std::collections::HashMap;
+use std::fmt::{self, Debug};
+use std::sync::Arc;
+
+use crate::TableProvider;
+use datafusion_common::error::Result;
+use datafusion_expr::Expr;
+use datafusion_expr::TableType;
+use datafusion_physical_expr::create_physical_sort_exprs;
+use datafusion_physical_plan::repartition::RepartitionExec;
+use datafusion_physical_plan::{
+common, DisplayAs, DisplayFormatType, ExecutionPlan, 
ExecutionPlanProperties,
+Partitioning, SendableRecordBatchStream,
+};
+
+use arrow::datatypes::SchemaRef;
+use arrow::record_batch::RecordBatch;
+use datafusion_common::{not_impl_err, plan_err, Constraints, DFSchema, 
SchemaExt};
+use datafusion_common_runtime::JoinSet;
+use datafusion_datasource::memory::MemorySourceConfig;
+use datafusion_datasource::sink::{DataSink, DataSinkExec};
+use datafusion_datasource::source::DataSourceExec;
+use datafusion_execution::TaskContext;
+use datafusion_expr::dml::InsertOp;
+use datafusion_expr::SortExpr;
+use datafusion_session::Session;
+
+use async_trait::async_trait;
+use futures::StreamExt;
+use log::debug;
+use parking_lot::Mutex;
+use tokio::sync::RwLock;
+
+/// Type alias for partition data
+pub type PartitionData = Arc>>;
+
+/// In-memory data source for presenting a `Vec` as a
+/// data source that can be queried by DataFusion. This allows data to
+/// be pre-loaded into memory and then repeatedly queried without
+/// incurring additional file I/O overhead.
+#[derive(Debug)]
+pub struct MemTable {
+schema: SchemaRef,
+// make it pub(crate) when possible
+pub batches: Vec,
+constraints: Constraints,
+column_defaults: HashMap,
+/// Optional pre-known sort order(s). Must be `SortExpr`s.
+/// inserting data into this table removes the order
+pub sort_order: Arc>>>,
+}
+
+impl MemTable {
+/// Create a new in-memory table from the provided schema and record 
batches
+pub fn try_new(schema: SchemaRef, partitions: Vec>) -> 
Result {
+for batches in partitions.iter().flatten() {
+let batches_schema = batches.schema();
+if !schema.contains(&batches_schema) {
+debug!(
+"mem table schema does not contain batches schema. \
+Target_schema: {schema:?}. Batches Schema: 
{batches_schema:?}"
+);
+return plan_err!("Mismatch between schema and batches");
+}
+}
+
+Ok(Self {
+schema,
+batches: partitions
+.into_iter()
+.map(|e| Arc::new(RwLock::new(e)))
+.collect::>(),
+constraints: Constraints::empty(),
+column_defaults: HashMap::new(),
+sort_order: Arc::new(Mutex::new(vec![])),
+})
+}
+
+/// Assign constraints
+pub fn with_constraints(mut self, constraints: Constraints) -> Self {
+self.constraints = constraints;
+self
+}
+
+/// Assign column defaults
+pub fn with_column_defaults(
+mut self,
+column_defaults: HashMap,
+) -> Self {
+self.column_defaults = column_defaults;
+self
+}
+
+/// Specify an optional pre-known sort order(s). Must be `SortExpr`s.
+///
+/// If the data is not sorted by this order, DataFusion may produce
+/// incorrect results.
+///
+/// DataFusion may take advantage of this ordering to omit sorts
+/// or use more efficient algorithms.
+///
+/// Note that multiple sort orders are supported, if some are known to be
+/// equivalent,
+pub fn with_sort_order(self, mut sort_order: Vec>) -> Self {
+std::mem::swap(self.sort_order.lock().as_mut(), &mut sort_order);
+self
+}
+
+/// Create a mem table by reading from another data source
+pub async fn load(
+t: Arc,
+output_partitions: Option,
+state:

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub



acking-you commented on PR #15475:
URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761888661

   > Looks good to me. Since we're only ordering by this it shouldn't matter 
that we order by an integer instead of a proper timestamp, ordering is 
equivalent.
   
   Thank you very much for your reply, but I have a question: why can't the 
following `order by "EventTime"` be modified? It seems that they are equivalent 
(since the default is UTC timestamps).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub



alamb closed issue #15456: [Bug] datafusion-cli may fail to read csv files
URL: https://github.com/apache/datafusion/issues/15456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub



parthchandra commented on PR #15481:
URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761890159

   There is also some profiling information in 
https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html
   I'm presuming this will replace that? 
   How does samply compare with cargo flamegraph ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: Improve fetch partition performance, support skip validation arrow ipc files [datafusion-ballista]

2025-03-28 Thread via GitHub



westhide opened a new pull request, #1216:
URL: https://github.com/apache/datafusion-ballista/pull/1216

   # Which issue does this PR close?
   
   
   
   Closes #1189.
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub



ctsk commented on PR #15476:
URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2761891586

   Note that this does break for users of HashJoinExec that
   - Use the CollectLeft mode, with >1 partition on the build side AND
   - Construct their physical plan without running EnforceDistribution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on code in PR #15479:
URL: https://github.com/apache/datafusion/pull/15479#discussion_r2019008151


##
datafusion/physical-optimizer/src/coalesce_batches.rs:
##
@@ -92,3 +92,73 @@ impl PhysicalOptimizerRule for CoalesceBatches {
 true
 }
 }
+
+/// Remove CoalesceBatchesExec that are in front of a AggregateExec
+#[derive(Default, Debug)]
+pub struct UnCoalesceBatches {}

Review Comment:
   I am wondering if we instead can avoid adding them in the `CoalesceBatches` 
optimizer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on PR #15479:
URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2761889602

   That makes a lot of sense!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]

2025-03-28 Thread via GitHub



ivankelly commented on issue #15028:
URL: https://github.com/apache/datafusion/issues/15028#issuecomment-276185

   Excellent analysis folks. Parquet row groups size makes a lot of sense since 
the rows are large. We can tune that way down since our use case isn't 
columnar. How do I get the good errors? I see that was merged last year, but I 
was using the latest release at time of repro


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub



alamb commented on issue #15456:
URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761888196

   Turns out this is a bug in the generator -- 
https://github.com/clflushopt/tpchgen-rs/issues/73#issuecomment-2761885245


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub



adriangb commented on PR #15475:
URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761917991

   > > Looks good to me. Since we're only ordering by this it shouldn't matter 
that we order by an integer instead of a proper timestamp, ordering is 
equivalent.
   > 
   > Thank you very much for your reply, but I have a question: why can't the 
following `order by "EventTime"` be modified? It seems that they are equivalent 
(since the default is UTC timestamps). And both ClickHouse and duckdb do this 
way: 
[duckdb](https://github.com/ClickHouse/ClickBench/blob/main/duckdb/queries.sql#L24-L27)
 
[clickhouse](https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/queries.sql#L24-L27)
   
   I'm agreeing with you! I think this change is good 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub



comphead commented on PR #15481:
URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761920912

   > There is also some profiling information in 
https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html
 I'm presuming this will replace that? How does samply compare with cargo 
flamegraph ?
   
   the tool also builds a flame graph, it can be found in the separate tab. 
   For the Comet I think we can refer to this section to provide alternative 
profiling ways


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub



comphead commented on code in PR #15469:
URL: https://github.com/apache/datafusion/pull/15469#discussion_r2019237566


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -416,21 +409,23 @@ impl ExternalSorter {
 Some(self.spill_manager.create_in_progress_file("Sorting")?);
 }
 
-self.organize_stringview_arrays()?;
+Self::organize_stringview_arrays(globally_sorted_batches)?;
 
 debug!("Spilling sort data of ExternalSorter to disk whilst 
inserting");
 
-let batches = std::mem::take(&mut self.in_mem_batches);
+let batches_to_spill = std::mem::take(globally_sorted_batches);
 self.reservation.free();
 
 let in_progress_file = 
self.in_progress_spill_file.as_mut().ok_or_else(|| {
 internal_datafusion_err!("In-progress spill file should be 
initialized")
 })?;
 
-for batch in batches {
+for batch in batches_to_spill {
 in_progress_file.append_batch(&batch)?;
 }
 
+assert!(globally_sorted_batches.is_empty());

Review Comment:
   perhaps we can get rid of assert and return `Err`? if the code regressed the 
users can face the app crash



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub



alan910127 commented on code in PR #15482:
URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019310584


##
datafusion/core/tests/expr_api/mod.rs:
##
@@ -330,12 +330,12 @@ async fn test_create_physical_expr_coercion() {
 create_expr_test(lit(1i32).eq(col("id")), "CAST(1 AS Utf8) = id@0");
 // compare int col to string literal `i = '202410'`
 // Note this casts the column (not the field)
-create_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410");
-create_expr_test(lit("202410").eq(col("i")), "202410 = CAST(i@1 AS Utf8)");
+create_expr_test(col("i").eq(lit("202410")), "i@1 = 202410");
+create_expr_test(lit("202410").eq(col("i")), "202410 = i@1");
 // however, when simplified the casts on i should removed
 // https://github.com/apache/datafusion/issues/14944
-create_simplified_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) 
= 202410");
-create_simplified_expr_test(lit("202410").eq(col("i")), "CAST(i@1 AS Utf8) 
= 202410");
+create_simplified_expr_test(col("i").eq(lit("202410")), "i@1 = 202410");
+create_simplified_expr_test(lit("202410").eq(col("i")), "i@1 = 202410");

Review Comment:
   not sure if this test is still needed since the literal casting behavior is 
not considered an "optimization"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub



alan910127 opened a new pull request, #15482:
URL: https://github.com/apache/datafusion/pull/15482

   ## Which issue does this PR close?
   
   
   
   - Closes #15161.
   
   ## Rationale for this change
   
   
   
   Currently, DataFusion handles comparisons between numbers and string 
literals differently from a number of databases. It coerces the number to a 
string, whereas other databases cast the literal to the column type and emit an 
error if the cast fails. This behavior can be unintuitive.
   
   ## What changes are included in this PR?
   
   
   Updated 
[`TypeCoercionRewriter::coerce_binary_op`](https://github.com/apache/datafusion/blob/b4b8f6489c4a61ae2d27e99a2c897c9c594a15fb/datafusion/optimizer/src/analyzer/type_coercion.rs#L285)
 to cast string literals to the column type if one is present on either side of 
a comparison expression.
   
   ## Are these changes tested?
   
   
   - Updated existing tests to reflect the new type coercion behavior.
   - In `push_down_filter.slt`, some `explain` tests now produce no output when 
queries fail due to invalid casts. For now, I have updated these tests to 
expect empty output, but further adjustments may be needed.
   
   ## Are there any user-facing changes?
   
   
   
   Yes. Queries that previously coerced numbers into strings will now fail if 
the string literal cannot be cast to the column type.
   
   ### Example
   
    Before this change (success)
   
   ```sql
   > CREATE TABLE t AS SELECT CAST(123 AS int) a;
   > SELECT * FROM t WHERE a = '2147483648'; -- Not a valid i32
   +---+
   | a |
   +---+
   +---+
   ```
   
    After this change (error)
   
   ```sql
   > CREATE TABLE t AS SELECT CAST(123 AS int) a;
   > SELECT * FROM t WHERE a = '2147483648'; -- Not a valid i32
   type_coercion
   caused by
   Error during planning: Cannot coerce '2147483648' to type 'Int32'
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub



alamb commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762075206

   > @andygrove @alamb would y'all be able to help here? I saw that the 
prebuilt JARs were tested for 3.5.4 and upwards. Are they backwards compatible?
   
   Sorry I don't know much about comet or java errors


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub



alamb commented on PR #15413:
URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762447424

   Run extended tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] [DISCUSS] Data quality framework using DataFusion [datafusion]

2025-03-28 Thread via GitHub



jsai28 opened a new issue, #15483:
URL: https://github.com/apache/datafusion/issues/15483

   Would there be any interest in building a data quality framework like [Great 
Expectations](https://github.com/great-expectations/great_expectationshttps://github.com/great-expectations/great_expectations)
 or [Deequ](https://github.com/awslabs/deequ) (built on spark) except in Rust 
using DataFusion? As far as I am aware, there is nothing like this in Rust let 
alone built on DataFusion. 
   
   The idea is essentially a Rust-based tool to specify unit-like tests for 
your data. Users would specify tests (called expectations in Great 
Expectations) and then DataFusion could be used for the underlying metric 
computation. Essentially something like this:
   
   ```
   fn main() {
 let validator = create_validator(‘example.csv’);
 
 validator.is_not_null(“id”); // specify column for null check
 validator.min_value(“price”, 0); // specify column and minimum value
 validator.validate();
   }
   ```
   
   Which could return an output like this:
   
   ```
   ✅ id: Passed (All values not null)
   ❌ price: Failed (2 values below 0)
   ```
   
   It would be a pretty niche tool that could be apart of a larger data 
pipeline. I was thinking it could be a good project to work on for GSoC. What 
do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore(deps): update rand requirement from 0.8 to 0.9 [datafusion]

2025-03-28 Thread via GitHub



comphead commented on PR #14333:
URL: https://github.com/apache/datafusion/pull/14333#issuecomment-2762462451

   depends on https://github.com/apache/datafusion/pull/14967


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]

2025-03-28 Thread via GitHub



westhide commented on issue #1189:
URL: 
https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761307615

   References:
   [Benchmarks for Arrow IPC 
reader](https://github.com/apache/arrow-rs/pull/7091)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-28 Thread via GitHub



berkaysynnada commented on PR #15253:
URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761294822

   @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As 
datafusion hasn't implemented it yet, you didn't have difficulty rewriting the 
`human_display` according to the new expressions. Could you suggest a way to 
update `human_readable` field for the implementers of `with_new_expressions()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-28 Thread via GitHub



goldmedal commented on PR #15423:
URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2761416899

   > You should be able to get the test back by also setting 
`datafusion.optimizer.hash_join_single_partition_threshold` to `0` / a low 
value.
   
   Thanks. It works. I also added the test for the normal hash partition for 
RepartitionExec.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub



LiaCastaneda commented on issue #15477:
URL: https://github.com/apache/datafusion/issues/15477#issuecomment-2761931268

   I think @ologlogn will take it 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]

2025-03-28 Thread via GitHub



Kontinuation commented on issue #15028:
URL: https://github.com/apache/datafusion/issues/15028#issuecomment-2761940822

   > Excellent analysis folks. Parquet row groups size makes a lot of sense 
since the rows are large. We can tune that way down since our use case isn't 
columnar. How do I get the good errors? I see that was merged last year, but I 
was using the latest release at time of repro
   
   I made some modification to the code to use `TrackConsumersPool`. The error 
message will contain the amount of memory consumed by top 10 consumers when 
memory reservation failure happens.
   
   ```diff
   diff --git a/src/main.rs b/src/main.rs
   index 4413046..71af6d6 100644
   --- a/src/main.rs
   +++ b/src/main.rs
   @@ -1,5 +1,6 @@
use std::sync::Arc;
   -use datafusion::execution::memory_pool::FairSpillPool;
   +use std::num::NonZeroUsize;
   +use datafusion::execution::memory_pool::{FairSpillPool, TrackConsumersPool};
use datafusion::execution::runtime_env::RuntimeEnvBuilder;
use datafusion::prelude::SessionConfig;
use datafusion::prelude::SessionContext;
   @@ -12,7 +13,11 @@ use datafusion::logical_expr::col;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
env_logger::init();
   -let pool = Arc::new(FairSpillPool::new(100 * 1024 * 1024));
   +let pool = Arc::new(
   +TrackConsumersPool::new(
   +FairSpillPool::new(100 * 1024 * 1024),
   +NonZeroUsize::new(10).unwrap()
   +));
let runtime_env = RuntimeEnvBuilder::new() // TODO: add disk
   .with_memory_pool(pool.clone()) // TODO: from config
   .build_arc()
   @@ -33,6 +38,7 @@ async fn main() -> anyhow::Result<()> {
table_opts.global.dictionary_enabled = Some(false);
table_opts.global.statistics_enabled = Some("none".to_string());
table_opts.global.bloom_filter_on_write = false;
   +// table_opts.global.max_row_group_size = 1000;

table_opts.column_specific_options.insert(
"id".to_string(),
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] doc: fix quick-start executor command [datafusion-ballista]

2025-03-28 Thread via GitHub



westhide opened a new pull request, #1217:
URL: https://github.com/apache/datafusion-ballista/pull/1217

   # Which issue does this PR close?
   
   
   
   Closes N/A.
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]

2025-03-28 Thread via GitHub



westhide commented on issue #1189:
URL: 
https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761293050

   /take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-28 Thread via GitHub



irenjj commented on PR #15253:
URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761472370

   > @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As 
datafusion hasn't implemented it yet, you didn't have difficulty rewriting the 
`human_display` according to the new expressions. Could you suggest a way to 
update `human_readable` field for the implementers of `with_new_expressions()`?
   
   I believe that the `human_display` in `AggregateFunctionExpr` is constructed 
during the logical plan phase based on the available information at that time 
(and the `name` in `AggregateFunctionExpr` is similar). On the other hand, 
`with_new_expressions()` replaces the `args` in `AggregateFunctionExpr` during 
the physical plan phase (perhaps related to the optimizer?). It's perfectly 
fine to keep `human_display` unchanged when implementing 
`with_new_expressions()` because the `name` used in the original `indented 
explain` is actually quite similar to `human_display`.
   
   Perhaps what we should really focus on is ensuring that the explain output 
correctly displays the `name` of the `expressions` after 
`with_new_expressions()` has been applied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-28 Thread via GitHub



mbutrovich commented on code in PR #1511:
URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2018762152


##
native/core/src/execution/shuffle/shuffle_writer.rs:
##
@@ -422,27 +432,29 @@ impl ShuffleRepartitioner {
 .collect::>>()?;
 
 // use identical seed as spark hash partition
-let hashes_buf = &mut self.hashes_buf[..arrays[0].len()];
+let hashes_buf = &mut 
scratch.hashes_buf[..arrays[0].len()];
 hashes_buf.fill(42_u32);
 
 // Hash arrays and compute buckets based on number of 
partitions
-let partition_ids = &mut 
self.partition_ids[..arrays[0].len()];
+let partition_ids = &mut 
scratch.partition_ids[..arrays[0].len()];
 create_murmur3_hashes(&arrays, hashes_buf)?
 .iter()
 .enumerate()
 .for_each(|(idx, hash)| {
-partition_ids[idx] = pmod(*hash, 
*num_output_partitions) as u64
+partition_ids[idx] = pmod(*hash, 
*num_output_partitions) as u32;
 });
 
 // count each partition size
-let mut partition_counters = vec![0usize; 
*num_output_partitions];
+let partition_counters = &mut scratch.partition_starts;
+partition_counters.resize(num_output_partitions + 1, 0);
+partition_counters.fill(0);
 partition_ids
 .iter()
 .for_each(|partition_id| 
partition_counters[*partition_id as usize] += 1);
 
 // accumulate partition counters into partition ends
 // e.g. partition counter: [1, 3, 2, 1] => [1, 4, 6, 7]

Review Comment:
   Agreed. I don't think that is any more readable. Thank you for the 
explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on PR #15475:
URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761610618

   > Should we update the same query in the clickbench repo as well?
   
   Yes, and we might rerun the queries as well (as `to_timestamp_seconds` takes 
some time itself as well).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub



blaginin commented on PR #15480:
URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762056261

   can you merge main into this branch please? to remove the diff


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-28 Thread via GitHub



kazuyukitanimura commented on code in PR #1555:
URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2019103734


##
native/core/src/parquet/mod.rs:
##
@@ -641,6 +640,8 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_parquet_Native_initRecordBat
 session_timezone: jstring,
 ) -> jlong {
 try_unwrap_or_throw(&e, |mut env| unsafe {
+let task_ctx = TaskContext::default();

Review Comment:
   @parthchandra Is this good for merging?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub



qstommyshu commented on PR #15480:
URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762061935

   I'll do a last commit to resolve those comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-28 Thread via GitHub



alamb commented on PR #15352:
URL: https://github.com/apache/datafusion/pull/15352#issuecomment-2762070065

   Thanks again @blaginin @mertak-synnada 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add documentation for `Run extended tests` command [datafusion]

2025-03-28 Thread via GitHub



alamb commented on PR #15463:
URL: https://github.com/apache/datafusion/pull/15463#issuecomment-2762078550

   Thanks @comphead and @xudong963 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add documentation for `Run extended tests` command [datafusion]

2025-03-28 Thread via GitHub



alamb merged PR #15463:
URL: https://github.com/apache/datafusion/pull/15463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Docs: Formatting and Added Extra resources [datafusion]

2025-03-28 Thread via GitHub



alamb commented on PR #15450:
URL: https://github.com/apache/datafusion/pull/15450#issuecomment-2762082875

   > > BTW @oznur-synnada I wonder if you have time to update the page with 
other recent blog content 🤔
   > 
   > You mean this? #15440
   
   Yes! As well as
   
   https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown
   https://datafusion.apache.org/blog/2025/03/20/parquet-pruning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub



alamb commented on PR #15480:
URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762080799

   🌶️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub



adriangb commented on code in PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006767851


##
datafusion/core/src/datasource/physical_plan/parquet.rs:
##
@@ -1655,4 +1656,46 @@ mod tests {
 assert_eq!(calls.len(), 2);
 assert_eq!(calls, vec![Some(123), Some(456)]);
 }
+
+#[tokio::test]
+async fn test_topk_predicate_pushdown() {
+let ctx = SessionContext::new();
+let opt = ListingOptions::new(Arc::new(ParquetFormat::default()))
+// We need to force 1 partition because TopK predicate pushdown 
happens on a per-partition basis
+// If we had 1 file per partition (as an example) no pushdown 
would happen
+.with_target_partitions(1);
+
+let tmp_dir = TempDir::new().unwrap();
+let path = tmp_dir.path().to_str().unwrap().to_string();
+// The point here is that we write many, many files.
+// So when we scan after we processed the first one we should be able 
to skip the rest
+// because of the TopK predicate pushdown.
+for file in 0..100 {
+let name = format!("test{:02}.parquet", file);
+write_file(&format!("{path}/{name}"));
+}
+ctx.register_listing_table("base_table", path, opt, None, None)
+.await
+.unwrap();
+
+let query = "select name from base_table order by id desc limit 3";
+
+let batches = ctx.sql(query).await.unwrap().collect().await.unwrap();
+#[rustfmt::skip]
+let expected = [
+"++",
+"| name   |",
+"++",
+"| test02 |",
+"| test02 |",
+"| test02 |",
+"++",
+];
+assert_batches_eq!(expected, &batches);
+
+let sql = format!("explain analyze {query}");
+let batches = ctx.sql(&sql).await.unwrap().collect().await.unwrap();
+let explain_plan = format!("{}", 
pretty_format_batches(&batches).unwrap());
+assert_contains!(explain_plan, "row_groups_pruned_statistics=96");

Review Comment:
   Yes! More tests! I just tried this in my full system and found a bug w/ hive 
partition columns. Making a note to add a test and fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub



comphead opened a new pull request, #15481:
URL: https://github.com/apache/datafusion/pull/15481

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   
   ## Rationale for this change
   
Introduce cross platform Samply profiler for DataFusion and benchmarks
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub



acking-you commented on PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761813612

   > I think one issue is that the short-circuit logic is not handling cases 
where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to 
`NULL`
   
   After taking a closer look, in fact, the situation you mentioned does not 
actually lead to the short-circuit optimization logic.  
   The error I’m currently seeing is the use of `true_count() == 0` to 
determine if it is false, but in reality, it could also be null.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Idea: Avoid planning CoalesceBatches in front of blocking operators. [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on issue #15478:
URL: https://github.com/apache/datafusion/issues/15478#issuecomment-2762997345

   > One downside: Increased memory usage.
   > 
   > The hash join build side stores the RecordBatches in a vector before 
building the hash table. This vector will grow larger. In addition, we might 
delay e.g. garbage collection for StringViews, which also increases memory 
usage.
   
   Perhaps instead of removing the coalescebatches, we can change the the 
threshold ratio to be much smaller (e.g. 1/10th (or something based on some 
benchmarkig) of the size instead of 1/2). That way the overhead of `Vec` is 
minimal while probably still redundant copying in many cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub



Omega359 commented on PR #15413:
URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762816764

   Looks like it failed? 
https://github.com/apache/datafusion/actions/runs/14139465370/job/39618247236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Update ClickBench queries to avoid `to_timestamp_seconds` [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on issue #15465:
URL: https://github.com/apache/datafusion/issues/15465#issuecomment-2762989314

   Another discrepancy I found in the queries is the "EventDate"::INT::DATE" 
casting. Is this something we could remove as well? Maybe would be good to look 
at all further that are applied to the queries and undo them if possible (or 
file issues when the planner fails to plan them).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support for user defined FFI functions [datafusion-ballista]

2025-03-28 Thread via GitHub



unknown-no commented on issue #1215:
URL: 
https://github.com/apache/datafusion-ballista/issues/1215#issuecomment-2763011354

   Related [WASM UDFs](https://github.com/apache/datafusion/issues/9326)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub



parthchandra commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762922710

   This particular API is not a public API and we use it to so we can verify 
the metrics in tests. Maybe we can disable its use in non test environments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub



adriangb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2763043386

   @alamb I've achieved 2/3 goals:
   - I added wrapping of a `DynamicFilterSource` in a `PhysicalExpr` such that 
it can dynamically update itself to prune rows using filter pushdown _even on a 
single file_. After various experiments I went with 
https://github.com/apache/datafusion/pull/15301/commits/1f2fcd832ce7432e9ecb23d698ebf83823f48405#diff-cb5bce9bce1aacd37171f2501f660cd564eb83dac28d331e39bef98aa227
 which is a hybrid of explicitly passing around a `DynamicFilterSource` and an 
opaque `PhysicalExpr`: the idea is to explicitly pass around the dynamic filter 
source so that operators can opt-into it explicitly and do special handling 
such as taking a snapshot for serializing across the wire or 
`PruningPredicate`, but are still able to convert it to a `PhysicalExpr` when 
needed. It was necessary to take quite a bit of care with the `PhyscialExpr` 
wrapper implementation because e.g. filter pushdown remaps column indices by 
rewriting children so we need to do dynamic rewriting of children. But after it 
was all wired
  up I think it is pretty nice.
   - I switched the dynamic aspect from polling (you ask `DynamicFilterSource` 
for new filters) to a push model (when the TopK updates it pushes the new 
filters into the shared state). I think this should be more performant, and it 
completely removes locks on the TopK heap. This could be made even more 
efficient if we bring back the `supports_dynamic_filter_pushdown() -> bool` 
method since we can avoid doing some work and setting up references / locks if 
we can know ahead of time if anything will need it or not.
   
   I was not able to get the global TopK thing working because it's not 
actually a global TopK doing the work at the top level: it's a 
`SortPreservingMergeExec` which I think in theory we can implement this 
optimization for but this PR is large enough as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub



jayzhan211 commented on code in PR #15482:
URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019668440


##
datafusion/sqllogictest/test_files/push_down_filter.slt:
##
@@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], 
full_filters=[t.a != Int32(100)]
 query TT
 explain select a from t where a = '999';
 
-logical_plan TableScan: t projection=[a], full_filters=[CAST(t.a AS Utf8) = 
Utf8("999")]
+

Review Comment:
   why is there no plan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub



xudong963 commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2763074215

   > For your planning purposes I will be away the week of April 21 -- so 
perhaps we can start testing a week earlier (week of April 7 so we have time to 
complete / fix issues prior to April 14)
   
   That sounds good!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-28 Thread via GitHub



xudong963 commented on code in PR #15432:
URL: https://github.com/apache/datafusion/pull/15432#discussion_r2019681642


##
datafusion/core/src/datasource/statistics.rs:
##
@@ -145,7 +147,142 @@ pub async fn get_statistics_with_limit(
 Ok((result_files, statistics))
 }
 
-fn add_row_stats(
+/// Generic function to compute statistics across multiple items that have 
statistics
+fn compute_summary_statistics(
+items: I,
+file_schema: &SchemaRef,
+stats_extractor: impl Fn(&T) -> Option<&Statistics>,
+) -> Statistics
+where
+I: IntoIterator,
+{
+let size = file_schema.fields().len();
+let mut col_stats_set = vec![ColumnStatistics::default(); size];
+let mut num_rows = PrecisionAbsent;
+let mut total_byte_size = PrecisionAbsent;
+
+for (idx, item) in items.into_iter().enumerate() {
+if let Some(item_stats) = stats_extractor(&item) {
+if idx == 0 {
+// First item, set values directly
+num_rows = item_stats.num_rows;
+total_byte_size = item_stats.total_byte_size;
+for (index, column_stats) in
+item_stats.column_statistics.iter().enumerate()
+{
+col_stats_set[index].null_count = column_stats.null_count;

Review Comment:
   After trying, maybe can't avoid 😢 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub



shehabgamin commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2762517221

   Happy to test whenever!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add regexp_substr function [datafusion]

2025-03-28 Thread via GitHub



github-actions[bot] commented on PR #14323:
URL: https://github.com/apache/datafusion/pull/14323#issuecomment-2763013966

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: aggregation corner case [datafusion]

2025-03-28 Thread via GitHub



jayzhan211 commented on PR #15457:
URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2763109194

   > count（*） actually doesnt depend on any column on input logically
   
   count(*) need to know the row number of the column


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub



alamb commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741


##
benchmarks/queries/clickbench/README.md:
##
@@ -120,13 +122,42 @@ LIMIT 10;
 ```
 
 Results look like
-
+```
 +-+-+---+--+--+--+
 | ClientIP| WatchID | c | tmin | tp95 | tmax |
 +-+-+---+--+--+--+
 | 1611957945  | 6655575552203051303 | 2 | 0| 0| 0|
 | -1402644643 | 8566928176839891583 | 2 | 0| 0| 0|
 +-+-+---+--+--+--+
+```
+
+### Q6: How many social shares meet complex multi-stage filtering criteria?
+**Question**: What is the count of sharing actions from iPhone mobile users on 
specific social networks, within common timezones, participating in seasonal 
campaigns, with high screen resolutions and closely matched UTM parameters?
+**Important Query Properties**: Simple filter with high-selectivity, Costly 
string matching, A large number of filters with high overhead are positioned 
relatively later in the process

Review Comment:
   👍 



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return !value;
+}
+}
+}
+false
+}
+(DataType::Boolean, Operator::Or) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.true_count() == array.len();
+}
+}
+ColumnarValue::Scalar(scalar) => {
+if let ScalarValue::Boolean(Some(value)) = scalar {
+return *value;
+}
+}
+}
+false
+}
+_ => false,
+}
+}

Review Comment:
   Could you please also add some basic sqllogictests to ensure these code 
paths are hit? Or show that they are covered by existing tests?



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling `false_count` / `true_count` -- like 
for example if the rhs arg is "complex" (not a Column for example)



##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr {
 fn evaluate(&self, batch: &RecordBatch) -> Result {
 use arrow::compute::kernels::numeric::*;
 
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {

Review Comment:
   Thanks @acking-you -- this looks great
   
   Is there any reason to have this function defined in the `evaluate` method? 
I think we could just make it a normal function and reduce the nesting level



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub



Dandandan merged PR #15475:
URL: https://github.com/apache/datafusion/pull/15475


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub



andygrove commented on issue #458:
URL: 
https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762962087

   I discussed this feature with @mbutrovich recently and he may have 
additional thoughts on this topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub



viirya commented on issue #458:
URL: 
https://github.com/apache/datafusion-comet/issues/458#issuecomment-2763002946

   The implementation issue or difference for RangePartitioning other than 
other partitioning like HashPartitioning, is that it involves some sampling 
operations that perform with RDD in Spark. That's said Comet lib at each 
executor cannot determine the sampling results separately like we did for 
general query execution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub



parthchandra commented on code in PR #1566:
URL: https://github.com/apache/datafusion-comet/pull/1566#discussion_r2019634486


##
spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala:
##
@@ -1460,6 +1460,59 @@ class ParquetReadV1Suite extends ParquetReadSuite with 
AdaptiveSparkPlanHelper {
   v1 = Some("parquet"))
 }
   }
+
+  test("test V1 parquet scan filter pushdown of primitive types uses 
native_iceberg_compat") {

Review Comment:
   Can we enable this test for both native_datafusion and native_iceberg_compat?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub



ctsk commented on PR #15418:
URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2761412282

   > You mean coalesce_partitions_if_needed() call is redundant in datafusion? 
I don't think that's the case, but if it is so, why don't we remove that line?
   
   I wanted to keep the PR simple and *just* refactor. Here is a follow-up that 
removes it: https://github.com/apache/datafusion/pull/15476. Let's see if the 
tests pass :)
   
   EnforceDistribution adds it here:
   
   
https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-optimizer/src/enforce_distribution.rs#L940-L968
   
   `add_spm_on_top` gets called when the child operator has the 
`SinglePartition` requirement
   
   
https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-optimizer/src/enforce_distribution.rs#L1256-L1259
   
This apples to CollectLeft HJs left child:

   
https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-plan/src/joins/hash_join.rs#L705-L710


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub



alamb commented on issue #15456:
URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761675489

   Nice find @chenkovsky  -- so looks like there is some bug in the data 
generator after all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub



Dandandan commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379


##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::datatypes::{DataType, Field, Schema};
+use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
+use datafusion_common::stats::Precision;
+use datafusion_common::{ColumnStatistics, ScalarValue, Statistics};
+use datafusion_datasource::file_groups::FileGroup;
+use datafusion_datasource::file_scan_config::FileScanConfig;
+use datafusion_datasource::PartitionedFile;
+use datafusion_physical_expr::PhysicalSortExpr;
+use datafusion_physical_expr_common::sort_expr::LexOrdering;
+use object_store::{path::Path, ObjectMeta};
+use std::sync::Arc;
+use std::time::Duration;
+
+/// Generates test files with min-max statistics in different overlap patterns
+fn generate_test_files(num_files: usize, overlap_factor: f64) -> 
Vec {
+let mut files = Vec::with_capacity(num_files);
+let range_size = if overlap_factor == 0.0 {
+100 / num_files as i64
+} else {
+(100.0 / (overlap_factor * num_files as f64)).max(1.0) as i64
+};
+
+for i in 0..num_files {
+let base = (i as f64 * range_size as f64 * (1.0 - overlap_factor)) as 
i64;
+let min = base as f64;
+let max = (base + range_size) as f64;
+
+let file = PartitionedFile {
+object_meta: ObjectMeta {
+location: Path::from(format!("file_{}.parquet", i)),
+last_modified: chrono::Utc::now(),
+size: 1000,
+e_tag: None,
+version: None,
+},
+partition_values: vec![],
+range: None,
+statistics: Some(Statistics {
+num_rows: Precision::Exact(100),
+total_byte_size: Precision::Exact(1000),
+column_statistics: vec![ColumnStatistics {
+null_count: Precision::Exact(0),
+max_value: 
Precision::Exact(ScalarValue::Float64(Some(max))),
+min_value: 
Precision::Exact(ScalarValue::Float64(Some(min))),
+sum_value: Precision::Absent,
+distinct_count: Precision::Absent,
+}],
+}),
+extensions: None,
+metadata_size_hint: None,
+};
+files.push(file);
+}
+
+vec![FileGroup::new(files)]
+}
+
+pub fn compare_split_groups_by_statistics_algorithms(c: &mut Criterion) {

Review Comment:
   > Execution phase
   Would be great to have some example of this. The impact on planning time 
seems not super high.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub



mkgada commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2761689234

   @jinwenjie123 appreciate your response! I am using one of the pre-built 
JARs, I will not be able to switch to Spark 3.4.x since our cluster was 
recently upgraded to 3.5.x and would like to stick to it
   
   @andygrove @alamb would y'all be able to help here? I saw that the prebuilt 
JARs were tested for  3.5.4 and upwards. Are they backwards compatible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub



qstommyshu opened a new pull request, #15480:
URL: https://github.com/apache/datafusion/pull/15480

   ## Which issue does this PR close?
   
   
   
   - Closes #15398. Related #15444 
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   Migrated tests in datafusion/core/tests/subtrait to use insta.
   
   ## Are these changes tested?
   
   
   Yes, I manually tested the before/after changes.
   
   ## Are there any user-facing changes?
   
   
   
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub



qstommyshu commented on PR #15480:
URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2761715713

   Hi @alamb and @blaginin 
   
   Part2 of the substrait tests migration is done as well. Please take a look 
when you have time :)
   
   The only tests that cannot be changed to `insta` are tests in 
`simple_intersect_table_reuse()`and `simple_intersect()` in [this 
file](https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs).
 I don't think it is possible to migrate these tests to `insta` because their 
`expected_plan_str` requires formatting variable into a string. It is not 
supported in Rust to generate a raw string with variable, and raw string is a 
required argument for `insta` inline snapshot test. So I left those tests out 
(I don't think it is possible to migrate them into `insta`).
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub



l0kr opened a new pull request, #1577:
URL: https://github.com/apache/datafusion-comet/pull/1577

   ## Which issue does this PR close?
   
   
   
   Closes #936.
   
   ## Rationale for this change
   
   Previous PR went stale so I wanted to move it forward.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   Added unit tests for row and columnar input.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub



adriangb commented on code in PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#discussion_r2004247883


##
datafusion/common/src/config.rs:
##
@@ -590,6 +590,9 @@ config_namespace! {
 /// during aggregations, if possible
 pub enable_topk_aggregation: bool, default = true
 
+/// When set to true attempts to push down dynamic filters from TopK 
operations into file scans
+pub enable_dynamic_filter_pushdown: bool, default = true

Review Comment:
   Probably should be false by default



##
datafusion/datasource-parquet/src/opener.rs:
##
@@ -294,3 +345,127 @@ fn create_initial_plan(
 // default to scanning all row groups
 Ok(ParquetAccessPlan::new_all(row_group_count))
 }
+
+/// Build a pruning predicate from an optional predicate expression.
+/// If the predicate is None or the predicate cannot be converted to a pruning
+/// predicate, return None.
+/// If there is an error creating the pruning predicate it is recorded by 
incrementing
+/// the `predicate_creation_errors` counter.
+pub(crate) fn build_pruning_predicate(
+predicate: Arc,
+file_schema: &SchemaRef,
+predicate_creation_errors: &Count,
+) -> Option> {
+match PruningPredicate::try_new(predicate.clone(), 
Arc::clone(file_schema)) {
+Ok(pruning_predicate) => {
+if !pruning_predicate.always_true() {
+return Some(Arc::new(pruning_predicate));
+}
+}
+Err(e) => {
+debug!("Could not create pruning predicate for: {e}");
+predicate_creation_errors.add(1);
+}
+}
+None
+}
+
+/// Build a page pruning predicate from an optional predicate expression.
+/// If the predicate is None or the predicate cannot be converted to a page 
pruning
+/// predicate, return None.
+pub(crate) fn build_page_pruning_predicate(
+predicate: &Arc,
+file_schema: &SchemaRef,
+) -> Arc {
+Arc::new(PagePruningAccessPlanFilter::new(
+predicate,
+Arc::clone(file_schema),
+))
+}
+
+/// A vistor for a PhysicalExpr that collects all column references to 
determine what columns the expression needs to be evaluated.
+struct FilterSchemaBuilder<'schema> {
+filter_schema_fields: BTreeSet>,
+file_schema: &'schema Schema,
+table_schema: &'schema Schema,
+}
+
+impl<'schema> FilterSchemaBuilder<'schema> {
+fn new(file_schema: &'schema Schema, table_schema: &'schema Schema) -> 
Self {
+Self {
+filter_schema_fields: BTreeSet::new(),
+file_schema,
+table_schema,
+}
+}
+
+fn sort_fields(
+fields: &mut Vec>,
+table_schema: &Schema,
+file_schema: &Schema,
+) {
+fields.sort_by_key(|f| f.name().to_string());
+fields.dedup_by_key(|f| f.name().to_string());
+fields.sort_by_key(|f| {
+let table_schema_index =
+table_schema.index_of(f.name()).unwrap_or(usize::MAX);
+let file_schema_index = 
file_schema.index_of(f.name()).unwrap_or(usize::MAX);
+(table_schema_index, file_schema_index)
+});
+}
+
+fn build(self) -> SchemaRef {
+let mut fields = 
self.filter_schema_fields.into_iter().collect::>();
+FilterSchemaBuilder::sort_fields(
+&mut fields,
+self.table_schema,
+self.file_schema,
+);
+Arc::new(Schema::new(fields))
+}
+}
+
+impl<'node> TreeNodeVisitor<'node> for FilterSchemaBuilder<'_> {
+type Node = Arc;
+
+fn f_down(
+&mut self,
+node: &'node Arc,
+) -> Result {
+if let Some(column) = node.as_any().downcast_ref::() {
+if let Ok(field) = 
self.table_schema.field_with_name(column.name()) {
+self.filter_schema_fields.insert(Arc::new(field.clone()));
+} else if let Ok(field) = 
self.file_schema.field_with_name(column.name()) {
+self.filter_schema_fields.insert(Arc::new(field.clone()));
+} else {
+// valid fields are the table schema's fields + the file 
schema's fields, preferring the table schema's fields when there is a conflict
+let mut valid_fields = self
+.table_schema
+.fields()
+.iter()
+.chain(self.file_schema.fields().iter())
+.cloned()
+.collect::>();
+FilterSchemaBuilder::sort_fields(
+&mut valid_fields,
+self.table_schema,
+self.file_schema,
+);
+let valid_fields = valid_fields
+.into_iter()
+.map(|f| 
datafusion_common::Column::new_unqualified(f.name()))
+.collect();
+let field = 
datafusion_common::Column::new_unqualified(column.name());
+return Err(datafu

1 2 >

1 - 100 of 157 matches

Mail list logo