[PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]
dependabot[bot] opened a new pull request, #15470: URL: https://github.com/apache/datafusion/pull/15470 Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.0 to 1.6.1. Commits See full diff in https://github.com/smithy-lang/smithy-rs/commits";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] The average time compute for clickbench query should not inside the query iterator [datafusion]
zhuqi-lucas opened a new issue, #15471: URL: https://github.com/apache/datafusion/issues/15471 ### Describe the bug The average time compute for clickbench query should not inside the query iterator. I was mistakenly added inside the iterator. ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
xudong963 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2760506147 Fyi, I'm working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]
zhuqi-lucas commented on PR #15472: URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760512944 cc @xudong963 @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] The average time compute for clickbench query should not inside the query iterator [datafusion]
zhuqi-lucas commented on issue #15471: URL: https://github.com/apache/datafusion/issues/15471#issuecomment-2760505849 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Improve `split_groups_by_statistics` method [datafusion]
xudong963 opened a new pull request, #15473: URL: https://github.com/apache/datafusion/pull/15473 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825 ## Rationale for this change As @suremarc and @leoyvens said, we should make the method load balanced to keep the parallel. ## What changes are included in this PR? Introduced a new method to split file groups into new groups based on statistics to enable efficient parallel processing ## Are these changes tested? TBD ## Are there any user-facing changes? Users need to choose the proper `split_groups_by_statistics` method by their cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `split_groups_by_statistics` method [datafusion]
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018151341 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups based on statistics to enable efficient parallel processing. +/// +/// The method distributes files across a target number of partitions while ensuring +/// files within each partition maintain sort order based on their min/max statistics. +/// +/// The algorithm works by: +/// 1. Sorting all files by their minimum values +/// 2. Trying to place each file into an existing group where it can maintain sort order +/// 3. Creating new groups when necessary if a file cannot fit into existing groups +/// 4. Prioritizing smaller groups when multiple suitable groups exist (for load balancing) +/// +/// # Parameters +/// * `table_schema`: Schema containing information about the columns +/// * `file_groups`: The original file groups to split +/// * `sort_order`: The lexicographical ordering to maintain within each group +/// * `target_partitions`: The desired number of output partitions +/// +/// # Returns +/// A new set of file groups, where files within each group are non-overlapping with respect to +/// their min/max statistics and maintain the specified sort order. +pub fn split_groups_by_statistics_v2( Review Comment: The name is TBD. I'm wondering if we need to keep the old method because I think both of them have applicable scenarios -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Partial fix for #1078 — [Add Dataframe display config] [datafusion-python]
kosiew opened a new pull request, #1086: URL: https://github.com/apache/datafusion-python/pull/1086 ## Which issue does this PR close? Partial fix for #1078 ## Rationale for this change This PR adds configurable display settings for `DataFrame` representations in the Python bindings of DataFusion. These changes improve flexibility and control over how DataFrames are visualized, especially in interactive environments like Jupyter notebooks. Users can now fine-tune memory usage, cell truncation behavior, and row limits for both string and HTML representations. ## What changes are included in this PR? - Introduced a new `DisplayConfig` Python class to encapsulate display settings. - Extended `PyDataFrame` to store and expose `DisplayConfig` via a new `display_config` property. - Added `configure_display()` and `reset_display_config()` methods to allow users to modify or reset configuration. - Updated DataFrame display logic (`__repr__` and `_repr_html_`) to respect the active configuration. - Modified Rust backend to pass `DisplayConfig` to relevant formatting functions. - Added extensive unit tests in `test_dataframe.py` to validate: - Custom configuration values - Invalid input handling - Display behavior for large cells, large tables, and row limits ## Are there any user-facing changes? Yes. This PR introduces: - `df.display_config`: Property to access current display configuration. - `df.configure_display(...)`: Method to customize display settings. - `df.reset_display_config()`: Method to reset to default configuration. - More intuitive and memory-conscious display formatting in notebook and terminal environments. > These changes enhance user control and improve display ergonomics for large or complex DataFrames. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit [datafusion]
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2760574284 I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]
zhuqi-lucas opened a new pull request, #15472: URL: https://github.com/apache/datafusion/pull/15472 ## Which issue does this PR close? - Closes [#15471](https://github.com/apache/datafusion/issues/15471) ## Rationale for this change the average time for clickbench query compute should outside the iterator loop ## What changes are included in this PR? the average time for clickbench query compute should outside the iterator loop ## Are these changes tested? Yes ## Are there any user-facing changes? Fix the wrong average time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]
xudong963 merged PR #15470: URL: https://github.com/apache/datafusion/pull/15470 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]
2010YOUY01 merged PR #15472: URL: https://github.com/apache/datafusion/pull/15472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] The average time compute for clickbench query is wrong [datafusion]
2010YOUY01 closed issue #15471: The average time compute for clickbench query is wrong URL: https://github.com/apache/datafusion/issues/15471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Partial fix for 1078: [refactor: Simplify HTML generation in PyDataFrame by extracting helper functions] [datafusion-python]
kosiew opened a new pull request, #1087: URL: https://github.com/apache/datafusion-python/pull/1087 # Which issue does this PR close? Partial fix for #1078 # Rationale for this change > Split up some of the html generation into a set of helper functions. The rendering logic for DataFrame HTML output was previously monolithic. By modularizing the generation of HTML structure, styles, and table rows into clearly named helper functions, we improve readability, testability, and long-term maintainability. This refactor lays the foundation for easier customization and potential future features like theming or enhanced interactivity. # What changes are included in this PR? - Extracted style definitions into `get_html_style_definitions()` - Moved the opening table tag into `get_html_table_opening()` - Separated table header generation into `get_html_table_header()` - Encapsulated array formatter creation logic in `create_batch_formatters()` - Delegated row and cell HTML generation to `get_html_table_rows()` and `format_table_cell()` - Moved JavaScript logic for expandable cells into `get_html_js_functions()` - Preserved original behavior with no changes to the rendered output # Are there any user-facing changes? No functional changes for end users. The generated HTML and behavior remain the same, but the underlying implementation is now modular and easier to maintain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760761409 The line number in the error message is the row index of a certain record batch, not the line number in the csv file. I have filed an issue to arrow-rs for making this error message more user-friendly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760766577 > why there are two head rows I didn't find this. You might find the cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]
acking-you opened a new pull request, #15475: URL: https://github.com/apache/datafusion/pull/15475 ## Which issue does this PR close? - Closes #15465 . ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761137454 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` Thank you very much for your hint, it will be very helpful for me to fix these tests! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]
alamb commented on issue #1077: URL: https://github.com/apache/datafusion-python/issues/1077#issuecomment-2742876701 FYI @timsaucer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]
comphead commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762263935 > Note that this does break for users of HashJoinExec that > > * Use the CollectLeft mode, with >1 partition on the build side AND > * Construct their physical plan without running EnforceDistribution Thanks @ctsk what exactly is broken? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]
codecov-commenter commented on PR #1577: URL: https://github.com/apache/datafusion-comet/pull/1577#issuecomment-2762138134 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1577?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 56.16%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`6f83dd8`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/6f83dd84d02fa77ae196684bf35adc4326bd97df?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 105 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1577 +/- ## + Coverage 56.12% 56.16% +0.03% + Complexity 976 909 -67 Files 119 122 +3 Lines 1174312228 +485 Branches 2251 2262 +11 + Hits 6591 6868 +277 - Misses 4012 4246 +234 + Partials 1140 1114 -26 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1577?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]
ctsk commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762302514 Before this PR, if someone hand-wired a CollectLeft HashJoin where the left child has more than one output partition, the HashJoin would automatically add a CoalescePartitions exec. This behaviour never triggers for plans that are constructed by datafusion, because the EnforceDistribution pass makes sure that that CoalescePartitions exist. After this PR, if someone hand-wires a CollectLeft HashJoin and does not use the EnforceDistribution pass and provides a left child that has more than 1 output partition, the resulting plan, when executed will return a wrong result (because the hash table will only built on partition 0). Now that I've written it out. I believe it is strictly better to return a plan_err! if one constructs such a plan rather than return a wrong result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
friendlymatthew commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019270254 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.value(idx).try_to_string(); match result { Ok(value) => results.push(Some(value)), -Err(e) => return exec_err!("{}", e), +Err(e) => { +if data_type == &Date32 { Review Comment: > Casting Date32 to Date64 should be incredibly cheap I think I agree, converting involves a cast between primitives and a mult. https://github.com/apache/arrow-rs/blob/206fe024cdd9b9de770714baa27b58a1442fcc75/arrow-cast/src/cast/mod.rs#L1423-L1427 > Yeah, this is a limitation of arrow-rs currently. I never got around to pushing a PR to arrow-rs to change that. I'd be happy to work on this next, if you'd like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: enable iceberg compat tests, more tests for complex types [datafusion-comet]
parthchandra commented on code in PR #1550: URL: https://github.com/apache/datafusion-comet/pull/1550#discussion_r2019242229 ## spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala: ## @@ -490,8 +490,7 @@ object CometScanExec extends DataTypeSupport { // TODO add map dt match { case s: StructType => s.fields.map(_.dataType).forall(isTypeSupported) -// TODO: Add nested array and iceberg compat support -// case a: ArrayType => isTypeSupported(a.elementType) +case a: ArrayType => isTypeSupported(a.elementType) Review Comment: I'm not sure myself how well arrays are supported by the scans. I suppose we need to add more tests for arrays? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Dynamic pruning filters from TopK state [datafusion]
alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990 @adriangb and I had a discussion about https://github.com/apache/datafusion/pull/15301 here are some notes: ## Usecases: - TopK dynamic filter pushdown - Prune files with dynamic filter based on statistics - Prune row groups with dynamic filter based on statistics - Prune row pages with dynamic filter based on statistics - Apply during filtering when pushdown enabled - Join SIPs ## Pros / Cons The pros for merging this PR are: - We already have benchmarks that show some performance improvement The cons: - It requires special implementation for any operators (like FileOpenenr) to take advantage of such filters. THis is not a blocker in my mind – but I do think implementing a PhysicalExpr is a cleaner design. As Adrian says, we can refactor it over time if/when PhysicalExpr gets more sophisticated - We will get even more performance when filter_pushdown is enabled (again maybe this is just follow on work) ## Nice to haves - For a plan with multiple partitions (e.g. for 16 input partitions, we end up with 17 top heaps – one for each partition and then a global one), but this PR can only apply the per-partition top k value. - It would be nice to somehow be able to use all the top values (aka pick the smallest one) when filtering. - This PR takes a snapshot of the contents of the TopK heap when a file is opened and never changes it. - This is good for pruning as all the pruning (file, row group and page) happens on file opening - It is not as good for filter_pushdown when the values in the topK heap can change over the course of the query so using the snapshot means the dynamic filter doesn’t improve over time I believe adrian is going to look into these – but I also think they could easily be done as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]
mbutrovich commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2762276704 > Looks good, do we still need to wait for arrow-rs based on [#1566 (comment)](https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890) ? We can merge it. I'd rather be testing the pushdown logic as the default, even if it's a little bit slower right now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] minor: Allow to run TPCH bench for a specific query [datafusion]
comphead commented on PR #15467: URL: https://github.com/apache/datafusion/pull/15467#issuecomment-276152 Thanks @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FileScanConfigBuilder` [datafusion]
alamb merged PR #15352: URL: https://github.com/apache/datafusion/pull/15352 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]
kazuyukitanimura commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2761992952 Looks good, do we still need to wait for arrow-rs based on https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Organize fields inside `SortMergeJoinStream` [datafusion]
suibianwanwank commented on issue #15406: URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162 Hi, @2010YOUY01. I have read and tried to understand both SortMergeJoinStream and GroupedHashAggregateStream (though I still have some uncertainties). I have some initial thoughts on organizing the structure, but I’m not sure about the right level of granularity. One idea is to introduce a SortMergeState, which would roughly comprise: ```rust struct SortMergeState { /// State of streamed pub streamed_state: StreamedState, /// State of buffered pub buffered_state: BufferedState, /// Staging output size, including output batches and staging joined results. /// Increased when we put rows into buffer and decreased after we actually output batches. /// Used to trigger output when sufficient rows are ready pub output_size: usize, /// Target output batch size pub batch_size: usize, /// Current processing record batch of streamed pub streamed_batch: StreamedBatch, /// Current buffered data pub buffered_data: BufferedData, /// (used in outer join) Is current streamed row joined at least once? pub streamed_joined: bool, /// (used in outer join) Is current buffered batches joined at least once? pub buffered_joined: bool, /// A unique number for each batch pub streamed_batch_counter: AtomicUsize, /// Staging output array builders pub staging_output_record_batches: JoinedRecordBatches, /// Output buffer. Currently used by filtering as it requires double buffering /// to avoid small/empty batches. Non-filtered join outputs directly from `staging_output_record_batches.batches` pub output: RecordBatch, } ``` These contain the parts of the stream, buffer, and output that change during the merge process. Another idea is to organize the contents of the buffer, stream and output separately, which would roughly comprise: ```rust struct StreamContext { /// State of streamed pub streamed_state: StreamedState, /// Current processing record batch of streamed pub streamed_batch: StreamedBatch, /// (used in outer join) Is current streamed row joined at least once? pub streamed_joined: bool, /// Join key columns of streamed pub on_streamed: Vec, /// Streamed data stream pub streamed: SendableRecordBatchStream, /// Input schema of streamed pub streamed_schema: SchemaRef, } struct BufferContext { //... } struct OutputState { //... } ``` Do you have any suggestions on which approach would be clearer and more readable? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]
jinwenjie123 commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762497216 Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Feature is not implemeneted: Unsupported cast with list of structs [datafusion]
ion-elgreco commented on issue #15338: URL: https://github.com/apache/datafusion/issues/15338#issuecomment-2762411530 > - Related discussion: https://github.com/apache/arrow-rs/issues/7176 > > I think @kosiew may have a PR up that is related > - https://github.com/apache/datafusion/pull/15295 I don't think the PR solves it for us, to my understanding that PR helps when you read parquet data. But we are evolving the schema within a dataframe cast -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019195514 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.value(idx).try_to_string(); match result { Ok(value) => results.push(Some(value)), -Err(e) => return exec_err!("{}", e), +Err(e) => { +if data_type == &Date32 { Review Comment: I thought about this a bit and was mentally toying with the idea of testing the format string(s) for time formats vs just casting always ... and generally I don't think it would be worth it at all. Casting Date32 to Date64 should be incredibly cheap I think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019198451 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.value(idx).try_to_string(); match result { Ok(value) => results.push(Some(value)), -Err(e) => return exec_err!("{}", e), +Err(e) => { +if data_type == &Date32 { Review Comment: > _*The entire `Date32` array because for every format string, we create a new `ArrayFormatter` by supplying the entire array and specifying the index for the formatter to format:_ > > https://github.com/apache/datafusion/blob/fdb4e848b65c001dd3f65b477296e07cbe8e0b07/datafusion/functions/src/datetime/to_char.rs#L274-L277 Yeah, this is a limitation of arrow-rs currently. I never got around to pushing a PR to arrow-rs to change that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2762301099 From discussion with Andrew here are a couple notes: - The most granular update frequency of the filters is when the TopK itself updates, so we should switch from polling to pushing: `TopKDynamicFilterSource` should just hold a reference to it's current filter and the TopK should update `TopKDynamicFilterSource` when it gets updated and not the other way around. - When you have say 10 partitions you end up with 11 TopK operators: one per partition + a global one on top. Can we push down filters from the global TopK? My suggestion was that yes we can if we have each TopK accept filters from above it and forward them on. - Can we update the dynamic filters for each batch that gets read inside of the openers? I think we might be able to, might require some trickery though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: Move `Memtable` to catalog [datafusion]
logan-keede commented on code in PR #15459: URL: https://github.com/apache/datafusion/pull/15459#discussion_r2019291249 ## datafusion/catalog/src/memory/table.rs: ## @@ -0,0 +1,377 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! [`MemTable`] for querying `Vec` by DataFusion. + +use std::any::Any; +use std::collections::HashMap; +use std::fmt::{self, Debug}; +use std::sync::Arc; + +use crate::TableProvider; +use datafusion_common::error::Result; +use datafusion_expr::Expr; +use datafusion_expr::TableType; +use datafusion_physical_expr::create_physical_sort_exprs; +use datafusion_physical_plan::repartition::RepartitionExec; +use datafusion_physical_plan::{ +common, DisplayAs, DisplayFormatType, ExecutionPlan, ExecutionPlanProperties, +Partitioning, SendableRecordBatchStream, +}; + +use arrow::datatypes::SchemaRef; +use arrow::record_batch::RecordBatch; +use datafusion_common::{not_impl_err, plan_err, Constraints, DFSchema, SchemaExt}; +use datafusion_common_runtime::JoinSet; +use datafusion_datasource::memory::MemorySourceConfig; +use datafusion_datasource::sink::{DataSink, DataSinkExec}; +use datafusion_datasource::source::DataSourceExec; +use datafusion_execution::TaskContext; +use datafusion_expr::dml::InsertOp; +use datafusion_expr::SortExpr; +use datafusion_session::Session; + +use async_trait::async_trait; +use futures::StreamExt; +use log::debug; +use parking_lot::Mutex; +use tokio::sync::RwLock; + +/// Type alias for partition data +pub type PartitionData = Arc>>; + +/// In-memory data source for presenting a `Vec` as a +/// data source that can be queried by DataFusion. This allows data to +/// be pre-loaded into memory and then repeatedly queried without +/// incurring additional file I/O overhead. +#[derive(Debug)] +pub struct MemTable { +schema: SchemaRef, +// make it pub(crate) when possible +pub batches: Vec, +constraints: Constraints, +column_defaults: HashMap, +/// Optional pre-known sort order(s). Must be `SortExpr`s. +/// inserting data into this table removes the order +pub sort_order: Arc>>>, +} + +impl MemTable { +/// Create a new in-memory table from the provided schema and record batches +pub fn try_new(schema: SchemaRef, partitions: Vec>) -> Result { +for batches in partitions.iter().flatten() { +let batches_schema = batches.schema(); +if !schema.contains(&batches_schema) { +debug!( +"mem table schema does not contain batches schema. \ +Target_schema: {schema:?}. Batches Schema: {batches_schema:?}" +); +return plan_err!("Mismatch between schema and batches"); +} +} + +Ok(Self { +schema, +batches: partitions +.into_iter() +.map(|e| Arc::new(RwLock::new(e))) +.collect::>(), +constraints: Constraints::empty(), +column_defaults: HashMap::new(), +sort_order: Arc::new(Mutex::new(vec![])), +}) +} + +/// Assign constraints +pub fn with_constraints(mut self, constraints: Constraints) -> Self { +self.constraints = constraints; +self +} + +/// Assign column defaults +pub fn with_column_defaults( +mut self, +column_defaults: HashMap, +) -> Self { +self.column_defaults = column_defaults; +self +} + +/// Specify an optional pre-known sort order(s). Must be `SortExpr`s. +/// +/// If the data is not sorted by this order, DataFusion may produce +/// incorrect results. +/// +/// DataFusion may take advantage of this ordering to omit sorts +/// or use more efficient algorithms. +/// +/// Note that multiple sort orders are supported, if some are known to be +/// equivalent, +pub fn with_sort_order(self, mut sort_order: Vec>) -> Self { +std::mem::swap(self.sort_order.lock().as_mut(), &mut sort_order); +self +} + +/// Create a mem table by reading from another data source +pub async fn load( +t: Arc, +output_partitions: Option, +state:
Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]
acking-you commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761888661 > Looks good to me. Since we're only ordering by this it shouldn't matter that we order by an integer instead of a proper timestamp, ordering is equivalent. Thank you very much for your reply, but I have a question: why can't the following `order by "EventTime"` be modified? It seems that they are equivalent (since the default is UTC timestamps). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]
alamb closed issue #15456: [Bug] datafusion-cli may fail to read csv files URL: https://github.com/apache/datafusion/issues/15456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]
parthchandra commented on PR #15481: URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761890159 There is also some profiling information in https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html I'm presuming this will replace that? How does samply compare with cargo flamegraph ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Improve fetch partition performance, support skip validation arrow ipc files [datafusion-ballista]
westhide opened a new pull request, #1216: URL: https://github.com/apache/datafusion-ballista/pull/1216 # Which issue does this PR close? Closes #1189. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]
ctsk commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2761891586 Note that this does break for users of HashJoinExec that - Use the CollectLeft mode, with >1 partition on the build side AND - Construct their physical plan without running EnforceDistribution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]
Dandandan commented on code in PR #15479: URL: https://github.com/apache/datafusion/pull/15479#discussion_r2019008151 ## datafusion/physical-optimizer/src/coalesce_batches.rs: ## @@ -92,3 +92,73 @@ impl PhysicalOptimizerRule for CoalesceBatches { true } } + +/// Remove CoalesceBatchesExec that are in front of a AggregateExec +#[derive(Default, Debug)] +pub struct UnCoalesceBatches {} Review Comment: I am wondering if we instead can avoid adding them in the `CoalesceBatches` optimizer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]
Dandandan commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2761889602 That makes a lot of sense! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]
ivankelly commented on issue #15028: URL: https://github.com/apache/datafusion/issues/15028#issuecomment-276185 Excellent analysis folks. Parquet row groups size makes a lot of sense since the rows are large. We can tune that way down since our use case isn't columnar. How do I get the good errors? I see that was merged last year, but I was using the latest release at time of repro -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]
alamb commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761888196 Turns out this is a bug in the generator -- https://github.com/clflushopt/tpchgen-rs/issues/73#issuecomment-2761885245 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761917991 > > Looks good to me. Since we're only ordering by this it shouldn't matter that we order by an integer instead of a proper timestamp, ordering is equivalent. > > Thank you very much for your reply, but I have a question: why can't the following `order by "EventTime"` be modified? It seems that they are equivalent (since the default is UTC timestamps). And both ClickHouse and duckdb do this way: [duckdb](https://github.com/ClickHouse/ClickBench/blob/main/duckdb/queries.sql#L24-L27) [clickhouse](https://github.com/ClickHouse/ClickBench/blob/main/clickhouse/queries.sql#L24-L27) I'm agreeing with you! I think this change is good 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]
comphead commented on PR #15481: URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761920912 > There is also some profiling information in https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html I'm presuming this will replace that? How does samply compare with cargo flamegraph ? the tool also builds a flame graph, it can be found in the separate tab. For the Comet I think we can refer to this section to provide alternative profiling ways -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Assertion fail in external sort [datafusion]
comphead commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2019237566 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -416,21 +409,23 @@ impl ExternalSorter { Some(self.spill_manager.create_in_progress_file("Sorting")?); } -self.organize_stringview_arrays()?; +Self::organize_stringview_arrays(globally_sorted_batches)?; debug!("Spilling sort data of ExternalSorter to disk whilst inserting"); -let batches = std::mem::take(&mut self.in_mem_batches); +let batches_to_spill = std::mem::take(globally_sorted_batches); self.reservation.free(); let in_progress_file = self.in_progress_spill_file.as_mut().ok_or_else(|| { internal_datafusion_err!("In-progress spill file should be initialized") })?; -for batch in batches { +for batch in batches_to_spill { in_progress_file.append_batch(&batch)?; } +assert!(globally_sorted_batches.is_empty()); Review Comment: perhaps we can get rid of assert and return `Err`? if the code regressed the users can face the app crash -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]
alan910127 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019310584 ## datafusion/core/tests/expr_api/mod.rs: ## @@ -330,12 +330,12 @@ async fn test_create_physical_expr_coercion() { create_expr_test(lit(1i32).eq(col("id")), "CAST(1 AS Utf8) = id@0"); // compare int col to string literal `i = '202410'` // Note this casts the column (not the field) -create_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410"); -create_expr_test(lit("202410").eq(col("i")), "202410 = CAST(i@1 AS Utf8)"); +create_expr_test(col("i").eq(lit("202410")), "i@1 = 202410"); +create_expr_test(lit("202410").eq(col("i")), "202410 = i@1"); // however, when simplified the casts on i should removed // https://github.com/apache/datafusion/issues/14944 -create_simplified_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410"); -create_simplified_expr_test(lit("202410").eq(col("i")), "CAST(i@1 AS Utf8) = 202410"); +create_simplified_expr_test(col("i").eq(lit("202410")), "i@1 = 202410"); +create_simplified_expr_test(lit("202410").eq(col("i")), "i@1 = 202410"); Review Comment: not sure if this test is still needed since the literal casting behavior is not considered an "optimization" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix!: incorrect coercion when comparing with string literals [datafusion]
alan910127 opened a new pull request, #15482: URL: https://github.com/apache/datafusion/pull/15482 ## Which issue does this PR close? - Closes #15161. ## Rationale for this change Currently, DataFusion handles comparisons between numbers and string literals differently from a number of databases. It coerces the number to a string, whereas other databases cast the literal to the column type and emit an error if the cast fails. This behavior can be unintuitive. ## What changes are included in this PR? Updated [`TypeCoercionRewriter::coerce_binary_op`](https://github.com/apache/datafusion/blob/b4b8f6489c4a61ae2d27e99a2c897c9c594a15fb/datafusion/optimizer/src/analyzer/type_coercion.rs#L285) to cast string literals to the column type if one is present on either side of a comparison expression. ## Are these changes tested? - Updated existing tests to reflect the new type coercion behavior. - In `push_down_filter.slt`, some `explain` tests now produce no output when queries fail due to invalid casts. For now, I have updated these tests to expect empty output, but further adjustments may be needed. ## Are there any user-facing changes? Yes. Queries that previously coerced numbers into strings will now fail if the string literal cannot be cast to the column type. ### Example Before this change (success) ```sql > CREATE TABLE t AS SELECT CAST(123 AS int) a; > SELECT * FROM t WHERE a = '2147483648'; -- Not a valid i32 +---+ | a | +---+ +---+ ``` After this change (error) ```sql > CREATE TABLE t AS SELECT CAST(123 AS int) a; > SELECT * FROM t WHERE a = '2147483648'; -- Not a valid i32 type_coercion caused by Error during planning: Cannot coerce '2147483648' to type 'Int32' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
alamb commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762075206 > @andygrove @alamb would y'all be able to help here? I saw that the prebuilt JARs were tested for 3.5.4 and upwards. Are they backwards compatible? Sorry I don't know much about comet or java errors -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Avg distinct for `float64` type [datafusion]
alamb commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762447424 Run extended tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [DISCUSS] Data quality framework using DataFusion [datafusion]
jsai28 opened a new issue, #15483: URL: https://github.com/apache/datafusion/issues/15483 Would there be any interest in building a data quality framework like [Great Expectations](https://github.com/great-expectations/great_expectationshttps://github.com/great-expectations/great_expectations) or [Deequ](https://github.com/awslabs/deequ) (built on spark) except in Rust using DataFusion? As far as I am aware, there is nothing like this in Rust let alone built on DataFusion. The idea is essentially a Rust-based tool to specify unit-like tests for your data. Users would specify tests (called expectations in Great Expectations) and then DataFusion could be used for the underlying metric computation. Essentially something like this: ``` fn main() { let validator = create_validator(‘example.csv’); validator.is_not_null(“id”); // specify column for null check validator.min_value(“price”, 0); // specify column and minimum value validator.validate(); } ``` Which could return an output like this: ``` ✅ id: Passed (All values not null) ❌ price: Failed (2 values below 0) ``` It would be a pretty niche tool that could be apart of a larger data pipeline. I was thinking it could be a good project to work on for GSoC. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): update rand requirement from 0.8 to 0.9 [datafusion]
comphead commented on PR #14333: URL: https://github.com/apache/datafusion/pull/14333#issuecomment-2762462451 depends on https://github.com/apache/datafusion/pull/14967 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]
westhide commented on issue #1189: URL: https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761307615 References: [Benchmarks for Arrow IPC reader](https://github.com/apache/arrow-rs/pull/7091) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
berkaysynnada commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761294822 @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As datafusion hasn't implemented it yet, you didn't have difficulty rewriting the `human_display` according to the new expressions. Could you suggest a way to update `human_readable` field for the implementers of `with_new_expressions()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce selection vector repartitioning [datafusion]
goldmedal commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2761416899 > You should be able to get the test back by also setting `datafusion.optimizer.hash_join_single_partition_threshold` to `0` / a low value. Thanks. It works. I also added the test for the normal hash partition for RepartitionExec. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support Min/Max accumulator for type List [datafusion]
LiaCastaneda commented on issue #15477: URL: https://github.com/apache/datafusion/issues/15477#issuecomment-2761931268 I think @ologlogn will take it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]
Kontinuation commented on issue #15028: URL: https://github.com/apache/datafusion/issues/15028#issuecomment-2761940822 > Excellent analysis folks. Parquet row groups size makes a lot of sense since the rows are large. We can tune that way down since our use case isn't columnar. How do I get the good errors? I see that was merged last year, but I was using the latest release at time of repro I made some modification to the code to use `TrackConsumersPool`. The error message will contain the amount of memory consumed by top 10 consumers when memory reservation failure happens. ```diff diff --git a/src/main.rs b/src/main.rs index 4413046..71af6d6 100644 --- a/src/main.rs +++ b/src/main.rs @@ -1,5 +1,6 @@ use std::sync::Arc; -use datafusion::execution::memory_pool::FairSpillPool; +use std::num::NonZeroUsize; +use datafusion::execution::memory_pool::{FairSpillPool, TrackConsumersPool}; use datafusion::execution::runtime_env::RuntimeEnvBuilder; use datafusion::prelude::SessionConfig; use datafusion::prelude::SessionContext; @@ -12,7 +13,11 @@ use datafusion::logical_expr::col; #[tokio::main] async fn main() -> anyhow::Result<()> { env_logger::init(); -let pool = Arc::new(FairSpillPool::new(100 * 1024 * 1024)); +let pool = Arc::new( +TrackConsumersPool::new( +FairSpillPool::new(100 * 1024 * 1024), +NonZeroUsize::new(10).unwrap() +)); let runtime_env = RuntimeEnvBuilder::new() // TODO: add disk .with_memory_pool(pool.clone()) // TODO: from config .build_arc() @@ -33,6 +38,7 @@ async fn main() -> anyhow::Result<()> { table_opts.global.dictionary_enabled = Some(false); table_opts.global.statistics_enabled = Some("none".to_string()); table_opts.global.bloom_filter_on_write = false; +// table_opts.global.max_row_group_size = 1000; table_opts.column_specific_options.insert( "id".to_string(), ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] doc: fix quick-start executor command [datafusion-ballista]
westhide opened a new pull request, #1217: URL: https://github.com/apache/datafusion-ballista/pull/1217 # Which issue does this PR close? Closes N/A. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]
westhide commented on issue #1189: URL: https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761293050 /take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]
irenjj commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761472370 > @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As datafusion hasn't implemented it yet, you didn't have difficulty rewriting the `human_display` according to the new expressions. Could you suggest a way to update `human_readable` field for the implementers of `with_new_expressions()`? I believe that the `human_display` in `AggregateFunctionExpr` is constructed during the logical plan phase based on the available information at that time (and the `name` in `AggregateFunctionExpr` is similar). On the other hand, `with_new_expressions()` replaces the `args` in `AggregateFunctionExpr` during the physical plan phase (perhaps related to the optimizer?). It's perfectly fine to keep `human_display` unchanged when implementing `with_new_expressions()` because the `name` used in the original `indented explain` is actually quite similar to `human_display`. Perhaps what we should really focus on is ensuring that the explain output correctly displays the `name` of the `expressions` after `with_new_expressions()` has been applied. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]
mbutrovich commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2018762152 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -422,27 +432,29 @@ impl ShuffleRepartitioner { .collect::>>()?; // use identical seed as spark hash partition -let hashes_buf = &mut self.hashes_buf[..arrays[0].len()]; +let hashes_buf = &mut scratch.hashes_buf[..arrays[0].len()]; hashes_buf.fill(42_u32); // Hash arrays and compute buckets based on number of partitions -let partition_ids = &mut self.partition_ids[..arrays[0].len()]; +let partition_ids = &mut scratch.partition_ids[..arrays[0].len()]; create_murmur3_hashes(&arrays, hashes_buf)? .iter() .enumerate() .for_each(|(idx, hash)| { -partition_ids[idx] = pmod(*hash, *num_output_partitions) as u64 +partition_ids[idx] = pmod(*hash, *num_output_partitions) as u32; }); // count each partition size -let mut partition_counters = vec![0usize; *num_output_partitions]; +let partition_counters = &mut scratch.partition_starts; +partition_counters.resize(num_output_partitions + 1, 0); +partition_counters.fill(0); partition_ids .iter() .for_each(|partition_id| partition_counters[*partition_id as usize] += 1); // accumulate partition counters into partition ends // e.g. partition counter: [1, 3, 2, 1] => [1, 4, 6, 7] Review Comment: Agreed. I don't think that is any more readable. Thank you for the explanation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]
Dandandan commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761610618 > Should we update the same query in the clickbench repo as well? Yes, and we might rerun the queries as well (as `to_timestamp_seconds` takes some time itself as well). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]
blaginin commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762056261 can you merge main into this branch please? to remove the diff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]
kazuyukitanimura commented on code in PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2019103734 ## native/core/src/parquet/mod.rs: ## @@ -641,6 +640,8 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat session_timezone: jstring, ) -> jlong { try_unwrap_or_throw(&e, |mut env| unsafe { +let task_ctx = TaskContext::default(); Review Comment: @parthchandra Is this good for merging? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762061935 I'll do a last commit to resolve those comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FileScanConfigBuilder` [datafusion]
alamb commented on PR #15352: URL: https://github.com/apache/datafusion/pull/15352#issuecomment-2762070065 Thanks again @blaginin @mertak-synnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add documentation for `Run extended tests` command [datafusion]
alamb commented on PR #15463: URL: https://github.com/apache/datafusion/pull/15463#issuecomment-2762078550 Thanks @comphead and @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add documentation for `Run extended tests` command [datafusion]
alamb merged PR #15463: URL: https://github.com/apache/datafusion/pull/15463 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Docs: Formatting and Added Extra resources [datafusion]
alamb commented on PR #15450: URL: https://github.com/apache/datafusion/pull/15450#issuecomment-2762082875 > > BTW @oznur-synnada I wonder if you have time to update the page with other recent blog content 🤔 > > You mean this? #15440 Yes! As well as https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown https://datafusion.apache.org/blog/2025/03/20/parquet-pruning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]
alamb commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762080799 🌶️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006767851 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -1655,4 +1656,46 @@ mod tests { assert_eq!(calls.len(), 2); assert_eq!(calls, vec![Some(123), Some(456)]); } + +#[tokio::test] +async fn test_topk_predicate_pushdown() { +let ctx = SessionContext::new(); +let opt = ListingOptions::new(Arc::new(ParquetFormat::default())) +// We need to force 1 partition because TopK predicate pushdown happens on a per-partition basis +// If we had 1 file per partition (as an example) no pushdown would happen +.with_target_partitions(1); + +let tmp_dir = TempDir::new().unwrap(); +let path = tmp_dir.path().to_str().unwrap().to_string(); +// The point here is that we write many, many files. +// So when we scan after we processed the first one we should be able to skip the rest +// because of the TopK predicate pushdown. +for file in 0..100 { +let name = format!("test{:02}.parquet", file); +write_file(&format!("{path}/{name}")); +} +ctx.register_listing_table("base_table", path, opt, None, None) +.await +.unwrap(); + +let query = "select name from base_table order by id desc limit 3"; + +let batches = ctx.sql(query).await.unwrap().collect().await.unwrap(); +#[rustfmt::skip] +let expected = [ +"++", +"| name |", +"++", +"| test02 |", +"| test02 |", +"| test02 |", +"++", +]; +assert_batches_eq!(expected, &batches); + +let sql = format!("explain analyze {query}"); +let batches = ctx.sql(&sql).await.unwrap().collect().await.unwrap(); +let explain_plan = format!("{}", pretty_format_batches(&batches).unwrap()); +assert_contains!(explain_plan, "row_groups_pruned_statistics=96"); Review Comment: Yes! More tests! I just tried this in my full system and found a bug w/ hive partition columns. Making a note to add a test and fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] bench: Introduce cross platform Samply profiler [datafusion]
comphead opened a new pull request, #15481: URL: https://github.com/apache/datafusion/pull/15481 ## Which issue does this PR close? - Closes #. ## Rationale for this change Introduce cross platform Samply profiler for DataFusion and benchmarks ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit [datafusion]
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761813612 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` After taking a closer look, in fact, the situation you mentioned does not actually lead to the short-circuit optimization logic. The error I’m currently seeing is the use of `true_count() == 0` to determine if it is false, but in reality, it could also be null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Idea: Avoid planning CoalesceBatches in front of blocking operators. [datafusion]
Dandandan commented on issue #15478: URL: https://github.com/apache/datafusion/issues/15478#issuecomment-2762997345 > One downside: Increased memory usage. > > The hash join build side stores the RecordBatches in a vector before building the hash table. This vector will grow larger. In addition, we might delay e.g. garbage collection for StringViews, which also increases memory usage. Perhaps instead of removing the coalescebatches, we can change the the threshold ratio to be much smaller (e.g. 1/10th (or something based on some benchmarkig) of the size instead of 1/2). That way the overhead of `Vec` is minimal while probably still redundant copying in many cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Avg distinct for `float64` type [datafusion]
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762816764 Looks like it failed? https://github.com/apache/datafusion/actions/runs/14139465370/job/39618247236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update ClickBench queries to avoid `to_timestamp_seconds` [datafusion]
Dandandan commented on issue #15465: URL: https://github.com/apache/datafusion/issues/15465#issuecomment-2762989314 Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that are applied to the queries and undo them if possible (or file issues when the planner fails to plan them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support for user defined FFI functions [datafusion-ballista]
unknown-no commented on issue #1215: URL: https://github.com/apache/datafusion-ballista/issues/1215#issuecomment-2763011354 Related [WASM UDFs](https://github.com/apache/datafusion/issues/9326) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
parthchandra commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762922710 This particular API is not a public API and we use it to so we can verify the metrics in tests. Maybe we can disable its use in non test environments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2763043386 @alamb I've achieved 2/3 goals: - I added wrapping of a `DynamicFilterSource` in a `PhysicalExpr` such that it can dynamically update itself to prune rows using filter pushdown _even on a single file_. After various experiments I went with https://github.com/apache/datafusion/pull/15301/commits/1f2fcd832ce7432e9ecb23d698ebf83823f48405#diff-cb5bce9bce1aacd37171f2501f660cd564eb83dac28d331e39bef98aa227 which is a hybrid of explicitly passing around a `DynamicFilterSource` and an opaque `PhysicalExpr`: the idea is to explicitly pass around the dynamic filter source so that operators can opt-into it explicitly and do special handling such as taking a snapshot for serializing across the wire or `PruningPredicate`, but are still able to convert it to a `PhysicalExpr` when needed. It was necessary to take quite a bit of care with the `PhyscialExpr` wrapper implementation because e.g. filter pushdown remaps column indices by rewriting children so we need to do dynamic rewriting of children. But after it was all wired up I think it is pretty nice. - I switched the dynamic aspect from polling (you ask `DynamicFilterSource` for new filters) to a push model (when the TopK updates it pushes the new filters into the shared state). I think this should be more performant, and it completely removes locks on the TopK heap. This could be made even more efficient if we bring back the `supports_dynamic_filter_pushdown() -> bool` method since we can avoid doing some work and setting up references / locks if we can know ahead of time if anything will need it or not. I was not able to get the global TopK thing working because it's not actually a global TopK doing the work at the top level: it's a `SortPreservingMergeExec` which I think in theory we can implement this optimization for but this PR is large enough as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]
jayzhan211 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019668440 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)] query TT explain select a from t where a = '999'; -logical_plan TableScan: t projection=[a], full_filters=[CAST(t.a AS Utf8) = Utf8("999")] + Review Comment: why is there no plan? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
xudong963 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2763074215 > For your planning purposes I will be away the week of April 21 -- so perhaps we can start testing a week earlier (week of April 7 so we have time to complete / fix issues prior to April 14) That sounds good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support computing statistics for FileGroup [datafusion]
xudong963 commented on code in PR #15432: URL: https://github.com/apache/datafusion/pull/15432#discussion_r2019681642 ## datafusion/core/src/datasource/statistics.rs: ## @@ -145,7 +147,142 @@ pub async fn get_statistics_with_limit( Ok((result_files, statistics)) } -fn add_row_stats( +/// Generic function to compute statistics across multiple items that have statistics +fn compute_summary_statistics( +items: I, +file_schema: &SchemaRef, +stats_extractor: impl Fn(&T) -> Option<&Statistics>, +) -> Statistics +where +I: IntoIterator, +{ +let size = file_schema.fields().len(); +let mut col_stats_set = vec![ColumnStatistics::default(); size]; +let mut num_rows = PrecisionAbsent; +let mut total_byte_size = PrecisionAbsent; + +for (idx, item) in items.into_iter().enumerate() { +if let Some(item_stats) = stats_extractor(&item) { +if idx == 0 { +// First item, set values directly +num_rows = item_stats.num_rows; +total_byte_size = item_stats.total_byte_size; +for (index, column_stats) in +item_stats.column_statistics.iter().enumerate() +{ +col_stats_set[index].null_count = column_stats.null_count; Review Comment: After trying, maybe can't avoid 😢 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
shehabgamin commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2762517221 Happy to test whenever! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add regexp_substr function [datafusion]
github-actions[bot] commented on PR #14323: URL: https://github.com/apache/datafusion/pull/14323#issuecomment-2763013966 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: aggregation corner case [datafusion]
jayzhan211 commented on PR #15457: URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2763109194 > count(*) actually doesnt depend on any column on input logically count(*) need to know the row number of the column -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit [datafusion]
alamb commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741 ## benchmarks/queries/clickbench/README.md: ## @@ -120,13 +122,42 @@ LIMIT 10; ``` Results look like - +``` +-+-+---+--+--+--+ | ClientIP| WatchID | c | tmin | tp95 | tmax | +-+-+---+--+--+--+ | 1611957945 | 6655575552203051303 | 2 | 0| 0| 0| | -1402644643 | 8566928176839891583 | 2 | 0| 0| 0| +-+-+---+--+--+--+ +``` + +### Q6: How many social shares meet complex multi-stage filtering criteria? +**Question**: What is the count of sharing actions from iPhone mobile users on specific social networks, within common timezones, participating in seasonal campaigns, with high screen resolutions and closely matched UTM parameters? +**Important Query Properties**: Simple filter with high-selectivity, Costly string matching, A large number of filters with high overhead are positioned relatively later in the process Review Comment: 👍 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result { use arrow::compute::kernels::numeric::*; +fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { +let data_type = arg.data_type(); +match (data_type, op) { +(DataType::Boolean, Operator::And) => { +match arg { +ColumnarValue::Array(array) => { +if let Ok(array) = as_boolean_array(&array) { +return array.false_count() == array.len(); +} +} +ColumnarValue::Scalar(scalar) => { +if let ScalarValue::Boolean(Some(value)) = scalar { +return !value; +} +} +} +false +} +(DataType::Boolean, Operator::Or) => { +match arg { +ColumnarValue::Array(array) => { +if let Ok(array) = as_boolean_array(&array) { +return array.true_count() == array.len(); +} +} +ColumnarValue::Scalar(scalar) => { +if let ScalarValue::Boolean(Some(value)) = scalar { +return *value; +} +} +} +false +} +_ => false, +} +} Review Comment: Could you please also add some basic sqllogictests to ensure these code paths are hit? Or show that they are covered by existing tests? ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result { use arrow::compute::kernels::numeric::*; +fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { Review Comment: If we find that this slows down some other performance we could also add some sort of heuristic check to calling `false_count` / `true_count` -- like for example if the rhs arg is "complex" (not a Column for example) ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -> Result { use arrow::compute::kernels::numeric::*; +fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { Review Comment: Thanks @acking-you -- this looks great Is there any reason to have this function defined in the `evaluate` method? I think we could just make it a normal function and reduce the nesting level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]
Dandandan merged PR #15475: URL: https://github.com/apache/datafusion/pull/15475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]
andygrove commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762962087 I discussed this feature with @mbutrovich recently and he may have additional thoughts on this topic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]
viirya commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2763002946 The implementation issue or difference for RangePartitioning other than other partitioning like HashPartitioning, is that it involves some sampling operations that perform with RDD in Spark. That's said Comet lib at each executor cannot determine the sampling results separately like we did for general query execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]
parthchandra commented on code in PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#discussion_r2019634486 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1460,6 +1460,59 @@ class ParquetReadV1Suite extends ParquetReadSuite with AdaptiveSparkPlanHelper { v1 = Some("parquet")) } } + + test("test V1 parquet scan filter pushdown of primitive types uses native_iceberg_compat") { Review Comment: Can we enable this test for both native_datafusion and native_iceberg_compat? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2761412282 > You mean coalesce_partitions_if_needed() call is redundant in datafusion? I don't think that's the case, but if it is so, why don't we remove that line? I wanted to keep the PR simple and *just* refactor. Here is a follow-up that removes it: https://github.com/apache/datafusion/pull/15476. Let's see if the tests pass :) EnforceDistribution adds it here: https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-optimizer/src/enforce_distribution.rs#L940-L968 `add_spm_on_top` gets called when the child operator has the `SinglePartition` requirement https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-optimizer/src/enforce_distribution.rs#L1256-L1259 This apples to CollectLeft HJs left child: https://github.com/apache/datafusion/blob/fa452e6789194e6da7c3aaf5e2f1ffa5679aacf2/datafusion/physical-plan/src/joins/hash_join.rs#L705-L710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]
alamb commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761675489 Nice find @chenkovsky -- so looks like there is some bug in the data generator after all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow::datatypes::{DataType, Field, Schema}; +use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion}; +use datafusion_common::stats::Precision; +use datafusion_common::{ColumnStatistics, ScalarValue, Statistics}; +use datafusion_datasource::file_groups::FileGroup; +use datafusion_datasource::file_scan_config::FileScanConfig; +use datafusion_datasource::PartitionedFile; +use datafusion_physical_expr::PhysicalSortExpr; +use datafusion_physical_expr_common::sort_expr::LexOrdering; +use object_store::{path::Path, ObjectMeta}; +use std::sync::Arc; +use std::time::Duration; + +/// Generates test files with min-max statistics in different overlap patterns +fn generate_test_files(num_files: usize, overlap_factor: f64) -> Vec { +let mut files = Vec::with_capacity(num_files); +let range_size = if overlap_factor == 0.0 { +100 / num_files as i64 +} else { +(100.0 / (overlap_factor * num_files as f64)).max(1.0) as i64 +}; + +for i in 0..num_files { +let base = (i as f64 * range_size as f64 * (1.0 - overlap_factor)) as i64; +let min = base as f64; +let max = (base + range_size) as f64; + +let file = PartitionedFile { +object_meta: ObjectMeta { +location: Path::from(format!("file_{}.parquet", i)), +last_modified: chrono::Utc::now(), +size: 1000, +e_tag: None, +version: None, +}, +partition_values: vec![], +range: None, +statistics: Some(Statistics { +num_rows: Precision::Exact(100), +total_byte_size: Precision::Exact(1000), +column_statistics: vec![ColumnStatistics { +null_count: Precision::Exact(0), +max_value: Precision::Exact(ScalarValue::Float64(Some(max))), +min_value: Precision::Exact(ScalarValue::Float64(Some(min))), +sum_value: Precision::Absent, +distinct_count: Precision::Absent, +}], +}), +extensions: None, +metadata_size_hint: None, +}; +files.push(file); +} + +vec![FileGroup::new(files)] +} + +pub fn compare_split_groups_by_statistics_algorithms(c: &mut Criterion) { Review Comment: > Execution phase Would be great to have some example of this. The impact on planning time seems not super high. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2761689234 @jinwenjie123 appreciate your response! I am using one of the pre-built JARs, I will not be able to switch to Spark 3.4.x since our cluster was recently upgraded to 3.5.x and would like to stick to it @andygrove @alamb would y'all be able to help here? I saw that the prebuilt JARs were tested for 3.5.4 and upwards. Are they backwards compatible? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Migrate-substrait-tests-to-insta, part2 [datafusion]
qstommyshu opened a new pull request, #15480: URL: https://github.com/apache/datafusion/pull/15480 ## Which issue does this PR close? - Closes #15398. Related #15444 ## Rationale for this change ## What changes are included in this PR? Migrated tests in datafusion/core/tests/subtrait to use insta. ## Are these changes tested? Yes, I manually tested the before/after changes. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2761715713 Hi @alamb and @blaginin Part2 of the substrait tests migration is done as well. Please take a look when you have time :) The only tests that cannot be changed to `insta` are tests in `simple_intersect_table_reuse()`and `simple_intersect()` in [this file](https://github.com/apache/datafusion/blob/main/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs). I don't think it is possible to migrate these tests to `insta` because their `expected_plan_str` requires formatting variable into a string. It is not supported in Rust to generate a raw string with variable, and raw string is a required argument for `insta` inline snapshot test. So I left those tests out (I don't think it is possible to migrate them into `insta`). Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]
l0kr opened a new pull request, #1577: URL: https://github.com/apache/datafusion-comet/pull/1577 ## Which issue does this PR close? Closes #936. ## Rationale for this change Previous PR went stale so I wanted to move it forward. ## What changes are included in this PR? ## How are these changes tested? Added unit tests for row and columnar input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2004247883 ## datafusion/common/src/config.rs: ## @@ -590,6 +590,9 @@ config_namespace! { /// during aggregations, if possible pub enable_topk_aggregation: bool, default = true +/// When set to true attempts to push down dynamic filters from TopK operations into file scans +pub enable_dynamic_filter_pushdown: bool, default = true Review Comment: Probably should be false by default ## datafusion/datasource-parquet/src/opener.rs: ## @@ -294,3 +345,127 @@ fn create_initial_plan( // default to scanning all row groups Ok(ParquetAccessPlan::new_all(row_group_count)) } + +/// Build a pruning predicate from an optional predicate expression. +/// If the predicate is None or the predicate cannot be converted to a pruning +/// predicate, return None. +/// If there is an error creating the pruning predicate it is recorded by incrementing +/// the `predicate_creation_errors` counter. +pub(crate) fn build_pruning_predicate( +predicate: Arc, +file_schema: &SchemaRef, +predicate_creation_errors: &Count, +) -> Option> { +match PruningPredicate::try_new(predicate.clone(), Arc::clone(file_schema)) { +Ok(pruning_predicate) => { +if !pruning_predicate.always_true() { +return Some(Arc::new(pruning_predicate)); +} +} +Err(e) => { +debug!("Could not create pruning predicate for: {e}"); +predicate_creation_errors.add(1); +} +} +None +} + +/// Build a page pruning predicate from an optional predicate expression. +/// If the predicate is None or the predicate cannot be converted to a page pruning +/// predicate, return None. +pub(crate) fn build_page_pruning_predicate( +predicate: &Arc, +file_schema: &SchemaRef, +) -> Arc { +Arc::new(PagePruningAccessPlanFilter::new( +predicate, +Arc::clone(file_schema), +)) +} + +/// A vistor for a PhysicalExpr that collects all column references to determine what columns the expression needs to be evaluated. +struct FilterSchemaBuilder<'schema> { +filter_schema_fields: BTreeSet>, +file_schema: &'schema Schema, +table_schema: &'schema Schema, +} + +impl<'schema> FilterSchemaBuilder<'schema> { +fn new(file_schema: &'schema Schema, table_schema: &'schema Schema) -> Self { +Self { +filter_schema_fields: BTreeSet::new(), +file_schema, +table_schema, +} +} + +fn sort_fields( +fields: &mut Vec>, +table_schema: &Schema, +file_schema: &Schema, +) { +fields.sort_by_key(|f| f.name().to_string()); +fields.dedup_by_key(|f| f.name().to_string()); +fields.sort_by_key(|f| { +let table_schema_index = +table_schema.index_of(f.name()).unwrap_or(usize::MAX); +let file_schema_index = file_schema.index_of(f.name()).unwrap_or(usize::MAX); +(table_schema_index, file_schema_index) +}); +} + +fn build(self) -> SchemaRef { +let mut fields = self.filter_schema_fields.into_iter().collect::>(); +FilterSchemaBuilder::sort_fields( +&mut fields, +self.table_schema, +self.file_schema, +); +Arc::new(Schema::new(fields)) +} +} + +impl<'node> TreeNodeVisitor<'node> for FilterSchemaBuilder<'_> { +type Node = Arc; + +fn f_down( +&mut self, +node: &'node Arc, +) -> Result { +if let Some(column) = node.as_any().downcast_ref::() { +if let Ok(field) = self.table_schema.field_with_name(column.name()) { +self.filter_schema_fields.insert(Arc::new(field.clone())); +} else if let Ok(field) = self.file_schema.field_with_name(column.name()) { +self.filter_schema_fields.insert(Arc::new(field.clone())); +} else { +// valid fields are the table schema's fields + the file schema's fields, preferring the table schema's fields when there is a conflict +let mut valid_fields = self +.table_schema +.fields() +.iter() +.chain(self.file_schema.fields().iter()) +.cloned() +.collect::>(); +FilterSchemaBuilder::sort_fields( +&mut valid_fields, +self.table_schema, +self.file_schema, +); +let valid_fields = valid_fields +.into_iter() +.map(|f| datafusion_common::Column::new_unqualified(f.name())) +.collect(); +let field = datafusion_common::Column::new_unqualified(column.name()); +return Err(datafu