Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-13 Thread via GitHub
ashdnazg commented on PR #15654: URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2800542302 I do reproduce it here on ubuntu - when I run the test through the runner it takes much more time (or hangs entirely) than without. Just to see what happens, I tried to run th

Re: [PR] fix: miss output ordering during projection [datafusion]

2025-04-13 Thread via GitHub
suremarc commented on PR #15683: URL: https://github.com/apache/datafusion/pull/15683#issuecomment-2800524158 > That is, users should ensure that the output ordering is correct. One of the users as of now is `ListingTable`, which I don't believe makes any such guarantees, so we would

Re: [PR] Set DataFusion runtime configurations through SQL interface [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15594: URL: https://github.com/apache/datafusion/pull/15594#discussion_r2041433804 ## datafusion/core/src/execution/context/mod.rs: ## @@ -1036,13 +1040,73 @@ impl SessionContext { variable, value, .. } = stmt; -

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15697: URL: https://github.com/apache/datafusion/pull/15697#discussion_r2041421543 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -202,27 +204,99 @@ impl TopK { }) .collect::>>()?; +// selected indices +

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user camuel added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? I am local and willing to help organize, let me know how to be useful, also attending Databricks Data & AI Summit GitHub link: https://github.com/ap

Re: [PR] Upgrade to arrow/parquet 55, and `object_store` to `0.12.0` and pyo3 to `0.24.0` [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on PR #15466: URL: https://github.com/apache/datafusion/pull/15466#issuecomment-2800169545 Happy to improve performance πŸ˜„ I got more in my chamber -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041251862 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -759,12 +761,51 @@ impl ExternalSorter { if self.runtime.disk_manager.tmp_files_enabled() {

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2041413480 ## datafusion/physical-plan/src/sorts/sort_filters.rs: ## @@ -0,0 +1,297 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributo

Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on PR #15654: URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2800461046 > Extended test takes longer time and couldn't finish in 6hr after this change > > https://github.com/apache/datafusion/actions/runs/14419458859/job/40440288212 I fo

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2800459456 > I'll take a look tomorrow! Why do we have to use only the first column? Is it just to break up the change into smaller units? We had multi-column support working in the now close

Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-13 Thread via GitHub
ashdnazg commented on PR #15654: URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2800458013 @jayzhan211 :hankey: :frowning_face: On it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] Optimize BinaryExpr Evaluation with Short-Circuiting for AND/OR Operators [datafusion]

2025-04-13 Thread via GitHub
kosiew commented on PR #15648: URL: https://github.com/apache/datafusion/pull/15648#issuecomment-2800444256 Closing this. @acking-you improves this significantly in #15694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Optimize BinaryExpr Evaluation with Short-Circuiting for AND/OR Operators [datafusion]

2025-04-13 Thread via GitHub
kosiew closed pull request #15648: Optimize BinaryExpr Evaluation with Short-Circuiting for AND/OR Operators URL: https://github.com/apache/datafusion/pull/15648 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub
kosiew commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,164 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +Retu

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub
kosiew commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041391176 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,164 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +Retu

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub
kosiew commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2041328132 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,164 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +Retu

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041385875 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -765,6 +765,25 @@ impl ExternalSorter { Ok(()) } + +/// Reserves memory to be able to a

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on PR #15692: URL: https://github.com/apache/datafusion/pull/15692#issuecomment-2800424077 Thank you, it looks good to me πŸ™πŸΌ Let's make the CI pass, I think we can change the assertion type for simplicity here, and do a separate PR for this utility `DataFusionError:

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041381039 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -759,12 +761,51 @@ impl ExternalSorter { if self.runtime.disk_manager.tmp_files_enabled() {

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2800408050 > > > Also, to have a fully working larger than memory sort, you need to spill in > > > https://github.com/apache/datafusion/blob/362fcdfc7b9e00cb6126a0cbc41c9abb2637c563/dataf

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2041372025 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -431,12 +422,16 @@ impl ExternalSorter { let batches_to_spill = std::mem::take(globally_sorted_bat

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
2010YOUY01 commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041368520 ## datafusion/common/src/config.rs: ## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usi

Re: [PR] Per file filter evaluation [datafusion]

2025-04-13 Thread via GitHub
adriangb commented on PR #15057: URL: https://github.com/apache/datafusion/pull/15057#issuecomment-2800379042 > Another question is, isn't the filter created based on table schema? And then the batch is read as file schema and casted to table schema and is evaluated by filter. Yes th

Re: [PR] Per file filter evaluation [datafusion]

2025-04-13 Thread via GitHub
jayzhan211 commented on PR #15057: URL: https://github.com/apache/datafusion/pull/15057#issuecomment-2800373423 > PhysicalExpr::with_schema This method is too general and it is unclear what we need to do with the provided schema for each PhysicalExpr, it is not a good idea. > I

Re: [I] Add CatalogProvider API [datafusion-python]

2025-04-13 Thread via GitHub
tespent commented on issue #1103: URL: https://github.com/apache/datafusion-python/issues/1103#issuecomment-2800371392 > I am concerned about the table providers, though. I think any implementation will need to get the table provider to provide record batches efficiently. A small co

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041234887 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -765,6 +765,25 @@ impl ExternalSorter { Ok(()) } + +/// Reserves memory to be able to accom

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user camuel edited a comment on the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? I am local and willing to help organize, let me know how to be useful, also attending Databricks Data & AI Summit. GitHub link: https://github.com

Re: [PR] Consolidate statistics merging code (try 2) [datafusion]

2025-04-13 Thread via GitHub
xudong963 commented on PR #15661: URL: https://github.com/apache/datafusion/pull/15661#issuecomment-2800239553 > do you have time to look at that one? Sure, sorry for the late reply, I had a headache this weekend, so off my computer. -- This is an automated message from the Apache

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user camuel edited a comment on the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? I believe most, if not all, DataFusion meetups in San Francisco have been kindly hosted by Jeff Huber at the Chroma offices. This time might be simil

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user camuel added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? I believe most, if not all, DataFusion meetups in San Francisco have been kindly hosted by Jeff Huber at the Chrome offices. This time might be simila

Re: [PR] Fix: after repartitioning, the `PartitionedFile` and `FileGroup` statistics should be inexact/recomputed [datafusion]

2025-04-13 Thread via GitHub
xudong963 commented on PR #15539: URL: https://github.com/apache/datafusion/pull/15539#issuecomment-2800254350 @berkaysynnada Thanks for your review, i'll continue it this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-13 Thread via GitHub
jayzhan211 commented on PR #15654: URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2800243988 Extended test takes longer time and couldn't finish in 6hr after this change https://github.com/apache/datafusion/actions/runs/14419458859/job/40440288212 -- This is an au

Re: [I] ListingTable statistics improperly merges statistics when files have different schemas [datafusion]

2025-04-13 Thread via GitHub
xudong963 commented on issue #15689: URL: https://github.com/apache/datafusion/issues/15689#issuecomment-2800240401 @alamb Thank you for writing the issue in detail @friendlymatthew Thank you for taking it -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041251862 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -759,12 +761,51 @@ impl ExternalSorter { if self.runtime.disk_manager.tmp_files_enabled() {

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041254566 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -759,12 +761,51 @@ impl ExternalSorter { if self.runtime.disk_manager.tmp_files_enabled() {

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041251481 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1552,6 +1593,62 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_batch_reservati

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041251481 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -1552,6 +1593,62 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_batch_reservati

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041239380 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -759,12 +761,51 @@ impl ExternalSorter { if self.runtime.disk_manager.tmp_files_enabled() {

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041234887 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -765,6 +765,25 @@ impl ExternalSorter { Ok(()) } + +/// Reserves memory to be able to accom

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-13 Thread via GitHub
adriangb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2800153336 I'll take a look tomorrow! Why do we have to use only the first column? Is it just to break up the change into smaller units? We had multi-column support working in the now closed P

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2041222937 ## datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs: ## @@ -753,3 +765,226 @@ async fn test_single_mode_aggregate_with_spill() -> Result<()> { Ok(()) }

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041222385 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -765,6 +765,25 @@ impl ExternalSorter { Ok(()) } + +/// Reserves memory to be able to accom

Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]

2025-04-13 Thread via GitHub
comphead commented on PR #15696: URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2800125353 Yeah, that was actually my question having the warnings without being returned to the end user, who is supposed to react on the warnings? πŸ€” -- This is an automated message from t

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041181467 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -529,6 +523,12 @@ impl ExternalSorter { /// Sorts the in-memory batches and merges them into a single sort

Re: [I] Update regexp slt tests to refactor out string type tests, cleanup [datafusion]

2025-04-13 Thread via GitHub
kumarlokesh commented on issue #14452: URL: https://github.com/apache/datafusion/issues/14452#issuecomment-2800107028 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-04-13 Thread via GitHub
rluvaton opened a new pull request, #15700: URL: https://github.com/apache/datafusion/pull/15700 ## Which issue does this PR close? - Closes #14692. ## Rationale for this change We need merge sort that does not fail with out of memory ## What changes are included in th

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2800097814 > > Also, to have a fully working larger than memory sort, you need to spill in > > https://github.com/apache/datafusion/blob/362fcdfc7b9e00cb6126a0cbc41c9abb2637c563/datafusion/

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041196922 ## datafusion/common/src/config.rs: ## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usize

Re: [I] Question: is there a way to get the current catalog or database? [datafusion-python]

2025-04-13 Thread via GitHub
NickCrews commented on issue #1106: URL: https://github.com/apache/datafusion-python/issues/1106#issuecomment-2800092336 I don't super understand #1103, but I think that is maybe the inverse: that issue is about allowing devs to provide datafusion with ways to access they custom databases/

Re: [I] Question: is there a way to get the current catalog or database? [datafusion-python]

2025-04-13 Thread via GitHub
NickCrews closed issue #1106: Question: is there a way to get the current catalog or database? URL: https://github.com/apache/datafusion-python/issues/1106 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-13 Thread via GitHub
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800087698 The relevant bug fixes have been completed, and corresponding performance tests have been conducted. The results show that pre-selection has achieved significant gains! @Dandandan

Re: [PR] feat: Override MapBuilder values field with expected schema [datafusion-comet]

2025-04-13 Thread via GitHub
comphead commented on code in PR #1643: URL: https://github.com/apache/datafusion-comet/pull/1643#discussion_r2041190729 ## native/core/src/execution/shuffle/map.rs: ## @@ -2832,13 +2833,13 @@ pub fn append_map_elements( } #[allow(clippy::field_reassign_with_default)] -pub f

Re: [PR] feat: Override MapBuilder values field with expected schema [datafusion-comet]

2025-04-13 Thread via GitHub
codecov-commenter commented on PR #1643: URL: https://github.com/apache/datafusion-comet/pull/1643#issuecomment-2800074535 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1643?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Optimize TopK with filter ~1.4x faster [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2800082794 @adriangb FYI CI is passing, it's ready for review. I had to make some changes to the filter that is applied to respect lexicographic ordering (which made Q7 lose the speedup), b

Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]

2025-04-13 Thread via GitHub
changsun20 commented on PR #15696: URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2800081010 > Thanks @changsun20 wondering if its possible to test those warnings in integration slt test files? Thank you for the thoughtful question, @comphead. I appreciate your focu

Re: [PR] WIP: Attach Diagnostic to syntax errors [datafusion]

2025-04-13 Thread via GitHub
logan-keede commented on PR #15680: URL: https://github.com/apache/datafusion/pull/15680#issuecomment-2800055667 cc @eliaperantoni -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

[I] Optimize TopK with filter [datafusion]

2025-04-13 Thread via GitHub
Dandandan opened a new issue, #15699: URL: https://github.com/apache/datafusion/issues/15699 ### Is your feature request related to a problem or challenge? TopK can be optimized by filtering on the max value before converting the arrays to row-format (which is slow). ##

[I] chore: Move Map branches to separate file [datafusion-comet]

2025-04-13 Thread via GitHub
comphead opened a new issue, #1645: URL: https://github.com/apache/datafusion-comet/issues/1645 Map arm branches takes too much of space in this file, proposing to move Map arm branches into a separate file. Ideally to investigate how those branches can be rewritten in macros _

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-13 Thread via GitHub
DerGut commented on code in PR #15692: URL: https://github.com/apache/datafusion/pull/15692#discussion_r2041181467 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -529,6 +523,12 @@ impl ExternalSorter { /// Sorts the in-memory batches and merges them into a single sort

Re: [I] GlobalLimitExec execution offset pagination query results in internal error [datafusion]

2025-04-13 Thread via GitHub
akurmustafa commented on issue #15665: URL: https://github.com/apache/datafusion/issues/15665#issuecomment-2800060351 Hi @lalaorya I didn't reproduce the plans you generated locally, so my thoughts might be wrong or misleading. However, here are my thoughts regarding your problem: > What

Re: [PR] feat: Override MapBuilder values field with expected schema [datafusion-comet]

2025-04-13 Thread via GitHub
comphead commented on code in PR #1643: URL: https://github.com/apache/datafusion-comet/pull/1643#discussion_r2041175431 ## native/core/src/execution/shuffle/row.rs: ## @@ -904,7 +904,7 @@ pub(crate) fn append_field( append_map_element!(StringBuilder, Decima

[I] chore: Improve crates.io download stability [datafusion-comet]

2025-04-13 Thread via GitHub
comphead opened a new issue, #1644: URL: https://github.com/apache/datafusion-comet/issues/1644 ### Describe the bug Sometimes build failed with ``` warning: spurious network error (3 tries remaining): [7] Could not connect to server (Failed to connect to index.crates.io port 4

[PR] feat: Override MapBuilder values field with expected schema [datafusion-comet]

2025-04-13 Thread via GitHub
comphead opened a new pull request, #1643: URL: https://github.com/apache/datafusion-comet/pull/1643 ## Which issue does this PR close? Closes #1633 . ## Rationale for this change MapBuilder by default uses nullable columns to represent Map entries. Overriding th

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041170126 ## datafusion/common/src/config.rs: ## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usize

Re: [PR] Minor: add order by arg for last value [datafusion]

2025-04-13 Thread via GitHub
comphead merged PR #15695: URL: https://github.com/apache/datafusion/pull/15695 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Minor: add order by arg for last value [datafusion]

2025-04-13 Thread via GitHub
comphead commented on PR #15695: URL: https://github.com/apache/datafusion/pull/15695#issuecomment-2800045143 Thanks @jayzhan211 and @berkaysynnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] first_value and last_value should have identical signatures [datafusion]

2025-04-13 Thread via GitHub
comphead closed issue #12376: first_value and last_value should have identical signatures URL: https://github.com/apache/datafusion/issues/12376 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2041170126 ## datafusion/common/src/config.rs: ## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usize

[I] Filter multiple columns from TopK using Lexicographical ordering [datafusion]

2025-04-13 Thread via GitHub
Dandandan opened a new issue, #15698: URL: https://github.com/apache/datafusion/issues/15698 ### Is your feature request related to a problem or challenge? In the PR https://github.com/apache/datafusion/pull/15697 we added support for filtering input values early on to speed up TopK e

Re: [PR] Optimize TopK with filter ~1.4x faster [datafusion]

2025-04-13 Thread via GitHub
adriangb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2800033871 @Dandandan will be happy to review once CI is passing πŸ˜„ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Question: is there a way to get the current catalog or database? [datafusion-python]

2025-04-13 Thread via GitHub
timsaucer commented on issue #1106: URL: https://github.com/apache/datafusion-python/issues/1106#issuecomment-2800014421 Probably a duplicate for https://github.com/apache/datafusion-python/issues/1103 @NickCrews please let me know if that issue doesn’t answer your needs -- This i

Re: [PR] feat: Add ConfigOptions to ScalarFunctionArgs [datafusion]

2025-04-13 Thread via GitHub
Omega359 commented on PR #13527: URL: https://github.com/apache/datafusion/pull/13527#issuecomment-2800015976 I've spent some time looking at using ExecutionProps for this and while I think it'll work it's still a lot of churn. That churn is largely because of two reasons: 1. We woul

[I] extendable task distribution api [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm opened a new issue, #1238: URL: https://github.com/apache/datafusion-ballista/issues/1238 At the moment we have three different task distribution strategies - binding - round robin - consistent hashing I believe we should open scheduler interface exposing p

[I] Question: is there a way to get the current catalog or database? [datafusion-python]

2025-04-13 Thread via GitHub
NickCrews opened a new issue, #1106: URL: https://github.com/apache/datafusion-python/issues/1106 Hi! I'm working on the datafusion backend for ibis. Specifically, I'm working on PR https://github.com/ibis-project/ibis/pull/2. Most of the backends for ibis, such as postgres, duckdb, sql

Re: [PR] Per file filter evaluation [datafusion]

2025-04-13 Thread via GitHub
adriangb commented on PR #15057: URL: https://github.com/apache/datafusion/pull/15057#issuecomment-282196 I would like to resume this work. Some thoughts should the rewrite happen via a new trait as I'm currently doing, or should we add a method `PhysicalExpr::with_schema`? If

Re: [I] Improve the way to pass through configurations to datafusion [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm closed issue #579: Improve the way to pass through configurations to datafusion URL: https://github.com/apache/datafusion-ballista/issues/579 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] starts_with function is serialised as UDF [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm closed issue #578: starts_with function is serialised as UDF URL: https://github.com/apache/datafusion-ballista/issues/578 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [I] starts_with function is serialised as UDF [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm commented on issue #578: URL: https://github.com/apache/datafusion-ballista/issues/578#issuecomment-2799989526 I believe this is not the case anymore, will close it. Please re-open if still issue -- This is an automated message from the Apache Git Service. To respond to the m

Re: [I] Remove `flight-sql` from ballista in 46.0.0 [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm commented on issue #1227: URL: https://github.com/apache/datafusion-ballista/issues/1227#issuecomment-2799989081 issues which may be relevant to `flight-sql`: - #941 - #839 -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [I] add create_dataframe method to BallistaContest [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm closed issue #633: add create_dataframe method to BallistaContest URL: https://github.com/apache/datafusion-ballista/issues/633 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [I] add create_dataframe method to BallistaContest [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm commented on issue #633: URL: https://github.com/apache/datafusion-ballista/issues/633#issuecomment-2799988635 Ballista uses `SessionContext` from datafusion most methods exposed by `SessionContext` should be supported. Closing this as outdated. Please re-open if still needed

Re: [I] Deployment on AWS [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm commented on issue #886: URL: https://github.com/apache/datafusion-ballista/issues/886#issuecomment-2799986505 will close this issue, ballista does not provide deployment scripts anymore -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [I] Deployment on AWS [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm closed issue #886: Deployment on AWS URL: https://github.com/apache/datafusion-ballista/issues/886 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e

Re: [PR] Add S3 object store support to executor and scheduler [datafusion-ballista]

2025-04-13 Thread via GitHub
milenkovicm commented on PR #1230: URL: https://github.com/apache/datafusion-ballista/pull/1230#issuecomment-2799983961 @mmooyyii & @joaoferrao would you mind have a look at this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] Optimize TopK with filter ~1.4x faster [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2799975208 > Nice! We can even wire it up with the filter pushdown so that if an operator under us "absorbs" the filter (eg it got pushed down to the scan) we skip doing this internally.

Re: [PR] Optimize TopK with filter ~1.4x faster [datafusion]

2025-04-13 Thread via GitHub
adriangb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2799971429 Nice! We can even wire it up with the filter pushdown so that if an operator under us "absorbs" the filter (eg it got pushed down to the scan) we skip doing this internally. -- T

Re: [PR] Optimize TopK with filter ~1.4x faster [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2799968795 > If I understand correctly, the ideas to basically do the same thing we're going to do for the dynamic filters but essentially do the filtering inside of top K to avoid some extra

Re: [D] San Francisco DataFusion Meetup scheduled for 9/25 [datafusion]

2025-04-13 Thread via GitHub
GitHub user alamb added a comment to the discussion: San Francisco DataFusion Meetup scheduled for 9/25 We are organizing another one here: https://github.com/apache/datafusion/discussions/15657 GitHub link: https://github.com/apache/datafusion/discussions/11972#discussioncomment-12819584 -

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user alamb added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? Timing Thread: * maybe we could shoot for Monday June 9 as that will be before the Data and AI summit events get going full steam GitHub link: https:

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-13 Thread via GitHub
GitHub user alamb added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? I will also be traveling to attend and would love to help make it happen! cc @mwlyed @ameyc @emgeee @mwylde maybe you would be around and interested i

Re: [I] Add CatalogProvider API [datafusion-python]

2025-04-13 Thread via GitHub
timsaucer commented on issue #1103: URL: https://github.com/apache/datafusion-python/issues/1103#issuecomment-2799942428 This is *very* good feedback. I think the catalog provider and schema provider will be relatively easy to do to provide both pure python and rust-ffi versions. I am conc

Re: [PR] Optimize TopK with filter [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on code in PR #15697: URL: https://github.com/apache/datafusion/pull/15697#discussion_r2041118426 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -202,24 +204,93 @@ impl TopK { }) .collect::>>()?; +// selected indices +

Re: [PR] Optimize TopK with filter [datafusion]

2025-04-13 Thread via GitHub
Dandandan commented on code in PR #15697: URL: https://github.com/apache/datafusion/pull/15697#discussion_r2041118282 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -202,24 +204,93 @@ impl TopK { }) .collect::>>()?; +// selected indices +

[PR] Optimize TopK with filter [datafusion]

2025-04-13 Thread via GitHub
Dandandan opened a new pull request, #15697: URL: https://github.com/apache/datafusion/pull/15697 ## Which issue does this PR close? - Closes #. ## Rationale for this change This optimizes our TopK by filtering early based on the threshold values, avoidin

Re: [I] Support Rust UDF [datafusion-ballista]

2025-04-13 Thread via GitHub
yongda-fan commented on issue #993: URL: https://github.com/apache/datafusion-ballista/issues/993#issuecomment-2799909943 Ya i agree, with example https://github.com/apache/datafusion-ballista/blob/main/examples/examples/custom-executor.rs, one can easily inject rust UDFs. -- This is an

Re: [I] Support Rust UDF [datafusion-ballista]

2025-04-13 Thread via GitHub
yongda-fan closed issue #993: Support Rust UDF URL: https://github.com/apache/datafusion-ballista/issues/993 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-m

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-13 Thread via GitHub
alamb commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2799903999 I plan to re-review this tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-13 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2799902720 Thanks @jayzhan211 for the approval and for the discussion. I'll plan to merge https://github.com/apache/datafusion/pull/15466 tomorrow then unless we want to discuss it further.

Re: [PR] Specialize join matching when values in map are unique [datafusion]

2025-04-13 Thread via GitHub
Dandandan closed pull request #15690: Specialize join matching when values in map are unique URL: https://github.com/apache/datafusion/pull/15690 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Add CatalogProvider API [datafusion-python]

2025-04-13 Thread via GitHub
tespent commented on issue #1103: URL: https://github.com/apache/datafusion-python/issues/1103#issuecomment-2799841561 @timsaucer This is wonderful! However, I think FFI CatalogProvider is not enough for my needs, since I'm looking for *pure python-written* CatalogProvider and SchemaProvid

  1   2   >