Re: [PR] [datafusion-spark] Implement `factorical` function [datafusion]

2025-06-17 Thread via GitHub
tlm365 commented on PR #16125: URL: https://github.com/apache/datafusion/pull/16125#issuecomment-2982855702 > @tlm365 - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: > > * [chore: generate b

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-17 Thread via GitHub
dharanad commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2982806130 @adriangb I would like to work on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [I] signum(0) returns incorrect result [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove closed issue #664: signum(0) returns incorrect result URL: https://github.com/apache/datafusion-comet/issues/664 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [PR] Add support of parsing struct field's options in BigQuery [datafusion-sqlparser-rs]

2025-06-17 Thread via GitHub
iffyio merged PR #1890: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1890 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] feat: Add support for signum expression [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove merged PR #1889: URL: https://github.com/apache/datafusion-comet/pull/1889 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Fix CI Failure: replace false with NullEqualsNothing [datafusion]

2025-06-17 Thread via GitHub
ding-young commented on PR #16437: URL: https://github.com/apache/datafusion/pull/16437#issuecomment-2982671518 @2010YOUY01 Could you take a quick look? It’s a small fix that should help with running CI for other PRs. -- This is an automated message from the Apache Git Service. To respon

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-17 Thread via GitHub
zhuqi-lucas commented on code in PR #16395: URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153573700 ## datafusion-examples/examples/embedding_parquet_indexes.rs: ## @@ -0,0 +1,243 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-17 Thread via GitHub
suibianwanwank commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2982535228 > If you can't find something suitable I can try and find time to help over the next few days Thanks @alamb, as mentioned, the PhysicalPlan generated by default `planning`

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2153511180 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase with

Re: [PR] feat: Add support to lookup map by key [datafusion-comet]

2025-06-17 Thread via GitHub
codecov-commenter commented on PR #1898: URL: https://github.com/apache/datafusion-comet/pull/1898#issuecomment-2982432145 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1898?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] dynamic filter refactor [datafusion]

2025-06-17 Thread via GitHub
github-actions[bot] commented on PR #15685: URL: https://github.com/apache/datafusion/pull/15685#issuecomment-2982375299 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] minor: Avoid rewriting join to unsupported join [datafusion-comet]

2025-06-17 Thread via GitHub
comphead commented on PR #1888: URL: https://github.com/apache/datafusion-comet/pull/1888#issuecomment-2982368402 btw is it still an issue? I think LeftAnti with SMJ has been fixed a while ago in DF -- This is an automated message from the Apache Git Service. To respond to the message, p

[PR] feat: support lookup map by key [datafusion-comet]

2025-06-17 Thread via GitHub
comphead opened a new pull request, #1898: URL: https://github.com/apache/datafusion-comet/pull/1898 ## Which issue does this PR close? Closes #1884 . ## Rationale for this change ## What changes are included in this PR? ## How are these cha

Re: [PR] Add support of parsing struct field's options in BigQuery [datafusion-sqlparser-rs]

2025-06-17 Thread via GitHub
git-hulk commented on PR #1890: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1890#issuecomment-2982349184 @iffyio, I've changed the link to the Markdown style now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[PR] Fix CI Failure: replace false with NullEqualsNothing [datafusion]

2025-06-17 Thread via GitHub
ding-young opened a new pull request, #16437: URL: https://github.com/apache/datafusion/pull/16437 ## Which issue does this PR close? ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-17 Thread via GitHub
comphead commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2153457367 ## datafusion/catalog/src/schema.rs: ## @@ -54,6 +55,14 @@ pub trait SchemaProvider: Debug + Sync + Send { name: &str, ) -> Result>, DataFusionError

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-17 Thread via GitHub
ding-young commented on PR #16268: URL: https://github.com/apache/datafusion/pull/16268#issuecomment-2982326469 Currently CI fails, but I think that is due to change introduced in another pr. -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-17 Thread via GitHub
ding-young commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2153450196 ## datafusion/common/src/config.rs: ## @@ -274,6 +276,61 @@ config_namespace! { } } +#[derive(Debug, Default, Clone, Copy, PartialEq, Eq)] +pub enum Spi

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-17 Thread via GitHub
ding-young commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2153447729 ## datafusion/physical-plan/src/spill/spill_manager.rs: ## @@ -44,16 +44,23 @@ pub struct SpillManager { schema: SchemaRef, /// Number of batches to b

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-17 Thread via GitHub
parthchandra commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2153446795 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase wi

[PR] doc: Add comments to clarify algorithm for `MarkJoin`s [datafusion]

2025-06-17 Thread via GitHub
jonathanc-n opened a new pull request, #16436: URL: https://github.com/apache/datafusion/pull/16436 ## Which issue does this PR close? - Closes #16415. ## Rationale for this change The algorithm that is used for mark joins in `calculate_indices_by_join_type` might be

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-17 Thread via GitHub
ding-young commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2153442938 ## datafusion/core/tests/memory_limit/mod.rs: ## @@ -630,6 +635,77 @@ async fn test_disk_spill_limit_not_reached() -> Result<()> { Ok(()) } +/// Extern

Re: [I] Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on issue #7955: URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2982304653 I opened #16435 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [I] [Epic] A collection of dynamic filtering related items [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on issue #15512: URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2982302447 I added #16435 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [I] Comet cannot read decimals with physical type BINARY [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove closed issue #567: Comet cannot read decimals with physical type BINARY URL: https://github.com/apache/datafusion-comet/issues/567 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-17 Thread via GitHub
adriangb opened a new issue, #16435: URL: https://github.com/apache/datafusion/issues/16435 ### Is your feature request related to a problem or challenge? Related to #15512 I think this is a first step towards HashJoinExec pushdown. I think we should model that as `col >= hash

Re: [PR] docs: Add docs stating that Comet does not support reading decimals encoded in Parquet BINARY format [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove merged PR #1895: URL: https://github.com/apache/datafusion-comet/pull/1895 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-06-17 Thread via GitHub
jonathanc-n commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-2982300855 cc @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-06-17 Thread via GitHub
jonathanc-n opened a new pull request, #16434: URL: https://github.com/apache/datafusion/pull/16434 ## Which issue does this PR close? - Closes #16179 . ## Rationale for this change We can use `u32` indices instead of `u64` indices when there are less than `u32::MAX`

Re: [I] Blog post about TopK filter pushdown [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on issue #15513: URL: https://github.com/apache/datafusion/issues/15513#issuecomment-2982293966 I think we're ready to publish a blogpost! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] [Epic] A collection of dynamic filtering related items [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on issue #15512: URL: https://github.com/apache/datafusion/issues/15512#issuecomment-2982286049 Exciting news! We've merged TopK filter pushdown. As we've moved along we've found a lot of neat little optimizations - I think a blog post is in order. -- This is an automat

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich merged PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich closed issue #458: Support RangePartitioning with native shuffle URL: https://github.com/apache/datafusion-comet/issues/458 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-17 Thread via GitHub
epgif commented on PR #16401: URL: https://github.com/apache/datafusion/pull/16401#issuecomment-2982140980 > @alamb > > > I wonder if there is some way we can write a test for it (mostly to prevent it from being accidentally broken/changed in the future) > > I looked around for

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-17 Thread via GitHub
epgif commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2153345514 ## datafusion/catalog/src/schema.rs: ## @@ -54,6 +55,14 @@ pub trait SchemaProvider: Debug + Sync + Send { name: &str, ) -> Result>, DataFusionError>;

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-06-17 Thread via GitHub
irenjj commented on PR #15958: URL: https://github.com/apache/datafusion/pull/15958#issuecomment-2982098992 > @irenjj - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: > > * [chore: generate basic

Re: [PR] minor: Improve testing of math scalar functions [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove merged PR #1896: URL: https://github.com/apache/datafusion-comet/pull/1896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Improved experience when remote object store URL does not end in / [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16386: URL: https://github.com/apache/datafusion/pull/16386#issuecomment-2982071425 @blaginin also did an end to end test with S3 in the CI tests. The instructions are here: - https://github.com/apache/datafusion/blob/main/datafusion-cli/CONTRIBUTING.md#L30-L

Re: [PR] Improved experience when remote object store URL does not end in / [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16386: URL: https://github.com/apache/datafusion/pull/16386#issuecomment-2982068035 > > > @alamb Could you help reivew this PR? > > > > > > Thanks @xiedeyantu ! We normally need to add tests as part of any code PR -- could you look into adding some tests and

Re: [I] Access Data from S3 in DeltaLake format using Ballista on Kubernetes [datafusion-ballista]

2025-06-17 Thread via GitHub
janbraunsdorff commented on issue #1268: URL: https://github.com/apache/datafusion-ballista/issues/1268#issuecomment-2982066537 Hey @milenkovicm Thanks for your help in the best case, this is what i am looking for: ```rust let table = open_table("s3://bucket/source").awa

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2982065398 I think this will require @Dandandan 's suggestion of only updating the filters if the new ones are more selective: #16433. Right now since we always update the filters -> it

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2982059379 Thanks @suibianwanwank - I think it would be great if we could use .slt tests to write a reproducer Here are the instructions: https://github.com/apache/datafusion/tree/main/da

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #15958: URL: https://github.com/apache/datafusion/pull/15958#issuecomment-2982055315 @irenjj - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: - https://github.com/apache/data

Re: [PR] [datafusion-spark] Implement `factorical` function [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16125: URL: https://github.com/apache/datafusion/pull/16125#issuecomment-2982055127 @tlm365 - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: - https://github.com/apache/dataf

Re: [PR] chore: generate basic spark function tests [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16409: URL: https://github.com/apache/datafusion/pull/16409#issuecomment-2982052754 Thanks again @shehabgamin and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] chore: generate basic spark function tests [datafusion]

2025-06-17 Thread via GitHub
alamb merged PR #16409: URL: https://github.com/apache/datafusion/pull/16409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Use dedicated NullEquality enum instead of null_equals_null boolean [datafusion]

2025-06-17 Thread via GitHub
alamb merged PR #16419: URL: https://github.com/apache/datafusion/pull/16419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Use dedicated NullEquality enum instead of null_equals_null boolean [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16419: URL: https://github.com/apache/datafusion/pull/16419#issuecomment-2982051679 πŸš€ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] fix: Enable WASM compilation by making sqlparser's recursive-protection optional [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16418: URL: https://github.com/apache/datafusion/pull/16418#issuecomment-2982051446 Thanks again @jonmmease -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] fix: Enable WASM compilation by making sqlparser's recursive-protection optional [datafusion]

2025-06-17 Thread via GitHub
alamb merged PR #16418: URL: https://github.com/apache/datafusion/pull/16418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on code in PR #16433: URL: https://github.com/apache/datafusion/pull/16433#discussion_r2153281386 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -846,8 +846,10 @@ pub struct SortExec { common_sort_prefix: Vec, /// Cache holding plan properties

[PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-17 Thread via GitHub
adriangb opened a new pull request, #16433: URL: https://github.com/apache/datafusion/pull/16433 Closes #16432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-06-17 Thread via GitHub
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2982012440 I should be able to resume working on this later this week. I might have to file another ticket to first upgrade the datafusion dependency before making Comet changes. --

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981923921 @Dandandan @alamb I pushed [ebe4196](https://github.com/apache/datafusion/pull/16424/commits/ebe41962f285a9f746b030a7242d122d00a8b0df) which adds a very cheap way to track changes t

Re: [PR] Blog post on query cancellation [datafusion-site]

2025-06-17 Thread via GitHub
alamb commented on code in PR #75: URL: https://github.com/apache/datafusion-site/pull/75#discussion_r2153183826 ## content/blog/2025-06-15-cancellation.md: ## @@ -0,0 +1,353 @@ +--- Review Comment: Yes of course -- we have done this on other blogs as well Something

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2981852178 woohoo! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2153165641 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase with

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981793916 πŸ€–: Benchmark completed Details ``` Comparing HEAD and task_budget Benchmark clickbench_1.json ┏

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981771620 πŸ€–: Benchmark completed Details ``` Comparing HEAD and task_budget Benchmark clickbench_extended.json ┏━

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981771719 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubun

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-17 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2153128215 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lice

Re: [PR] minor: Improve testing of math scalar functions [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove commented on code in PR #1896: URL: https://github.com/apache/datafusion-comet/pull/1896#discussion_r2153127591 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1217,38 +1217,54 @@ class CometExpressionSuite extends CometTestBase with Adapti

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-17 Thread via GitHub
Copilot commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2153125391 ## datafusion/functions/benches/regx.rs: ## @@ -127,6 +128,46 @@ fn criterion_benchmark(c: &mut Criterion) { }) }); +c.bench_function("regexp_i

Re: [PR] minor: Improve testing of math scalar functions [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich commented on code in PR #1896: URL: https://github.com/apache/datafusion-comet/pull/1896#discussion_r2153121127 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1217,38 +1217,54 @@ class CometExpressionSuite extends CometTestBase with Adapt

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-17 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2153122396 ## datafusion/functions/src/regex/regexpcount.rs: ## @@ -550,7 +550,7 @@ where } } -fn compile_and_cache_regex<'strings, 'cache>( +pub fn compile_and_cach

Re: [PR] minor: Improve testing of math scalar functions [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich commented on code in PR #1896: URL: https://github.com/apache/datafusion-comet/pull/1896#discussion_r2153121127 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1217,38 +1217,54 @@ class CometExpressionSuite extends CometTestBase with Adapt

Re: [I] Streamline github actions [datafusion-ballista]

2025-06-17 Thread via GitHub
milenkovicm commented on issue #1128: URL: https://github.com/apache/datafusion-ballista/issues/1128#issuecomment-2981745228 DataFusion has much more tests. I'm open for suggestions, my main concern with current state of git actions is that they may be quite few repeated steps. -- This

Re: [PR] Chore: implement datetime funcs as ScalarUDFImpl [datafusion-comet]

2025-06-17 Thread via GitHub
mbutrovich merged PR #1874: URL: https://github.com/apache/datafusion-comet/pull/1874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [PR] Fix: Map functions crash on out of bounds cases [datafusion]

2025-06-17 Thread via GitHub
comphead commented on PR #16203: URL: https://github.com/apache/datafusion/pull/16203#issuecomment-2981716899 > @comphead, can you please tell if there is something I can do to move this forward? Are there any relevant benches I could either run or adapt for this case? Sorry I didn't

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981674932 > πŸ€–: Benchmark completed > > Details Interesting results. I'm inclined to believe that the speedups and slowdowns are both real. We'll have to think about this a bit mo

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981671395 > > > @alamb sorry for the ping but would you mind running `topk_tpch` on here? > > > > > > LOL I need to make a webpage (or give you access to the sever to queue the job

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981666369 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubun

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-17 Thread via GitHub
alamb commented on code in PR #16395: URL: https://github.com/apache/datafusion/pull/16395#discussion_r2153064233 ## datafusion-examples/examples/embedding_parquet_indexes.rs: ## @@ -0,0 +1,243 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contrib

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
pepijnve commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981642290 Just saw the merge conflict. I'll rebase. @alamb I've changed the default to the tokio-based temporary placeholder. The version that's not using `poll_proceed` yet. Would you

Re: [I] Only update topk filter when updated filter is more selective [datafusion]

2025-06-17 Thread via GitHub
Dandandan commented on issue #16432: URL: https://github.com/apache/datafusion/issues/16432#issuecomment-2981628184 The filter is shared between `TopK` instances, so it would benefit from a higher selectivity from other partitions and being earlier to filter out more rows. -- This is an

[I] Only update topk filter when updated filter is more selective [datafusion]

2025-06-17 Thread via GitHub
Dandandan opened a new issue, #16432: URL: https://github.com/apache/datafusion/issues/16432 Hm @adriangb another thing I wondered is `update_filter` does seem to take only the heap of the current partition into account, as in TopK (currently at least) each partition has it's own heap (of k

Re: [PR] chore: Enable Spark SQL tests for auto scan mode [WIP] [datafusion-comet]

2025-06-17 Thread via GitHub
andygrove commented on PR #1885: URL: https://github.com/apache/datafusion-comet/pull/1885#issuecomment-2981585283 The failing tests are already ignored when the scan is explicitly set to native, but this does not work with `auto`. -- This is an automated message from the Apache Git Serv

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981561219 πŸ€–: Benchmark completed Details ``` Comparing HEAD and prune-rg Benchmark run_topk_tpch.json ┏━━

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2981552270 > Perhaps we can compare against the current filter and only update the expression if it is greater / more selective? Yeah I think that would be good. For context (I had to rem

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
Dandandan commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981532775 > > @alamb sorry for the ping but would you mind running `topk_tpch` on here? > > LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

Re: [PR] Blog post on query cancellation [datafusion-site]

2025-06-17 Thread via GitHub
pepijnve commented on code in PR #75: URL: https://github.com/apache/datafusion-site/pull/75#discussion_r2152981299 ## content/blog/2025-06-15-cancellation.md: ## @@ -0,0 +1,353 @@ +--- Review Comment: Since Datadobi is paying for my time while I work on this stuff, would it

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981525856 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubun

Re: [PR] Blog post on query cancellation [datafusion-site]

2025-06-17 Thread via GitHub
pepijnve commented on code in PR #75: URL: https://github.com/apache/datafusion-site/pull/75#discussion_r2152978616 ## content/blog/2025-06-15-cancellation.md: ## @@ -0,0 +1,353 @@ +--- +layout: post +title: Query Cancellation +date: 2025-06-27 +author: Pepijn Van Eeckhoudt +cat

Re: [PR] Blog post on query cancellation [datafusion-site]

2025-06-17 Thread via GitHub
pepijnve commented on code in PR #75: URL: https://github.com/apache/datafusion-site/pull/75#discussion_r2152976180 ## content/blog/2025-06-15-cancellation.md: ## @@ -0,0 +1,353 @@ +--- +layout: post +title: Query Cancellation +date: 2025-06-27 +author: Pepijn Van Eeckhoudt +cat

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-17 Thread via GitHub
Dandandan commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2981524048 Hm @adriangb another thing I wondered is `update_filter` does seem to take only the heap of the current partition into account, as in TopK (currently at least) each partition has i

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981522651 > @alamb sorry for the ping but would you mind running `topk_tpch` on here? LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself) -- This i

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981516505 > I think we can go ahead with the task budget -- we can revert to the alternative if we discover a (surprising) performance issue Yeah, I agree In general I think default

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-17 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2981455523 @alamb sorry for the ping but would you mind running `topk_tpch` on here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-17 Thread via GitHub
adriangb merged PR #15770: URL: https://github.com/apache/datafusion/pull/15770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) [datafusion]

2025-06-17 Thread via GitHub
adriangb closed issue #15037: Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) URL: https://github.com/apache/datafusion/issues/15037 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

[I] interleave_views is slow [datafusion]

2025-06-17 Thread via GitHub
Dandandan opened a new issue, #16431: URL: https://github.com/apache/datafusion/issues/16431 ### Is your feature request related to a problem or challenge? I ran some benchmarks in DataFusion (sort_tpch) and I saw that `interleave_views` take up a large amount of time for the sorting

Re: [I] Optimize TopK with filter [datafusion]

2025-06-17 Thread via GitHub
adriangb closed issue #15699: Optimize TopK with filter URL: https://github.com/apache/datafusion/issues/15699 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-17 Thread via GitHub
Dandandan commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2981406974 https://github.com/user-attachments/assets/b96882b9-f3da-4d9a-8635-ba77bdf8fbf3"; /> Well, it looks like for these benchmarks 95% is now spent on just scanning the data and only

Re: [PR] feat: optimize and unparse grouping [datafusion]

2025-06-17 Thread via GitHub
eejbyfeldt commented on PR #16161: URL: https://github.com/apache/datafusion/pull/16161#issuecomment-2981365745 I did not look closely at yet since I have not really contributed here in months. > there are three projections. for bitwise operation, there's no benifit for extra projec

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
ozankabak commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981363644 I think we can go ahead with the task budget -- we can revert to the alternative if we discover a (surprising) performance issue -- This is an automated message from the Apache G

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
pepijnve commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981356224 > What do you suggest? @ozankabak I'm not sure. I've been trying to convince myself that the task budget variant is good enough for now and doesn't incur a performance penalty

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
ozankabak commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981349176 > @alamb the remaining open question I have related to this PR is what the default implementation should be for the time being. Do we use the tokio has_budget_remaining/consume_bud

Re: [PR] Chore: implement datetime funcs as ScalarUDFImpl [datafusion-comet]

2025-06-17 Thread via GitHub
trompa commented on code in PR #1874: URL: https://github.com/apache/datafusion-comet/pull/1874#discussion_r2152871276 ## native/core/src/execution/planner.rs: ## @@ -460,22 +459,40 @@ impl PhysicalPlanner { ))) } ExprStruct::Hour(expr)

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
pepijnve commented on code in PR #16398: URL: https://github.com/apache/datafusion/pull/16398#discussion_r2152869442 ## datafusion/physical-plan/src/coop.rs: ## @@ -0,0 +1,325 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agree

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-17 Thread via GitHub
pepijnve commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2981329400 @alamb the remaining open question I have related to this PR is what the default implementation should be for the time being. Do we use the tokio `has_budget_remaining`/`consume_bud

  1   2   3   >