Re: [PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
Samyak2 commented on PR #16500: URL: https://github.com/apache/datafusion/pull/16500#issuecomment-3039543288 I have added asserts for metrics in the existing join tests. The ones in hash_join and cross_join are working. The asserts in nested_loop_join are currently failing due to a mismatch

[I] Panic happens when adding a decimal256 to a float (SQLancer) [datafusion]

2025-07-05 Thread via GitHub
2010YOUY01 opened a new issue, #16689: URL: https://github.com/apache/datafusion/issues/16689 ### Describe the bug datafusion-cli is compiled from the latest main (commit 25c2a079fc) ``` > create table tt(v1 decimal(50,2)); 0 row(s) fetched. Elapsed 0.002 seconds.

Re: [I] Datafusion can't seem to cast evolving structs [datafusion]

2025-07-05 Thread via GitHub
alamb commented on issue #14757: URL: https://github.com/apache/datafusion/issues/14757#issuecomment-3039034005 πŸš€ -- thank you for your attention to this @kosiew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039045610 Thanks @zhuqi-lucas -- I will keep looking at this later today -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] feature: adapt predicate pushdown for mismatched nested/struct schemas [datafusion]

2025-07-05 Thread via GitHub
alamb commented on issue #16565: URL: https://github.com/apache/datafusion/issues/16565#issuecomment-3039045185 > Implementation considerations: We would need to extend CastExpr and TryCastExpr to detect when the input/output types are structs and invoke the struct-aware casting logic appro

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

2025-07-05 Thread via GitHub
swaingotnochill commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3039046493 I observed a significant performance improvement just testing on the master branch without any metadata caching as compared to before. ``` ➜ datafusion git:(m

Re: [I] feature: adapt predicate pushdown for mismatched nested/struct schemas [datafusion]

2025-07-05 Thread via GitHub
adriangb commented on issue #16565: URL: https://github.com/apache/datafusion/issues/16565#issuecomment-3039050784 I agree with you @alamb. The issue I see with both approaches is going to be nested types: once you call the arrow cast kernel you can't take back control, so this approa

Re: [PR] feat: Support `PiecewiseMergeJoin` to speed up single range predicate joins [datafusion]

2025-07-05 Thread via GitHub
jonathanc-n commented on PR #16660: URL: https://github.com/apache/datafusion/pull/16660#issuecomment-3039228697 cc @alamb @ozankabak Seems like you guys were part of the discussion for range joins, this is a nice start to it? @Dandandan @comphead might be interested? -- This is an autom

Re: [PR] Auto start testcontainers for `datafusion-cli` [datafusion]

2025-07-05 Thread via GitHub
blaginin commented on PR #16644: URL: https://github.com/apache/datafusion/pull/16644#issuecomment-3039402927 @findepi hey πŸ‘‹ FYI as you suggested the change - in case you want to approve / merge πŸ‘€ -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Blog post on query cancellation [datafusion-site]

2025-07-05 Thread via GitHub
kevinjqliu commented on PR #75: URL: https://github.com/apache/datafusion-site/pull/75#issuecomment-3039741826 Just read the post. Coming back here to thank everyone for the contribution! I always learn so much from these. I think it would be great to have a comment section for the b

[I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-05 Thread via GitHub
kevinjqliu opened a new issue, #80: URL: https://github.com/apache/datafusion-site/issues/80 I think its beneficial to the community to add a comment section to the datafusion blogs. Personally, I would like to have follow up discussions on the blog post topic. I found a possible so

Re: [PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
Samyak2 commented on PR #16500: URL: https://github.com/apache/datafusion/pull/16500#issuecomment-3039652143 Actually, I see the same behavior on latest main. Looks like the output_rows metric in nested loop join is currently wrong? -- This is an automated message from the Apache Git Serv

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-05 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2181119355 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-05 Thread via GitHub
fvj commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3038866616 Weird, it worked on my machine (tm). Let me investigate! Should I close the PR for now and reopen when I fixed? -- This is an automated message from the Apache Git Service. To respond

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-05 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3038917238 I see some failures in the row_group_pruning tests ``` $ cargo test --package datafusion --test parquet_config parquet::row_group_pruning ``` ``` failures:

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-05 Thread via GitHub
fvj commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3038925425 latest patch should include the missing files: ``` $ docker build . -t datafusion-devcontainer:latest &>/dev/null && docker run --rm -it datafusion-devcontainer:latest bash -c "

Re: [PR] Refactor StreamJoinMetrics to reuse BaselineMetrics [datafusion]

2025-07-05 Thread via GitHub
alamb commented on PR #16674: URL: https://github.com/apache/datafusion/pull/16674#issuecomment-3038992838 Than you @Standing-Man and @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Implementation for regex_instr [datafusion]

2025-07-05 Thread via GitHub
alamb commented on PR #15928: URL: https://github.com/apache/datafusion/pull/15928#issuecomment-3039007195 πŸš€ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] Change tag and policy names to `ObjectName` [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
alamb commented on PR #1892: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1892#issuecomment-3039005788 πŸ‘ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update to Rust 1.88 [datafusion]

2025-07-05 Thread via GitHub
alamb commented on PR #16663: URL: https://github.com/apache/datafusion/pull/16663#issuecomment-3038886263 > > Thanks @melroy12 -- I started the CI tests > > I suspect this PR will fail at least `clippy` -- you'll perhaps have to run Clippy and make the changes it suggests (note you can a

Re: [PR] Refactor StreamJoinMetrics to reuse BaselineMetrics [datafusion]

2025-07-05 Thread via GitHub
alamb merged PR #16674: URL: https://github.com/apache/datafusion/pull/16674 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Refactor `StreamJoinMetrics` to reuse `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
alamb closed issue #16496: Refactor `StreamJoinMetrics` to reuse `BaselineMetrics` URL: https://github.com/apache/datafusion/issues/16496 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub
JigaoLuo commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039204719 Hi @zhuqi-lucas, While proofreading the blog, I had one major general question: **What are the limitations of such an embedded index?** - Is it limited to just one emb

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039224050 > Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above. > > Regarding the content: One suggestion would be

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039222981 > Thanks @zhuqi-lucas -- I will keep looking at this later today Thank you @alamb ! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039990579 This also ties into my initial impression that β€œthe Embedded Index is just a hashset to speed up scans, which adds overhead to Parquet." as mentioned as a follow-up here: https://gi

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-05 Thread via GitHub
kevinjqliu commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3040039493 > I only worry that we won't see the comments and therefore won't respond. Thats a good point. I think subscribing to the "Discussions" events on the github repo will he

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-05 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3038952572 > Thanks @UBarney, just some comments Thanks @jonathanc-n for reviewing. I have addressed all of your comments. -- This is an automated message from the Apache Git Service. T

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
JigaoLuo commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039088620 Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above. Regarding the content: One suggestion would be to inclu

Re: [PR] Partially implement MATCH_RECOGNIZE for Advanced Pattern Matching [datafusion]

2025-07-05 Thread via GitHub
geoffreyclaude commented on code in PR #16685: URL: https://github.com/apache/datafusion/pull/16685#discussion_r2187358106 ## datafusion/sql/src/relation/mod.rs: ## @@ -281,3 +796,49 @@ fn optimize_subquery_sort(plan: LogicalPlan) -> Result> }); new_plan } + +/// Hel

Re: [I] 1000x slowdown opening parquet file due to partitions [datafusion]

2025-07-05 Thread via GitHub
jatin510 commented on issue #16676: URL: https://github.com/apache/datafusion/issues/16676#issuecomment-3039286131 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub
adriangb commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039806254 Index suggestion: a tablesample index. And a general thought: exploring these sorts of indexes could do very cool stuff for DataFusion in general in terms of pushing us t

Re: [PR] Make `GenericDialect` support trailing commas in projections [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
simonvandel commented on code in PR #1921: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1921#discussion_r2187644576 ## tests/sqlparser_common.rs: ## @@ -11125,16 +11125,9 @@ fn parse_trailing_comma() { ); trailing_commas.verified_stmt(r#"SELECT "from" F

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-05 Thread via GitHub
alamb commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3039788072 Adding comments to the blog would be great -- I only worry that we won't see the comments and therefore won't respond. Maybe we could make a git hub issue or something for each blo

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub
alamb commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039796047 > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas), > > While proofreading the blog, I had one major general question: **What are the limitations of such an embedded index?

[PR] build(deps): bump tokio from 1.45.0 to 1.46.1 [datafusion-python]

2025-07-05 Thread via GitHub
dependabot[bot] opened a new pull request, #1180: URL: https://github.com/apache/datafusion-python/pull/1180 Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.0 to 1.46.1. Release notes Sourced from https://github.com/tokio-rs/tokio/releases";>tokio's releases. Tokio

Re: [PR] Perf: fast CursorValues compare for StringViewArray using inline_key_… [datafusion]

2025-07-05 Thread via GitHub
alamb commented on code in PR #16630: URL: https://github.com/apache/datafusion/pull/16630#discussion_r2187647772 ## datafusion/physical-plan/src/sorts/cursor.rs: ## @@ -288,6 +288,64 @@ impl CursorArray for StringViewArray { } } +/// Todo use arrow-rs side api after: <

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3039820680 Thanks -- I am going to spend an hour or so taking a pass through this blog trying to get the formatting to work out So exciting -- This is an automated message from the Apache

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub
JigaoLuo commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039853245 > > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas), > > While proofreading the blog, I had one major general question: **What are the limitations of such an embedded index

[PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-05 Thread via GitHub
alamb opened a new pull request, #16690: URL: https://github.com/apache/datafusion/pull/16690 ## Which issue does this PR close? - Related to https://github.com/apache/arrow-rs/issues/7395 ## Rationale for this change There are several non trivial changes in arrow

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-05 Thread via GitHub
adriangb commented on code in PR #16686: URL: https://github.com/apache/datafusion/pull/16686#discussion_r2187232804 ## datafusion/datasource/src/source.rs: ## @@ -325,6 +328,9 @@ impl ExecutionPlan for DataSourceExec { new_node.data_source = data_source;

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245413 I just pushed a commit that reworked the intro a bit and started filling out the background https://github.com/user-attachments/assets/efe36816-7fed-44d7-9158-1b2fc19ffb19"; />

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040245879 I need to run now to attend to to some family matters. I'll be back tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[PR] clean up unused templates [datafusion-site]

2025-07-05 Thread via GitHub
kevinjqliu opened a new pull request, #81: URL: https://github.com/apache/datafusion-site/pull/81 I was trying to make changes to the templates and noticed that there are some unused ones. This is confusing so I'm cleaning it up. - Removed `frontpage.html`, this is unused and referenc

[PR] add infrastructure-actions to gitignore [datafusion-site]

2025-07-05 Thread via GitHub
kevinjqliu opened a new pull request, #82: URL: https://github.com/apache/datafusion-site/pull/82 Local setup clones the `apache/infrastructure-actions` repo into the `infrastructure-actions` local directory. See https://github.com/apache/datafusion-site?tab=readme-ov-file#setup-for-doc

[PR] Add Snowflake COPY/REVOKE CURRENT GRANTS option [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
yoavcloud opened a new pull request, #1926: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1926 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[I] Add support for registering files in the Arrow IPC stream format as tables using `register_arrow` or similar [datafusion]

2025-07-05 Thread via GitHub
corasaurus-hex opened a new issue, #16688: URL: https://github.com/apache/datafusion/issues/16688 ### Is your feature request related to a problem or challenge? Datafusion currently supports [registering files in the Arrow IPC file format as tables](https://gist.github.com/corasaurus

Re: [I] Release DataFusion `48.0.1` [datafusion]

2025-07-05 Thread via GitHub
alamb commented on issue #16486: URL: https://github.com/apache/datafusion/issues/16486#issuecomment-3038693992 We have begun voting: https://lists.apache.org/thread/xobyfmpcvtcclcbgo0wz6c74b35mxcdx -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-05 Thread via GitHub
alamb commented on PR #16686: URL: https://github.com/apache/datafusion/pull/16686#issuecomment-3038695163 Thank you @liamzwbao @xudong963 or @adriangb do you have time to review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] fix: create file for empty stream [datafusion]

2025-07-05 Thread via GitHub
brunal commented on code in PR #16342: URL: https://github.com/apache/datafusion/pull/16342#discussion_r2187061361 ## datafusion/datasource/src/file_sink_config.rs: ## @@ -77,13 +79,34 @@ pub trait FileSink: DataSink { .runtime_env() .object_store(&conf

[PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
ryanschneider opened a new pull request, #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927 Fixes #1920 by adding `NOT NULL` expression support to DuckDB and SQLite, and `NOTNULL` support to DuckDB, SQLite, and Postgres. Since this is my first non-trivial PR pleas

Re: [PR] Draft : Accelerating Query Processing in DataFusion with Embedded Parquet Indexes [datafusion-site]

2025-07-05 Thread via GitHub
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3040695911 > I just pushed a commit that reworked the intro a bit and started filling out the background > > https://private-user-images.githubusercontent.com/490673/462836291-efe36816

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

2025-07-05 Thread via GitHub
zhuqi-lucas commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3040702910 Thank you @alamb @JigaoLuo @adriangb , i agree current example is the start, we can further add more advanced examples! -- This is an automated message from the Ap

Re: [PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
2010YOUY01 commented on code in PR #16500: URL: https://github.com/apache/datafusion/pull/16500#discussion_r2187952197 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -825,7 +825,8 @@ impl NestedLoopJoinStream { handle_state!(ready!(self.fet

Re: [PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
2010YOUY01 commented on PR #16500: URL: https://github.com/apache/datafusion/pull/16500#issuecomment-3040715483 > Actually, I see the same behavior on latest main. Looks like the output_rows metric in nested loop join is currently wrong? Thank you for the catch! Here might also need `

Re: [PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-07-05 Thread via GitHub
Samyak2 commented on PR #16500: URL: https://github.com/apache/datafusion/pull/16500#issuecomment-3040935741 > > Actually, I see the same behavior on latest main. Looks like the output_rows metric in nested loop join is currently wrong? > > Thank you for the catch! Here might also nee

Re: [PR] Make `GenericDialect` support trailing commas in projections [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
iffyio merged PR #1921: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1921 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] Add support for several Snowflake grant statements [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
iffyio merged PR #1922: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1922 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
ryanschneider commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2187875896 ## src/parser/mod.rs: ## @@ -3571,6 +3572,11 @@ impl<'a> Parser<'a> { ), regexp,

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-05 Thread via GitHub
ryanschneider commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2187875712 ## src/dialect/duckdb.rs: ## @@ -94,4 +94,12 @@ impl Dialect for DuckDbDialect { fn supports_order_by_all(&self) -> bool { true