Re: [I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #15892:
URL: https://github.com/apache/datafusion/issues/15892#issuecomment-2845830552

   Related discussion: 
   - https://github.com/apache/datafusion/issues/15173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` [datafusion]

2025-05-01 Thread via GitHub


phillipleblanc commented on issue #15173:
URL: https://github.com/apache/datafusion/issues/15173#issuecomment-2846043563

   Yeah that makes sense. Part of the complexity here is that several of the 
features needed to make the ListingTableProvider work today (i.e. partition 
columns) are actually implemented in the core FileStream. And in order to add 
the metadata columns I want, I need to pipe through the ObjectMeta in the 
FileStream as well.
   
   So it would end up being quite a lot of code duplication to fork out of the 
core repo to get this working as is. I do think there is opportunity for some 
better extensibility/abstractions that would allow people to implement 
something like this on their own - but I'm not sure what that is yet.
   
   I think a good first step would be figuring out how to abstract out the 
partition columns so that the existing ListingTable can work without any 
special code in core.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #15915:
URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846046740

   I think this is a good first issue as there is a clear request of what is 
desired and examples to follow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [datafusion-spark] Test integrating datafusion-spark code into comet [datafusion-comet]

2025-05-01 Thread via GitHub


alamb opened a new issue, #1704:
URL: https://github.com/apache/datafusion-comet/issues/1704

   ### What is the problem the feature request solves?
   
   - Part of https://github.com/apache/datafusion/issues/15914
   
   @shehabgamin added the `datafusion-spark` crate in 
https://github.com/apache/datafusion/pull/15168
   
   The goal is to help centralize the development of this function library in 
the core repository rather than duplicated effort
   
   To verify this is feasible I would like to verify that the setup created in 
`datafusion-spark`  / https://github.com/apache/datafusion/pull/15168 can be 
used by downstram crates like comet before we go too far
   
   ### Describe the potential solution
   
   I would like someone to make a PR that shows we can remove one or more of 
the spark function in datafusion-spark can be used in comet (and reduce the 
code here)
   
   ### Additional context
   
   cc @comphead @andygrove @mbutrovich 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add memory profiling / logging [datafusion-comet]

2025-05-01 Thread via GitHub


alamb commented on issue #1701:
URL: 
https://github.com/apache/datafusion-comet/issues/1701#issuecomment-2846047457

   It would be amazing to have memory monitoring of native code in datafusion 
too -- it is an important feature that is currently hard for downstream crates


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-05-01 Thread via GitHub


xudong963 commented on PR #15168:
URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846060139

   Fyi, the main CI has failed since the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix typo in introduction.md [datafusion]

2025-05-01 Thread via GitHub


xudong963 merged PR #15910:
URL: https://github.com/apache/datafusion/pull/15910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: regenerate builtin functions coverage [datafusion-comet]

2025-05-01 Thread via GitHub


comphead closed pull request #1698: chore: regenerate builtin functions coverage
URL: https://github.com/apache/datafusion-comet/pull/1698


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [datafusion-spark] Implement `ceil` function [datafusion]

2025-05-01 Thread via GitHub


alamb opened a new issue, #15916:
URL: https://github.com/apache/datafusion/issues/15916

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/15914
   
   Given the importantance of spark functions in general, we are consolidating 
/ adding spark compatible functions in the `datafusion-spark` crate in 
DataFusion: https://github.com/apache/datafusion/tree/main/datafusion/spark
   
   This ticket is part of a series of ticket to fill out the set of spark 
compatible functions in DataFusion
   
   ### Describe the solution you'd like
   
   Implement the function listed above in the `datafusion-spark` crate
   
   ### Describe alternatives you've considered
   
   1. Add the relevant function here: 
https://github.com/apache/datafusion/tree/main/datafusion/spark/src/function/math
   2. Add tests. Instructions are here: 
https://github.com/apache/datafusion/blob/main/datafusion/spark/README.md
   
   ### Examples to follow: 
   * The existing `expm1` implementation : 
https://github.com/apache/datafusion/blob/main/datafusion/spark/src/function/math/expm1.rs
   * COmet implementation here: 
https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/math_funcs
   * Sail implementation here: 
https://github.com/lakehq/sail/blob/4144d9a53d3f1ef447b3d2ce9afa9030906de640/crates/sail-plan/src/function/scalar/math.rs#L404
   
   You can also find the corresponding datafusion implementations in the 
corresponding file here: 
https://github.com/apache/datafusion/tree/main/datafusion/functions/src (it 
would be ok to call into these implementations if they are the same as the 
spark implementation)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [datafusion-spark] Implement `ceil` function [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #15916:
URL: https://github.com/apache/datafusion/issues/15916#issuecomment-2846091416

   @shehabgamin and @andygrove  -- here is a ticket for another spark function. 
I am hoping we can do one or two of these functions to set a pattern, and then 
we will be able to basically copy the pattern for the rest of the library
   
   Questions for you:
   1. Is there any other function(s) that you would recommend as one of the 
first to try and consolidate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] [experiment] Generate event log in Chrome tracing format [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove opened a new pull request, #1706:
URL: https://github.com/apache/datafusion-comet/pull/1706

   ## Which issue does this PR close?
   
   
   
   Part of https://github.com/apache/datafusion-comet/issues/1705
   
   ## Rationale for this change
   
   
   
   This is a quick POC of generating an event log in Chrome tracing format so 
that we can use existing tools to visualize native execution flow.
   
   
![2025-05-01_19-46](https://github.com/user-attachments/assets/d25c5f68-c74b-4443-b1e9-f6650540f71f)
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add support for event tracing for visualizing where time is spent during execution [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on issue #1705:
URL: 
https://github.com/apache/datafusion-comet/issues/1705#issuecomment-2846126081

   @alamb This may also be interesting to explore for DataFusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] support simple/cross lateral joins [datafusion]

2025-05-01 Thread via GitHub


github-actions[bot] closed pull request #14595: support simple/cross lateral 
joins
URL: https://github.com/apache/datafusion/pull/14595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: implement contextualized ObjectStore [datafusion]

2025-05-01 Thread via GitHub


github-actions[bot] closed pull request #14805: feat: implement contextualized 
ObjectStore
URL: https://github.com/apache/datafusion/pull/14805


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-05-01 Thread via GitHub


acking-you commented on issue #15720:
URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2844531606

   I think I need to submit two PRs to complete this issue:  
   1. Add benchmark code for SortPreservingMergeStream  
   2. Implement the reuse of Rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 [datafusion]

2025-05-01 Thread via GitHub


dependabot[bot] opened a new pull request, #15909:
URL: https://github.com/apache/datafusion/pull/15909

   Bumps [assert_cmd](https://github.com/assert-rs/assert_cmd) from 2.0.16 to 
2.0.17.
   
   Changelog
   Sourced from https://github.com/assert-rs/assert_cmd/blob/master/CHANGELOG.md";>assert_cmd's
 changelog.
   
   [2.0.17] - 2025-04-16
   Features
   
   Add cargo::cargo_bin! which will work with Cargo's 
build-dir
   
   
   
   
   Commits
   
   https://github.com/assert-rs/assert_cmd/commit/c5c7f0f63bc70a412be41f721e72c0a465c89541";>c5c7f0f
 chore: Release assert_cmd version 2.0.17
   https://github.com/assert-rs/assert_cmd/commit/569854e8a63eafeda86374bf58fbec9751605421";>569854e
 docs: Update changelog
   https://github.com/assert-rs/assert_cmd/commit/26cddcf53bdc74743fdaf43dbc2b8d90475088e0";>26cddcf
 Merge pull request https://redirect.github.com/assert-rs/assert_cmd/issues/234";>#234 
from epage/macro
   https://github.com/assert-rs/assert_cmd/commit/abe77aa131d775a7ba1f8e0e44b0ef8f5805ecb4";>abe77aa
 feat(cargo): Add cargo_bin!
   https://github.com/assert-rs/assert_cmd/commit/148c026b1966bc2a398ad3466b10dd041d424539";>148c026
 refactor: Move crate_name out into a file
   https://github.com/assert-rs/assert_cmd/commit/a096861a8e3eb31c35d9b3ce888e072d5e39f7b7";>a096861
 Merge pull request https://redirect.github.com/assert-rs/assert_cmd/issues/233";>#233 
from xixishidibei/master
   https://github.com/assert-rs/assert_cmd/commit/9b88b5a1811f7ee5e1a606a2b86ac14b95a6be98";>9b88b5a
 chore:fix cargo clippy warning
   https://github.com/assert-rs/assert_cmd/commit/1bc3b1240fb201684e4a4b687aaea18867acf96d";>1bc3b12
 chore(deps): Update Rust Stable to v1.86 (https://redirect.github.com/assert-rs/assert_cmd/issues/232";>#232)
   https://github.com/assert-rs/assert_cmd/commit/b177879ea58482558d7416773acc1ed53e6b7002";>b177879
 chore(deps): Update Rust crate automod to v1.0.15 (https://redirect.github.com/assert-rs/assert_cmd/issues/231";>#231)
   https://github.com/assert-rs/assert_cmd/commit/3d914b52dcf02529daeab971c467ca7424984964";>3d914b5
 chore(deps): Update Rust Stable to v1.85 (https://redirect.github.com/assert-rs/assert_cmd/issues/229";>#229)
   Additional commits viewable in https://github.com/assert-rs/assert_cmd/compare/v2.0.16...v2.0.17";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=assert_cmd&package-manager=cargo&previous-version=2.0.16&new-version=2.0.17)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

2025-05-01 Thread via GitHub


alamb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070243952


##
datafusion/core/src/datasource/listing/table.rs:
##
@@ -982,18 +980,6 @@ impl TableProvider for ListingTable {
 return Ok(TableProviderFilterPushDown::Exact);
 }
 
-// if we can't push it down completely with only the 
filename-based/path-based

Review Comment:
   This change makes sense to me -- when @itsjunetime originally implemented 
this code, there was some complexity because there was no way to do filter 
pushdown in ExecutionPlans so in my mind this approach was a (clever) workaround
   
   The comments even hint that this is a parquet specific special case
   
   I think the new pattern of handling predicates more generally in this PR is 
cleaner and will support more cases. Since this code is only currently executed 
   
   Perhaps  @cisaacson has some other thoughts



##
datafusion/datasource-parquet/src/mod.rs:
##
@@ -244,7 +242,7 @@ impl ParquetExecBuilder {
 inner: DataSourceExec::new(Arc::new(base_config.clone())),
 base_config,
 predicate,
-pruning_predicate: parquet.pruning_predicate,
+pruning_predicate: None, // for backwards compat since 
`ParquetExec` is only for backwards compat anyway

Review Comment:
   yeah this is fine too in my opinion. It is almost time to remove ParquetExec 
anyways -- maybe we should just do it in this release 🤔 



##
datafusion/core/src/datasource/listing/table.rs:
##
@@ -982,18 +980,6 @@ impl TableProvider for ListingTable {
 return Ok(TableProviderFilterPushDown::Exact);
 }
 
-// if we can't push it down completely with only the 
filename-based/path-based
-// column names, then we should check if we can do parquet 
predicate pushdown
-let supports_pushdown = 
self.options.format.supports_filters_pushdown(
-&self.file_schema,
-&self.table_schema,
-&[filter],
-)?;
-
-if supports_pushdown == FilePushdownSupport::Supported {
-return Ok(TableProviderFilterPushDown::Exact);
-}

Review Comment:
   > We can justify implementing other TableProviders for Parquet, but still I 
cannot understand why we need to degrade the capabilities of our ListingTable. 
Is't it always better pruning/simplifying things at the higher levels as 
possible?
   
   I don't think this degrades the capabilities of the current listing table. I 
think the only implications are for anyone who used a custom `FileFormat` and 
impleented `supports_filters_pushdown` -- I suspect this is not very common and 
we can likely avoid consternation by mentioning it in the upgrade guide (see 
comment below)



##
datafusion/datasource-parquet/src/source.rs:
##
@@ -559,25 +549,8 @@ impl FileSource for ParquetSource {
 .predicate()
 .map(|p| format!(", predicate={p}"))
 .unwrap_or_default();
-let pruning_predicate_string = self
-.pruning_predicate
-.as_ref()
-.map(|pre| {
-let mut guarantees = pre
-.literal_guarantees()
-.iter()
-.map(|item| format!("{}", item))
-.collect_vec();
-guarantees.sort();
-format!(
-", pruning_predicate={}, required_guarantees=[{}]",
-pre.predicate_expr(),
-guarantees.join(", ")
-)
-})
-.unwrap_or_default();

Review Comment:
   I think it is important to keep these in the physical plans -- in particular 
what I think is important is to be able to check via the explain plan if 
pruning is happening by looking at the explain plan



##
datafusion/datasource/src/file_format.rs:
##
@@ -109,37 +108,10 @@ pub trait FileFormat: Send + Sync + fmt::Debug {
 not_impl_err!("Writer not implemented for this format")
 }
 
-/// Check if the specified file format has support for pushing down the 
provided filters within
-/// the given schemas. Added initially to support the Parquet file 
format's ability to do this.
-fn supports_filters_pushdown(

Review Comment:
   yes I agree 
   
   Since 
[`FileFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html)
 is a pub trait, this is technically a breaking API change, but I do think it 
was a parquet specific optimization
   
   I recommend we mark this PR as an API change and add a note to the upgrade 
guide 
h

Re: [PR] feat: decode() expression when using 'utf-8' encoding [datafusion-comet]

2025-05-01 Thread via GitHub


mbutrovich commented on PR #1697:
URL: 
https://github.com/apache/datafusion-comet/pull/1697#issuecomment-2844679657

   Linux is failing in CometFuzzTestSuite due to failing to cast a Dictionary 
to utf-8. RNG there must be generating different values. I'll look into that 
failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-05-01 Thread via GitHub


acking-you commented on issue #15720:
URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2844672108

   > # Overall Implementation
   > Adjust `RowCursorStream` to become the owner of `Rows` with continuous 
reuse, requiring each partition to maintain two `Rows` instances (subsequent 
sorting algorithms need to maintain `prev` and `cur` states for comparison). 
The key challenge lies in passing a reference to `Rows` via `poll_next` to 
`SortPreservingMergeStream`, which is difficult to achieve under Rust's 
lifetime annotation constraints.
   > 
   > ## Possible Implementation Approaches:
   > 1. ​**Lifetime Annotations**: Attempt to pass a reference to `Rows` using 
Rust's lifetime annotations (tested and found nearly infeasible).
   > 2. ​**Box Abstraction**: Wrap `Rows` in a `Box`, returning a `const*` to 
`Rows` during `poll_next`. This pointer must be re-encapsulated into a 
structure resembling `Rows`. This approach is currently under exploration. 
Since the `SortPreservingMergeStream` invokes `RowCursorStream::poll_next` 
synchronously, there are no concurrency safety concerns (yet to be tested).
   > 
   > ## Optimization Opportunities:
   > When the `ORDER BY` clause contains only a single column, the current 
implementation of `SortPreservingMergeExec` still copies and generates `Rows` 
structures for comparison. Consider abstracting a unified structure to handle 
this scenario while fulfilling the requirements.
   
   After reviewing the specific code, it was found that optimization for a 
single column is not possible because SortPreservingMergeExec will only choose 
to create a 
[SortPreservingMergeStream](https://docs.rs/datafusion-physical-plan/47.0.0/src/datafusion_physical_plan/sorts/sort_preserving_merge.rs.html#291-337)
 when sorting involves more than one column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on code in PR #1702:
URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070469650


##
native/core/src/execution/jni_api.rs:
##
@@ -359,6 +365,21 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_Native_executePlan(
 // Retrieve the query
 let exec_context = get_execution_context(exec_context);
 
+// memory profiling is only available on linux
+if exec_context.memory_profiling_enabled {
+#[cfg(target_os = "linux")]

Review Comment:
   I did that originally, but the compiler complained about the unused variable 
`memory_profiling_enabled` on non-Linux platforms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: decode() expression when using 'utf-8' encoding [datafusion-comet]

2025-05-01 Thread via GitHub


mbutrovich commented on PR #1697:
URL: 
https://github.com/apache/datafusion-comet/pull/1697#issuecomment-2845174358

   It's interesting to me that `native_comet` (CometScan) fails, but 
`native_datafusion` (DataSourceExec) succeeds. I wonder if DataSourceExec is 
already unpacking dictionaries on read.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


kazuyukitanimura commented on code in PR #1702:
URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070493011


##
spark/src/main/scala/org/apache/comet/CometExecIterator.scala:
##
@@ -130,6 +134,21 @@ class CometExecIterator(
   def getNextBatch(): Option[ColumnarBatch] = {
 assert(partitionIndex >= 0 && partitionIndex < numParts)
 
+if (memoryProfilingEnabled) {
+  val memoryMXBean = ManagementFactory.getMemoryMXBean
+  val heap = memoryMXBean.getHeapMemoryUsage
+  val nonHeap = memoryMXBean.getNonHeapMemoryUsage
+
+  def mb(n: Long) = n / 1024 / 1024
+
+  // scalastyle:off println
+  println(
+"JVM_MEMORY: { " +
+  s"heapUsed: ${mb(heap.getUsed)}, heapCommitted: 
${mb(heap.getCommitted)}, " +
+  s"nonHeapUsed: ${mb(nonHeap.getUsed)}, nonHeapCommitted: 
${mb(nonHeap.getCommitted)} " +
+  "}")

Review Comment:
   Perhaps `logInfo` ?
   If println is preferred, we may have to add back `// scalastyle:on println`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added SQL Example for `Aggregate Functions` [datafusion]

2025-05-01 Thread via GitHub


Adez017 commented on PR #15778:
URL: https://github.com/apache/datafusion/pull/15778#issuecomment-2844743481

   > This PR appears to hve no changes https://private-user-images.githubusercontent.com/490673/438510565-a8b12e9d-739c-4b18-9df2-69d64572667e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDYxMDI1NDYsIm5iZiI6MTc0NjEwMjI0NiwicGF0aCI6Ii80OTA2NzMvNDM4NTEwNTY1LWE4YjEyZTlkLTczOWMtNGIxOC05ZGYyLTY5ZDY0NTcyNjY3ZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNTAxJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDUwMVQxMjI0MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01ZjhhZTE3MjYxY2RkZTA1N2E1MzEwYjhlMGJiNjhjODQzOGYwMzIzYzM5NDEyY2UyMDlhYzZjOGExZGY0Mjk2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.ggsQD5ALD3chY-BGKpWKjAKnkyvEIged_anosxg16G0";>
   > 
   > I may be missing something 🤔
   
   i actually closed this PR accidently and afterwards i reopened it 😁


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]

2025-05-01 Thread via GitHub


NevroHelios commented on code in PR #15841:
URL: https://github.com/apache/datafusion/pull/15841#discussion_r2069898066


##
datafusion/common/src/column.rs:
##
@@ -130,8 +130,8 @@ impl Column {
 /// where `"foo.BAR"` would be parsed to a reference to column named 
`foo.BAR`
 pub fn from_qualified_name(flat_name: impl Into) -> Self {
 let flat_name = flat_name.into();
-Self::from_idents(parse_identifiers_normalized(&flat_name, 
false)).unwrap_or(
-Self {
+Self::from_idents(parse_identifiers_normalized(&flat_name, 
false)).unwrap_or_else(

Review Comment:
   thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


codecov-commenter commented on PR #1702:
URL: 
https://github.com/apache/datafusion-comet/pull/1702#issuecomment-2845028742

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `42.85714%` with `8 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 57.31%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`2c17a71`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/2c17a7129ea62918b8fc9d82ab3a36fc9be64993?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 174 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...ain/scala/org/apache/comet/CometExecIterator.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2FCometExecIterator.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9Db21ldEV4ZWNJdGVyYXRvci5zY2FsYQ==)
 | 27.27% | [7 Missing and 1 partial :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1702  +/-   ##
   
   + Coverage 56.12%   57.31%   +1.18% 
   - Complexity  976 1073  +97 
   
 Files   119  129  +10 
 Lines 1174312496 +753 
 Branches   2251 2341  +90 
   
   + Hits   6591 7162 +571 
   - Misses 4012 4153 +141 
   - Partials   1140 1181  +41 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: More warning info for users [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove merged PR #1667:
URL: https://github.com/apache/datafusion-comet/pull/1667


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on code in PR #1700:
URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070408978


##
spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala:
##
@@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("regexp_replace") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val df = spark.read.parquet(filename)
+  df.createOrReplaceTempView("t1")
+  // We want to make sure that the schema generator wasn't modified to 
accidentally omit
+  // StringType, since then this test would not run any queries and 
silently pass.
+  var testedString = false
+  for (field <- df.schema.fields if field.dataType == StringType) {
+testedString = true
+val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1"

Review Comment:
   I'm not sure how likely it is that the input strings will contain the 
character `a`, so it is possible that this test isn't really testing anything. 
I modified the test locally to add:
   
   ```
   val sql = s"SELECT count(*) FROM t1 WHERE ${field.name} LIKE '%a%'"
   spark.sql(sql).show()
   ```
   
   this produced
   
   ```
   ++
   |count(1)|
   ++
   |   0|
   ++
   ```
   
   This is a general problem with just comparing results between Spark and 
Comet and it would be nice to have a way to also ensure that we see expected 
results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2052746932


##
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
 
 logical_plan
 01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > 
Int32(2)]
+02)--Projection: t_pushdown.a
+03)Filter: t_pushdown.b > Int32(2)
+04)--TableScan: t_pushdown projection=[a, b], 
partial_filters=[t_pushdown.b > Int32(2)]
 physical_plan
 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST]
 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-03)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2, 
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, 
required_guarantees=[]
+03)CoalesceBatchesExec: target_batch_size=8192
+04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+05)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2

Review Comment:
   > PushdownFilter should probably work before distribution&order satisfiers
   
   That makes sense to me. It does more "invasive" re-arranging of plans than 
those do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070247089


##
datafusion/physical-plan/src/execution_plan.rs:
##
@@ -471,39 +471,53 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
 Ok(None)
 }
 
-/// Attempts to recursively push given filters from the top of the tree 
into leafs.
-///
-/// This is used for various optimizations, such as:
-///
-/// * Pushing down filters into scans in general to minimize the amount of 
data that needs to be materialzied.
-/// * Pushing down dynamic filters from operators like TopK and Joins into 
scans.
-///
-/// Generally the further down (closer to leaf nodes) that filters can be 
pushed, the better.
-///
-/// Consider the case of a query such as `SELECT * FROM t WHERE a = 1 AND 
b = 2`.
-/// With no filter pushdown the scan needs to read and materialize all the 
data from `t` and then filter based on `a` and `b`.
-/// With filter pushdown into the scan it can first read only `a`, then 
`b` and keep track of
-/// which rows match the filter.
-/// Then only for rows that match the filter does it have to materialize 
the rest of the columns.
-///
-/// # Default Implementation
-///
-/// The default implementation assumes:
-/// * Parent filters can't be passed onto children.
-/// * This node has no filters to contribute.
-///
-/// # Implementation Notes
-///
-/// Most of the actual logic is implemented as a Physical Optimizer rule.
-/// See [`PushdownFilter`] for more details.
-///
-/// [`PushdownFilter`]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/filter_pushdown/struct.PushdownFilter.html
-fn try_pushdown_filters(

Review Comment:
   Agreed I'd prefer a single method as well - but as per 
https://github.com/apache/datafusion/pull/15770#discussion_r2052311257 we 
probably would have had to add a new method to the existing implementation 
anyway. I don't see a way to have a single method without making it absurdly 
convoluted just to avoid 2 methods. I'll also point out that there are multiple 
dimensions of filter pushdown for each node:
   - does the node allow parent filters to be pushed down to it's children?
   - does the node want to add any filters to be passed to it's children?
   - does the node need to handle the result of pushdown?
   Having two methods makes it easy to have canned implementations for each one 
of these independently and combine/compose them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070247089


##
datafusion/physical-plan/src/execution_plan.rs:
##
@@ -471,39 +471,53 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
 Ok(None)
 }
 
-/// Attempts to recursively push given filters from the top of the tree 
into leafs.
-///
-/// This is used for various optimizations, such as:
-///
-/// * Pushing down filters into scans in general to minimize the amount of 
data that needs to be materialzied.
-/// * Pushing down dynamic filters from operators like TopK and Joins into 
scans.
-///
-/// Generally the further down (closer to leaf nodes) that filters can be 
pushed, the better.
-///
-/// Consider the case of a query such as `SELECT * FROM t WHERE a = 1 AND 
b = 2`.
-/// With no filter pushdown the scan needs to read and materialize all the 
data from `t` and then filter based on `a` and `b`.
-/// With filter pushdown into the scan it can first read only `a`, then 
`b` and keep track of
-/// which rows match the filter.
-/// Then only for rows that match the filter does it have to materialize 
the rest of the columns.
-///
-/// # Default Implementation
-///
-/// The default implementation assumes:
-/// * Parent filters can't be passed onto children.
-/// * This node has no filters to contribute.
-///
-/// # Implementation Notes
-///
-/// Most of the actual logic is implemented as a Physical Optimizer rule.
-/// See [`PushdownFilter`] for more details.
-///
-/// [`PushdownFilter`]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/filter_pushdown/struct.PushdownFilter.html
-fn try_pushdown_filters(

Review Comment:
   Agreed I'd prefer a single method as well - but as per 
https://github.com/apache/datafusion/pull/15770#discussion_r2052311257 we 
probably would have had to add a new method to the existing implementation 
anyway. I don't see a way to have a single method without making it absurdly 
convoluted just to avoid 2 methods.
   
   I'll also point out that there are multiple dimensions of filter pushdown 
for each node:
   - does the node allow parent filters to be pushed down to it's children?
   - does the node want to add any filters to be passed to it's children?
   - does the node need to handle the result of pushdown?
   
   Having two methods makes it easy to have canned implementations for each one 
of these independently and combine/compose them.
   
   E.g.:
   - TopK or HashJoin will override `gather_filters_for_pushdown` but not 
`handle_child_pushdown_result`
   - FilterExec will override both
   - DataSourceExec will override only `handle_child_pushdown_result`
   - ProjectionExec will override only `gather_filters_for_pushdown`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on PR #1702:
URL: 
https://github.com/apache/datafusion-comet/pull/1702#issuecomment-2844943324

   @mbutrovich I'd like to get your feedback on this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]

2025-05-01 Thread via GitHub


NevroHelios commented on PR #15841:
URL: https://github.com/apache/datafusion/pull/15841#issuecomment-2844942814

   I pushed the updates. Could you please run the ci again? @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070335984


##
datafusion/datasource-parquet/src/source.rs:
##
@@ -559,25 +549,8 @@ impl FileSource for ParquetSource {
 .predicate()
 .map(|p| format!(", predicate={p}"))
 .unwrap_or_default();
-let pruning_predicate_string = self
-.pruning_predicate
-.as_ref()
-.map(|pre| {
-let mut guarantees = pre
-.literal_guarantees()
-.iter()
-.map(|item| format!("{}", item))
-.collect_vec();
-guarantees.sort();
-format!(
-", pruning_predicate={}, required_guarantees=[{}]",
-pre.predicate_expr(),
-guarantees.join(", ")
-)
-})
-.unwrap_or_default();

Review Comment:
   Hmm okay. I'll see if I can make it happen...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feat: support bit_count function [datafusion-comet]

2025-05-01 Thread via GitHub


kazantsev-maksim commented on PR #1602:
URL: 
https://github.com/apache/datafusion-comet/pull/1602#issuecomment-2844940964

   @mbutrovich I couldn't find a built-in implementation of bit_count in the 
DataFusion project, but i rewrote it using scalarFunc without adding a new 
proto expr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


aharpervc commented on code in PR #1810:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1810#discussion_r2070337010


##
src/dialect/mssql.rs:
##
@@ -215,6 +225,59 @@ impl MsSqlDialect {
 }))
 }
 
+/// Parse `CREATE TRIGGER` for [MsSql]
+///
+/// [MsSql]: 
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-trigger-transact-sql
+fn parse_create_trigger(
+&self,
+parser: &mut Parser,
+or_alter: bool,
+) -> Result {
+let name = parser.parse_object_name(false)?;
+parser.expect_keyword_is(Keyword::ON)?;
+let table_name = parser.parse_object_name(false)?;
+let period = parser.parse_trigger_period()?;
+let events = 
parser.parse_comma_separated(Parser::parse_trigger_event)?;
+
+parser.expect_keyword_is(Keyword::AS)?;
+
+let trigger_statements_body = if parser.peek_keyword(Keyword::BEGIN) {

Review Comment:
   Similar approach in the other PR for stored procedures that may/not have 
begin/end tokens 
https://github.com/apache/datafusion-sqlparser-rs/pull/1834/files#r2069458988



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070340727


##
datafusion/core/tests/physical_optimizer/push_down_filter.rs:
##
@@ -154,29 +153,25 @@ impl FileSource for TestSource {
 
 fn try_pushdown_filters(
 &self,
-mut fd: FilterDescription,
+filters: &[Arc],
 config: &ConfigOptions,
-) -> Result>> {
+) -> Result>> {
+let mut all_filters = 
filters.iter().map(Arc::clone).collect::>();

Review Comment:
   Wouldn't the same be true if it was just `Vec>`?
   
   I tried implementing a struct with some helper methods but the thing is that 
for `ExecutionPlan` we have a slightly different flow than `DataSource` and 
`FileSource` (which are their own traits and hence need their own method) so 
re-using the struct becomes a bit wonky.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: ORDER BY ALL [datafusion]

2025-05-01 Thread via GitHub


PokIsemaine commented on code in PR #15772:
URL: https://github.com/apache/datafusion/pull/15772#discussion_r2070204781


##
datafusion/expr/src/expr.rs:
##
@@ -701,6 +701,24 @@ impl TryCast {
 }
 }
 
+/// OrderBy Expressions
+pub enum OrderByExprs {
+OrderByExprVec(Vec),
+All {
+exprs: Vec,
+options: OrderByOptions,
+},
+}

Review Comment:
   Okay, I’ll try to move it this weekend



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 [datafusion]

2025-05-01 Thread via GitHub


xudong963 merged PR #15909:
URL: https://github.com/apache/datafusion/pull/15909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]

2025-05-01 Thread via GitHub


mbutrovich commented on code in PR #1700:
URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070460938


##
spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala:
##
@@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("regexp_replace") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val df = spark.read.parquet(filename)
+  df.createOrReplaceTempView("t1")
+  // We want to make sure that the schema generator wasn't modified to 
accidentally omit
+  // StringType, since then this test would not run any queries and 
silently pass.
+  var testedString = false
+  for (field <- df.schema.fields if field.dataType == StringType) {
+testedString = true
+val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1"

Review Comment:
   Makes sense. I find myself expanding the generator for #1697 to create 
dictionaries, so let me sync up with you for good solutions to expand the 
schemas and values in specific types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] ci: require approving review [datafusion-python]

2025-05-01 Thread via GitHub


timsaucer merged PR #1122:
URL: https://github.com/apache/datafusion-python/pull/1122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


comphead commented on code in PR #1702:
URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070464759


##
native/core/src/execution/jni_api.rs:
##
@@ -359,6 +365,21 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_Native_executePlan(
 // Retrieve the query
 let exec_context = get_execution_context(exec_context);
 
+// memory profiling is only available on linux
+if exec_context.memory_profiling_enabled {
+#[cfg(target_os = "linux")]

Review Comment:
   should the if statement be inside cfg? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update extending-operators.md [datafusion]

2025-05-01 Thread via GitHub


xudong963 commented on PR #15832:
URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2844855985

   > > > > You can rebase with main
   > > > 
   > > > 
   > > > doe this solve the issue ?
   > > 
   > > 
   > > You can open the failed CI and see what's wrong:
   > > ```
   > > error[E0599]: no method named `unwrap` found for struct 
`CoalescePartitionsExec` in the current scope
   > >--> 
datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:210:14
   > > |
   > > 208 |   let coalesce = CoalescePartitionsExec::new(child)
   > > |  -
   > > 209 | | .with_fetch(plan.fetch())
   > > 210 | | .unwrap();
   > > | | -^^ method not found in `CoalescePartitionsExec`
   > > | |_|
   > > |
   > > 
   > > warning: unused import: `ExecutionPlan`
   > >   --> 
datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:37:32
   > >|
   > > 37 | use datafusion_physical_plan::{ExecutionPlan, 
ExecutionPlanProperties};
   > >|^
   > >|
   > >= note: `#[warn(unused_imports)]` on by default
   > > ```
   > > 
   > > 
   > > 
   > >   
   > > 
   > > 
   > >   
   > > 
   > > 
   > > 
   > >   
   > > The error is fixed in main, so rebasing your branch with main will fix 
the error
   > 
   > Thank for your help @xudong963 but I think it didn't work for the failing 
workflow
   
   Sorry I am on vacation , you can try to fix by the error hints in ci


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Allow stored procedures to be defined without `BEGIN`/`END` [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


aharpervc commented on code in PR #1834:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1834#discussion_r2070278308


##
src/ast/mod.rs:
##
@@ -3744,7 +3750,7 @@ pub enum Statement {
 or_alter: bool,
 name: ObjectName,
 params: Option>,
-body: Vec,
+body: BeginEndStatements,

Review Comment:
   It it an abuse of BeginEndStatements to potentially have the begin/end token 
be empty? This could alternately be an enum of BeginEndStatements and Sequence, 
similar to elsewhere



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix typo in introduction.md [datafusion]

2025-05-01 Thread via GitHub


tom-mont opened a new pull request, #15910:
URL: https://github.com/apache/datafusion/pull/15910

   - Fix typo in introduction.md
   - Remove period from end of bullet point to maintain consistency with other 
bullet points
   
   ## Which issue does this PR close?
   
   
   
   - No issue: simply noticed a typo when reading documentation
   
   ## Rationale for this change
   
   
   
   Maintain good spelling and consistency in formatting
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   No tests - just a change to .md file
   
   ## Are there any user-facing changes?
   
   
   
   
   
   Changes are user-facing but in a .md file


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] ci: require approving review [datafusion-python]

2025-05-01 Thread via GitHub


timsaucer opened a new pull request, #1122:
URL: https://github.com/apache/datafusion-python/pull/1122

   This is a change to the `.asf.yaml` file that requires an approving review 
to merge to `main`. I copied the configuration from our upstream `datafusion` 
repository.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] feat: metadata handling for aggregates and window functions [datafusion]

2025-05-01 Thread via GitHub


timsaucer opened a new pull request, #15911:
URL: https://github.com/apache/datafusion/pull/15911

   ## Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion/issues/15902
   
   ## Rationale for this change
   
   This change is a follow on to 
https://github.com/apache/datafusion/pull/15646. With this change we can now 
handle metadata for both window and aggregate functions. It enables the use of 
extension types via this metadata handling.
   
   ## What changes are included in this PR?
   
   Instead of passing `DataType` to aggregates and windows we now pass `Field` 
in their arguments. `return_type` has been replaced with `return_field` so we 
can get metadata out of these functions as well.
   
   ## Are these changes tested?
   
   All existing unit tests pass. New unit tests are added.
   
   ## Are there any user-facing changes?
   
   Yes, the migration guide contains information on the updates that the user 
will need to make for their user defined functions.
   
   ## TODO before moving to ready to review
   
   - [ ] Add unit test that is a window of an aggregate function
   - [ ] Verify migration guide covers both aggregate and window functions
   - [ ] Find remaining places where the function name doesn't match the 
arguments (ie: `fn return_type()` that returns a `Field`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Wrong query results for filters that involve partition columns and data file columns [datafusion]

2025-05-01 Thread via GitHub


adriangb opened a new issue, #15912:
URL: https://github.com/apache/datafusion/issues/15912

   ### Describe the bug
   
   Filters such as `partition_col = col_from_file` are never applied if 
`datafusion.execution.parquet.pushdown_filters = true`
   
   ### To Reproduce
   
   With `datafusion-cli`:
   
   ```sql
   COPY  (
 SELECT arrow_cast('a', 'Utf8') AS val
   )  TO 'test_files/scratch/test/part=a/123.parquet'
   STORED AS PARQUET;
   COPY  (
 SELECT arrow_cast('b', 'Utf8') AS val
   )  TO 'test_files/scratch/test/part=b/123.parquet'
   STORED AS PARQUET;
   COPY  (
 SELECT arrow_cast('xyz', 'Utf8') AS val
   )  TO 'test_files/scratch/test/part=c/123.parquet'
   STORED AS PARQUET;
   
   set datafusion.execution.parquet.pushdown_filters = true;
   
   CREATE EXTERNAL TABLE test(part text, val text)
   STORED AS PARQUET
   PARTITIONED BY (part)
   LOCATION 'test_files/scratch/test/';
   
   SELECT * FROM test;
   
   explain analyze
   select * from test
   where part != val;
   ```
   
   ```
   > select * from test
   where part != val;
   +-+--+
   | val | part |
   +-+--+
   | a   | a|
   | xyz | c|
   | b   | b|
   +-+--+
   3 row(s) fetched. 
   ```
   
   Which is clearly wrong.
   
   ### Expected behavior
   
   ```
   > select * from test
   where part != val;
   +-+--+
   | val | part |
   +-+--+
   | xyz | c|
   +-+--+
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] feat: Add memory profiling [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove opened a new pull request, #1702:
URL: https://github.com/apache/datafusion-comet/pull/1702

   ## Which issue does this PR close?
   
   
   
   Closes https://github.com/apache/datafusion-comet/issues/1701
   
   ## Rationale for this change
   
   
   
   We need a way to profile memory.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Spark SQL test failures in native_iceberg_compat mode [datafusion-comet]

2025-05-01 Thread via GitHub


parthchandra commented on issue #1542:
URL: 
https://github.com/apache/datafusion-comet/issues/1542#issuecomment-2845276570

   Failure count : 
   ```
   Core 1: Tests: succeeded 9113, failed 25, canceled 6, ignored 292, pending 0
   Core 2: Tests: succeeded 2636, failed 19, canceled 0, ignored 387, pending 0
   Hive 1: Tests: succeeded 2144, failed  4, canceled 4, ignored  40, pending 0
   ```
   Total failures: 48 (down from 176)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: Remove fast encoding option [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove opened a new pull request, #1703:
URL: https://github.com/apache/datafusion-comet/pull/1703

   ## Which issue does this PR close?
   
   
   
   N/A
   
   ## Rationale for this change
   
   
   
   Now that we are using a version of Arrow that lets us skip validation on IPC 
reads, there is little justification in continuing to maintain our "fast 
encoding" option, since there is no longer any noticeable improvement in 
performance using this option.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [DISCUSSION] DataFusion Road Map: Q3-Q4 2025 [datafusion]

2025-05-01 Thread via GitHub


Dandandan commented on issue #15878:
URL: https://github.com/apache/datafusion/issues/15878#issuecomment-2845332274

   I am currently interested in the following subjects where I'll probably 
experiment with some things or help out others.
   
   - [ ] Window Functions (profiling, implementing improvements / optimizations)
   1. ** https://github.com/apache/datafusion/issues/15607
   2. Possible (logical plan) optimizations - limit pushdown?
   - [ ] Sorting performance ideas / helping out, examples:
   1. https://github.com/apache/datafusion/issues/15720
   2. 
https://github.com/apache/datafusion/pull/15380#pullrequestreview-2810308155
   - [ ] Streamlining arrow-rs kernels in terms of speed, consistency, reduce 
use of unsafe, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve sqllogictest error reporting [datafusion]

2025-05-01 Thread via GitHub


gabotechs commented on PR #15905:
URL: https://github.com/apache/datafusion/pull/15905#issuecomment-2844394504

   👍 Done! also added some failure indexes to help our eyes parse the different 
errors quickly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on code in PR #1700:
URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070420243


##
spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala:
##
@@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("regexp_replace") {
+withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+  val df = spark.read.parquet(filename)
+  df.createOrReplaceTempView("t1")
+  // We want to make sure that the schema generator wasn't modified to 
accidentally omit
+  // StringType, since then this test would not run any queries and 
silently pass.
+  var testedString = false
+  for (field <- df.schema.fields if field.dataType == StringType) {
+testedString = true
+val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1"

Review Comment:
   Perhaps we should improve the logic in `ParquetGenerator` to also generate 
ASCII strings:
   
   ```scala
 case DataTypes.StringType =>
   Range(0, numRows).map(_ => {
 r.nextInt(10) match {
   case 0 if options.allowNull => null
   case 1 => r.nextInt().toByte.toString
   case 2 => r.nextLong().toString
   case 3 => r.nextDouble().toString
   case _ => r.nextString(8)
 }
   })
   ```
   
   The call to ` r.nextString(8)` generates strings containing characters in 
the range 0 through 0xD800, so while it could theoretically generate `a`, it 
will be rare.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


alamb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070220686


##
datafusion/core/tests/physical_optimizer/push_down_filter.rs:
##
@@ -154,29 +153,25 @@ impl FileSource for TestSource {
 
 fn try_pushdown_filters(
 &self,
-mut fd: FilterDescription,
+filters: &[Arc],

Review Comment:
   I would personally prefer keeping a struct as an argument as it
   1. Is easier to document / explain what the structures mean via comments
   2. it is clearer in the struct definition what types of operations (e.g. 
append only vs swap, etc) are done
   3. It is easier to add new fields if needed
   
   Basically,  something like
   
   ```rust
   pub struct FilterDescription {
   filters: &[Arc],
   }
   ```
   ...
   
   This is a preference thing and i can easily see alternative opinions



##
datafusion/core/tests/physical_optimizer/push_down_filter.rs:
##
@@ -278,7 +273,7 @@ fn test_filter_collapse() {
 - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, 
c], file_type=test, pushdown_supported=true
   output:
 Ok:
-  - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, c], 
file_type=test, pushdown_supported=true, predicate=b@1 = bar AND a@0 = foo
+  - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, c], 
file_type=test, pushdown_supported=true, predicate=a@0 = foo AND b@1 = bar

Review Comment:
   This actually looks like an improvement to me as now `a = foo` will be 
evaluated before `b=bar` as was done in the input plan. This might be important 
for short circuiting, perhaps
   
   The prior version of this optimization seems to have reordered them



##
datafusion/physical-plan/src/filter_pushdown.rs:
##
@@ -17,79 +17,207 @@
 
 use std::sync::Arc;
 
-use crate::ExecutionPlan;
 use datafusion_physical_expr_common::physical_expr::PhysicalExpr;
 
-#[derive(Clone, Debug)]
-pub struct FilterDescription {
-/// Expressions coming from the parent nodes
-pub filters: Vec>,
+/// The result of or a plan for pushing down a filter into a child node.
+/// This contains references to filters so that nodes can mutate a filter
+/// before pushing it down to a child node (e.g. to adjust a projection)
+/// or can directly take ownership of `Unsupported` filters that their children
+/// could not handle.
+#[derive(Debug, Clone)]
+pub enum FilterPushdown {
+Supported(Arc),
+Unsupported(Arc),
 }
 
-impl Default for FilterDescription {
-fn default() -> Self {
-Self::empty()
+/// A thin wrapper around [`FilterPushdown`]s that allows for easy collection 
of
+/// supported and unsupported filters.
+#[derive(Debug, Clone)]
+pub struct FilterPushdowns(Vec);
+
+impl FilterPushdowns {
+/// Create a new FilterPushdowns with the given filters and their pushdown 
status.
+pub fn new(pushdowns: Vec) -> Self {
+Self(pushdowns)
+}
+
+/// Create a new FilterPushdowns with all filters as supported.
+pub fn all_supported(filters: &[Arc]) -> Self {
+let pushdowns = filters
+.iter()
+.map(|f| FilterPushdown::Supported(Arc::clone(f)))
+.collect();
+Self::new(pushdowns)
+}
+
+/// Create a new FilterPushdowns with all filters as unsupported.
+pub fn all_unsupported(filters: &[Arc]) -> Self {
+let pushdowns = filters
+.iter()
+.map(|f| FilterPushdown::Unsupported(Arc::clone(f)))
+.collect();
+Self::new(pushdowns)
+}
+
+/// Transform all filters to supported, returning a new FilterPushdowns.
+/// This does not modify the original FilterPushdowns.
+pub fn as_supported(&self) -> Self {
+let pushdowns = self
+.0
+.iter()
+.map(|f| match f {
+FilterPushdown::Supported(expr) => {
+FilterPushdown::Supported(Arc::clone(expr))
+}
+FilterPushdown::Unsupported(expr) => {
+FilterPushdown::Supported(Arc::clone(expr))
+}
+})
+.collect();
+Self::new(pushdowns)
+}
+
+/// Collect unsupported filters into a Vec, without removing them from the 
original
+/// FilterPushdowns.
+pub fn collect_unsupported(&self) -> Vec> {
+self.0
+.iter()
+.filter_map(|f| match f {
+FilterPushdown::Unsupported(expr) => Some(Arc::clone(expr)),
+FilterPushdown::Supported(_) => None,
+})
+.collect()
+}
+
+/// Collect all filters as PhysicalExprs into a Vec, without removing them 
from the original
+/// FilterPushdowns.
+pub fn into_inner_filters(&self) -> Vec> {
+self.0
+.iter()
+.map(|f| match f {
+FilterPushdown::Supported(expr) => Arc::clone(expr),
+  

Re: [PR] Update extending-operators.md [datafusion]

2025-05-01 Thread via GitHub


Adez017 commented on PR #15832:
URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2844762089

   > > > You can rebase with main
   > > 
   > > 
   > > doe this solve the issue ?
   > 
   > You can open the failed CI and see what's wrong:
   > 
   > ```
   > error[E0599]: no method named `unwrap` found for struct 
`CoalescePartitionsExec` in the current scope
   >--> 
datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:210:14
   > |
   > 208 |   let coalesce = CoalescePartitionsExec::new(child)
   > |  -
   > 209 | | .with_fetch(plan.fetch())
   > 210 | | .unwrap();
   > | | -^^ method not found in `CoalescePartitionsExec`
   > | |_|
   > |
   > 
   > warning: unused import: `ExecutionPlan`
   >   --> 
datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:37:32
   >|
   > 37 | use datafusion_physical_plan::{ExecutionPlan, 
ExecutionPlanProperties};
   >|^
   >|
   >= note: `#[warn(unused_imports)]` on by default
   > ```
   > 
   > The error is fixed in main, so rebasing your branch with main will fix the 
error
   
   Thank for your help @xudong963  but I think it didn't  work for the failing 
workflow 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070234428


##
datafusion/core/tests/physical_optimizer/push_down_filter.rs:
##
@@ -154,29 +153,25 @@ impl FileSource for TestSource {
 
 fn try_pushdown_filters(
 &self,
-mut fd: FilterDescription,
+filters: &[Arc],

Review Comment:
   Happy to do that 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor filter pushdown apis [datafusion]

2025-05-01 Thread via GitHub


berkaysynnada commented on PR #15801:
URL: https://github.com/apache/datafusion/pull/15801#issuecomment-2844916340

   Hi @alamb. I've a WIP commit for this PR. I'm trying to both address 
@adriangb concerns and needs, and at the same time trying to keep complexity at 
minimum and trying to make things similar to other DF patterns. I think I can 
finish it at the end of today. Thank you for the patience


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] add benchmark code for `Reuse rows in row cursor stream` [datafusion]

2025-05-01 Thread via GitHub


acking-you opened a new pull request, #15913:
URL: https://github.com/apache/datafusion/pull/15913

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   
   ## Rationale for this change
   
   
   You can 
see:https://github.com/apache/datafusion/issues/15720#issuecomment-2844531606
   
   ## What changes are included in this PR?
   
   
   
   Add a benchmark to test the optimization effect of reusing Rows.
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Remove fast encoding option [datafusion-comet]

2025-05-01 Thread via GitHub


codecov-commenter commented on PR #1703:
URL: 
https://github.com/apache/datafusion-comet/pull/1703#issuecomment-2845566547

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1703?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 56.51%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`48c4377`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/48c4377592b8c56303c9e7d6dd6f5e4dc3a6fd46?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 175 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1703  +/-   ##
   
   + Coverage 56.12%   56.51%   +0.39% 
   - Complexity  976 1065  +89 
   
 Files   119  129  +10 
 Lines 1174312582 +839 
 Branches   2251 2339  +88 
   
   + Hits   6591 7111 +520 
   - Misses 4012 4294 +282 
   - Partials   1140 1177  +37 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1703?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve sqllogictest error reporting [datafusion]

2025-05-01 Thread via GitHub


berkaysynnada commented on code in PR #15905:
URL: https://github.com/apache/datafusion/pull/15905#discussion_r2070694736


##
datafusion/sqllogictest/bin/sqllogictests.rs:
##
@@ -235,14 +235,38 @@ async fn run_test_file(
 runner.with_normalizer(value_normalizer);
 runner.with_validator(validator);
 
-let res = runner
-.run_file_async(path)
-.await
-.map_err(|e| DataFusionError::External(Box::new(e)));
+let path = path.canonicalize()?;
+let records =
+parse_file(&path).map_err(|e| DataFusionError::External(Box::new(e)))?;
+let mut errs = vec![];
+for record in records.into_iter() {
+if let Record::Halt { .. } = record {
+break;
+}
+if let Err(err) = runner.run_async(record).await {
+errs.push(format!("{err}"));
+}
+}
 
 pb.finish_and_clear();
 
-res
+if errs.is_empty() {
+return Ok(());
+}
+const ERR_LIMIT: usize = 10;

Review Comment:
   Better to define at the top where other constant are defined ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Remove fast encoding option [datafusion-comet]

2025-05-01 Thread via GitHub


andygrove commented on PR #1703:
URL: 
https://github.com/apache/datafusion-comet/pull/1703#issuecomment-2845639468

   > I love a good negative line count. Thanks @andygrove! Do we anticipate any 
performance changes?
   
   Based on local testing, TPC-H performance with fast encoding took 276s 
compared to 278s with Arrow IPC, and the performance does vary slightly on each 
run, so this could just be noise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: regenerate builtin functions coverage [datafusion-comet]

2025-05-01 Thread via GitHub


comphead commented on PR #1698:
URL: 
https://github.com/apache/datafusion-comet/pull/1698#issuecomment-2845671771

   > Interesting we getting regressions for some string functions like
   > 
   > * initcap
   > * lower
   > * upper
   > * abs
   >   ...
   > 
   > @mbutrovich ^^
   
   Might be related to this param
   
   ```
   val COMET_CASE_CONVERSION_ENABLED: ConfigEntry[Boolean] =
 conf("spark.comet.caseConversion.enabled")
   .doc(
 "Java uses locale-specific rules when converting strings to upper or 
lower case and " +
   "Rust does not, so we disable upper and lower by default.")
   .booleanConf
   .createWithDefault(false)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2845795660

   Thank you @chenkovsky and @adriangb  and everyone else who worked on this PR.
   
   I think the idea of additional metadata columns for LIstingTable provider is 
super valuable but it seems like this particular PR adds more complexity than 
we are willing to handle in the core. 
   
   I wonder if we could/should create a more full featured equivalent of 
LIstingTable in another repo / project -- I think there are many interesting 
ideas / potential things to add -- for example 
   - https://github.com/apache/datafusion/issues/15892
   - https://github.com/apache/datafusion/pull/15181 from @phillipleblanc 
   
   I apologize for letting it sit for so long. This is non ideal to be sure. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #14362:
URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2845809583

   See discussion on https://github.com/apache/datafusion/issues/15173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `FormatOptions` to Config [datafusion]

2025-05-01 Thread via GitHub


Omega359 commented on code in PR #15793:
URL: https://github.com/apache/datafusion/pull/15793#discussion_r2070825920


##
datafusion/sqllogictest/test_files/information_schema.slt:
##
@@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to 
true, the explain statem
 datafusion.explain.show_schema false When set to true, the explain statement 
will print schema information
 datafusion.explain.show_sizes true When set to true, the explain statement 
will print the partition sizes
 datafusion.explain.show_statistics false When set to true, the explain 
statement will print operator statistics for physical plans
+datafusion.format.date_format %Y-%m-%d Date format for date arrays
+datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime 
arrays
+datafusion.format.duration_format pretty Duration format. Can be either 
`"pretty"` or `"ISO8601"`

Review Comment:
   I think case may matter here for ISO8601 vs iso8601 given the impl in 
config.rs. I would suggest that config would use .to_lowercase() to make it not 
matter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Unparse `UNNEST` projection with the table column alias [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #15879:
URL: https://github.com/apache/datafusion/pull/15879#issuecomment-2845811697

   Thank you @xudong963 and @goldmedal 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Unparse `UNNEST` projection with the table column alias [datafusion]

2025-05-01 Thread via GitHub


alamb merged PR #15879:
URL: https://github.com/apache/datafusion/pull/15879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Generate the common SQL for the unparsing result of the unnest [datafusion]

2025-05-01 Thread via GitHub


alamb closed issue #15233: Generate the common SQL for the unparsing result of 
the unnest
URL: https://github.com/apache/datafusion/issues/15233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]

2025-05-01 Thread via GitHub


xudong963 commented on code in PR #15841:
URL: https://github.com/apache/datafusion/pull/15841#discussion_r2070946442


##
benchmarks/src/util/options.rs:
##
@@ -72,16 +72,11 @@ impl CommonOpt {
 /// Modify the existing config appropriately
 pub fn update_config(&self, mut config: SessionConfig) -> SessionConfig {
 if let Some(batch_size) = self.batch_size {
-config = config.with_batch_size(batch_size)
+config = config.with_batch_size(batch_size);
 }
 
 if let Some(partitions) = self.partitions {
-config = config.with_target_partitions(partitions)
-}
-
-if let Some(sort_spill_reservation_bytes) = 
self.sort_spill_reservation_bytes {
-config =
-
config.with_sort_spill_reservation_bytes(sort_spill_reservation_bytes);

Review Comment:
   Why was the part removed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [datafusion-spark] Example of using Spark compatible function library [datafusion]

2025-05-01 Thread via GitHub


alamb opened a new issue, #15915:
URL: https://github.com/apache/datafusion/issues/15915

   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/issues/15914
   
   @shehabgamin added the `datafusion-spark` crate in 
https://github.com/apache/datafusion/pull/15168
   
   The idea is that using the functions in this crate you can get a 
`SessionContext` that executes sql using Spark semantics. However, there is no 
user facing documentation that shows how to do this
   
   ### Describe the solution you'd like
   
   Add an example somewhere showing how to configure and use the spark 
functions in a SessionContext. I can help with this
   
   
   ### Describe alternatives you've considered
   
   I personally suggest adding a new page to the website: 
https://datafusion.apache.org/
   
   Specifically, I suggest
   1. Add a new page in the "Library User Guide" called "Spark Compatible 
Functions"
   2. Add a preamble explaining what the datafusion-spark crate is (contains a 
list of spark compatible functions)
   3. Add examples
   
   For example we should show how to run sql using a "spark compatible" frame:
   ```rust
   let ctx = SessionContext::new();
   datafusion_spark::register_all(&ctx)?;
   
   // TODO run an example SQL query here that uses a function from 
   // the datafusion spark crate
   ctx.sql("select ... ")
   
   // also add an example for DataFrame API
   ```
   
   
   In order to run the example code as part of CI, you will have to add an 
entry such as this:
   
https://github.com/apache/datafusion/blob/81b4c074886e604277cc303986113ca12f9ac77d/datafusion/core/src/lib.rs#L928-L932
   
   
   to the [datafusion-spark 
lib.rs](https://github.com/apache/datafusion/blob/6bda4796a3c8142b87d0fad072bdbe25f9d12934/datafusion/spark/src/lib.rs#L16-L15)
 file (it can't go in the datafusion/core/lib.rs because the core crate doesn't 
bring in datafusion-spark)
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `FormatOptions` to Config [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #15793:
URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845786865

   Are we happy with this PR? Shall we merge it in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Factor out Substrait consumers into separate files [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #15794:
URL: https://github.com/apache/datafusion/pull/15794#issuecomment-2845786048

   Thanks again @vbarua and @gabotechs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Factor out Substrait consumers into separate files [datafusion]

2025-05-01 Thread via GitHub


alamb merged PR #15794:
URL: https://github.com/apache/datafusion/pull/15794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [substrait] refactor consumer.rs [datafusion]

2025-05-01 Thread via GitHub


alamb closed issue #13864: [substrait] refactor consumer.rs
URL: https://github.com/apache/datafusion/issues/13864


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `FormatOptions` to Config [datafusion]

2025-05-01 Thread via GitHub


blaginin commented on PR #15793:
URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845789235

   Asked Bruce to re-review, and then I'm ready to merge 🤗


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]

2025-05-01 Thread via GitHub


acking-you commented on issue #15892:
URL: https://github.com/apache/datafusion/issues/15892#issuecomment-2845864345

   nice feature👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070868393


##
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
 
 logical_plan
 01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > 
Int32(2)]
+02)--Projection: t_pushdown.a
+03)Filter: t_pushdown.b > Int32(2)
+04)--TableScan: t_pushdown projection=[a, b], 
partial_filters=[t_pushdown.b > Int32(2)]
 physical_plan
 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST]
 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-03)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2, 
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, 
required_guarantees=[]
+03)CoalesceBatchesExec: target_batch_size=8192
+04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+05)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2

Review Comment:
   This would also be fixed by https://github.com/apache/datafusion/pull/15801



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #15168:
URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846025123

   I have filed an epic to track filling out the datafusion-spark crate:
   - https://github.com/apache/datafusion/issues/15914  
   
   I will file some subtickets for follow on work as well (e.g. what is in 
https://github.com/apache/datafusion/pull/15168#pullrequestreview-2801437803)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-05-01 Thread via GitHub


alamb merged PR #15168:
URL: https://github.com/apache/datafusion/pull/15168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]

2025-05-01 Thread via GitHub


alamb opened a new issue, #15914:
URL: https://github.com/apache/datafusion/issues/15914

   ### Is your feature request related to a problem or challenge?
   
   Many DataFusion users are using DataFusion to execution workloads originally 
developed for Apache Spark. Examples include
   - [DataFusion Comet](https://datafusion.apache.org/comet/) (@andygrove 
@comphead , etc) 
   - [LakeHQ / Sail](https://github.com/lakehq/sail) (@shehabgamin )
   - Various internal pileines / engines  (e.g. that @Omega359 and I think 
@Blizzara use)
   
   They often do this for superior performance
   
   * Part of running Spark workloads is emulating Spark sematics
   * Emulating Spark semantics requires (among other things) functions 
compatible with Spark  (which differs in semantics to the functions included in 
DataFusion)
   
   Several projects are in the process of implementing Spark compatible 
function libraries using DataFusion's extension APIs. However. we concluded in 
https://github.com/apache/datafusion/issues/5600 that we could join forces and 
maintain a spark compatible funciton library in the core datafusion repo. 
@shehabgamin  has implemented the first step in 
https://github.com/apache/datafusion/pull/15168 🙏 
   
   ### Describe the solution you'd like
   
   This ticket tracks "completing" the spark function library started in 
https://github.com/apache/datafusion/pull/15168
   
   ### Describe alternatives you've considered
   
   Related Issues
   - [ ] https://github.com/apache/datafusion/issues/5600
   - [ ] https://github.com/apache/datafusion/pull/15168
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add `datafusion-spark` crate [datafusion]

2025-05-01 Thread via GitHub


alamb commented on PR #15168:
URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846025618

   Onward!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]

2025-05-01 Thread via GitHub


alamb closed issue #5600: [DISCUSSION] Add separate crate to cover spark 
builtin functions
URL: https://github.com/apache/datafusion/issues/5600


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #5600:
URL: https://github.com/apache/datafusion/issues/5600#issuecomment-2846025987

   We are tracking follow on work in 
- https://github.com/apache/datafusion/issues/15914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix CI in main [datafusion]

2025-05-01 Thread via GitHub


blaginin opened a new pull request, #15917:
URL: https://github.com/apache/datafusion/pull/15917

   ## Which issue does this PR close?
   
   Fixes CI after https://github.com/apache/datafusion/pull/15168
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


iffyio commented on code in PR #1810:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1810#discussion_r2071038220


##
src/dialect/mssql.rs:
##
@@ -215,6 +225,59 @@ impl MsSqlDialect {
 }))
 }
 
+/// Parse `CREATE TRIGGER` for [MsSql]
+///
+/// [MsSql]: 
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-trigger-transact-sql
+fn parse_create_trigger(
+&self,
+parser: &mut Parser,
+or_alter: bool,
+) -> Result {
+let name = parser.parse_object_name(false)?;
+parser.expect_keyword_is(Keyword::ON)?;
+let table_name = parser.parse_object_name(false)?;
+let period = parser.parse_trigger_period()?;
+let events = 
parser.parse_comma_separated(Parser::parse_trigger_event)?;
+
+parser.expect_keyword_is(Keyword::AS)?;
+
+let trigger_statements_body = if parser.peek_keyword(Keyword::BEGIN) {
+let begin_token = parser.expect_keyword(Keyword::BEGIN)?;
+let statements = parser.parse_statement_list(&[Keyword::END])?;
+let end_token = parser.expect_keyword(Keyword::END)?;
+
+BeginEndStatements {
+begin_token: AttachedToken(begin_token),
+statements,
+end_token: AttachedToken(end_token),
+}
+} else {
+BeginEndStatements {
+begin_token: AttachedToken::empty(),
+statements: parser.parse_statements()?,
+end_token: AttachedToken::empty(),
+}

Review Comment:
   Oh I think instead of representing the begin optionality using tokens, we 
can use an enum, something similar to 
[ConditionalStatements](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/ast/mod.rs#L2294-L2299)
 where we have an explicit `BeginEnd` enum variant



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #15173:
URL: https://github.com/apache/datafusion/issues/15173#issuecomment-2845807026

   There is another request here that is related I think (basically "row number 
within the file"):
   - https://github.com/apache/datafusion/issues/15892
   
   It seems to me what has happened is that the existing ListingTableProvider 
in DataFusion has to walk a fine balance between eing relatively "simple" and 
additional functionality like this while widely useful I think also could be 
implemented outside the core with the existing APIs. 
   
   As we add more features to ListingTable it will look more and more like a 
full featured Table Format, but there are already many implementations of such 
formats (for example Iceberg, Delta, etc)
   
   Thus I think it might be good to propose we creating a `datafusion-contrib` 
project with make a more full featured "TableProvider" that adds new features 
like:
   1. Additional metadata clumns
   2. https://github.com/apache/datafusion/issues/15892
   3. etc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] metadata column support [datafusion]

2025-05-01 Thread via GitHub


alamb commented on issue #13975:
URL: https://github.com/apache/datafusion/issues/13975#issuecomment-2845808404

   I think this is duplicated by 
https://github.com/apache/datafusion/issues/15173 so let's continue the 
discussion there


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] metadata column support [datafusion]

2025-05-01 Thread via GitHub


alamb closed issue #13975: metadata column support
URL: https://github.com/apache/datafusion/issues/13975


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `FormatOptions` to Config [datafusion]

2025-05-01 Thread via GitHub


blaginin commented on code in PR #15793:
URL: https://github.com/apache/datafusion/pull/15793#discussion_r2070830060


##
datafusion/sqllogictest/test_files/information_schema.slt:
##
@@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to 
true, the explain statem
 datafusion.explain.show_schema false When set to true, the explain statement 
will print schema information
 datafusion.explain.show_sizes true When set to true, the explain statement 
will print the partition sizes
 datafusion.explain.show_statistics false When set to true, the explain 
statement will print operator statistics for physical plans
+datafusion.format.date_format %Y-%m-%d Date format for date arrays
+datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime 
arrays
+datafusion.format.duration_format pretty Duration format. Can be either 
`"pretty"` or `"ISO8601"`

Review Comment:
   great catch! fixed in 
https://github.com/apache/datafusion/pull/15793/commits/83748836c64aecdd7d63dc63e48b1fc1bb4efc69



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `FormatOptions` to Config [datafusion]

2025-05-01 Thread via GitHub


Omega359 commented on PR #15793:
URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845818047

   Beyond the config try_into looking like it's case sensitive for 
duration_format I think this is good. There is a breaking change with the 
rename of FormatOptions to OutputFormat that should be noted in the upgrade 
guide for the release this goes into.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add support for `DENY` statements [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


aharpervc opened a new pull request, #1836:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1836

   This is another statement for SQL Server: 
https://learn.microsoft.com/en-us/sql/t-sql/statements/deny-transact-sql, but 
implemented in a common way. Similar to GRANT & REVOKE, so using a lot of those 
patterns with a similar test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

2025-05-01 Thread via GitHub


adriangb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070868393


##
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
 
 logical_plan
 01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > 
Int32(2)]
+02)--Projection: t_pushdown.a
+03)Filter: t_pushdown.b > Int32(2)
+04)--TableScan: t_pushdown projection=[a, b], 
partial_filters=[t_pushdown.b > Int32(2)]
 physical_plan
 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST]
 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-03)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2, 
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, 
required_guarantees=[]
+03)CoalesceBatchesExec: target_batch_size=8192
+04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+05)DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2

Review Comment:
   This would also be fixed by https://github.com/apache/datafusion/pull/15801



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Resolved bug in `parse_function_arg` [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


LucaCappelletti94 commented on code in PR #1826:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1826#discussion_r2071126270


##
src/parser/mod.rs:
##
@@ -5199,13 +5199,20 @@ impl<'a> Parser<'a> {
 
 // parse: [ argname ] argtype
 let mut name = None;
+let next_token = self.peek_token();

Review Comment:
   The argument named `Int2` (as described in the 
[issue](https://github.com/apache/datafusion-sqlparser-rs/issues/1825)) is not 
parsed as `DataType::Custom`, but as a 
[`DataType::Int2`](https://github.com/apache/datafusion-sqlparser-rs/blob/a464f8e8d7a5057e4e5b8046b0f619acdf7fce74/src/ast/data_type.rs#L150).
 Analogously, any other such argument names that collides with data types from 
other SQL engines would be parsed into a type.
   
   Now, if I were to convert back to string `DataType::Int2` I would get some 
arbitrary capitalization which in this case is `INT2` - without the 
`peek_token`, I am unsure how we can preserve the initial token from being lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update extending-operators.md [datafusion]

2025-05-01 Thread via GitHub


Adez017 commented on PR #15832:
URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2846154397

   i Think ww need @alamb  help now . could you help ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]

2025-05-01 Thread via GitHub


Adez017 commented on issue #15915:
URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846193611

   hi @alamb  . it sounds interesting , i would love to work in this .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]

2025-05-01 Thread via GitHub


Adez017 commented on issue #15915:
URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846193703

   take
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Resolved bug in `parse_function_arg` [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


iffyio commented on code in PR #1826:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1826#discussion_r2071042324


##
src/parser/mod.rs:
##
@@ -5199,13 +5199,20 @@ impl<'a> Parser<'a> {
 
 // parse: [ argname ] argtype
 let mut name = None;
+let next_token = self.peek_token();

Review Comment:
   > The code you proposed does not work as Int2 (or any analogous such type) 
does not fall in if let DataType::Custom(n, _) = &data_type {
   
   Oh how did you mean here by `Int2` in this example not being parsed as a 
custom datatype, do we get back a different type or does `parse_data_type` fail 
in that scenario?
   
   I think ideally we will want to do without this `self.peek_token()` to avoid 
the cloning that it includes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support some of pipe operators [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


iffyio merged PR #1759:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1759


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [experiment] Generate event log in Chrome tracing format [datafusion-comet]

2025-05-01 Thread via GitHub


codecov-commenter commented on PR #1706:
URL: 
https://github.com/apache/datafusion-comet/pull/1706#issuecomment-2846219034

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1706?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 57.37%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`14cd1c5`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/14cd1c5d90b7b1c17a81d23490ac05cd56221251?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 175 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1706  +/-   ##
   
   + Coverage 56.12%   57.37%   +1.24% 
   - Complexity  976 1072  +96 
   
 Files   119  129  +10 
 Lines 1174312515 +772 
 Branches   2251 2347  +96 
   
   + Hits   6591 7180 +589 
   - Misses 4012 4149 +137 
   - Partials   1140 1186  +46 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1706?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added support for `DROP DOMAIN` [datafusion-sqlparser-rs]

2025-05-01 Thread via GitHub


iffyio merged PR #1828:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1828


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >