Re: [I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]
alamb commented on issue #15892: URL: https://github.com/apache/datafusion/issues/15892#issuecomment-2845830552 Related discussion: - https://github.com/apache/datafusion/issues/15173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` [datafusion]
phillipleblanc commented on issue #15173: URL: https://github.com/apache/datafusion/issues/15173#issuecomment-2846043563 Yeah that makes sense. Part of the complexity here is that several of the features needed to make the ListingTableProvider work today (i.e. partition columns) are actually implemented in the core FileStream. And in order to add the metadata columns I want, I need to pipe through the ObjectMeta in the FileStream as well. So it would end up being quite a lot of code duplication to fork out of the core repo to get this working as is. I do think there is opportunity for some better extensibility/abstractions that would allow people to implement something like this on their own - but I'm not sure what that is yet. I think a good first step would be figuring out how to abstract out the partition columns so that the existing ListingTable can work without any special code in core. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]
alamb commented on issue #15915: URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846046740 I think this is a good first issue as there is a clear request of what is desired and examples to follow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [datafusion-spark] Test integrating datafusion-spark code into comet [datafusion-comet]
alamb opened a new issue, #1704: URL: https://github.com/apache/datafusion-comet/issues/1704 ### What is the problem the feature request solves? - Part of https://github.com/apache/datafusion/issues/15914 @shehabgamin added the `datafusion-spark` crate in https://github.com/apache/datafusion/pull/15168 The goal is to help centralize the development of this function library in the core repository rather than duplicated effort To verify this is feasible I would like to verify that the setup created in `datafusion-spark` / https://github.com/apache/datafusion/pull/15168 can be used by downstram crates like comet before we go too far ### Describe the potential solution I would like someone to make a PR that shows we can remove one or more of the spark function in datafusion-spark can be used in comet (and reduce the code here) ### Additional context cc @comphead @andygrove @mbutrovich -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add memory profiling / logging [datafusion-comet]
alamb commented on issue #1701: URL: https://github.com/apache/datafusion-comet/issues/1701#issuecomment-2846047457 It would be amazing to have memory monitoring of native code in datafusion too -- it is an important feature that is currently hard for downstream crates -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
xudong963 commented on PR #15168: URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846060139 Fyi, the main CI has failed since the PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix typo in introduction.md [datafusion]
xudong963 merged PR #15910: URL: https://github.com/apache/datafusion/pull/15910 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: regenerate builtin functions coverage [datafusion-comet]
comphead closed pull request #1698: chore: regenerate builtin functions coverage URL: https://github.com/apache/datafusion-comet/pull/1698 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [datafusion-spark] Implement `ceil` function [datafusion]
alamb opened a new issue, #15916: URL: https://github.com/apache/datafusion/issues/15916 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/15914 Given the importantance of spark functions in general, we are consolidating / adding spark compatible functions in the `datafusion-spark` crate in DataFusion: https://github.com/apache/datafusion/tree/main/datafusion/spark This ticket is part of a series of ticket to fill out the set of spark compatible functions in DataFusion ### Describe the solution you'd like Implement the function listed above in the `datafusion-spark` crate ### Describe alternatives you've considered 1. Add the relevant function here: https://github.com/apache/datafusion/tree/main/datafusion/spark/src/function/math 2. Add tests. Instructions are here: https://github.com/apache/datafusion/blob/main/datafusion/spark/README.md ### Examples to follow: * The existing `expm1` implementation : https://github.com/apache/datafusion/blob/main/datafusion/spark/src/function/math/expm1.rs * COmet implementation here: https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/math_funcs * Sail implementation here: https://github.com/lakehq/sail/blob/4144d9a53d3f1ef447b3d2ce9afa9030906de640/crates/sail-plan/src/function/scalar/math.rs#L404 You can also find the corresponding datafusion implementations in the corresponding file here: https://github.com/apache/datafusion/tree/main/datafusion/functions/src (it would be ok to call into these implementations if they are the same as the spark implementation) ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [datafusion-spark] Implement `ceil` function [datafusion]
alamb commented on issue #15916: URL: https://github.com/apache/datafusion/issues/15916#issuecomment-2846091416 @shehabgamin and @andygrove -- here is a ticket for another spark function. I am hoping we can do one or two of these functions to set a pattern, and then we will be able to basically copy the pattern for the rest of the library Questions for you: 1. Is there any other function(s) that you would recommend as one of the first to try and consolidate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [experiment] Generate event log in Chrome tracing format [datafusion-comet]
andygrove opened a new pull request, #1706: URL: https://github.com/apache/datafusion-comet/pull/1706 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1705 ## Rationale for this change This is a quick POC of generating an event log in Chrome tracing format so that we can use existing tools to visualize native execution flow.  ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add support for event tracing for visualizing where time is spent during execution [datafusion-comet]
andygrove commented on issue #1705: URL: https://github.com/apache/datafusion-comet/issues/1705#issuecomment-2846126081 @alamb This may also be interesting to explore for DataFusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] support simple/cross lateral joins [datafusion]
github-actions[bot] closed pull request #14595: support simple/cross lateral joins URL: https://github.com/apache/datafusion/pull/14595 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: implement contextualized ObjectStore [datafusion]
github-actions[bot] closed pull request #14805: feat: implement contextualized ObjectStore URL: https://github.com/apache/datafusion/pull/14805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2844531606 I think I need to submit two PRs to complete this issue: 1. Add benchmark code for SortPreservingMergeStream 2. Implement the reuse of Rows -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 [datafusion]
dependabot[bot] opened a new pull request, #15909: URL: https://github.com/apache/datafusion/pull/15909 Bumps [assert_cmd](https://github.com/assert-rs/assert_cmd) from 2.0.16 to 2.0.17. Changelog Sourced from https://github.com/assert-rs/assert_cmd/blob/master/CHANGELOG.md";>assert_cmd's changelog. [2.0.17] - 2025-04-16 Features Add cargo::cargo_bin! which will work with Cargo's build-dir Commits https://github.com/assert-rs/assert_cmd/commit/c5c7f0f63bc70a412be41f721e72c0a465c89541";>c5c7f0f chore: Release assert_cmd version 2.0.17 https://github.com/assert-rs/assert_cmd/commit/569854e8a63eafeda86374bf58fbec9751605421";>569854e docs: Update changelog https://github.com/assert-rs/assert_cmd/commit/26cddcf53bdc74743fdaf43dbc2b8d90475088e0";>26cddcf Merge pull request https://redirect.github.com/assert-rs/assert_cmd/issues/234";>#234 from epage/macro https://github.com/assert-rs/assert_cmd/commit/abe77aa131d775a7ba1f8e0e44b0ef8f5805ecb4";>abe77aa feat(cargo): Add cargo_bin! https://github.com/assert-rs/assert_cmd/commit/148c026b1966bc2a398ad3466b10dd041d424539";>148c026 refactor: Move crate_name out into a file https://github.com/assert-rs/assert_cmd/commit/a096861a8e3eb31c35d9b3ce888e072d5e39f7b7";>a096861 Merge pull request https://redirect.github.com/assert-rs/assert_cmd/issues/233";>#233 from xixishidibei/master https://github.com/assert-rs/assert_cmd/commit/9b88b5a1811f7ee5e1a606a2b86ac14b95a6be98";>9b88b5a chore:fix cargo clippy warning https://github.com/assert-rs/assert_cmd/commit/1bc3b1240fb201684e4a4b687aaea18867acf96d";>1bc3b12 chore(deps): Update Rust Stable to v1.86 (https://redirect.github.com/assert-rs/assert_cmd/issues/232";>#232) https://github.com/assert-rs/assert_cmd/commit/b177879ea58482558d7416773acc1ed53e6b7002";>b177879 chore(deps): Update Rust crate automod to v1.0.15 (https://redirect.github.com/assert-rs/assert_cmd/issues/231";>#231) https://github.com/assert-rs/assert_cmd/commit/3d914b52dcf02529daeab971c467ca7424984964";>3d914b5 chore(deps): Update Rust Stable to v1.85 (https://redirect.github.com/assert-rs/assert_cmd/issues/229";>#229) Additional commits viewable in https://github.com/assert-rs/assert_cmd/compare/v2.0.16...v2.0.17";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
alamb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070243952 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -982,18 +980,6 @@ impl TableProvider for ListingTable { return Ok(TableProviderFilterPushDown::Exact); } -// if we can't push it down completely with only the filename-based/path-based Review Comment: This change makes sense to me -- when @itsjunetime originally implemented this code, there was some complexity because there was no way to do filter pushdown in ExecutionPlans so in my mind this approach was a (clever) workaround The comments even hint that this is a parquet specific special case I think the new pattern of handling predicates more generally in this PR is cleaner and will support more cases. Since this code is only currently executed Perhaps @cisaacson has some other thoughts ## datafusion/datasource-parquet/src/mod.rs: ## @@ -244,7 +242,7 @@ impl ParquetExecBuilder { inner: DataSourceExec::new(Arc::new(base_config.clone())), base_config, predicate, -pruning_predicate: parquet.pruning_predicate, +pruning_predicate: None, // for backwards compat since `ParquetExec` is only for backwards compat anyway Review Comment: yeah this is fine too in my opinion. It is almost time to remove ParquetExec anyways -- maybe we should just do it in this release 🤔 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -982,18 +980,6 @@ impl TableProvider for ListingTable { return Ok(TableProviderFilterPushDown::Exact); } -// if we can't push it down completely with only the filename-based/path-based -// column names, then we should check if we can do parquet predicate pushdown -let supports_pushdown = self.options.format.supports_filters_pushdown( -&self.file_schema, -&self.table_schema, -&[filter], -)?; - -if supports_pushdown == FilePushdownSupport::Supported { -return Ok(TableProviderFilterPushDown::Exact); -} Review Comment: > We can justify implementing other TableProviders for Parquet, but still I cannot understand why we need to degrade the capabilities of our ListingTable. Is't it always better pruning/simplifying things at the higher levels as possible? I don't think this degrades the capabilities of the current listing table. I think the only implications are for anyone who used a custom `FileFormat` and impleented `supports_filters_pushdown` -- I suspect this is not very common and we can likely avoid consternation by mentioning it in the upgrade guide (see comment below) ## datafusion/datasource-parquet/src/source.rs: ## @@ -559,25 +549,8 @@ impl FileSource for ParquetSource { .predicate() .map(|p| format!(", predicate={p}")) .unwrap_or_default(); -let pruning_predicate_string = self -.pruning_predicate -.as_ref() -.map(|pre| { -let mut guarantees = pre -.literal_guarantees() -.iter() -.map(|item| format!("{}", item)) -.collect_vec(); -guarantees.sort(); -format!( -", pruning_predicate={}, required_guarantees=[{}]", -pre.predicate_expr(), -guarantees.join(", ") -) -}) -.unwrap_or_default(); Review Comment: I think it is important to keep these in the physical plans -- in particular what I think is important is to be able to check via the explain plan if pruning is happening by looking at the explain plan ## datafusion/datasource/src/file_format.rs: ## @@ -109,37 +108,10 @@ pub trait FileFormat: Send + Sync + fmt::Debug { not_impl_err!("Writer not implemented for this format") } -/// Check if the specified file format has support for pushing down the provided filters within -/// the given schemas. Added initially to support the Parquet file format's ability to do this. -fn supports_filters_pushdown( Review Comment: yes I agree Since [`FileFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/trait.FileFormat.html) is a pub trait, this is technically a breaking API change, but I do think it was a parquet specific optimization I recommend we mark this PR as an API change and add a note to the upgrade guide h
Re: [PR] feat: decode() expression when using 'utf-8' encoding [datafusion-comet]
mbutrovich commented on PR #1697: URL: https://github.com/apache/datafusion-comet/pull/1697#issuecomment-2844679657 Linux is failing in CometFuzzTestSuite due to failing to cast a Dictionary to utf-8. RNG there must be generating different values. I'll look into that failure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2844672108 > # Overall Implementation > Adjust `RowCursorStream` to become the owner of `Rows` with continuous reuse, requiring each partition to maintain two `Rows` instances (subsequent sorting algorithms need to maintain `prev` and `cur` states for comparison). The key challenge lies in passing a reference to `Rows` via `poll_next` to `SortPreservingMergeStream`, which is difficult to achieve under Rust's lifetime annotation constraints. > > ## Possible Implementation Approaches: > 1. **Lifetime Annotations**: Attempt to pass a reference to `Rows` using Rust's lifetime annotations (tested and found nearly infeasible). > 2. **Box Abstraction**: Wrap `Rows` in a `Box`, returning a `const*` to `Rows` during `poll_next`. This pointer must be re-encapsulated into a structure resembling `Rows`. This approach is currently under exploration. Since the `SortPreservingMergeStream` invokes `RowCursorStream::poll_next` synchronously, there are no concurrency safety concerns (yet to be tested). > > ## Optimization Opportunities: > When the `ORDER BY` clause contains only a single column, the current implementation of `SortPreservingMergeExec` still copies and generates `Rows` structures for comparison. Consider abstracting a unified structure to handle this scenario while fulfilling the requirements. After reviewing the specific code, it was found that optimization for a single column is not possible because SortPreservingMergeExec will only choose to create a [SortPreservingMergeStream](https://docs.rs/datafusion-physical-plan/47.0.0/src/datafusion_physical_plan/sorts/sort_preserving_merge.rs.html#291-337) when sorting involves more than one column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add memory profiling [datafusion-comet]
andygrove commented on code in PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070469650 ## native/core/src/execution/jni_api.rs: ## @@ -359,6 +365,21 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan( // Retrieve the query let exec_context = get_execution_context(exec_context); +// memory profiling is only available on linux +if exec_context.memory_profiling_enabled { +#[cfg(target_os = "linux")] Review Comment: I did that originally, but the compiler complained about the unused variable `memory_profiling_enabled` on non-Linux platforms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: decode() expression when using 'utf-8' encoding [datafusion-comet]
mbutrovich commented on PR #1697: URL: https://github.com/apache/datafusion-comet/pull/1697#issuecomment-2845174358 It's interesting to me that `native_comet` (CometScan) fails, but `native_datafusion` (DataSourceExec) succeeds. I wonder if DataSourceExec is already unpacking dictionaries on read. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add memory profiling [datafusion-comet]
kazuyukitanimura commented on code in PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070493011 ## spark/src/main/scala/org/apache/comet/CometExecIterator.scala: ## @@ -130,6 +134,21 @@ class CometExecIterator( def getNextBatch(): Option[ColumnarBatch] = { assert(partitionIndex >= 0 && partitionIndex < numParts) +if (memoryProfilingEnabled) { + val memoryMXBean = ManagementFactory.getMemoryMXBean + val heap = memoryMXBean.getHeapMemoryUsage + val nonHeap = memoryMXBean.getNonHeapMemoryUsage + + def mb(n: Long) = n / 1024 / 1024 + + // scalastyle:off println + println( +"JVM_MEMORY: { " + + s"heapUsed: ${mb(heap.getUsed)}, heapCommitted: ${mb(heap.getCommitted)}, " + + s"nonHeapUsed: ${mb(nonHeap.getUsed)}, nonHeapCommitted: ${mb(nonHeap.getCommitted)} " + + "}") Review Comment: Perhaps `logInfo` ? If println is preferred, we may have to add back `// scalastyle:on println` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added SQL Example for `Aggregate Functions` [datafusion]
Adez017 commented on PR #15778: URL: https://github.com/apache/datafusion/pull/15778#issuecomment-2844743481 > This PR appears to hve no changes https://private-user-images.githubusercontent.com/490673/438510565-a8b12e9d-739c-4b18-9df2-69d64572667e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDYxMDI1NDYsIm5iZiI6MTc0NjEwMjI0NiwicGF0aCI6Ii80OTA2NzMvNDM4NTEwNTY1LWE4YjEyZTlkLTczOWMtNGIxOC05ZGYyLTY5ZDY0NTcyNjY3ZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNTAxJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDUwMVQxMjI0MDZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT01ZjhhZTE3MjYxY2RkZTA1N2E1MzEwYjhlMGJiNjhjODQzOGYwMzIzYzM5NDEyY2UyMDlhYzZjOGExZGY0Mjk2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.ggsQD5ALD3chY-BGKpWKjAKnkyvEIged_anosxg16G0";> > > I may be missing something 🤔 i actually closed this PR accidently and afterwards i reopened it 😁 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]
NevroHelios commented on code in PR #15841: URL: https://github.com/apache/datafusion/pull/15841#discussion_r2069898066 ## datafusion/common/src/column.rs: ## @@ -130,8 +130,8 @@ impl Column { /// where `"foo.BAR"` would be parsed to a reference to column named `foo.BAR` pub fn from_qualified_name(flat_name: impl Into) -> Self { let flat_name = flat_name.into(); -Self::from_idents(parse_identifiers_normalized(&flat_name, false)).unwrap_or( -Self { +Self::from_idents(parse_identifiers_normalized(&flat_name, false)).unwrap_or_else( Review Comment: thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add memory profiling [datafusion-comet]
codecov-commenter commented on PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#issuecomment-2845028742 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `42.85714%` with `8 lines` in your changes missing coverage. Please review. > Project coverage is 57.31%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`2c17a71`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/2c17a7129ea62918b8fc9d82ab3a36fc9be64993?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 174 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...ain/scala/org/apache/comet/CometExecIterator.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2FCometExecIterator.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9Db21ldEV4ZWNJdGVyYXRvci5zY2FsYQ==) | 27.27% | [7 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1702 +/- ## + Coverage 56.12% 57.31% +1.18% - Complexity 976 1073 +97 Files 119 129 +10 Lines 1174312496 +753 Branches 2251 2341 +90 + Hits 6591 7162 +571 - Misses 4012 4153 +141 - Partials 1140 1181 +41 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1702?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove merged PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]
andygrove commented on code in PR #1700: URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070408978 ## spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala: ## @@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("regexp_replace") { +withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") { + val df = spark.read.parquet(filename) + df.createOrReplaceTempView("t1") + // We want to make sure that the schema generator wasn't modified to accidentally omit + // StringType, since then this test would not run any queries and silently pass. + var testedString = false + for (field <- df.schema.fields if field.dataType == StringType) { +testedString = true +val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1" Review Comment: I'm not sure how likely it is that the input strings will contain the character `a`, so it is possible that this test isn't really testing anything. I modified the test locally to add: ``` val sql = s"SELECT count(*) FROM t1 WHERE ${field.name} LIKE '%a%'" spark.sql(sql).show() ``` this produced ``` ++ |count(1)| ++ | 0| ++ ``` This is a general problem with just comparing results between Spark and Comet and it would be nice to have a way to also ensure that we see expected results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
adriangb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2052746932 ## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ## @@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a; logical_plan 01)Sort: t_pushdown.a ASC NULLS LAST -02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > Int32(2)] +02)--Projection: t_pushdown.a +03)Filter: t_pushdown.b > Int32(2) +04)--TableScan: t_pushdown projection=[a, b], partial_filters=[t_pushdown.b > Int32(2)] physical_plan 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST] 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true] -03)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2, pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, required_guarantees=[] +03)CoalesceBatchesExec: target_batch_size=8192 +04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2 +05)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2 Review Comment: > PushdownFilter should probably work before distribution&order satisfiers That makes sense to me. It does more "invasive" re-arranging of plans than those do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070247089 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -471,39 +471,53 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { Ok(None) } -/// Attempts to recursively push given filters from the top of the tree into leafs. -/// -/// This is used for various optimizations, such as: -/// -/// * Pushing down filters into scans in general to minimize the amount of data that needs to be materialzied. -/// * Pushing down dynamic filters from operators like TopK and Joins into scans. -/// -/// Generally the further down (closer to leaf nodes) that filters can be pushed, the better. -/// -/// Consider the case of a query such as `SELECT * FROM t WHERE a = 1 AND b = 2`. -/// With no filter pushdown the scan needs to read and materialize all the data from `t` and then filter based on `a` and `b`. -/// With filter pushdown into the scan it can first read only `a`, then `b` and keep track of -/// which rows match the filter. -/// Then only for rows that match the filter does it have to materialize the rest of the columns. -/// -/// # Default Implementation -/// -/// The default implementation assumes: -/// * Parent filters can't be passed onto children. -/// * This node has no filters to contribute. -/// -/// # Implementation Notes -/// -/// Most of the actual logic is implemented as a Physical Optimizer rule. -/// See [`PushdownFilter`] for more details. -/// -/// [`PushdownFilter`]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/filter_pushdown/struct.PushdownFilter.html -fn try_pushdown_filters( Review Comment: Agreed I'd prefer a single method as well - but as per https://github.com/apache/datafusion/pull/15770#discussion_r2052311257 we probably would have had to add a new method to the existing implementation anyway. I don't see a way to have a single method without making it absurdly convoluted just to avoid 2 methods. I'll also point out that there are multiple dimensions of filter pushdown for each node: - does the node allow parent filters to be pushed down to it's children? - does the node want to add any filters to be passed to it's children? - does the node need to handle the result of pushdown? Having two methods makes it easy to have canned implementations for each one of these independently and combine/compose them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070247089 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -471,39 +471,53 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { Ok(None) } -/// Attempts to recursively push given filters from the top of the tree into leafs. -/// -/// This is used for various optimizations, such as: -/// -/// * Pushing down filters into scans in general to minimize the amount of data that needs to be materialzied. -/// * Pushing down dynamic filters from operators like TopK and Joins into scans. -/// -/// Generally the further down (closer to leaf nodes) that filters can be pushed, the better. -/// -/// Consider the case of a query such as `SELECT * FROM t WHERE a = 1 AND b = 2`. -/// With no filter pushdown the scan needs to read and materialize all the data from `t` and then filter based on `a` and `b`. -/// With filter pushdown into the scan it can first read only `a`, then `b` and keep track of -/// which rows match the filter. -/// Then only for rows that match the filter does it have to materialize the rest of the columns. -/// -/// # Default Implementation -/// -/// The default implementation assumes: -/// * Parent filters can't be passed onto children. -/// * This node has no filters to contribute. -/// -/// # Implementation Notes -/// -/// Most of the actual logic is implemented as a Physical Optimizer rule. -/// See [`PushdownFilter`] for more details. -/// -/// [`PushdownFilter`]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/filter_pushdown/struct.PushdownFilter.html -fn try_pushdown_filters( Review Comment: Agreed I'd prefer a single method as well - but as per https://github.com/apache/datafusion/pull/15770#discussion_r2052311257 we probably would have had to add a new method to the existing implementation anyway. I don't see a way to have a single method without making it absurdly convoluted just to avoid 2 methods. I'll also point out that there are multiple dimensions of filter pushdown for each node: - does the node allow parent filters to be pushed down to it's children? - does the node want to add any filters to be passed to it's children? - does the node need to handle the result of pushdown? Having two methods makes it easy to have canned implementations for each one of these independently and combine/compose them. E.g.: - TopK or HashJoin will override `gather_filters_for_pushdown` but not `handle_child_pushdown_result` - FilterExec will override both - DataSourceExec will override only `handle_child_pushdown_result` - ProjectionExec will override only `gather_filters_for_pushdown` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add memory profiling [datafusion-comet]
andygrove commented on PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#issuecomment-2844943324 @mbutrovich I'd like to get your feedback on this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]
NevroHelios commented on PR #15841: URL: https://github.com/apache/datafusion/pull/15841#issuecomment-2844942814 I pushed the updates. Could you please run the ci again? @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
adriangb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070335984 ## datafusion/datasource-parquet/src/source.rs: ## @@ -559,25 +549,8 @@ impl FileSource for ParquetSource { .predicate() .map(|p| format!(", predicate={p}")) .unwrap_or_default(); -let pruning_predicate_string = self -.pruning_predicate -.as_ref() -.map(|pre| { -let mut guarantees = pre -.literal_guarantees() -.iter() -.map(|item| format!("{}", item)) -.collect_vec(); -guarantees.sort(); -format!( -", pruning_predicate={}, required_guarantees=[{}]", -pre.predicate_expr(), -guarantees.join(", ") -) -}) -.unwrap_or_default(); Review Comment: Hmm okay. I'll see if I can make it happen... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Feat: support bit_count function [datafusion-comet]
kazantsev-maksim commented on PR #1602: URL: https://github.com/apache/datafusion-comet/pull/1602#issuecomment-2844940964 @mbutrovich I couldn't find a built-in implementation of bit_count in the DataFusion project, but i rewrote it using scalarFunc without adding a new proto expr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]
aharpervc commented on code in PR #1810: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1810#discussion_r2070337010 ## src/dialect/mssql.rs: ## @@ -215,6 +225,59 @@ impl MsSqlDialect { })) } +/// Parse `CREATE TRIGGER` for [MsSql] +/// +/// [MsSql]: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-trigger-transact-sql +fn parse_create_trigger( +&self, +parser: &mut Parser, +or_alter: bool, +) -> Result { +let name = parser.parse_object_name(false)?; +parser.expect_keyword_is(Keyword::ON)?; +let table_name = parser.parse_object_name(false)?; +let period = parser.parse_trigger_period()?; +let events = parser.parse_comma_separated(Parser::parse_trigger_event)?; + +parser.expect_keyword_is(Keyword::AS)?; + +let trigger_statements_body = if parser.peek_keyword(Keyword::BEGIN) { Review Comment: Similar approach in the other PR for stored procedures that may/not have begin/end tokens https://github.com/apache/datafusion-sqlparser-rs/pull/1834/files#r2069458988 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070340727 ## datafusion/core/tests/physical_optimizer/push_down_filter.rs: ## @@ -154,29 +153,25 @@ impl FileSource for TestSource { fn try_pushdown_filters( &self, -mut fd: FilterDescription, +filters: &[Arc], config: &ConfigOptions, -) -> Result>> { +) -> Result>> { +let mut all_filters = filters.iter().map(Arc::clone).collect::>(); Review Comment: Wouldn't the same be true if it was just `Vec>`? I tried implementing a struct with some helper methods but the thing is that for `ExecutionPlan` we have a slightly different flow than `DataSource` and `FileSource` (which are their own traits and hence need their own method) so re-using the struct becomes a bit wonky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: ORDER BY ALL [datafusion]
PokIsemaine commented on code in PR #15772: URL: https://github.com/apache/datafusion/pull/15772#discussion_r2070204781 ## datafusion/expr/src/expr.rs: ## @@ -701,6 +701,24 @@ impl TryCast { } } +/// OrderBy Expressions +pub enum OrderByExprs { +OrderByExprVec(Vec), +All { +exprs: Vec, +options: OrderByOptions, +}, +} Review Comment: Okay, I’ll try to move it this weekend -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 [datafusion]
xudong963 merged PR #15909: URL: https://github.com/apache/datafusion/pull/15909 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]
mbutrovich commented on code in PR #1700: URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070460938 ## spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala: ## @@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("regexp_replace") { +withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") { + val df = spark.read.parquet(filename) + df.createOrReplaceTempView("t1") + // We want to make sure that the schema generator wasn't modified to accidentally omit + // StringType, since then this test would not run any queries and silently pass. + var testedString = false + for (field <- df.schema.fields if field.dataType == StringType) { +testedString = true +val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1" Review Comment: Makes sense. I find myself expanding the generator for #1697 to create dictionaries, so let me sync up with you for good solutions to expand the schemas and values in specific types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] ci: require approving review [datafusion-python]
timsaucer merged PR #1122: URL: https://github.com/apache/datafusion-python/pull/1122 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add memory profiling [datafusion-comet]
comphead commented on code in PR #1702: URL: https://github.com/apache/datafusion-comet/pull/1702#discussion_r2070464759 ## native/core/src/execution/jni_api.rs: ## @@ -359,6 +365,21 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan( // Retrieve the query let exec_context = get_execution_context(exec_context); +// memory profiling is only available on linux +if exec_context.memory_profiling_enabled { +#[cfg(target_os = "linux")] Review Comment: should the if statement be inside cfg? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update extending-operators.md [datafusion]
xudong963 commented on PR #15832: URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2844855985 > > > > You can rebase with main > > > > > > > > > doe this solve the issue ? > > > > > > You can open the failed CI and see what's wrong: > > ``` > > error[E0599]: no method named `unwrap` found for struct `CoalescePartitionsExec` in the current scope > >--> datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:210:14 > > | > > 208 | let coalesce = CoalescePartitionsExec::new(child) > > | - > > 209 | | .with_fetch(plan.fetch()) > > 210 | | .unwrap(); > > | | -^^ method not found in `CoalescePartitionsExec` > > | |_| > > | > > > > warning: unused import: `ExecutionPlan` > > --> datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:37:32 > >| > > 37 | use datafusion_physical_plan::{ExecutionPlan, ExecutionPlanProperties}; > >|^ > >| > >= note: `#[warn(unused_imports)]` on by default > > ``` > > > > > > > > > > > > > > > > > > > > > > > > The error is fixed in main, so rebasing your branch with main will fix the error > > Thank for your help @xudong963 but I think it didn't work for the failing workflow Sorry I am on vacation , you can try to fix by the error hints in ci -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Allow stored procedures to be defined without `BEGIN`/`END` [datafusion-sqlparser-rs]
aharpervc commented on code in PR #1834: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1834#discussion_r2070278308 ## src/ast/mod.rs: ## @@ -3744,7 +3750,7 @@ pub enum Statement { or_alter: bool, name: ObjectName, params: Option>, -body: Vec, +body: BeginEndStatements, Review Comment: It it an abuse of BeginEndStatements to potentially have the begin/end token be empty? This could alternately be an enum of BeginEndStatements and Sequence, similar to elsewhere -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix typo in introduction.md [datafusion]
tom-mont opened a new pull request, #15910: URL: https://github.com/apache/datafusion/pull/15910 - Fix typo in introduction.md - Remove period from end of bullet point to maintain consistency with other bullet points ## Which issue does this PR close? - No issue: simply noticed a typo when reading documentation ## Rationale for this change Maintain good spelling and consistency in formatting ## What changes are included in this PR? ## Are these changes tested? No tests - just a change to .md file ## Are there any user-facing changes? Changes are user-facing but in a .md file -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] ci: require approving review [datafusion-python]
timsaucer opened a new pull request, #1122: URL: https://github.com/apache/datafusion-python/pull/1122 This is a change to the `.asf.yaml` file that requires an approving review to merge to `main`. I copied the configuration from our upstream `datafusion` repository. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: metadata handling for aggregates and window functions [datafusion]
timsaucer opened a new pull request, #15911: URL: https://github.com/apache/datafusion/pull/15911 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/15902 ## Rationale for this change This change is a follow on to https://github.com/apache/datafusion/pull/15646. With this change we can now handle metadata for both window and aggregate functions. It enables the use of extension types via this metadata handling. ## What changes are included in this PR? Instead of passing `DataType` to aggregates and windows we now pass `Field` in their arguments. `return_type` has been replaced with `return_field` so we can get metadata out of these functions as well. ## Are these changes tested? All existing unit tests pass. New unit tests are added. ## Are there any user-facing changes? Yes, the migration guide contains information on the updates that the user will need to make for their user defined functions. ## TODO before moving to ready to review - [ ] Add unit test that is a window of an aggregate function - [ ] Verify migration guide covers both aggregate and window functions - [ ] Find remaining places where the function name doesn't match the arguments (ie: `fn return_type()` that returns a `Field`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Wrong query results for filters that involve partition columns and data file columns [datafusion]
adriangb opened a new issue, #15912: URL: https://github.com/apache/datafusion/issues/15912 ### Describe the bug Filters such as `partition_col = col_from_file` are never applied if `datafusion.execution.parquet.pushdown_filters = true` ### To Reproduce With `datafusion-cli`: ```sql COPY ( SELECT arrow_cast('a', 'Utf8') AS val ) TO 'test_files/scratch/test/part=a/123.parquet' STORED AS PARQUET; COPY ( SELECT arrow_cast('b', 'Utf8') AS val ) TO 'test_files/scratch/test/part=b/123.parquet' STORED AS PARQUET; COPY ( SELECT arrow_cast('xyz', 'Utf8') AS val ) TO 'test_files/scratch/test/part=c/123.parquet' STORED AS PARQUET; set datafusion.execution.parquet.pushdown_filters = true; CREATE EXTERNAL TABLE test(part text, val text) STORED AS PARQUET PARTITIONED BY (part) LOCATION 'test_files/scratch/test/'; SELECT * FROM test; explain analyze select * from test where part != val; ``` ``` > select * from test where part != val; +-+--+ | val | part | +-+--+ | a | a| | xyz | c| | b | b| +-+--+ 3 row(s) fetched. ``` Which is clearly wrong. ### Expected behavior ``` > select * from test where part != val; +-+--+ | val | part | +-+--+ | xyz | c| +-+--+ ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Add memory profiling [datafusion-comet]
andygrove opened a new pull request, #1702: URL: https://github.com/apache/datafusion-comet/pull/1702 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1701 ## Rationale for this change We need a way to profile memory. ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Spark SQL test failures in native_iceberg_compat mode [datafusion-comet]
parthchandra commented on issue #1542: URL: https://github.com/apache/datafusion-comet/issues/1542#issuecomment-2845276570 Failure count : ``` Core 1: Tests: succeeded 9113, failed 25, canceled 6, ignored 292, pending 0 Core 2: Tests: succeeded 2636, failed 19, canceled 0, ignored 387, pending 0 Hive 1: Tests: succeeded 2144, failed 4, canceled 4, ignored 40, pending 0 ``` Total failures: 48 (down from 176) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: Remove fast encoding option [datafusion-comet]
andygrove opened a new pull request, #1703: URL: https://github.com/apache/datafusion-comet/pull/1703 ## Which issue does this PR close? N/A ## Rationale for this change Now that we are using a version of Arrow that lets us skip validation on IPC reads, there is little justification in continuing to maintain our "fast encoding" option, since there is no longer any noticeable improvement in performance using this option. ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSSION] DataFusion Road Map: Q3-Q4 2025 [datafusion]
Dandandan commented on issue #15878: URL: https://github.com/apache/datafusion/issues/15878#issuecomment-2845332274 I am currently interested in the following subjects where I'll probably experiment with some things or help out others. - [ ] Window Functions (profiling, implementing improvements / optimizations) 1. ** https://github.com/apache/datafusion/issues/15607 2. Possible (logical plan) optimizations - limit pushdown? - [ ] Sorting performance ideas / helping out, examples: 1. https://github.com/apache/datafusion/issues/15720 2. https://github.com/apache/datafusion/pull/15380#pullrequestreview-2810308155 - [ ] Streamlining arrow-rs kernels in terms of speed, consistency, reduce use of unsafe, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve sqllogictest error reporting [datafusion]
gabotechs commented on PR #15905: URL: https://github.com/apache/datafusion/pull/15905#issuecomment-2844394504 👍 Done! also added some failure indexes to help our eyes parse the different errors quickly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: regexp_replace() expression with no starting offset [datafusion-comet]
andygrove commented on code in PR #1700: URL: https://github.com/apache/datafusion-comet/pull/1700#discussion_r2070420243 ## spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala: ## @@ -188,6 +188,22 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("regexp_replace") { +withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") { + val df = spark.read.parquet(filename) + df.createOrReplaceTempView("t1") + // We want to make sure that the schema generator wasn't modified to accidentally omit + // StringType, since then this test would not run any queries and silently pass. + var testedString = false + for (field <- df.schema.fields if field.dataType == StringType) { +testedString = true +val sql = s"SELECT regexp_replace(${field.name}, 'a', 'b') FROM t1" Review Comment: Perhaps we should improve the logic in `ParquetGenerator` to also generate ASCII strings: ```scala case DataTypes.StringType => Range(0, numRows).map(_ => { r.nextInt(10) match { case 0 if options.allowNull => null case 1 => r.nextInt().toByte.toString case 2 => r.nextLong().toString case 3 => r.nextDouble().toString case _ => r.nextString(8) } }) ``` The call to ` r.nextString(8)` generates strings containing characters in the range 0 through 0xD800, so while it could theoretically generate `a`, it will be rare. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
alamb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070220686 ## datafusion/core/tests/physical_optimizer/push_down_filter.rs: ## @@ -154,29 +153,25 @@ impl FileSource for TestSource { fn try_pushdown_filters( &self, -mut fd: FilterDescription, +filters: &[Arc], Review Comment: I would personally prefer keeping a struct as an argument as it 1. Is easier to document / explain what the structures mean via comments 2. it is clearer in the struct definition what types of operations (e.g. append only vs swap, etc) are done 3. It is easier to add new fields if needed Basically, something like ```rust pub struct FilterDescription { filters: &[Arc], } ``` ... This is a preference thing and i can easily see alternative opinions ## datafusion/core/tests/physical_optimizer/push_down_filter.rs: ## @@ -278,7 +273,7 @@ fn test_filter_collapse() { - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, c], file_type=test, pushdown_supported=true output: Ok: - - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, c], file_type=test, pushdown_supported=true, predicate=b@1 = bar AND a@0 = foo + - DataSourceExec: file_groups={0 groups: []}, projection=[a, b, c], file_type=test, pushdown_supported=true, predicate=a@0 = foo AND b@1 = bar Review Comment: This actually looks like an improvement to me as now `a = foo` will be evaluated before `b=bar` as was done in the input plan. This might be important for short circuiting, perhaps The prior version of this optimization seems to have reordered them ## datafusion/physical-plan/src/filter_pushdown.rs: ## @@ -17,79 +17,207 @@ use std::sync::Arc; -use crate::ExecutionPlan; use datafusion_physical_expr_common::physical_expr::PhysicalExpr; -#[derive(Clone, Debug)] -pub struct FilterDescription { -/// Expressions coming from the parent nodes -pub filters: Vec>, +/// The result of or a plan for pushing down a filter into a child node. +/// This contains references to filters so that nodes can mutate a filter +/// before pushing it down to a child node (e.g. to adjust a projection) +/// or can directly take ownership of `Unsupported` filters that their children +/// could not handle. +#[derive(Debug, Clone)] +pub enum FilterPushdown { +Supported(Arc), +Unsupported(Arc), } -impl Default for FilterDescription { -fn default() -> Self { -Self::empty() +/// A thin wrapper around [`FilterPushdown`]s that allows for easy collection of +/// supported and unsupported filters. +#[derive(Debug, Clone)] +pub struct FilterPushdowns(Vec); + +impl FilterPushdowns { +/// Create a new FilterPushdowns with the given filters and their pushdown status. +pub fn new(pushdowns: Vec) -> Self { +Self(pushdowns) +} + +/// Create a new FilterPushdowns with all filters as supported. +pub fn all_supported(filters: &[Arc]) -> Self { +let pushdowns = filters +.iter() +.map(|f| FilterPushdown::Supported(Arc::clone(f))) +.collect(); +Self::new(pushdowns) +} + +/// Create a new FilterPushdowns with all filters as unsupported. +pub fn all_unsupported(filters: &[Arc]) -> Self { +let pushdowns = filters +.iter() +.map(|f| FilterPushdown::Unsupported(Arc::clone(f))) +.collect(); +Self::new(pushdowns) +} + +/// Transform all filters to supported, returning a new FilterPushdowns. +/// This does not modify the original FilterPushdowns. +pub fn as_supported(&self) -> Self { +let pushdowns = self +.0 +.iter() +.map(|f| match f { +FilterPushdown::Supported(expr) => { +FilterPushdown::Supported(Arc::clone(expr)) +} +FilterPushdown::Unsupported(expr) => { +FilterPushdown::Supported(Arc::clone(expr)) +} +}) +.collect(); +Self::new(pushdowns) +} + +/// Collect unsupported filters into a Vec, without removing them from the original +/// FilterPushdowns. +pub fn collect_unsupported(&self) -> Vec> { +self.0 +.iter() +.filter_map(|f| match f { +FilterPushdown::Unsupported(expr) => Some(Arc::clone(expr)), +FilterPushdown::Supported(_) => None, +}) +.collect() +} + +/// Collect all filters as PhysicalExprs into a Vec, without removing them from the original +/// FilterPushdowns. +pub fn into_inner_filters(&self) -> Vec> { +self.0 +.iter() +.map(|f| match f { +FilterPushdown::Supported(expr) => Arc::clone(expr), +
Re: [PR] Update extending-operators.md [datafusion]
Adez017 commented on PR #15832: URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2844762089 > > > You can rebase with main > > > > > > doe this solve the issue ? > > You can open the failed CI and see what's wrong: > > ``` > error[E0599]: no method named `unwrap` found for struct `CoalescePartitionsExec` in the current scope >--> datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:210:14 > | > 208 | let coalesce = CoalescePartitionsExec::new(child) > | - > 209 | | .with_fetch(plan.fetch()) > 210 | | .unwrap(); > | | -^^ method not found in `CoalescePartitionsExec` > | |_| > | > > warning: unused import: `ExecutionPlan` > --> datafusion/physical-optimizer/src/enforce_sorting/replace_with_order_preserving_variants.rs:37:32 >| > 37 | use datafusion_physical_plan::{ExecutionPlan, ExecutionPlanProperties}; >|^ >| >= note: `#[warn(unused_imports)]` on by default > ``` > > The error is fixed in main, so rebasing your branch with main will fix the error Thank for your help @xudong963 but I think it didn't work for the failing workflow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2070234428 ## datafusion/core/tests/physical_optimizer/push_down_filter.rs: ## @@ -154,29 +153,25 @@ impl FileSource for TestSource { fn try_pushdown_filters( &self, -mut fd: FilterDescription, +filters: &[Arc], Review Comment: Happy to do that 👍🏻 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
berkaysynnada commented on PR #15801: URL: https://github.com/apache/datafusion/pull/15801#issuecomment-2844916340 Hi @alamb. I've a WIP commit for this PR. I'm trying to both address @adriangb concerns and needs, and at the same time trying to keep complexity at minimum and trying to make things similar to other DF patterns. I think I can finish it at the end of today. Thank you for the patience -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] add benchmark code for `Reuse rows in row cursor stream` [datafusion]
acking-you opened a new pull request, #15913: URL: https://github.com/apache/datafusion/pull/15913 ## Which issue does this PR close? - Closes #. ## Rationale for this change You can see:https://github.com/apache/datafusion/issues/15720#issuecomment-2844531606 ## What changes are included in this PR? Add a benchmark to test the optimization effect of reusing Rows. ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Remove fast encoding option [datafusion-comet]
codecov-commenter commented on PR #1703: URL: https://github.com/apache/datafusion-comet/pull/1703#issuecomment-2845566547 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1703?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 56.51%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`48c4377`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/48c4377592b8c56303c9e7d6dd6f5e4dc3a6fd46?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 175 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1703 +/- ## + Coverage 56.12% 56.51% +0.39% - Complexity 976 1065 +89 Files 119 129 +10 Lines 1174312582 +839 Branches 2251 2339 +88 + Hits 6591 7111 +520 - Misses 4012 4294 +282 - Partials 1140 1177 +37 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1703?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve sqllogictest error reporting [datafusion]
berkaysynnada commented on code in PR #15905: URL: https://github.com/apache/datafusion/pull/15905#discussion_r2070694736 ## datafusion/sqllogictest/bin/sqllogictests.rs: ## @@ -235,14 +235,38 @@ async fn run_test_file( runner.with_normalizer(value_normalizer); runner.with_validator(validator); -let res = runner -.run_file_async(path) -.await -.map_err(|e| DataFusionError::External(Box::new(e))); +let path = path.canonicalize()?; +let records = +parse_file(&path).map_err(|e| DataFusionError::External(Box::new(e)))?; +let mut errs = vec![]; +for record in records.into_iter() { +if let Record::Halt { .. } = record { +break; +} +if let Err(err) = runner.run_async(record).await { +errs.push(format!("{err}")); +} +} pb.finish_and_clear(); -res +if errs.is_empty() { +return Ok(()); +} +const ERR_LIMIT: usize = 10; Review Comment: Better to define at the top where other constant are defined ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Remove fast encoding option [datafusion-comet]
andygrove commented on PR #1703: URL: https://github.com/apache/datafusion-comet/pull/1703#issuecomment-2845639468 > I love a good negative line count. Thanks @andygrove! Do we anticipate any performance changes? Based on local testing, TPC-H performance with fast encoding took 276s compared to 278s with Arrow IPC, and the performance does vary slightly on each run, so this could just be noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: regenerate builtin functions coverage [datafusion-comet]
comphead commented on PR #1698: URL: https://github.com/apache/datafusion-comet/pull/1698#issuecomment-2845671771 > Interesting we getting regressions for some string functions like > > * initcap > * lower > * upper > * abs > ... > > @mbutrovich ^^ Might be related to this param ``` val COMET_CASE_CONVERSION_ENABLED: ConfigEntry[Boolean] = conf("spark.comet.caseConversion.enabled") .doc( "Java uses locale-specific rules when converting strings to upper or lower case and " + "Rust does not, so we disable upper and lower by default.") .booleanConf .createWithDefault(false) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: metadata columns [datafusion]
alamb commented on PR #14057: URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2845795660 Thank you @chenkovsky and @adriangb and everyone else who worked on this PR. I think the idea of additional metadata columns for LIstingTable provider is super valuable but it seems like this particular PR adds more complexity than we are willing to handle in the core. I wonder if we could/should create a more full featured equivalent of LIstingTable in another repo / project -- I think there are many interesting ideas / potential things to add -- for example - https://github.com/apache/datafusion/issues/15892 - https://github.com/apache/datafusion/pull/15181 from @phillipleblanc I apologize for letting it sit for so long. This is non ideal to be sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]
alamb commented on PR #14362: URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2845809583 See discussion on https://github.com/apache/datafusion/issues/15173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FormatOptions` to Config [datafusion]
Omega359 commented on code in PR #15793: URL: https://github.com/apache/datafusion/pull/15793#discussion_r2070825920 ## datafusion/sqllogictest/test_files/information_schema.slt: ## @@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to true, the explain statem datafusion.explain.show_schema false When set to true, the explain statement will print schema information datafusion.explain.show_sizes true When set to true, the explain statement will print the partition sizes datafusion.explain.show_statistics false When set to true, the explain statement will print operator statistics for physical plans +datafusion.format.date_format %Y-%m-%d Date format for date arrays +datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime arrays +datafusion.format.duration_format pretty Duration format. Can be either `"pretty"` or `"ISO8601"` Review Comment: I think case may matter here for ISO8601 vs iso8601 given the impl in config.rs. I would suggest that config would use .to_lowercase() to make it not matter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Unparse `UNNEST` projection with the table column alias [datafusion]
alamb commented on PR #15879: URL: https://github.com/apache/datafusion/pull/15879#issuecomment-2845811697 Thank you @xudong963 and @goldmedal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Unparse `UNNEST` projection with the table column alias [datafusion]
alamb merged PR #15879: URL: https://github.com/apache/datafusion/pull/15879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Generate the common SQL for the unparsing result of the unnest [datafusion]
alamb closed issue #15233: Generate the common SQL for the unparsing result of the unnest URL: https://github.com/apache/datafusion/issues/15233 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… [datafusion]
xudong963 commented on code in PR #15841: URL: https://github.com/apache/datafusion/pull/15841#discussion_r2070946442 ## benchmarks/src/util/options.rs: ## @@ -72,16 +72,11 @@ impl CommonOpt { /// Modify the existing config appropriately pub fn update_config(&self, mut config: SessionConfig) -> SessionConfig { if let Some(batch_size) = self.batch_size { -config = config.with_batch_size(batch_size) +config = config.with_batch_size(batch_size); } if let Some(partitions) = self.partitions { -config = config.with_target_partitions(partitions) -} - -if let Some(sort_spill_reservation_bytes) = self.sort_spill_reservation_bytes { -config = - config.with_sort_spill_reservation_bytes(sort_spill_reservation_bytes); Review Comment: Why was the part removed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [datafusion-spark] Example of using Spark compatible function library [datafusion]
alamb opened a new issue, #15915: URL: https://github.com/apache/datafusion/issues/15915 ### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/15914 @shehabgamin added the `datafusion-spark` crate in https://github.com/apache/datafusion/pull/15168 The idea is that using the functions in this crate you can get a `SessionContext` that executes sql using Spark semantics. However, there is no user facing documentation that shows how to do this ### Describe the solution you'd like Add an example somewhere showing how to configure and use the spark functions in a SessionContext. I can help with this ### Describe alternatives you've considered I personally suggest adding a new page to the website: https://datafusion.apache.org/ Specifically, I suggest 1. Add a new page in the "Library User Guide" called "Spark Compatible Functions" 2. Add a preamble explaining what the datafusion-spark crate is (contains a list of spark compatible functions) 3. Add examples For example we should show how to run sql using a "spark compatible" frame: ```rust let ctx = SessionContext::new(); datafusion_spark::register_all(&ctx)?; // TODO run an example SQL query here that uses a function from // the datafusion spark crate ctx.sql("select ... ") // also add an example for DataFrame API ``` In order to run the example code as part of CI, you will have to add an entry such as this: https://github.com/apache/datafusion/blob/81b4c074886e604277cc303986113ca12f9ac77d/datafusion/core/src/lib.rs#L928-L932 to the [datafusion-spark lib.rs](https://github.com/apache/datafusion/blob/6bda4796a3c8142b87d0fad072bdbe25f9d12934/datafusion/spark/src/lib.rs#L16-L15) file (it can't go in the datafusion/core/lib.rs because the core crate doesn't bring in datafusion-spark) ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FormatOptions` to Config [datafusion]
alamb commented on PR #15793: URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845786865 Are we happy with this PR? Shall we merge it in? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Factor out Substrait consumers into separate files [datafusion]
alamb commented on PR #15794: URL: https://github.com/apache/datafusion/pull/15794#issuecomment-2845786048 Thanks again @vbarua and @gabotechs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Factor out Substrait consumers into separate files [datafusion]
alamb merged PR #15794: URL: https://github.com/apache/datafusion/pull/15794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [substrait] refactor consumer.rs [datafusion]
alamb closed issue #13864: [substrait] refactor consumer.rs URL: https://github.com/apache/datafusion/issues/13864 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FormatOptions` to Config [datafusion]
blaginin commented on PR #15793: URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845789235 Asked Bruce to re-review, and then I'm ready to merge 🤗 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support file row index / row id for each file in a `ListingTableProvider` [datafusion]
acking-you commented on issue #15892: URL: https://github.com/apache/datafusion/issues/15892#issuecomment-2845864345 nice feature👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
adriangb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070868393 ## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ## @@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a; logical_plan 01)Sort: t_pushdown.a ASC NULLS LAST -02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > Int32(2)] +02)--Projection: t_pushdown.a +03)Filter: t_pushdown.b > Int32(2) +04)--TableScan: t_pushdown projection=[a, b], partial_filters=[t_pushdown.b > Int32(2)] physical_plan 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST] 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true] -03)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2, pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, required_guarantees=[] +03)CoalesceBatchesExec: target_batch_size=8192 +04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2 +05)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2 Review Comment: This would also be fixed by https://github.com/apache/datafusion/pull/15801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
alamb commented on PR #15168: URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846025123 I have filed an epic to track filling out the datafusion-spark crate: - https://github.com/apache/datafusion/issues/15914 I will file some subtickets for follow on work as well (e.g. what is in https://github.com/apache/datafusion/pull/15168#pullrequestreview-2801437803) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
alamb merged PR #15168: URL: https://github.com/apache/datafusion/pull/15168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]
alamb opened a new issue, #15914: URL: https://github.com/apache/datafusion/issues/15914 ### Is your feature request related to a problem or challenge? Many DataFusion users are using DataFusion to execution workloads originally developed for Apache Spark. Examples include - [DataFusion Comet](https://datafusion.apache.org/comet/) (@andygrove @comphead , etc) - [LakeHQ / Sail](https://github.com/lakehq/sail) (@shehabgamin ) - Various internal pileines / engines (e.g. that @Omega359 and I think @Blizzara use) They often do this for superior performance * Part of running Spark workloads is emulating Spark sematics * Emulating Spark semantics requires (among other things) functions compatible with Spark (which differs in semantics to the functions included in DataFusion) Several projects are in the process of implementing Spark compatible function libraries using DataFusion's extension APIs. However. we concluded in https://github.com/apache/datafusion/issues/5600 that we could join forces and maintain a spark compatible funciton library in the core datafusion repo. @shehabgamin has implemented the first step in https://github.com/apache/datafusion/pull/15168 🙏 ### Describe the solution you'd like This ticket tracks "completing" the spark function library started in https://github.com/apache/datafusion/pull/15168 ### Describe alternatives you've considered Related Issues - [ ] https://github.com/apache/datafusion/issues/5600 - [ ] https://github.com/apache/datafusion/pull/15168 ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add `datafusion-spark` crate [datafusion]
alamb commented on PR #15168: URL: https://github.com/apache/datafusion/pull/15168#issuecomment-2846025618 Onward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]
alamb closed issue #5600: [DISCUSSION] Add separate crate to cover spark builtin functions URL: https://github.com/apache/datafusion/issues/5600 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [DISCUSSION] Add separate crate to cover spark builtin functions [datafusion]
alamb commented on issue #5600: URL: https://github.com/apache/datafusion/issues/5600#issuecomment-2846025987 We are tracking follow on work in - https://github.com/apache/datafusion/issues/15914 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix CI in main [datafusion]
blaginin opened a new pull request, #15917: URL: https://github.com/apache/datafusion/pull/15917 ## Which issue does this PR close? Fixes CI after https://github.com/apache/datafusion/pull/15168 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `CREATE TRIGGER` support for SQL Server [datafusion-sqlparser-rs]
iffyio commented on code in PR #1810: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1810#discussion_r2071038220 ## src/dialect/mssql.rs: ## @@ -215,6 +225,59 @@ impl MsSqlDialect { })) } +/// Parse `CREATE TRIGGER` for [MsSql] +/// +/// [MsSql]: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-trigger-transact-sql +fn parse_create_trigger( +&self, +parser: &mut Parser, +or_alter: bool, +) -> Result { +let name = parser.parse_object_name(false)?; +parser.expect_keyword_is(Keyword::ON)?; +let table_name = parser.parse_object_name(false)?; +let period = parser.parse_trigger_period()?; +let events = parser.parse_comma_separated(Parser::parse_trigger_event)?; + +parser.expect_keyword_is(Keyword::AS)?; + +let trigger_statements_body = if parser.peek_keyword(Keyword::BEGIN) { +let begin_token = parser.expect_keyword(Keyword::BEGIN)?; +let statements = parser.parse_statement_list(&[Keyword::END])?; +let end_token = parser.expect_keyword(Keyword::END)?; + +BeginEndStatements { +begin_token: AttachedToken(begin_token), +statements, +end_token: AttachedToken(end_token), +} +} else { +BeginEndStatements { +begin_token: AttachedToken::empty(), +statements: parser.parse_statements()?, +end_token: AttachedToken::empty(), +} Review Comment: Oh I think instead of representing the begin optionality using tokens, we can use an enum, something similar to [ConditionalStatements](https://github.com/apache/datafusion-sqlparser-rs/blob/main/src/ast/mod.rs#L2294-L2299) where we have an explicit `BeginEnd` enum variant -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` [datafusion]
alamb commented on issue #15173: URL: https://github.com/apache/datafusion/issues/15173#issuecomment-2845807026 There is another request here that is related I think (basically "row number within the file"): - https://github.com/apache/datafusion/issues/15892 It seems to me what has happened is that the existing ListingTableProvider in DataFusion has to walk a fine balance between eing relatively "simple" and additional functionality like this while widely useful I think also could be implemented outside the core with the existing APIs. As we add more features to ListingTable it will look more and more like a full featured Table Format, but there are already many implementations of such formats (for example Iceberg, Delta, etc) Thus I think it might be good to propose we creating a `datafusion-contrib` project with make a more full featured "TableProvider" that adds new features like: 1. Additional metadata clumns 2. https://github.com/apache/datafusion/issues/15892 3. etc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] metadata column support [datafusion]
alamb commented on issue #13975: URL: https://github.com/apache/datafusion/issues/13975#issuecomment-2845808404 I think this is duplicated by https://github.com/apache/datafusion/issues/15173 so let's continue the discussion there -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] metadata column support [datafusion]
alamb closed issue #13975: metadata column support URL: https://github.com/apache/datafusion/issues/13975 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FormatOptions` to Config [datafusion]
blaginin commented on code in PR #15793: URL: https://github.com/apache/datafusion/pull/15793#discussion_r2070830060 ## datafusion/sqllogictest/test_files/information_schema.slt: ## @@ -372,6 +381,15 @@ datafusion.explain.physical_plan_only false When set to true, the explain statem datafusion.explain.show_schema false When set to true, the explain statement will print schema information datafusion.explain.show_sizes true When set to true, the explain statement will print the partition sizes datafusion.explain.show_statistics false When set to true, the explain statement will print operator statistics for physical plans +datafusion.format.date_format %Y-%m-%d Date format for date arrays +datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f Format for DateTime arrays +datafusion.format.duration_format pretty Duration format. Can be either `"pretty"` or `"ISO8601"` Review Comment: great catch! fixed in https://github.com/apache/datafusion/pull/15793/commits/83748836c64aecdd7d63dc63e48b1fc1bb4efc69 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FormatOptions` to Config [datafusion]
Omega359 commented on PR #15793: URL: https://github.com/apache/datafusion/pull/15793#issuecomment-2845818047 Beyond the config try_into looking like it's case sensitive for duration_format I think this is good. There is a breaking change with the rename of FormatOptions to OutputFormat that should be noted in the upgrade guide for the release this goes into. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add support for `DENY` statements [datafusion-sqlparser-rs]
aharpervc opened a new pull request, #1836: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1836 This is another statement for SQL Server: https://learn.microsoft.com/en-us/sql/t-sql/statements/deny-transact-sql, but implemented in a common way. Similar to GRANT & REVOKE, so using a lot of those patterns with a similar test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
adriangb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070868393 ## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ## @@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a; logical_plan 01)Sort: t_pushdown.a ASC NULLS LAST -02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > Int32(2)] +02)--Projection: t_pushdown.a +03)Filter: t_pushdown.b > Int32(2) +04)--TableScan: t_pushdown projection=[a, b], partial_filters=[t_pushdown.b > Int32(2)] physical_plan 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST] 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true] -03)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2, pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, required_guarantees=[] +03)CoalesceBatchesExec: target_batch_size=8192 +04)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2 +05)DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2 Review Comment: This would also be fixed by https://github.com/apache/datafusion/pull/15801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Resolved bug in `parse_function_arg` [datafusion-sqlparser-rs]
LucaCappelletti94 commented on code in PR #1826: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1826#discussion_r2071126270 ## src/parser/mod.rs: ## @@ -5199,13 +5199,20 @@ impl<'a> Parser<'a> { // parse: [ argname ] argtype let mut name = None; +let next_token = self.peek_token(); Review Comment: The argument named `Int2` (as described in the [issue](https://github.com/apache/datafusion-sqlparser-rs/issues/1825)) is not parsed as `DataType::Custom`, but as a [`DataType::Int2`](https://github.com/apache/datafusion-sqlparser-rs/blob/a464f8e8d7a5057e4e5b8046b0f619acdf7fce74/src/ast/data_type.rs#L150). Analogously, any other such argument names that collides with data types from other SQL engines would be parsed into a type. Now, if I were to convert back to string `DataType::Int2` I would get some arbitrary capitalization which in this case is `INT2` - without the `peek_token`, I am unsure how we can preserve the initial token from being lost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update extending-operators.md [datafusion]
Adez017 commented on PR #15832: URL: https://github.com/apache/datafusion/pull/15832#issuecomment-2846154397 i Think ww need @alamb help now . could you help ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]
Adez017 commented on issue #15915: URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846193611 hi @alamb . it sounds interesting , i would love to work in this . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [datafusion-spark] Example of using Spark compatible function library [datafusion]
Adez017 commented on issue #15915: URL: https://github.com/apache/datafusion/issues/15915#issuecomment-2846193703 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Resolved bug in `parse_function_arg` [datafusion-sqlparser-rs]
iffyio commented on code in PR #1826: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1826#discussion_r2071042324 ## src/parser/mod.rs: ## @@ -5199,13 +5199,20 @@ impl<'a> Parser<'a> { // parse: [ argname ] argtype let mut name = None; +let next_token = self.peek_token(); Review Comment: > The code you proposed does not work as Int2 (or any analogous such type) does not fall in if let DataType::Custom(n, _) = &data_type { Oh how did you mean here by `Int2` in this example not being parsed as a custom datatype, do we get back a different type or does `parse_data_type` fail in that scenario? I think ideally we will want to do without this `self.peek_token()` to avoid the cloning that it includes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support some of pipe operators [datafusion-sqlparser-rs]
iffyio merged PR #1759: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1759 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [experiment] Generate event log in Chrome tracing format [datafusion-comet]
codecov-commenter commented on PR #1706: URL: https://github.com/apache/datafusion-comet/pull/1706#issuecomment-2846219034 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1706?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 57.37%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`14cd1c5`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/14cd1c5d90b7b1c17a81d23490ac05cd56221251?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 175 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1706 +/- ## + Coverage 56.12% 57.37% +1.24% - Complexity 976 1072 +96 Files 119 129 +10 Lines 1174312515 +772 Branches 2251 2347 +96 + Hits 6591 7180 +589 - Misses 4012 4149 +137 - Partials 1140 1186 +46 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1706?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added support for `DROP DOMAIN` [datafusion-sqlparser-rs]
iffyio merged PR #1828: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1828 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org