Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621372943

   I also have an another approach. 
   
   ```rust
   struct ColumnIndex {
  pub index: usize,
  pub is_metadata_column: bool
   }
   
   impl Into for ColumnIndex {
   }
   
   impl From for ColumnIndex {
   }
   ```
   
   then we can hide METADATA_OFFSET logic without performance concern. what do 
you think about this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix UNION field nullability tracking [datafusion]

2025-01-29 Thread via GitHub


findepi opened a new pull request, #14356:
URL: https://github.com/apache/datafusion/pull/14356

   This commit fixes two bugs related to UNION handling
   
   - when constructing union plan nullability of the other union branch was
 ignored, thus resulting field could easily have incorrect nullability
   - when pruning/simplifying projects, in `recompute_schema` function
 there was similar logic, thus loosing nullability information even for
 correctly constructed Union plan node
   
   As a result, other optimizer logic (e.g. `expr_simplifier.rs`) could
   draw incorrect conclusions and thus lead to incorrect query results, as
   demonstrated with the attached SLT test.
   
   - fixes https://github.com/apache/datafusion/issues/14352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feature: Monotonic Sets [datafusion]

2025-01-29 Thread via GitHub


ozankabak commented on code in PR #14271:
URL: https://github.com/apache/datafusion/pull/14271#discussion_r1933977485


##
datafusion/physical-expr/src/window/standard.rs:
##
@@ -65,33 +65,19 @@ impl StandardWindowExpr {
 &self.expr
 }
 
-/// Adds any equivalent orderings generated by the `self.expr`
-/// to `builder`.
+/// Adds any equivalent orderings generated by the `self.expr` to 
`builder`.

Review Comment:
   ```suggestion
   /// Adds any equivalent orderings generated by `self.expr` to `builder`.
   ```



##
datafusion/physical-expr/src/window/aggregate.rs:
##
@@ -21,16 +21,17 @@ use std::any::Any;
 use std::ops::Range;
 use std::sync::Arc;
 
-use arrow::array::Array;
-use arrow::record_batch::RecordBatch;
-use arrow::{array::ArrayRef, datatypes::Field};
-
 use crate::aggregate::AggregateFunctionExpr;
+use crate::window::standard::add_new_ordering_expr_with_partition_by;
 use crate::window::window_expr::AggregateWindowExpr;
 use crate::window::{
 PartitionBatches, PartitionWindowAggStates, SlidingAggregateWindowExpr, 
WindowExpr,
 };
-use crate::{reverse_order_bys, PhysicalExpr};
+use crate::{reverse_order_bys, EquivalenceProperties, PhysicalExpr};
+
+use arrow::array::Array;
+use arrow::record_batch::RecordBatch;
+use arrow::{array::ArrayRef, datatypes::Field};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};

Review Comment:
   ```suggestion
   use datafusion_common::{DataFusionError, ScalarValue, Result};
   ```



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -533,6 +536,29 @@ impl AggregateFunctionExpr {
 pub fn default_value(&self, data_type: &DataType) -> Result {
 self.fun.default_value(data_type)
 }
+
+/// Indicates whether the aggregation function is monotonic as a set 
function. A set
+/// function is monotonically increasing if its value increases as its 
argument grows
+/// (as a set). Formally, `f` is a monotonically increasing set function 
if `f(S) >= f(T)`
+/// whenever `S` is a superset of `T`.
+pub fn set_monotonicity(&self) -> AggregateExprSetMonotonicity {
+let field = self.field();
+let data_type = field.data_type();
+self.fun.inner().set_monotonicity(data_type)
+}
+
+/// Returns PhysicalSortExpr based on monotonicity of the function

Review Comment:
   ```suggestion
   /// Returns `PhysicalSortExpr` based on the set monotonicity of the 
function.
   ```



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -35,24 +35,27 @@ pub mod utils {
 };
 }
 
+use std::fmt::Debug;
+use std::sync::Arc;
+
+use crate::expressions::Column;
+
 use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use arrow_schema::SortOptions;
 use datafusion_common::ScalarValue;
 use datafusion_common::{internal_err, not_impl_err, Result};

Review Comment:
   ```suggestion
   use datafusion_common::{internal_err, not_impl_err, ScalarValue, Result};
   ```



##
datafusion/physical-expr/src/window/standard.rs:
##
@@ -283,3 +269,27 @@ impl WindowExpr for StandardWindowExpr {
 }
 }
 }
+
+/// Adds new ordering expression into the existing ordering equivalence class 
based on partition by information.

Review Comment:
   ```suggestion
   /// Adds a new ordering expression into existing ordering equivalence 
class(es) based on
   /// PARTITION BY information (if it exists).
   ```



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -35,24 +35,27 @@ pub mod utils {
 };
 }
 
+use std::fmt::Debug;
+use std::sync::Arc;
+
+use crate::expressions::Column;
+
 use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+use arrow_schema::SortOptions;
 use datafusion_common::ScalarValue;
 use datafusion_common::{internal_err, not_impl_err, Result};
-use datafusion_expr::AggregateUDF;
 use datafusion_expr::ReversedUDAF;
+use datafusion_expr::{AggregateExprSetMonotonicity, AggregateUDF};

Review Comment:
   ```suggestion
   use datafusion_expr::{AggregateExprSetMonotonicity, AggregateUDF, 
ReversedUDAF};
   ```



##
datafusion/physical-plan/src/aggregates/mod.rs:
##
@@ -648,11 +649,32 @@ impl AggregateExec {
 group_expr_mapping: &ProjectionMapping,
 mode: &AggregateMode,
 input_order_mode: &InputOrderMode,
+aggr_exprs: Vec>,

Review Comment:
   A cursory look suggests this can be a slice. If that is true, let's avoid 
copying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands

Re: [PR] Fix incorrect searched CASE optimization [datafusion]

2025-01-29 Thread via GitHub


findepi closed pull request #14349: Fix incorrect searched CASE optimization
URL: https://github.com/apache/datafusion/pull/14349


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement physical optimizer rule to apply type coercion to physical plans [datafusion]

2025-01-29 Thread via GitHub


andygrove commented on issue #14324:
URL: https://github.com/apache/datafusion/issues/14324#issuecomment-2621810595

   > BTW this might be fairly straightforward to add as an extension in comet 
(maybe it doesn't have to be in the datafusion core if it is only used by 
systems that use the physical plan)
   
   We will probably do this in the short term, but don't we want DataFusion to 
be supported for the execution engine use case, similar to Velox?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add manual trigger for extended tests in pull requests [datafusion]

2025-01-29 Thread via GitHub


Omega359 commented on code in PR #14331:
URL: https://github.com/apache/datafusion/pull/14331#discussion_r1934000529


##
.github/workflows/extended.yml:
##
@@ -31,14 +31,38 @@ on:
   push:
 branches:
   - main
+  issue_comment:
+types: [created]
+
+permissions:
+  pull-requests: write
 
 jobs:
+  # Check issue comment and notify that extended tests are running
+  check_issue_comment:
+name: Check issue comment
+runs-on: ubuntu-latest
+if: github.event.issue.pull_request && github.event.comment.body == 'run 
extended tests'
+steps:
+  - uses: actions/github-script@v7
+with:
+  github-token: ${{secrets.GITHUB_TOKEN}}
+  script: |
+github.rest.issues.createComment({

Review Comment:
   Using other actions in an apache repo is problematic as I previously found 
out. However, it's not hard to have an action post a comment and then edit it 
to show progress.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Decimal promotion is not applied consistently [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove opened a new issue, #1350:
URL: https://github.com/apache/datafusion-comet/issues/1350

   ### Describe the bug
   
   In `QueryPlanSerde`, we have code in `exprToProto` to ensure the decimal 
precision is the same for both children of a binary expression.
   
   ```scala
 def exprToProto(
 expr: Expression,
 inputs: Seq[Attribute],
 binding: Boolean = true): Option[Expr] = {
   
   val conf = SQLConf.get
   val newExpr =
 DecimalPrecision.promote(conf.decimalOperationsAllowPrecisionLoss, 
expr, !conf.ansiEnabled)
   exprToProtoInternal(newExpr, inputs, binding)
 }
   ```
   
   However, this calls `exprToProtoInternal` which is a recursive method. For 
example, when creating a binary expression:
   
   ```scala
   val leftExpr = exprToProtoInternal(left, inputs, binding)
   val rightExpr = exprToProtoInternal(right, inputs, binding)
   ```
   
   The decimal promotion logic is not applied in this case.
   
   Given that `exprToProto` contains a single line of code and then delegates 
to `exprToProtoInternal`, I propose that we simply combine these two methods.
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add related source code locations to errors [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #13664:
URL: https://github.com/apache/datafusion/pull/13664#discussion_r1933723286


##
datafusion/common/src/diagnostic.rs:
##
@@ -0,0 +1,112 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use sqlparser::tokenizer::Span;
+
+/// Additional contextual information intended for end users, to help them
+/// understand what went wrong by providing human-readable messages, and
+/// locations in the source query that relate to the error in some way.
+///
+/// You can think of a single [`Diagnostic`] as a single "block" of output from
+/// rustc. i.e. either an error or a warning, optionally with some notes and
+/// help messages.
+///
+/// If the diagnostic, a note, or a help message doesn't need to point to a
+/// specific location in the original SQL query (or the [`Span`] is not
+/// available), use [`Span::empty`].
+///
+/// Example:
+///
+/// ```rust
+/// # use datafusion_common::Diagnostic;
+/// # use sqlparser::tokenizer::Span;
+/// # let span = Span::empty();
+/// let diagnostic = Diagnostic::new_error("Something went wrong", span)
+/// .with_help("Have you tried turning it on and off again?", 
Span::empty());
+/// ```
+#[derive(Debug, Clone)]
+pub struct Diagnostic {
+pub kind: DiagnosticKind,
+pub message: String,
+pub span: Span,
+pub notes: Vec,
+pub helps: Vec,
+}
+
+#[derive(Debug, Clone)]
+pub struct DiagnosticNote {
+pub message: String,
+pub span: Span,
+}
+
+#[derive(Debug, Clone)]
+pub struct DiagnosticHelp {

Review Comment:
   perhaps that could be added in comments as it may be non obvious to other 
readers
   
   ```suggestion
   /// Suggestions as to how to solve the error 
   ///
   /// This is different than [`DiagnosticNote`] which explains *why* the error 
occurred. 
   /// For exmaple that a non aggregate expression is not in the `GROUP BY` 
clause. 
   /// `DiagnosticHelp` contains a suggestion to solve the error, for example
   /// add the expression to the `GROUP BY`.
   pub struct DiagnosticHelp {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add related source code locations to errors [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #13664:
URL: https://github.com/apache/datafusion/pull/13664#discussion_r1933725543


##
datafusion/common/src/diagnostic.rs:
##
@@ -0,0 +1,112 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use sqlparser::tokenizer::Span;
+
+/// Additional contextual information intended for end users, to help them
+/// understand what went wrong by providing human-readable messages, and
+/// locations in the source query that relate to the error in some way.
+///
+/// You can think of a single [`Diagnostic`] as a single "block" of output from
+/// rustc. i.e. either an error or a warning, optionally with some notes and
+/// help messages.
+///
+/// If the diagnostic, a note, or a help message doesn't need to point to a
+/// specific location in the original SQL query (or the [`Span`] is not
+/// available), use [`Span::empty`].
+///
+/// Example:
+///
+/// ```rust
+/// # use datafusion_common::Diagnostic;
+/// # use sqlparser::tokenizer::Span;
+/// # let span = Span::empty();
+/// let diagnostic = Diagnostic::new_error("Something went wrong", span)
+/// .with_help("Have you tried turning it on and off again?", 
Span::empty());
+/// ```
+#[derive(Debug, Clone)]
+pub struct Diagnostic {
+pub kind: DiagnosticKind,
+pub message: String,
+pub span: Span,
+pub notes: Vec,
+pub helps: Vec,
+}
+
+#[derive(Debug, Clone)]
+pub struct DiagnosticNote {
+pub message: String,
+pub span: Span,
+}
+
+#[derive(Debug, Clone)]
+pub struct DiagnosticHelp {

Review Comment:
   NM -- I see you already did that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] add manual trigger for extended tests in pull requests [datafusion]

2025-01-29 Thread via GitHub


edmondop commented on code in PR #14331:
URL: https://github.com/apache/datafusion/pull/14331#discussion_r1933841345


##
.github/workflows/extended.yml:
##
@@ -31,14 +31,38 @@ on:
   push:
 branches:
   - main
+  issue_comment:
+types: [created]
+
+permissions:
+  pull-requests: write
 
 jobs:
+  # Check issue comment and notify that extended tests are running
+  check_issue_comment:
+name: Check issue comment
+runs-on: ubuntu-latest
+if: github.event.issue.pull_request && github.event.comment.body == 'run 
extended tests'
+steps:
+  - uses: actions/github-script@v7
+with:
+  github-token: ${{secrets.GITHUB_TOKEN}}
+  script: |
+github.rest.issues.createComment({

Review Comment:
   I wonder if we can create a check in the PR to show that the running the 
tests is in progress, 
   
   we can use https://octokit.github.io/rest.js/v21/#checks
   
   to create a check if the comment is available,
   
   and when the pipeline runs, we can update the check using the update the 
check
   
   ```js
   
   octokit.rest.checks.update({
   owner,
   repo,
   check_run_id,
   output.summary,
   output.annotations[].path,
   output.annotations[].start_line,
   output.annotations[].end_line,
   output.annotations[].annotation_level,
   output.annotations[].message,
   output.images[].alt,
   output.images[].image_url,
   actions[].label,
   actions[].description,
   actions[].identifier
 })
   ```
   
   see in the docs that I have linked the "Update a check run" under the checks 
api
   
   wdyt @alamb ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement physical optimizer rule to apply type coercion to physical plans [datafusion]

2025-01-29 Thread via GitHub


edmondop commented on issue #14324:
URL: https://github.com/apache/datafusion/issues/14324#issuecomment-2621615427

   Is it possible to do type coercion only at the physical layer? Or does it 
bring benefits at the logical layer so that we need to do type coercion twice, 
once at the logical layer for users of the "datafusion frontend" and once at 
the physical layer for applications like Comet that "skips" the datafusion 
frontend?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: LimitPushdown rule uncorrect remove some GlobalLimitExec [datafusion]

2025-01-29 Thread via GitHub


zhuqi-lucas commented on code in PR #14245:
URL: https://github.com/apache/datafusion/pull/14245#discussion_r1934090147


##
datafusion/physical-optimizer/src/limit_pushdown.rs:
##
@@ -248,7 +247,15 @@ pub fn pushdown_limit_helper(
 }
 } else {
 // Add fetch or a `LimitExec`:
-global_state.satisfied = true;
+// If the plan's children have limit and the child's limit < 
parent's limit, we shouldn't change the global state to true,
+// because the children limit will be overridden if the global 
state is changed.
+if pushdown_plan
+.children()
+.iter()
+.any(|&child| extract_limit(child).is_some())
+{
+global_state.satisfied = false;
+}

Review Comment:
   Got it @xudong963 , the previous logic will always setting true in the 
global_state.satisfied == false. This logic is keep the false for some cases. I 
change the logic to more clear that, we only setting to true to exclude the 
above case.
   Thanks!
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: LimitPushdown rule uncorrect remove some GlobalLimitExec [datafusion]

2025-01-29 Thread via GitHub


zhuqi-lucas commented on code in PR #14245:
URL: https://github.com/apache/datafusion/pull/14245#discussion_r1934092407


##
datafusion/physical-optimizer/src/limit_pushdown.rs:
##
@@ -248,7 +247,15 @@ pub fn pushdown_limit_helper(
 }
 } else {
 // Add fetch or a `LimitExec`:
-global_state.satisfied = true;

Review Comment:
   Here is the original logic for setting back to true. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement Common Subexpression Elimination optimizer rule [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove commented on issue #942:
URL: 
https://github.com/apache/datafusion-comet/issues/942#issuecomment-2621952623

   The DataFusion PR https://github.com/apache/datafusion/pull/13046 is still 
waiting for a review. I am adding this issue back onto the 0.6 milestone as a 
reminder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix DDL generation in case of an empty arguments FUNCTION. [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


remysaissy opened a new pull request, #1690:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1690

   Functions with an empty argument list are properly parsed into the AST but 
the string representation of such function removes the parenthesis required 
after the function name, causing an error at the database level.
   
   For eg.
   CREATE OR REPLACE FUNCTION no_arg() RETURNS VOID LANGUAGE plpgsql AS $$ 
BEGIN DELETE FROM my_table; END; $$
   Becomes:
   CREATE OR REPLACE FUNCTION no_arg RETURNS VOID LANGUAGE plpgsql AS $$ BEGIN 
DELETE FROM my_table; END; $$ 
   
   And causes Postgres to reject the query.
   
   This PR fixes the string representation in this specific case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: fetch is missed during EnforceDistribution [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14207:
URL: https://github.com/apache/datafusion/pull/14207#issuecomment-2621967843

   I merged up from main to resolve a conflict on this PR as part of my review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] refactor: switch `BooleanBufferBuilder` to `NullBufferBuilder` in single_group_by [datafusion]

2025-01-29 Thread via GitHub


Chen-Yuan-Lai opened a new pull request, #14360:
URL: https://github.com/apache/datafusion/pull/14360

   
   ## Which issue does this PR close?
   
   
   
   Closes #14115 .
   
   ## Rationale for this change
   
   
   As mentioned in #14115  , several examples in DataFusion codebase still 
using BooleanBufferBuilder rather than NullBufferBuilder, they should be 
replaced with NullBufferBuilder for optimization.
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Decimal promotion is not applied consistently [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove closed issue #1350: Decimal promotion is not applied consistently
URL: https://github.com/apache/datafusion-comet/issues/1350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: Fix NULL handling in array_slice [datafusion]

2025-01-29 Thread via GitHub


jkosh44 commented on code in PR #14289:
URL: https://github.com/apache/datafusion/pull/14289#discussion_r1933570686


##
datafusion/physical-expr/src/scalar_function.rs:
##
@@ -186,6 +187,15 @@ impl PhysicalExpr for ScalarFunctionExpr {
 .map(|e| e.evaluate(batch))
 .collect::>>()?;
 
+if self.fun.signature().null_handling == NullHandling::Propagate
+&& args.iter().any(
+|arg| matches!(arg, ColumnarValue::Scalar(scalar) if 
scalar.is_null()),

Review Comment:
   I'm not super confident about this check, how should `ColumnarValue::Array`s 
be treated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix Float and Decimal coercion [datafusion]

2025-01-29 Thread via GitHub


findepi commented on PR #14273:
URL: https://github.com/apache/datafusion/pull/14273#issuecomment-2621162538

   I am OK having DF coercions extensible 
(https://github.com/apache/datafusion/issues/14296), but first and foremost I 
want them to be **_optional_**. At ~SDF~ dbt, we create fully resolved and 
coerced plans, and so DF's coercion rules cannot do any good.
   For our needs the best would be clear separation of SQL analysis layer, 
where coercions should exist, and formal plan processing layer, where coercions 
should no longer exist. This was outlined in 
https://github.com/apache/datafusion/issues/12723.
   
   Regardless of future apt plans and potential customizability via new or 
existing extension points, we need to decide what is the "out of the box" 
behavior. We have a rule to follow PostgreSQL in principle and this is what 
this PR does. I consider it non-controversial. The existing comments revolve 
our "let's make sure it's properly reviewed" and by now it's obviously is. Will 
merge tomorrow unless there are actionable objections.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: Fix NULL handling in array_slice [datafusion]

2025-01-29 Thread via GitHub


jkosh44 commented on code in PR #14289:
URL: https://github.com/apache/datafusion/pull/14289#discussion_r1933573232


##
datafusion/functions-nested/src/extract.rs:
##
@@ -330,7 +330,8 @@ pub(super) struct ArraySlice {
 impl ArraySlice {
 pub fn new() -> Self {
 Self {
-signature: Signature::variadic_any(Volatility::Immutable),
+// TODO: This signature should use the actual accepted types, not 
variadic_any.

Review Comment:
   Just pushed an update to do this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add hook for sharing join state in distributed execution [datafusion]

2025-01-29 Thread via GitHub


thinkharderdev opened a new pull request, #12523:
URL: https://github.com/apache/datafusion/pull/12523

   ## Which issue does this PR close?
   
   
   
   Closes #12454 
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] support: Date +/plus Int or date_add function [datafusion]

2025-01-29 Thread via GitHub


DanCodedThis commented on issue #6876:
URL: https://github.com/apache/datafusion/issues/6876#issuecomment-2621860544

   @edmondop I think I agree with your statement. 
   - Also, does the new `TypeSignatureClass` + `logical_string()` allow parsing 
`month` as not a column/keyword, but as `Utf8` without the single quotes? In 
our own implementation, there is a preprocessing that fixes that in a hacky 
way. If we later add the Snowflake way to DF, I don't think I will be able to 
reproduce it here.
   - I saw the new `date_part` two weeks/week and a half ago on the Apache 
docs, and they were using `month` not `'month'` in the example, perhaps it was 
a mistake in the docs that was fixed, or I didn't perceive it correctly, I 
could swear I saw it.
   - I tried doing something similar for the aforementioned `date_diff`, but 
the example that I was referring too was gone by that time I wrote it and 
tested it. (didn't work)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] equivalence classes: use normalized mapping for projection [datafusion]

2025-01-29 Thread via GitHub


ozankabak commented on PR #14327:
URL: https://github.com/apache/datafusion/pull/14327#issuecomment-2621860123

   I am guessing what is meant was writing an SLT test whose output plan would 
change if this logic/fix weren't there. Maybe an optimization that wouldn't 
take place because of the normalization is not done properly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: include the number of files run in sqllogictest display [datafusion]

2025-01-29 Thread via GitHub


alamb opened a new pull request, #14359:
URL: https://github.com/apache/datafusion/pull/14359

   ## Which issue does this PR close?
   - Follow on to https://github.com/apache/datafusion/pull/14355
   - Follow on to 
   
   ## Rationale for this change
   
   While testing https://github.com/apache/datafusion/pull/14355 from @findepi  
I found it hard to see how many files were run
   
   ## What changes are included in this PR?
   
   Before this PR:
   ```shell
   andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo test --test 
sqllogictests  -- union
   Finished `test` profile [unoptimized + debuginfo] target(s) in 0.16s
Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-3d953efd822e1e41)
   Completed in 0 seconds
   ```
   
   After this PR, it reports `Completed 2 tests`:
   
   ```shell
   andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo test --test 
sqllogictests  -- union
   Finished `test` profile [unoptimized + debuginfo] target(s) in 0.16s
Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-3d953efd822e1e41)
   Completed 2 tests in 0 seconds
   ```
   
   ## Are these changes tested?
   
   I tested them manually
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: include the number of files run in sqllogictest display [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #14359:
URL: https://github.com/apache/datafusion/pull/14359#discussion_r1934030696


##
datafusion/sqllogictest/bin/sqllogictests.rs:
##
@@ -491,9 +497,7 @@ impl TestFile {
 }
 }
 
-fn read_test_files<'a>(
-options: &'a Options,
-) -> Result + 'a>> {
+fn read_test_files(options: &Options) -> Result> {

Review Comment:
   This function was already basically returning a `Vec` so there didn't seem 
to be any need to have the iterator indirection



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Restore ability to run single SLT file [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14355:
URL: https://github.com/apache/datafusion/pull/14355#issuecomment-2621874234

   > The functionality was probably accidentally lost probably in #13936
   
   BTW I think it is currently possible to run a single file. For example this 
will run both `union.slt` and `pg_compat_union.slt`.
   
   ```shell
   cargo test --test sqllogictests -- union.slt
   ```
   
   However, it only matches on the file name (not the entire path) so using 
`test_files/union.slt` does not work as you observe
   
   
   > I made a small PR to your branch to show the # of files tested, verifying 
that
   
   LOL we did the same thing @Omega359 
   - https://github.com/apache/datafusion/pull/14359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: attempt to analyze boundaries for select columns [datafusion]

2025-01-29 Thread via GitHub


ozankabak commented on PR #14308:
URL: https://github.com/apache/datafusion/pull/14308#issuecomment-2621875650

   Seems like we need to make certain view types comparable with their parent 
types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump serde_json from 1.0.137 to 1.0.138 in /datafusion-cli [datafusion]

2025-01-29 Thread via GitHub


alamb merged PR #14351:
URL: https://github.com/apache/datafusion/pull/14351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix UNION field nullability tracking [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14356:
URL: https://github.com/apache/datafusion/pull/14356#issuecomment-2621877094

   @wiedld  can you also please review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feature Unifying source execution plans [datafusion]

2025-01-29 Thread via GitHub


ozankabak commented on PR #14224:
URL: https://github.com/apache/datafusion/pull/14224#issuecomment-2621880797

   Thanks for all the review. We will focus on this again next week, improve 
the PR, and ask for more thoughts from the community.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump tempfile from 3.15.0 to 3.16.0 in /datafusion-cli [datafusion]

2025-01-29 Thread via GitHub


alamb merged PR #14350:
URL: https://github.com/apache/datafusion/pull/14350


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Decimal promotion is not applied consistently [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove commented on issue #1350:
URL: 
https://github.com/apache/datafusion-comet/issues/1350#issuecomment-2621887848

   nm, I see now that `DecimalPrecision.promote` calls `expr.transformUp` and 
transforms the whole expression tree.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: LimitPushdown rule uncorrect remove some GlobalLimitExec [datafusion]

2025-01-29 Thread via GitHub


zhuqi-lucas commented on code in PR #14245:
URL: https://github.com/apache/datafusion/pull/14245#discussion_r1933509296


##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -5182,3 +5177,32 @@ async fn register_non_parquet_file() {
 "1.json' does not match the expected extension '.parquet'"
 );
 }
+
+// Test issue: https://github.com/apache/datafusion/issues/14204

Review Comment:
   Thank you @xudong963 for review, it makes sense to remove this UT.
   
   Addressed in latest PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: Fix NULL handling in array_slice [datafusion]

2025-01-29 Thread via GitHub


jkosh44 commented on PR #14289:
URL: https://github.com/apache/datafusion/pull/14289#issuecomment-2621158643

   > ```rust
   > /// Any Null input causes the function to return Null.
   > Propogate,
   > ```
   
   I've updated the PR to use this. I just realized though that window and 
aggregate functions (and maybe other function types?) completely ignore this 
field in the `Signature`. That feels wrong, but I'll have to think about what 
the right behavior should be.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add `DataFrame::map` utility .map function for DataFrame for modifying internal LogicalPlan [datafusion]

2025-01-29 Thread via GitHub


phisn commented on issue #14317:
URL: https://github.com/apache/datafusion/issues/14317#issuecomment-2621740237

   @timsaucer My specific use case is what @alamb described. The problem with 
the transform approach is that I am forced to `.logical_plan().clone` as well 
as create a new `DataFrame` each time, since it's about directly manipulating 
the `LogicalPlan`. 
   
   The problem arises with applying extensions and using functions that are not 
always called using a `DataFrame`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Ensure `to_timestamp` behaves consistently with PostgreSQL [datafusion]

2025-01-29 Thread via GitHub


logan-keede commented on issue #13351:
URL: https://github.com/apache/datafusion/issues/13351#issuecomment-2621744869

   > That is a very good question. There must have been I reason I didn't do 
that when I coded this up but right now I can't recall why it would have been. 
Let me think about that for a day and see if I can recall a good reason.
   
   I think I found the reason,
   According to documentation 
   >`Optional Chrono format strings to use to parse the expression. Formats 
will be tried in the order they appear with the first successful one being 
returned. If none of the formats successfully parse the expression an error 
will be returned.`
   
   So, even if a format is wrong it should not throw error till it has tried 
all the formats and if it does then it just gives error ` no format matches`. 
So, I think we should leave that as it is, however if we really want to improve 
UX we can generate a warning, that `xyz` is not a valid chrono format.  
   Let me know if that makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Enable TableScan to return multiple arbitrary table references [datafusion]

2025-01-29 Thread via GitHub


phisn opened a new issue, #14358:
URL: https://github.com/apache/datafusion/issues/14358

   ### Is your feature request related to a problem or challenge?
   
   `TableProvider` has a type which indicates which type of table it contains. 
This can be a view. Therefore it makes sense to encode materialised views using 
`TableProvider` and use `TableScan`.
   
   Currently `TableScan` does force the use of one or no `TableReference`. 
Since a view can be derived from multiple tables having the same column names 
it can happen that two columns can only be distinguished by their respective 
`TableReference`.
   
   ### Describe the solution you'd like
   
   `TableScan` should contain unique `table_name: TableReference` for each 
column.
   
   ### Describe alternatives you've considered
   
   Even through `TableScan` contains a `DFSchema` which contains a list of 
`TableReference`, it is not possible to use it since some optimisation rules 
recreate the `TableScan` with the wrong schema.
   
   1. We change the schema in `TableProvider` to `DFSchema` instead of 
`Schema`. I did not test if having multiple same columns in `TableProvider` 
does cause issues which would force this option. As in the execution only the 
column position is important, this seems unlikely.
   2. In my current implementation I am inlining table names using 
`.flat_name()` and outline them again using a analyser rule. Since I can't use 
`TableScan` I am using a custom Extension which is resolved by the analyser too.
   3. Leaving this to the user or creating some new alternative `LogicalPlan` 
node is also possible.
   
   ### Additional context
   
   This feature is slightly related to #14310.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix UNION field nullability tracking [datafusion]

2025-01-29 Thread via GitHub


findepi commented on PR #14356:
URL: https://github.com/apache/datafusion/pull/14356#issuecomment-2621752435

   cc @felipecrv, @Omega359


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1933951486


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -3490,3 +3437,6 @@ trait CometExpressionSerde {
   inputs: Seq[Attribute],
   binding: Boolean): Option[ExprOuterClass.Expr]
 }
+
+/** Marker trait for an expression that is not guaranteed to be 100% 
compatible with Spark */
+trait IncompatExpr {}

Review Comment:
   Later, we could add a method to this trait to provide the reason why an 
expression is incompatible and this could be used to generate documentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Restore ability to run single SLT file [datafusion]

2025-01-29 Thread via GitHub


Omega359 commented on PR #14355:
URL: https://github.com/apache/datafusion/pull/14355#issuecomment-2621782574

   I made a small PR to your branch to show the # of files tested, verifying 
that 
   
   `cargo test --test sqllogictests -- union.slt`
   
   runs 2 files whereas
   
   `cargo test --test sqllogictests -- test_files/union.slt`
   
   runs 1 file.
   
   I also verified that the sqlite tests ran without issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2621231579

   > > Then we cannot make sure metadata columns always have same column index 
in different tables
   > 
   > Not quite understand the reason, with modified metadata, meta column could 
be considered as the same column like the normal one inside `Schema`. **Field 
index** for accessing field in `Schema` in contiguous like before, indices are 
changed if the schema is changed.
   
   the question whether treat metadata column like normal column, I want to 
listen more ideas.
   
   by the way, at the beginning, I didn't use offset to separate metadata 
column and normal column. you said that it's error prone. I think when we do 
schema migration, non-stable column index is really error prone. did I 
misunderstand something?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add RETURNS TABLE() support for CREATE FUNCTION in Postgresql [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


remysaissy opened a new pull request, #1687:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1687

   Currently CREATE FUNCTION doesn't support the RETURNS TABLE datatype 
(https://www.postgresql.org/docs/15/sql-createfunction.html).
   
   This PR adds it for the PostgresSQL and Generic dialect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] BigQuery: Fix column identifier reserved keywords list [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


alamb merged PR #1678:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1678


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] support: Date +/plus Int or date_add function [datafusion]

2025-01-29 Thread via GitHub


edmondop commented on issue #6876:
URL: https://github.com/apache/datafusion/issues/6876#issuecomment-2621668103

   @DanCodedThis and @alamb I looked at [snowflake 
documentation](https://docs.snowflake.com/en/sql-reference/functions/dateadd), 
[Spark 
documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.date_add.html)
 and [postgres 
documentation](https://www.postgresql.org/docs/current/functions-datetime.html)
   
   - Postgres provide a generic API that's based on Intervals 
   - Snowflake provides a simpler API where you provide the interval as a 
combination of timeUnit * number of units
   - Spark supports only days as a timeunit
   
   Implementing the Postgres approach seems the most flexible solution, and 
then we can implement a Snowflake dialect maybe to parse a snowflake-like UDF 
invocation into the Postgres-style? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Make TypedString contain Value instead of String to support and preserve other quote styles [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


graup commented on code in PR #1679:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1679#discussion_r1932661673


##
tests/sqlparser_bigquery.rs:
##
@@ -2214,6 +2214,30 @@ fn test_select_as_value() {
 assert_eq!(Some(ValueTableMode::AsValue), select.value_table_mode);
 }
 
+#[test]
+fn test_typed_strings() {
+let expr = bigquery().verified_expr(r#"JSON """{"foo":"bar's"}#);
+assert_eq!(
+Expr::TypedString {
+data_type: DataType::JSON,
+value: Value::TripleDoubleQuotedString(r#"{"foo":"bar's"}"#.into())
+},
+expr
+);
+
+let expr = bigquery().verified_expr(r#"JSON '''{"foo":"bar's"}'''"#);
+if let Expr::TypedString { data_type, value } = expr {
+assert_eq!(DataType::JSON, data_type);
+let string_value: String = value.into();
+assert_eq!(r#"{"foo":"bar's"}"#, string_value);
+}
+
+// SingleQuotedString and DoubleQuotedString are currently not correctly 
formatted by `fmt::Display for Value`.
+// BigQuery does not support double escaping, should be \' or \" instead.
+//bigquery().verified_expr(r#"JSON '{"foo":"bar\'s"}'"#);
+//bigquery().verified_expr(r#"JSON "{\"foo\":\"bar's\"}""#);

Review Comment:
   The issue I ran into is this SQL `SELECT JSON '{"foo":"bar\'s"}'` gets 
parsed and then re-printed as `SELECT JSON '{"foo":"bar''s"}'` (note the `''` 
instead of `\'` – a syntax error in BQ). But let me do some more tests – for 
now I'll take this out of this PR, will create a separate issue/PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: LimitPushdown rule uncorrect remove some GlobalLimitExec [datafusion]

2025-01-29 Thread via GitHub


xudong963 commented on code in PR #14245:
URL: https://github.com/apache/datafusion/pull/14245#discussion_r1933897333


##
datafusion/physical-optimizer/src/limit_pushdown.rs:
##
@@ -248,7 +247,15 @@ pub fn pushdown_limit_helper(
 }
 } else {
 // Add fetch or a `LimitExec`:
-global_state.satisfied = true;
+// If the plan's children have limit and the child's limit < 
parent's limit, we shouldn't change the global state to true,
+// because the children limit will be overridden if the global 
state is changed.
+if pushdown_plan
+.children()
+.iter()
+.any(|&child| extract_limit(child).is_some())
+{
+global_state.satisfied = false;
+}

Review Comment:
   I think the logic is strange, if comes to the else branch(248 lines), it 
means `global_state.satisfied == false;`, then here(257 lines) sets 
`global_state.satisfied = false;`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: pass scale to DF round in spark_round [datafusion-comet]

2025-01-29 Thread via GitHub


cht42 commented on code in PR #1341:
URL: https://github.com/apache/datafusion-comet/pull/1341#discussion_r1933876335


##
native/spark-expr/src/math_funcs/round.rs:
##
@@ -85,9 +85,10 @@ pub fn spark_round(
 let (precision, scale) = get_precision_scale(data_type);
 make_decimal_array(array, precision, scale, &f)
 }
-DataType::Float32 | DataType::Float64 => {
-Ok(ColumnarValue::Array(round(&[Arc::clone(array)])?))
-}
+DataType::Float32 | DataType::Float64 => 
Ok(ColumnarValue::Array(round(&[

Review Comment:
   added tests for floats



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Remove redundant processing from exprToProtoInternal [datafusion-comet]

2025-01-29 Thread via GitHub


codecov-commenter commented on PR #1351:
URL: 
https://github.com/apache/datafusion-comet/pull/1351#issuecomment-2622069957

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1351?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 57.29%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`a51967a`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/a51967aab4d3c322f1aa0fb97e395fe83dbc2ff2?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 11 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1351  +/-   ##
   
   + Coverage 56.12%   57.29%   +1.16% 
 Complexity  976  976  
   
 Files   119  120   +1 
 Lines 1174311896 +153 
 Branches   2251 2252   +1 
   
   + Hits   6591 6816 +225 
   + Misses 4012 3918  -94 
   - Partials   1140 1162  +22 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1351?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] moving memory.rs out of datafusion/core [datafusion]

2025-01-29 Thread via GitHub


comphead commented on PR #14332:
URL: https://github.com/apache/datafusion/pull/14332#issuecomment-2622072090

   > if we go with that we should change name of every similar file in a 
similar manner (atleast `async.rs`).
   
   That is true, to be consistent it needs to be fixed for all files which is 
not part of this PR for sure
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: fetch is missed during EnforceDistribution [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #14207:
URL: https://github.com/apache/datafusion/pull/14207#discussion_r1934089417


##
datafusion/physical-optimizer/src/enforce_distribution.rs:
##
@@ -932,12 +932,16 @@ fn add_hash_on_top(
 /// # Arguments
 ///
 /// * `input`: Current node.
+/// * `fetch`: Possible fetch value

Review Comment:
   I think it would help here to describe what is happening with the `fetch` 
argument  as well and when it is updated
   
   Maybe something like the followig
   
   ```suggestion
   /// * `fetch`: Possible fetch value. If a `SortPreservingMerge` is created
   ///   its fetch is set to this value and `fetch` is set to `None`
   ```



##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -3173,3 +3182,78 @@ fn optimize_away_unnecessary_repartition2() -> 
Result<()> {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn apply_enforce_distribution_multiple_times() -> Result<()> {
+// Create a configuration
+let config = SessionConfig::new();
+let ctx = SessionContext::new_with_config(config);
+
+// Create table schema and data
+let sql = "CREATE EXTERNAL TABLE aggregate_test_100 (
+c1  VARCHAR NOT NULL,
+c2  TINYINT NOT NULL,
+c3  SMALLINT NOT NULL,
+c4  SMALLINT,
+c5  INT,
+c6  BIGINT NOT NULL,
+c7  SMALLINT NOT NULL,
+c8  INT NOT NULL,
+c9  BIGINT UNSIGNED NOT NULL,
+c10 VARCHAR NOT NULL,
+c11 FLOAT NOT NULL,
+c12 DOUBLE NOT NULL,
+c13 VARCHAR NOT NULL
+)
+STORED AS CSV
+LOCATION '../../testing/data/csv/aggregate_test_100.csv'
+OPTIONS ('format.has_header' 'true')";
+
+ctx.sql(sql).await?;
+
+let df = ctx.sql("SELECT * FROM(SELECT * FROM aggregate_test_100 UNION ALL 
SELECT * FROM aggregate_test_100) ORDER BY c13 LIMIT 5").await?;
+let logical_plan = df.logical_plan().clone();
+let analyzed_logical_plan = ctx.state().analyzer().execute_and_check(
+logical_plan,
+ctx.state().config_options(),
+|_, _| (),
+)?;
+
+let optimized_logical_plan = ctx.state().optimizer().optimize(
+analyzed_logical_plan,
+&ctx.state(),
+|_, _| (),
+)?;
+
+let optimizers: Vec> = vec![
+Arc::new(OutputRequirements::new_add_mode()),
+Arc::new(EnforceDistribution::new()),
+Arc::new(EnforceSorting::new()),
+Arc::new(ProjectionPushdown::new()),
+Arc::new(CoalesceBatches::new()),
+Arc::new(EnforceDistribution::new()), // -- Add enforce distribution 
rule again
+Arc::new(OutputRequirements::new_remove_mode()),
+Arc::new(ProjectionPushdown::new()),
+Arc::new(LimitPushdown::new()),
+Arc::new(SanityCheckPlan::new()),
+];
+
+let planner = DefaultPhysicalPlanner::default();
+let session_state = SessionStateBuilder::new()
+.with_config(ctx.copied_config())
+.with_default_features()
+.with_physical_optimizer_rules(optimizers)
+.build();
+let optimized_physical_plan = planner
+.create_physical_plan(&optimized_logical_plan, &session_state)
+.await?;
+
+let mut results = optimized_physical_plan
+.execute(0, ctx.task_ctx().clone())
+.unwrap();
+
+let batch = results.next().await.unwrap()?;
+// With the fix of https://github.com/apache/datafusion/pull/14207, the 
number of rows will be 10

Review Comment:
   ```suggestion
   // Without the fix of https://github.com/apache/datafusion/pull/14207, 
the number of rows will be 10
   ```



##
datafusion/physical-optimizer/src/enforce_distribution.rs:
##
@@ -1362,6 +1383,21 @@ pub fn ensure_distribution(
 plan.with_new_children(children_plans)?
 };
 
+// If `fetch` was not consumed, it means that there was 
`SortPreservingMergeExec` with fetch before
+// It was removed by `remove_dist_changing_operators`
+// and we need to add it back.
+if fetch.is_some() {
+plan = Arc::new(
+SortPreservingMergeExec::new(
+plan.output_ordering()
+.unwrap_or(&LexOrdering::default())
+.clone(),
+plan,
+)
+.with_fetch(fetch.take()),
+)

Review Comment:
   Why does this add back a SortPreservingMerge without sort exprs? Wouldn't it 
be better to use a `GlobalLimitExec` or something?



##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -3172,3 +3181,78 @@ fn optimize_away_unnecessary_repartition2() -> 
Result<()> {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn apply_enforce_distribution_multiple_times() -> Result<()> {

Review Comment:
   I think it is nice to have this "end to end" style test, but given the 
amount of code changed I think it is important to have more "unit style" tests 
otherwise it is hard to understand how general this fix is (or if it just works

Re: [PR] moving memory.rs out of datafusion/core [datafusion]

2025-01-29 Thread via GitHub


comphead merged PR #14332:
URL: https://github.com/apache/datafusion/pull/14332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Restore ability to run single SLT file [datafusion]

2025-01-29 Thread via GitHub


alamb merged PR #14355:
URL: https://github.com/apache/datafusion/pull/14355


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Restore ability to run single SLT file [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14355:
URL: https://github.com/apache/datafusion/pull/14355#issuecomment-2622078034

   Thanks again @findepi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: switch `BooleanBufferBuilder` to `NullBufferBuilder` in binary_map [datafusion]

2025-01-29 Thread via GitHub


comphead merged PR #14341:
URL: https://github.com/apache/datafusion/pull/14341


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump home from 0.5.9 to 0.5.11 in /datafusion-cli [datafusion]

2025-01-29 Thread via GitHub


alamb merged PR #14257:
URL: https://github.com/apache/datafusion/pull/14257


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] moving memory.rs out of datafusion/core [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14332:
URL: https://github.com/apache/datafusion/pull/14332#issuecomment-2622083005

   😍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


comphead commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1934180368


##
common/src/main/scala/org/apache/comet/CometConf.scala:
##
@@ -605,6 +605,15 @@ object CometConf extends ShimCometConf {
   .booleanConf
   .createWithDefault(false)
 
+  val COMET_EXPR_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
+conf("spark.comet.expression.allowIncompatible")
+  .doc(
+"Comet is not currently fully compatible with Spark for all 
expressions. " +
+  "Set this config to true to allow them anyway. See compatibility 
guide " +

Review Comment:
   ```suggestion
 "Set this config to true to allow them anyway. See Compatibility 
Guide " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


comphead commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1934182431


##
docs/source/user-guide/configs.md:
##
@@ -64,6 +64,7 @@ Comet provides the following configuration settings.
 | spark.comet.explain.native.enabled | When this setting is enabled, Comet 
will provide a tree representation of the native query plan before execution 
and again after execution, with metrics. | false |
 | spark.comet.explain.verbose.enabled | When this setting is enabled, Comet 
will provide a verbose tree representation of the extended information. | false 
|
 | spark.comet.explainFallback.enabled | When this setting is enabled, Comet 
will provide logging explaining the reason(s) why a query stage cannot be 
executed natively. Set this to false to reduce the amount of logging. | false |
+| spark.comet.expression.allowIncompatible | Comet is not currently fully 
compatible with Spark for all expressions. Set this config to true to allow 
them anyway. See compatibility guide for more information. | false |

Review Comment:
   ```suggestion
   | spark.comet.expression.allowIncompatible | Comet is not currently fully 
compatible with Spark for all expressions. Set this config to true to allow 
them anyway. See Compatibility Guide for more information. 
(docs/user-guide/compatibility.md) | false |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


comphead commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1934181473


##
common/src/main/scala/org/apache/comet/CometConf.scala:
##
@@ -605,6 +605,15 @@ object CometConf extends ShimCometConf {
   .booleanConf
   .createWithDefault(false)
 
+  val COMET_EXPR_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
+conf("spark.comet.expression.allowIncompatible")
+  .doc(
+"Comet is not currently fully compatible with Spark for all 
expressions. " +
+  "Set this config to true to allow them anyway. See compatibility 
guide " +
+  "for more information.")

Review Comment:
   ```suggestion
 "for more information. (docs/user-guide/compatibility.md)")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] moving memory.rs out of datafusion/core [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #14332:
URL: https://github.com/apache/datafusion/pull/14332#discussion_r1934183938


##
datafusion/catalog/src/lib.rs:
##
@@ -15,6 +15,16 @@
 // specific language governing permissions and limitations
 // under the License.
 
+//! Interfaces and default implementations of catalogs and schemas.
+//!
+//! Implementations
+//! * Simple memory based catalog: [`MemoryCatalogProviderList`], 
[`MemoryCatalogProvider`], [`MemorySchemaProvider`]

Review Comment:
   I think this looks good
   
   One way to review this is to run `cargo --doc --open` and then looking at 
how it is rendered



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


comphead commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1934186350


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -929,6 +929,19 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   binding: Boolean): Option[Expr] = {
 SQLConf.get
 
+def convert(handler: CometExpressionSerde): Option[Expr] = {
+  handler match {
+case _: IncompatExpr if !CometConf.COMET_EXPR_ALLOW_INCOMPATIBLE.get() 
=>
+  withInfo(
+expr,
+s"$expr is not fully compatible with Spark. To enable it anyway, 
set " +

Review Comment:
   ```suggestion
   s"$expr is not fully compatible with Spark, see details in 
Compatibility Guide. To enable it anyway, set " +
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Move all array_* serde to new framework, use correct INCOMPAT config [datafusion-comet]

2025-01-29 Thread via GitHub


comphead commented on code in PR #1349:
URL: https://github.com/apache/datafusion-comet/pull/1349#discussion_r1934192543


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2371,83 +2384,17 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   withInfo(expr, "unsupported arguments for GetArrayStructFields", 
child)
   None
 }
-  case expr: ArrayRemove => CometArrayRemove.convert(expr, inputs, binding)
-  case expr if expr.prettyName == "array_contains" =>
-createBinaryExpr(
-  expr,
-  expr.children(0),
-  expr.children(1),
-  inputs,
-  binding,
-  (builder, binaryExpr) => builder.setArrayContains(binaryExpr))
-  case _ if expr.prettyName == "array_append" =>
-createBinaryExpr(
-  expr,
-  expr.children(0),
-  expr.children(1),
-  inputs,
-  binding,
-  (builder, binaryExpr) => builder.setArrayAppend(binaryExpr))
-  case _ if expr.prettyName == "array_intersect" =>
-createBinaryExpr(
-  expr,
-  expr.children(0),
-  expr.children(1),
-  inputs,
-  binding,
-  (builder, binaryExpr) => builder.setArrayIntersect(binaryExpr))
-  case ArrayJoin(arrayExpr, delimiterExpr, nullReplacementExpr) =>
-val arrayExprProto = exprToProto(arrayExpr, inputs, binding)
-val delimiterExprProto = exprToProto(delimiterExpr, inputs, binding)
-
-if (arrayExprProto.isDefined && delimiterExprProto.isDefined) {
-  val arrayJoinBuilder = nullReplacementExpr match {
-case Some(nrExpr) =>
-  val nullReplacementExprProto = exprToProto(nrExpr, inputs, 
binding)
-  ExprOuterClass.ArrayJoin
-.newBuilder()
-.setArrayExpr(arrayExprProto.get)
-.setDelimiterExpr(delimiterExprProto.get)
-.setNullReplacementExpr(nullReplacementExprProto.get)
-case None =>
-  ExprOuterClass.ArrayJoin
-.newBuilder()
-.setArrayExpr(arrayExprProto.get)
-.setDelimiterExpr(delimiterExprProto.get)
-  }
-  Some(
-ExprOuterClass.Expr
-  .newBuilder()
-  .setArrayJoin(arrayJoinBuilder)
-  .build())
-} else {
-  val exprs: List[Expression] = nullReplacementExpr match {
-case Some(nrExpr) => List(arrayExpr, delimiterExpr, nrExpr)
-case None => List(arrayExpr, delimiterExpr)
-  }
-  withInfo(expr, "unsupported arguments for ArrayJoin", exprs: _*)
-  None
-}
-  case ArraysOverlap(leftArrayExpr, rightArrayExpr) =>
-if (CometConf.COMET_CAST_ALLOW_INCOMPATIBLE.get()) {
-  createBinaryExpr(
-expr,
-leftArrayExpr,
-rightArrayExpr,
-inputs,
-binding,
-(builder, binaryExpr) => builder.setArraysOverlap(binaryExpr))
-} else {
-  withInfo(
-expr,
-s"$expr is not supported yet. To enable all incompatible casts, 
set " +
-  s"${CometConf.COMET_CAST_ALLOW_INCOMPATIBLE.key}=true")
-  None
-}
+  case _: ArrayRemove => convert(CometArrayRemove)
+  case _: ArrayContains => convert(CometArrayContains)
+  case _ if expr.prettyName == "array_append" => convert(CometArrayAppend)

Review Comment:
   ```suggestion
 // Function introduced in 3.4.0. Refer by name to provide 
compatibility with older Spark builds 
 case _ if expr.prettyName == "array_append" => 
convert(CometArrayAppend)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: metadata columns [datafusion]

2025-01-29 Thread via GitHub


adriangb commented on PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#issuecomment-2620931040

   Yep from an initial impression I agree with you @jayzhan211 that does seem 
like it should work. Then you just need to hook into wherever select / 
projections are evaluated and make sure if a column isn't explicitly mentioned 
by name and it has that metadata it is not included in the output 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix incorrect searched CASE optimization [datafusion]

2025-01-29 Thread via GitHub


findepi commented on code in PR #14349:
URL: https://github.com/apache/datafusion/pull/14349#discussion_r1933432132


##
datafusion/sqllogictest/test_files/case.slt:
##
@@ -289,12 +289,22 @@ query B
 select case when a=1 then false end from foo;
 
 false
-false
-false
-false
-false
-false

Review Comment:
   These results were not correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

2025-01-29 Thread via GitHub


Rachelint commented on code in PR #13681:
URL: https://github.com/apache/datafusion/pull/13681#discussion_r1933435620


##
datafusion/functions-aggregate/src/median.rs:
##
@@ -230,6 +276,212 @@ impl Accumulator for 
MedianAccumulator {
 }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec`, and the total group 
values
+/// will be actually organized as a `Vec>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator {
+data_type: DataType,
+group_values: Vec>,
+}
+
+impl MedianGroupsAccumulator {
+pub fn new(data_type: DataType) -> Self {
+Self {
+data_type,
+group_values: Vec::new(),
+}
+}
+}
+
+impl GroupsAccumulator for 
MedianGroupsAccumulator {
+fn update_batch(
+&mut self,
+values: &[ArrayRef],
+group_indices: &[usize],
+opt_filter: Option<&BooleanArray>,
+total_num_groups: usize,
+) -> Result<()> {
+assert_eq!(values.len(), 1, "single argument to update_batch");
+let values = values[0].as_primitive::();
+
+// Push the `not nulls + not filtered` row into its group
+self.group_values.resize(total_num_groups, Vec::new());
+accumulate(
+group_indices,
+values,
+opt_filter,
+|group_index, new_value| {
+self.group_values[group_index].push(new_value);
+},
+);
+
+Ok(())
+}
+
+fn merge_batch(
+&mut self,
+values: &[ArrayRef],
+group_indices: &[usize],
+// Since aggregate filter should be applied in partial stage, in final 
stage there should be no filter
+_opt_filter: Option<&BooleanArray>,
+total_num_groups: usize,
+) -> Result<()> {
+assert_eq!(values.len(), 1, "one argument to merge_batch");
+
+// The merged values should be organized like as a `ListArray` which 
is nullable
+// (input with nulls usually generated from `convert_to_state`), but 
`inner array` of
+// `ListArray`  is `non-nullable`.
+//
+// Following is the possible and impossible input `values`:
+//
+// # Possible values
+// ```text
+//   group 0: [1, 2, 3]
+//   group 1: null (list array is nullable)
+//   group 2: [6, 7, 8]
+//   ...
+//   group n: [...]
+// ```
+//
+// # Impossible values
+// ```text
+//   group x: [1, 2, null] (values in list array is non-nullable)
+// ```
+//
+let input_group_values = values[0].as_list::();
+
+// Ensure group values big enough
+self.group_values.resize(total_num_groups, Vec::new());
+
+// Extend values to related groups
+// TODO: avoid using iterator of the `ListArray`, this will lead to
+// many calls of `slice` of its ``inner array`, and `slice` is not
+// so efficient(due to the calculation of `null_count` for each 
`slice`).
+group_indices
+.iter()
+.zip(input_group_values.iter())
+.for_each(|(&group_index, values_opt)| {
+if let Some(values) = values_opt {
+let values = values.as_primitive::();
+
self.group_values[group_index].extend(values.values().iter());
+}
+});
+
+Ok(())
+}
+
+fn state(&mut self, emit_to: EmitTo) -> Result> {
+// Emit values
+let emit_group_values = emit_to.take_needed(&mut self.group_values);
+
+// Build offsets
+let mut offsets = Vec::with_capacity(self.group_values.len() + 1);
+offsets.push(0);
+let mut cur_len = 0_i32;
+for group_value in &emit_group_values {
+cur_len += group_value.len() as i32;
+offsets.push(cur_len);
+}
+// TODO: maybe we can use `OffsetBuffer::new_unchecked` like what in 
`convert_to_state`,
+// but safety should be considered more carefully here(and I am not 
sure if it can get
+// performance improvement when we introduce checks to keep the 
safety...).
+//
+// Can see more details in:
+// 
https://github.com/apache/datafusion/pull/13681#discussion_r1931209791
+//
+let offsets = OffsetBuffer::new(ScalarBuffer::from(offsets));
+
+// Build inner array
+let flatten_group_values =
+emit_group_values.into_iter().flatten().collect::>();
+let group_values_array =
+PrimitiveArraynew(ScalarBuffer::from(flatten_group_values), 
None)
+.with_data_type(self.data_type.clone());
+
+// Build the result list array
+let result_list_array = ListArray::new(
+

Re: [PR] Extend lambda support for ClickHouse, DuckDB and Generic dialects [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


gstvg commented on code in PR #1686:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1686#discussion_r1933397382


##
src/dialect/mod.rs:
##
@@ -340,12 +340,21 @@ pub trait Dialect: Debug + Any {
 /// Returns true if the dialect supports lambda functions, for example:
 ///
 /// ```sql
-/// SELECT transform(array(1, 2, 3), x -> x + 1); -- returns [2,3,4]
+/// SELECT transform(array(1, 2, 3), (x) -> x + 1); -- returns [2,3,4]
 /// ```
 fn supports_lambda_functions(&self) -> bool {
 false
 }
 
+/// Returns true if the dialect supports lambda functions without 
parentheses for a single argument, for example:

Review Comment:
   cc @jmhain 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve speed of `median` by implementing special `GroupsAccumulator` [datafusion]

2025-01-29 Thread via GitHub


Rachelint commented on code in PR #13681:
URL: https://github.com/apache/datafusion/pull/13681#discussion_r1933435620


##
datafusion/functions-aggregate/src/median.rs:
##
@@ -230,6 +276,212 @@ impl Accumulator for 
MedianAccumulator {
 }
 }
 
+/// The median groups accumulator accumulates the raw input values
+///
+/// For calculating the accurate medians of groups, we need to store all values
+/// of groups before final evaluation.
+/// So values in each group will be stored in a `Vec`, and the total group 
values
+/// will be actually organized as a `Vec>`.
+///
+#[derive(Debug)]
+struct MedianGroupsAccumulator {
+data_type: DataType,
+group_values: Vec>,
+}
+
+impl MedianGroupsAccumulator {
+pub fn new(data_type: DataType) -> Self {
+Self {
+data_type,
+group_values: Vec::new(),
+}
+}
+}
+
+impl GroupsAccumulator for 
MedianGroupsAccumulator {
+fn update_batch(
+&mut self,
+values: &[ArrayRef],
+group_indices: &[usize],
+opt_filter: Option<&BooleanArray>,
+total_num_groups: usize,
+) -> Result<()> {
+assert_eq!(values.len(), 1, "single argument to update_batch");
+let values = values[0].as_primitive::();
+
+// Push the `not nulls + not filtered` row into its group
+self.group_values.resize(total_num_groups, Vec::new());
+accumulate(
+group_indices,
+values,
+opt_filter,
+|group_index, new_value| {
+self.group_values[group_index].push(new_value);
+},
+);
+
+Ok(())
+}
+
+fn merge_batch(
+&mut self,
+values: &[ArrayRef],
+group_indices: &[usize],
+// Since aggregate filter should be applied in partial stage, in final 
stage there should be no filter
+_opt_filter: Option<&BooleanArray>,
+total_num_groups: usize,
+) -> Result<()> {
+assert_eq!(values.len(), 1, "one argument to merge_batch");
+
+// The merged values should be organized like as a `ListArray` which 
is nullable
+// (input with nulls usually generated from `convert_to_state`), but 
`inner array` of
+// `ListArray`  is `non-nullable`.
+//
+// Following is the possible and impossible input `values`:
+//
+// # Possible values
+// ```text
+//   group 0: [1, 2, 3]
+//   group 1: null (list array is nullable)
+//   group 2: [6, 7, 8]
+//   ...
+//   group n: [...]
+// ```
+//
+// # Impossible values
+// ```text
+//   group x: [1, 2, null] (values in list array is non-nullable)
+// ```
+//
+let input_group_values = values[0].as_list::();
+
+// Ensure group values big enough
+self.group_values.resize(total_num_groups, Vec::new());
+
+// Extend values to related groups
+// TODO: avoid using iterator of the `ListArray`, this will lead to
+// many calls of `slice` of its ``inner array`, and `slice` is not
+// so efficient(due to the calculation of `null_count` for each 
`slice`).
+group_indices
+.iter()
+.zip(input_group_values.iter())
+.for_each(|(&group_index, values_opt)| {
+if let Some(values) = values_opt {
+let values = values.as_primitive::();
+
self.group_values[group_index].extend(values.values().iter());
+}
+});
+
+Ok(())
+}
+
+fn state(&mut self, emit_to: EmitTo) -> Result> {
+// Emit values
+let emit_group_values = emit_to.take_needed(&mut self.group_values);
+
+// Build offsets
+let mut offsets = Vec::with_capacity(self.group_values.len() + 1);
+offsets.push(0);
+let mut cur_len = 0_i32;
+for group_value in &emit_group_values {
+cur_len += group_value.len() as i32;
+offsets.push(cur_len);
+}
+// TODO: maybe we can use `OffsetBuffer::new_unchecked` like what in 
`convert_to_state`,
+// but safety should be considered more carefully here(and I am not 
sure if it can get
+// performance improvement when we introduce checks to keep the 
safety...).
+//
+// Can see more details in:
+// 
https://github.com/apache/datafusion/pull/13681#discussion_r1931209791
+//
+let offsets = OffsetBuffer::new(ScalarBuffer::from(offsets));
+
+// Build inner array
+let flatten_group_values =
+emit_group_values.into_iter().flatten().collect::>();
+let group_values_array =
+PrimitiveArraynew(ScalarBuffer::from(flatten_group_values), 
None)
+.with_data_type(self.data_type.clone());
+
+// Build the result list array
+let result_list_array = ListArray::new(
+

[PR] chore(deps): bump tempfile from 3.15.0 to 3.16.0 in /datafusion-cli [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] opened a new pull request, #14350:
URL: https://github.com/apache/datafusion/pull/14350

   Bumps [tempfile](https://github.com/Stebalien/tempfile) from 3.15.0 to 
3.16.0.
   
   Changelog
   Sourced from https://github.com/Stebalien/tempfile/blob/master/CHANGELOG.md";>tempfile's
 changelog.
   
   3.16.0
   
   Update getrandom to 0.3.0 (thanks to https://github.com/paolobarbolini";>@​paolobarbolini).
   Allow windows-sys versions 0.59.x in addition 
to 0.59.0 (thanks https://github.com/ErichDonGubler";>@​ErichDonGubler).
   Improved security documentation (thanks to https://github.com/n0toose";>@​n0toose for collaborating 
with me on this).
   
   
   
   
   Commits
   
   https://github.com/Stebalien/tempfile/commit/6ecd9f197104c9edc74ba6ad7478a5289526adb1";>6ecd9f1
 chore: release 3.16.0
   https://github.com/Stebalien/tempfile/commit/3693c016b50cb43331966c80a9020e6d5cab709f";>3693c01
 build: upgrade getrandom to v0.3.0 (https://redirect.github.com/Stebalien/tempfile/issues/320";>#320)
   https://github.com/Stebalien/tempfile/commit/a4596e5d57359e43734407c283788b1d7720bf54";>a4596e5
 build: allow windows-sys 0.59 (https://redirect.github.com/Stebalien/tempfile/issues/319";>#319)
   https://github.com/Stebalien/tempfile/commit/3ac6d4eb04f4d719c4aadfb2328b9d0d80afcddf";>3ac6d4e
 chore: fix clippy lifetime warning (https://redirect.github.com/Stebalien/tempfile/issues/318";>#318)
   https://github.com/Stebalien/tempfile/commit/3a722e86baad70464e25348379c11f68b011a82a";>3a722e8
 docs: improve & expand security documentation (https://redirect.github.com/Stebalien/tempfile/issues/317";>#317)
   See full diff in https://github.com/Stebalien/tempfile/compare/v3.15.0...v3.16.0";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tempfile&package-manager=cargo&previous-version=3.15.0&new-version=3.16.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): bump serde_json from 1.0.137 to 1.0.138 in /datafusion-cli [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] opened a new pull request, #14351:
URL: https://github.com/apache/datafusion/pull/14351

   Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.137 to 1.0.138.
   
   Release notes
   Sourced from https://github.com/serde-rs/json/releases";>serde_json's 
releases.
   
   v1.0.138
   
   Documentation improvements
   
   
   
   
   Commits
   
   https://github.com/serde-rs/json/commit/c916099147f0864c158bfffaf6d74870a64b16ee";>c916099
 Release 1.0.138
   https://github.com/serde-rs/json/commit/dc29e4815d8b99d48f43d73638819b25e7cd19c8";>dc29e48
 Move BufReader to caller
   https://github.com/serde-rs/json/commit/29122f9ed796712c098a1cd614f207bd9d1b2ccc";>29122f9
 Sort imports from PR 1237
   https://github.com/serde-rs/json/commit/d33c1b527e77d29ed9a6c2719d8aaac424e85357";>d33c1b5
 Merge pull request https://redirect.github.com/serde-rs/json/issues/1237";>#1237 from 
JonathanBrouwer/master
   https://github.com/serde-rs/json/commit/8c2d8004b2b873772dbeadd5ad3f96a185d329df";>8c2d800
 Add more warnings to apply buffering on docs of affected functions
   https://github.com/serde-rs/json/commit/65bbd1aa2d0c9aca8347ba5b963e2f8658ab2d42";>65bbd1a
 Fix example of from_reader not applying buffering when it should
   https://github.com/serde-rs/json/commit/87f78da0f57a5bc6c875e56357bc9761558a3ef9";>87f78da
 More precise gitignore patterns
   https://github.com/serde-rs/json/commit/4134f119c025afa0f57f6b52b66def5c69db0ae6";>4134f11
 Remove *.sw[po] from gitignore
   https://github.com/serde-rs/json/commit/c7626dbac286ddf54aa120b3f3e7c9ebb4804af7";>c7626db
 Remove **/*.rs.bk from project-specific gitignore
   See full diff in https://github.com/serde-rs/json/compare/v1.0.137...v1.0.138";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=serde_json&package-manager=cargo&previous-version=1.0.137&new-version=1.0.138)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Incorrect result for IS NOT NULL predicate over UNION ALL query [datafusion]

2025-01-29 Thread via GitHub


findepi opened a new issue, #14352:
URL: https://github.com/apache/datafusion/issues/14352

   ### Describe the bug
   
   Incorrect result for IS NOT NULL predicate over UNION ALL query
   
   ### To Reproduce
   
   ```sql
   SELECT
   a,
   a IS NOT NULL
   FROM (
   SELECT 'foo' AS a, 3 AS b
   UNION ALL
   SELECT NULL AS a, 4 AS b
   )
   ```
   
   DF
   
   ```
   +--+---+
   | a| a IS NOT NULL |
   +--+---+
   | NULL | true  |   😱
   | foo  | true  |
   +--+---+
   ```
   
   
   ### Expected behavior
   
   `NULL` valued checked with `IS NOT NULL` should return `false`
   
   ```
   trino> SELECT
   -> a,
   -> a IS NOT NULL
   -> FROM (
   -> SELECT 'foo' AS a, 3 AS b
   -> UNION ALL
   -> SELECT NULL AS a, 4 AS b
   -> );
 a   | _col1
   --+---
foo  | true
NULL | false
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): update getrandom requirement from 0.2.8 to 0.3.1 [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] opened a new pull request, #14353:
URL: https://github.com/apache/datafusion/pull/14353

   Updates the requirements on 
[getrandom](https://github.com/rust-random/getrandom) to permit the latest 
version.
   
   Changelog
   Sourced from https://github.com/rust-random/getrandom/blob/master/CHANGELOG.md";>getrandom's
 changelog.
   
   [0.3.1] - 2025-01-28
   Fixed
   
   Build error on Android https://redirect.github.com/rust-random/getrandom/issues/588";>#588
   
   https://redirect.github.com/rust-random/getrandom/issues/588";>#588: 
https://redirect.github.com/rust-random/getrandom/pull/588";>rust-random/getrandom#588
   [0.3.0] - 2025-01-25
   Breaking Changes
   Changed
   
   Bump MSRV to 1.63 https://redirect.github.com/rust-random/getrandom/issues/542";>#542
   Rename getrandom and getrandom_uninit 
functions to fill and fill_uninit respectively https://redirect.github.com/rust-random/getrandom/issues/532";>#532
   
   Removed
   
   wasm32-wasi target support (use wasm32-wasip1 
or wasm32-wasip2 instead) https://redirect.github.com/rust-random/getrandom/issues/499";>#499
   linux_disable_fallback, rdrand, 
js, test-in-browser, and custom crate 
features
   in favor of configuration flags https://redirect.github.com/rust-random/getrandom/issues/504";>#504
   register_custom_getrandom! macro https://redirect.github.com/rust-random/getrandom/issues/504";>#504
   Implementation of From for 
Error and Error::code method https://redirect.github.com/rust-random/getrandom/issues/507";>#507
   Internet Explorer 11 support https://redirect.github.com/rust-random/getrandom/issues/554";>#554
   Target-specific assocciated Error constants https://redirect.github.com/rust-random/getrandom/issues/562";>#562
   
   Changed
   
   Use ProcessPrng on Windows 10 and up, and use 
RtlGenRandom on older Windows versions https://redirect.github.com/rust-random/getrandom/issues/415";>#415
   Do not use locale-specific strerror_r for retrieving error 
code descriptions https://redirect.github.com/rust-random/getrandom/issues/440";>#440
   Avoid assuming usize is the native word size in the 
rdrand backend https://redirect.github.com/rust-random/getrandom/issues/442";>#442
   Do not read from errno when libc did not 
indicate error on Solaris https://redirect.github.com/rust-random/getrandom/issues/448";>#448
   Switch from libpthread's mutex to futex on 
Linux and to nanosleep-based wait loop
   on other targets in the use_file backend https://redirect.github.com/rust-random/getrandom/issues/490";>#490
   Do not retry on EAGAIN while polling 
/dev/random on Linux https://redirect.github.com/rust-random/getrandom/issues/522";>#522
   Remove separate codepath for Node.js in the wasm_js backend
   (bumps minimum supported Node.js version to v19) https://redirect.github.com/rust-random/getrandom/issues/557";>#557
   Use js_namespace in the wasm_js backend https://redirect.github.com/rust-random/getrandom/issues/559";>#559
   
   Added
   
   wasm32-wasip1 and wasm32-wasip2 support https://redirect.github.com/rust-random/getrandom/issues/499";>#499
   getrandom_backend configuration flag for selection of 
opt-in backends https://redirect.github.com/rust-random/getrandom/issues/504";>#504
   Error::new_custom method https://redirect.github.com/rust-random/getrandom/issues/507";>#507
   rndr opt-in backend https://redirect.github.com/rust-random/getrandom/issues/512";>#512
   Automatic MemorySanitizer support https://redirect.github.com/rust-random/getrandom/issues/521";>#521 https://redirect.github.com/rust-random/getrandom/issues/571";>#571
   u32 and u64 functions for generating random 
values of the respective type https://redirect.github.com/rust-random/getrandom/issues/544";>#544
   wasm32v1-none support in the wasm_js backend 
https://redirect.github.com/rust-random/getrandom/issues/560";>#560
   wasm_js crate feature which allows users to enable the 
wasm_js opt-in backend https://redirect.github.com/rust-random/getrandom/issues/574";>#574
   
   Fixed
   
   NetBSD fallback code based on KERN_ARND https://redirect.github.com/rust-random/getrandom/issues/555";>#555
   
   https://redirect.github.com/rust-random/getrandom/issues/415";>#415: 
https://redirect.github.com/rust-random/getrandom/pull/415";>rust-random/getrandom#415
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/rust-random/getrandom/commit/2bb39598632af0097426815487f5b33b07edacbc";>2bb3959
 Release v0.3.1 (https://redirect.github.com/rust-random/getrandom/issues/589";>#589)
   https://github.com/rust-random/getrandom/commit/ac379dacbfbcbcc14747a0e014d84d6f24c8140b";>ac379da
 Fix build error on Android (https://redirect.github.com/rust-random/getrandom/issues/588";>#588)
   https://github.com/rust-random/getrandom/commit/aa9636349418ffa58bca6e2884d1bbc63fa1d2fa";>aa96363
 Add benchmarks for u32/u64 functions. (https://redirect.github.com/rust-random/getrandom/issues/582";>#582)
   https://github.com/rust-r

Re: [PR] chore(deps): update getrandom requirement from 0.2.8 to 0.3.0 [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] closed pull request #14315: chore(deps): update getrandom 
requirement from 0.2.8 to 0.3.0
URL: https://github.com/apache/datafusion/pull/14315


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): update getrandom requirement from 0.2.8 to 0.3.0 [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] commented on PR #14315:
URL: https://github.com/apache/datafusion/pull/14315#issuecomment-2621010404

   Superseded by #14353.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: FULL OUTER JOIN and LIMIT produces wrong results [datafusion]

2025-01-29 Thread via GitHub


zhuqi-lucas commented on PR #14338:
URL: https://github.com/apache/datafusion/pull/14338#issuecomment-2621010452

   Thank you @alamb and @xudong963!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): bump korandoru/hawkeye from 5 to 6 [datafusion]

2025-01-29 Thread via GitHub


dependabot[bot] opened a new pull request, #14354:
URL: https://github.com/apache/datafusion/pull/14354

   Bumps [korandoru/hawkeye](https://github.com/korandoru/hawkeye) from 5 to 6.
   
   Release notes
   Sourced from https://github.com/korandoru/hawkeye/releases";>korandoru/hawkeye's 
releases.
   
   6.0.0 2025-01-28
   Release Notes
   Breaking changes
   Now, HawkEye uses MiniJinja as the template engine.
   All the properties configured will be passed to the template 
engine as the props value, and thus:
   
   Previous ${property} should be replaced with {{ 
props["property"] }}.
   Previous built-in variables hawkeye.core.filename is now 
attrs.filename.
   Previous built-in variables hawkeye.git.fileCreatedYear is 
now attrs.git_file_created_year.
   Previous built-in variables hawkeye.git.fileModifiedYear is 
now attrs.git_file_modified_year.
   
   New properties:
   
   attrs.git_authors is a collection of authors of the file. 
You can join them with ,  to get a string by {{ 
attrs.git_authors | join(", ") }}.
   
   Notable changes
   Now, HawkEye would detect a leading BOM (Byte Order Mark) and remove it 
if it exists (https://redirect.github.com/korandoru/hawkeye/issues/166";>#166). I 
tend to treat this as a bug fix, but it may affect the output of the header.
   Install hawkeye 6.0.0
   Install prebuilt binaries via shell script
   curl --proto '=https' --tlsv1.2 -LsSf 
https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-installer.sh
 | sh
   
   Download hawkeye 6.0.0
   
   
   
   File
   Platform
   Checksum
   
   
   
   
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-aarch64-apple-darwin.tar.xz";>hawkeye-aarch64-apple-darwin.tar.xz
   Apple Silicon macOS
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-aarch64-apple-darwin.tar.xz.sha256";>checksum
   
   
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-apple-darwin.tar.xz";>hawkeye-x86_64-apple-darwin.tar.xz
   Intel macOS
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-apple-darwin.tar.xz.sha256";>checksum
   
   
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-pc-windows-msvc.zip";>hawkeye-x86_64-pc-windows-msvc.zip
   x64 Windows
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-pc-windows-msvc.zip.sha256";>checksum
   
   
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-unknown-linux-gnu.tar.xz";>hawkeye-x86_64-unknown-linux-gnu.tar.xz
   x64 Linux
   https://github.com/korandoru/hawkeye/releases/download/v6.0.0/hawkeye-x86_64-unknown-linux-gnu.tar.xz.sha256";>checksum
   
   
   
   v5.9.1
   Install hawkeye 5.9.1
   Install prebuilt binaries via shell script
   curl --proto '=https' --tlsv1.2 -LsSf 
https://github.com/korandoru/hawkeye/releases/download/v5.9.1/hawkeye-installer.sh
 | sh
   
   Download hawkeye 5.9.1
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/korandoru/hawkeye/blob/main/CHANGELOG.md";>korandoru/hawkeye's
 changelog.
   
   CHANGELOG
   All notable changes to this project will be documented in this file.
   [6.0.0] 2025-01-28
   Breaking changes
   Now, HawkEye uses MiniJinja as the template engine.
   All the properties configured will be passed to the template 
engine as the props value, and thus:
   
   Previous ${property} should be replaced with {{ 
props["property"] }}.
   Previous built-in variables hawkeye.core.filename is now 
attrs.filename.
   Previous built-in variables hawkeye.git.fileCreatedYear is 
now attrs.git_file_created_year.
   Previous built-in variables hawkeye.git.fileModifiedYear is 
now attrs.git_file_modified_year.
   
   New properties:
   
   attrs.git_authors is a collection of authors of the file. 
You can join them with ,  to get a string by {{ 
attrs.git_authors | join(", ") }}.
   
   Notable changes
   Now, HawkEye would detect a leading BOM (Byte Order Mark) and remove it 
if it exists (https://redirect.github.com/korandoru/hawkeye/issues/166";>#166). I 
tend to treat this as a bug fix, but it may affect the output of the header.
   
   
   
   Commits
   
   https://github.com/korandoru/hawkeye/commit/479470ff6bae2e70ac6b56ff9b1bf8ee14de9af2";>479470f
 chore: release 6.0.0
   https://github.com/korandoru/hawkeye/commit/1467c23fc6d4ce4e6fb5d23743cfcb36dd2a358c";>1467c23
 feat: render by minijnja and support git authors (https://redirect.github.com/korandoru/hawkeye/issues/169";>#169)
   https://github.com/korandoru/hawkeye/commit/223367e14afc21370b055f0a9c37e7fba209fbd9";>223367e
 fix: skip leading bom if exist (https://redirect.github.com/korandoru/hawkeye/issues/166";>#166)
   See full diff in https://github.com/korandoru/hawkeye/compare/v5...v6";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=korandor

Re: [I] Add support for lambda/higher order functions [datafusion]

2025-01-29 Thread via GitHub


gstvg commented on issue #14205:
URL: https://github.com/apache/datafusion/issues/14205#issuecomment-2621052052

   Hi @rkrishn7, have you already started any work?
   
   Since creating this issue, I’ve been working on a PoC to refine the 
proposal, and I managed to get it working. I’d like to continue with it if 
you’re OK with that—I probably should have assigned the issue to myself 
earlier, my bad 🤦🏻‍♂️! Anyway, many thanks for your interest and for your 
questionings, and I would really appreciate your input in futures questions (I 
expect many to arise!)
   
   > How do implementers define a signature for a UDF that takes one or more 
lambdas? Do we just skip type checking of lambda arguments? How does this 
affect the ordering?
   
   I added a new [ScalarUDF method 
invoke_with_lambda_args](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-ce24057bb984fe97516c33a9b1e690e20c85a5943503caa65db40d78ca3fdc7aR128)
 that receives ScalarFunctionArgs instead of 
ScalarFunctionArgs, and has a [default 
implementation](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-8a327db2db945bcf6ca2b4229885532feae127e94a450600d3fac6ecdc0eeb3fR702)
 calling invoke_with_args, returning an error if any arg is a lambda
   This should preserve ordering of any arg. 
   But maybe we could add an `lambda_arguments` to `ScalarFunctionArgs` 
similtar to 
[here](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-8a327db2db945bcf6ca2b4229885532feae127e94a450600d3fac6ecdc0eeb3fR413),
 what do you think ? 
   
   > How should one map lambda function arguments to concrete values in the 
body? If the function only operates on a list, this becomes easier to express.
   
   The lambda implementation should [build a 
RecordBatch](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-ce24057bb984fe97516c33a9b1e690e20c85a5943503caa65db40d78ca3fdc7aR147),
 likely using some other concrete argument, and pass it to the lambda 
[PhysicalExpr 
evaluate](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-ce24057bb984fe97516c33a9b1e690e20c85a5943503caa65db40d78ca3fdc7aR147)
 method
   
   
   
   That said, if you have any thoughts on the [current 
interface](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-ce24057bb984fe97516c33a9b1e690e20c85a5943503caa65db40d78ca3fdc7aR128),
 I’d love to hear your feedback. It's a `list_map([1,2], x -> x*2)` udf, you 
can see the [executed output 
here](https://github.com/apache/datafusion/compare/main...gstvg:arrow-datafusion:lambda#diff-53634af9d48017f60cb545dfe2aaab20d1f7fa3435947c161ef22515fa62ed8dR26).
 Thanks! 🙏


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add support for lambda/higher order functions [datafusion]

2025-01-29 Thread via GitHub


gstvg commented on issue #14205:
URL: https://github.com/apache/datafusion/issues/14205#issuecomment-2621070075

   Lambda syntax only works with databricks dialect, proposed fix here 
https://github.com/apache/datafusion-sqlparser-rs/pull/1686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: LimitPushdown rule uncorrect remove some GlobalLimitExec [datafusion]

2025-01-29 Thread via GitHub


zhuqi-lucas commented on code in PR #14245:
URL: https://github.com/apache/datafusion/pull/14245#discussion_r1933513054


##
datafusion/physical-optimizer/src/limit_pushdown.rs:
##
@@ -247,7 +246,15 @@ pub fn pushdown_limit_helper(
 }
 } else {
 // Add fetch or a `LimitExec`:
-global_state.satisfied = true;
+// If the plan's children have limit, we shouldn't change the 
global state to true,

Review Comment:
   Good question @xudong963 , added the slt testing for  children's limit is >= 
the globe limit, the limit should also support push down consistent with 
current behaviour, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Script and documentation for regenerating sqlite test files [datafusion]

2025-01-29 Thread via GitHub


Omega359 commented on code in PR #14290:
URL: https://github.com/apache/datafusion/pull/14290#discussion_r1934713735


##
datafusion/sqllogictest/regenerate_sqlite_files.sh:
##
@@ -0,0 +1,179 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+echo "This script is experimental! Please read the following completely to 
understand
+what this script does before running it.
+
+This script is designed to regenerate the .slt files in 
datafusion-testing/data/sqlite/
+from source files obtained from a git repository. To do that the following 
steps are
+performed:
+
+- Verify required commands are installed
+- Clone the remote git repository into /tmp/sqlitetesting
+- Delete all existing .slt files in datafusion-testing/data/sqlite/ folder
+- Copy the .test files from the cloned git repo into datafusion-testing
+- Remove a few directories and files from the copied files (not relevant to 
DataFusion)
+- Rename the .test files to .slt and cleanses the files. Cleansing involves:
+ - dos2unix
+ - removing all references to mysql, mssql and postgresql
+ - adds in a new 'control resultmode valuewise' statement at the top of all 
files
+ - updates the files to change 'AS REAL' to 'AS FLOAT8'
+- Replace the sqllogictest-rs dependency in the Cargo.toml with a version to
+  a git repository that has been custom written to properly complete the files
+  with comparison of datafusion results with postgresql
+- Replace the sqllogictest.rs file with a customized version from a github gist
+  that will work with the customized sqllogictest-rs dependency
+- Run the sqlite test with completion (takes > 1 hr)
+- Update a few results to ignore known issues
+- Run sqlite test to verify results
+- Perform cleanup to revert changes to the Cargo.toml file/sqllogictest.rs file
+- Delete backup files and the /tmp/sqlitetesting directory
+"
+read -r -p "Do you understand and accept the risk? (yes/no): " acknowledgement
+
+if [ "${acknowledgement,,}" != "yes" ]; then
+  exit 0
+else
+  echo "Ok, Proceeding..."
+fi
+
+if [ ! -x "$(command -v sd)" ]; then
+  echo "This script required 'sd' to be installed. Install via 'cargo install 
sd' or using your local package manager"
+  exit 0
+else
+  echo "sd command is installed, proceeding..."
+fi
+
+DF_HOME=$(realpath "../../")
+
+# location where we'll clone sql-logic-test repos
+if [ ! -d "/tmp/sqlitetesting" ]; then
+  mkdir /tmp/sqlitetesting
+fi
+
+if [ ! -d "/tmp/sqlitetesting/sql-logic-test" ]; then
+  echo "Cloning sql-logic-test into /tmp/sqlitetesting/"
+  cd /tmp/sqlitetesting/ || exit;
+  git clone https://github.com/hydromatic/sql-logic-test.git
+fi
+
+echo "Removing all existing .slt files from datafusion-testing/data/sqlite/ 
directory"
+
+cd "$DF_HOME/datafusion-testing/data/sqlite/" || exit;
+find ./ -type f -name "*.slt" -exec rm {} \;
+
+echo "Copying .test files from sql-logic-test to 
datafusion-testing/data/sqlite/"
+
+cp -r /tmp/sqlitetesting/sql-logic-test/src/main/resources/test/* ./
+
+echo "Removing 'evidence/*' and 'index/delete/*' directories from 
datafusion-testing"
+
+find ./evidence/ -type f -name "*.test" -exec rm {} \;
+rm -rf ./index/delete/1
+rm -rf ./index/delete/10
+rm -rf ./index/delete/100
+rm -rf ./index/delete/1000
+rm -rf ./index/delete/1
+# this file is empty and causes the sqllogictest-rs code to fail
+rm ./index/view/10/slt_good_0.test
+
+echo "Renaming .test files to .slt and cleansing the files ..."
+
+# add hash-threshold lines into these 3 files as they were missing
+sed -i '1s/^/hash-threshold 8\n\n/' ./select1.test
+sed -i '1s/^/hash-threshold 8\n\n/' ./select4.test
+sed -i '1s/^/hash-threshold 8\n\n/' ./select5.test
+# rename
+find ./ -type f -name "*.test" -exec rename -f 's/\.test/\.slt/' {} \;
+# gotta love windows :/
+find ./ -type f -name "*.slt" -exec dos2unix --quiet {} \;
+# add in control resultmode
+find ./ -type f -name "*.slt" -exec sd -f i 'hash-threshold 8\n' 
'hash-threshold 8\ncontrol resultmode valuewise\n' {} \;
+# remove mysql tests and skipif lines
+find ./ -type f -name "*.slt" -exec sd -f i 'onlyif 
mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
+find ./ -type f -name "*.slt" -exec sd -f i 'onlyif 
mysql.+\n.+\n.+\n.+\n.+\

Re: [I] Type Coercion fails for List with inner type struct which has large/view types [datafusion]

2025-01-29 Thread via GitHub


alamb commented on issue #14154:
URL: https://github.com/apache/datafusion/issues/14154#issuecomment-2622957118

   Possibly related: 
   - https://github.com/apache/datafusion/pull/12490 (released in DataFusion 
44.0.0)
   - https://github.com/apache/datafusion/pull/13452 (also released in 
DataFusion 44.0.0)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Type Coercion fails for List with inner type struct which has large/view types [datafusion]

2025-01-29 Thread via GitHub


alamb commented on issue #14154:
URL: https://github.com/apache/datafusion/issues/14154#issuecomment-2622959604

   Ao the next step for this PR is to find a DataFusion only reproducer that 
works in DF 43 but not in DF 44. I will try to do so tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14362:
URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2622977789

   as described in document, 
https://spark.apache.org/docs/3.5.1/api/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.html
   
   If a table column and a metadata column have the same name, the conflict is 
resolved by either renaming or suppressing the metadata column. I think we 
should also guarantee suppressing the metadata column in case of arbitrary 
fields order.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) [datafusion-comet]

2025-01-29 Thread via GitHub


andygrove commented on code in PR #1332:
URL: https://github.com/apache/datafusion-comet/pull/1332#discussion_r1934746178


##
native/core/src/execution/operators/filter.rs:
##
@@ -62,6 +65,8 @@ pub struct FilterExec {
 default_selectivity: u8,
 /// Properties equivalence properties, partitioning, etc.
 cache: PlanProperties,
+/// The projection indices of the columns in the output schema of join
+projection: Option>,

Review Comment:
   We currently maintain a copy of DataFusion's `FilterExec` with one small 
change, so I copied over that latest to keep in sync and then re-applied the 
change that we need (for memory safety because of the way we re-use buffers).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-01-29 Thread via GitHub


rohitrastogi commented on code in PR #14286:
URL: https://github.com/apache/datafusion/pull/14286#discussion_r1934929386


##
datafusion-examples/examples/thread_pools_lib/dedicated_executor.rs:
##
@@ -0,0 +1,1778 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [DedicatedExecutor] for running CPU-bound tasks on a separate tokio 
runtime.
+
+use crate::SendableRecordBatchStream;
+use async_trait::async_trait;
+use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+use datafusion_common::error::GenericError;
+use datafusion_common::DataFusionError;
+use futures::stream::BoxStream;
+use futures::{
+future::{BoxFuture, Shared},
+Future, FutureExt, Stream, StreamExt, TryFutureExt,
+};
+use log::{info, warn};
+use object_store::path::Path;
+use object_store::{
+GetOptions, GetResult, GetResultPayload, ListResult, MultipartUpload, 
ObjectMeta,
+ObjectStore, PutMultipartOpts, PutOptions, PutPayload, PutResult, 
UploadPart,
+};
+use std::cell::RefCell;
+use std::pin::Pin;
+use std::sync::RwLock;
+use std::task::{Context, Poll};
+use std::{fmt::Display, sync::Arc, time::Duration};
+use tokio::runtime::Builder;
+use tokio::task::JoinHandle;
+use tokio::{
+runtime::Handle,
+sync::{oneshot::error::RecvError, Notify},
+task::JoinSet,
+};
+use tokio_stream::wrappers::ReceiverStream;
+
+/// Create a [`DedicatedExecutorBuilder`] from a tokio [`Builder`]
+impl From for DedicatedExecutorBuilder {
+fn from(value: Builder) -> Self {
+Self::new_from_builder(value)
+}
+}
+
+/// Manages a separate tokio [`Runtime`] (thread pool) for executing CPU bound
+/// tasks such as DataFusion `ExecutionPlans`.
+///
+/// See [`DedicatedExecutorBuilder`] for creating a new instance.
+///
+/// A `DedicatedExecutor` can helps avoid issues when runnnig IO and CPU bound 
tasks on the
+/// same thread pool by running futures (and any `tasks` that are
+/// `tokio::task::spawned` by them) on a separate tokio [`Executor`].
+///
+/// `DedicatedExecutor`s can be `clone`ed and all clones share the same thread 
pool.
+///
+/// Since the primary use for a `DedicatedExecutor` is offloading CPU bound
+/// work, IO work can not be performed on tasks launched in the Executor.
+///
+/// To perform IO, see:
+/// - [`Self::spawn_io`]
+/// - [`Self::wrap_object_store`]
+///
+/// When [`DedicatedExecutorBuilder::build`] is called, a reference to the
+/// "current" tokio runtime will be stored and used, via 
[`register_io_runtime`] by all
+/// threads spawned by the executor. Any I/O done by threads in this
+/// [`DedicatedExecutor`] should use [`spawn_io`], which will run them on the
+/// I/O runtime.
+///
+/// # TODO examples
+///
+/// # Background
+///
+/// Tokio has the notion of the "current" runtime, which runs the current 
future
+/// and any tasks spawned by it. Typically, this is the runtime created by
+/// `tokio::main` and is used for the main application logic and I/O handling
+///
+/// For CPU bound work, such as DataFusion plan execution, it is important to
+/// run on a separate thread pool to avoid blocking the I/O handling for 
extended
+/// periods of time in order to avoid long poll latencies (which decreases the
+/// throughput of small requests under concurrent load).
+///
+/// # IO Scheduling
+///
+/// I/O, such as network calls, should not be performed on the runtime managed
+/// by [`DedicatedExecutor`]. As tokio is a cooperative scheduler, long-running
+/// CPU tasks will not be preempted and can therefore starve servicing of other
+/// tasks. This manifests in long poll-latencies, where a task is ready to run
+/// but isn't being scheduled to run. For CPU-bound work this isn't a problem 
as
+/// there is no external party waiting on a response, however, for I/O tasks,
+/// long poll latencies can prevent timely servicing of IO, which can have a
+/// significant detrimental effect.
+///
+/// # Details
+///
+/// The worker thread priority is set to low so that such tasks do
+/// not starve other more important tasks (such as answering health checks)
+///
+/// Follows the example from stack overflow and spawns a new
+/// thread to install a Tokio runtime "context"
+/// 

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-01-29 Thread via GitHub


rohitrastogi commented on code in PR #14286:
URL: https://github.com/apache/datafusion/pull/14286#discussion_r1934929386


##
datafusion-examples/examples/thread_pools_lib/dedicated_executor.rs:
##
@@ -0,0 +1,1778 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [DedicatedExecutor] for running CPU-bound tasks on a separate tokio 
runtime.
+
+use crate::SendableRecordBatchStream;
+use async_trait::async_trait;
+use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+use datafusion_common::error::GenericError;
+use datafusion_common::DataFusionError;
+use futures::stream::BoxStream;
+use futures::{
+future::{BoxFuture, Shared},
+Future, FutureExt, Stream, StreamExt, TryFutureExt,
+};
+use log::{info, warn};
+use object_store::path::Path;
+use object_store::{
+GetOptions, GetResult, GetResultPayload, ListResult, MultipartUpload, 
ObjectMeta,
+ObjectStore, PutMultipartOpts, PutOptions, PutPayload, PutResult, 
UploadPart,
+};
+use std::cell::RefCell;
+use std::pin::Pin;
+use std::sync::RwLock;
+use std::task::{Context, Poll};
+use std::{fmt::Display, sync::Arc, time::Duration};
+use tokio::runtime::Builder;
+use tokio::task::JoinHandle;
+use tokio::{
+runtime::Handle,
+sync::{oneshot::error::RecvError, Notify},
+task::JoinSet,
+};
+use tokio_stream::wrappers::ReceiverStream;
+
+/// Create a [`DedicatedExecutorBuilder`] from a tokio [`Builder`]
+impl From for DedicatedExecutorBuilder {
+fn from(value: Builder) -> Self {
+Self::new_from_builder(value)
+}
+}
+
+/// Manages a separate tokio [`Runtime`] (thread pool) for executing CPU bound
+/// tasks such as DataFusion `ExecutionPlans`.
+///
+/// See [`DedicatedExecutorBuilder`] for creating a new instance.
+///
+/// A `DedicatedExecutor` can helps avoid issues when runnnig IO and CPU bound 
tasks on the
+/// same thread pool by running futures (and any `tasks` that are
+/// `tokio::task::spawned` by them) on a separate tokio [`Executor`].
+///
+/// `DedicatedExecutor`s can be `clone`ed and all clones share the same thread 
pool.
+///
+/// Since the primary use for a `DedicatedExecutor` is offloading CPU bound
+/// work, IO work can not be performed on tasks launched in the Executor.
+///
+/// To perform IO, see:
+/// - [`Self::spawn_io`]
+/// - [`Self::wrap_object_store`]
+///
+/// When [`DedicatedExecutorBuilder::build`] is called, a reference to the
+/// "current" tokio runtime will be stored and used, via 
[`register_io_runtime`] by all
+/// threads spawned by the executor. Any I/O done by threads in this
+/// [`DedicatedExecutor`] should use [`spawn_io`], which will run them on the
+/// I/O runtime.
+///
+/// # TODO examples
+///
+/// # Background
+///
+/// Tokio has the notion of the "current" runtime, which runs the current 
future
+/// and any tasks spawned by it. Typically, this is the runtime created by
+/// `tokio::main` and is used for the main application logic and I/O handling
+///
+/// For CPU bound work, such as DataFusion plan execution, it is important to
+/// run on a separate thread pool to avoid blocking the I/O handling for 
extended
+/// periods of time in order to avoid long poll latencies (which decreases the
+/// throughput of small requests under concurrent load).
+///
+/// # IO Scheduling
+///
+/// I/O, such as network calls, should not be performed on the runtime managed
+/// by [`DedicatedExecutor`]. As tokio is a cooperative scheduler, long-running
+/// CPU tasks will not be preempted and can therefore starve servicing of other
+/// tasks. This manifests in long poll-latencies, where a task is ready to run
+/// but isn't being scheduled to run. For CPU-bound work this isn't a problem 
as
+/// there is no external party waiting on a response, however, for I/O tasks,
+/// long poll latencies can prevent timely servicing of IO, which can have a
+/// significant detrimental effect.
+///
+/// # Details
+///
+/// The worker thread priority is set to low so that such tasks do
+/// not starve other more important tasks (such as answering health checks)
+///
+/// Follows the example from stack overflow and spawns a new
+/// thread to install a Tokio runtime "context"
+/// 

Re: [PR] move information_schema to datafusion-catalog [datafusion]

2025-01-29 Thread via GitHub


alamb commented on code in PR #14364:
URL: https://github.com/apache/datafusion/pull/14364#discussion_r1934520683


##
datafusion/substrait/Cargo.toml:
##
@@ -46,6 +46,7 @@ url = { workspace = true }
 
 [dev-dependencies]
 datafusion = { workspace = true, features = ["nested_expressions"] }
+datafusion-catalog = { workspace = true }

Review Comment:
   I think since datafusion re-exports `datafusion_catalog` we should be able 
to avoid this dependency:
   
   
https://github.com/apache/datafusion/blob/9ca6c2557488c1c608182542c1be96889b64fe29/datafusion/core/src/lib.rs#L731-L730



##
datafusion/catalog/src/lib.rs:
##
@@ -18,23 +18,264 @@
 //! Interfaces and default implementations of catalogs and schemas.
 //!
 //! Implementations
+//! * Information schema: [`information_schema`]
 //! * Simple memory based catalog: [`MemoryCatalogProviderList`], 
[`MemoryCatalogProvider`], [`MemorySchemaProvider`]
 
 pub mod memory;
+pub use datafusion_sql::{ResolvedTableReference, TableReference};
 pub use memory::{
 MemoryCatalogProvider, MemoryCatalogProviderList, MemorySchemaProvider,
 };
+use std::collections::BTreeSet;
+use std::ops::ControlFlow;
 
 mod r#async;
 mod catalog;
 mod dynamic_file;
+pub mod information_schema;
 mod schema;
 mod session;
 mod table;
-
 pub use catalog::*;
 pub use dynamic_file::catalog::*;
 pub use r#async::*;
 pub use schema::*;
 pub use session::*;
 pub use table::*;
+pub mod streaming;
+
+/// Collects all tables and views referenced in the SQL statement. CTEs are 
collected separately.
+/// This can be used to determine which tables need to be in the catalog for a 
query to be planned.
+///
+/// # Returns
+///
+/// A `(table_refs, ctes)` tuple, the first element contains table and view 
references and the second
+/// element contains any CTE aliases that were defined and possibly referenced.
+///
+/// ## Example
+///
+/// ```
+/// # use datafusion_sql::parser::DFParser;
+/// # use datafusion_catalog::resolve_table_references;
+/// let query = "SELECT a FROM foo where x IN (SELECT y FROM bar)";
+/// let statement = DFParser::parse_sql(query).unwrap().pop_back().unwrap();
+/// let (table_refs, ctes) = resolve_table_references(&statement, 
true).unwrap();
+/// assert_eq!(table_refs.len(), 2);
+/// assert_eq!(table_refs[0].to_string(), "bar");
+/// assert_eq!(table_refs[1].to_string(), "foo");
+/// assert_eq!(ctes.len(), 0);
+/// ```
+///
+/// ## Example with CTEs  
+///  
+/// ```  
+/// # use datafusion_sql::parser::DFParser;  
+/// # use datafusion_catalog::resolve_table_references;
+/// let query = "with my_cte as (values (1), (2)) SELECT * from my_cte;";  
+/// let statement = DFParser::parse_sql(query).unwrap().pop_back().unwrap();  
+/// let (table_refs, ctes) = resolve_table_references(&statement, 
true).unwrap();  
+/// assert_eq!(table_refs.len(), 0);
+/// assert_eq!(ctes.len(), 1);  
+/// assert_eq!(ctes[0].to_string(), "my_cte");  
+/// ```
+pub fn resolve_table_references(
+statement: &datafusion_sql::parser::Statement,

Review Comment:
   I think since this function belongs in `datafusion-sql` as it is walking 
over the sql parse tree 
   
   perhaps we can put it in 
`https://github.com/apache/datafusion/tree/main/datafusion/sql/src/resolve.rs` 
or something and then remove the dependency of `datafusion-catalog` on 
`datafusion-sql`
   
   I think we could do this in a follow on PR as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] move information_schema to datafusion-catalog [datafusion]

2025-01-29 Thread via GitHub


alamb commented on PR #14364:
URL: https://github.com/apache/datafusion/pull/14364#issuecomment-2622752446

   I am checking this one out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Build time regression [datafusion]

2025-01-29 Thread via GitHub


alamb commented on issue #14256:
URL: https://github.com/apache/datafusion/issues/14256#issuecomment-2622826545

   > > After removing the WildcardOptions (by replacing it with an empty 
structure) I can see the build time drops. Removing the rule itself and the 
change in core doesn't help. It looks like the change to Expr is the root 
cause, though I don't know the reason...
   > 
   > I wonder if the issue is that `WildCardOptions` is a large struct that is 
not `Box`ed -- so it means that every `Expr` needs to be large enough to hold 
it which maybe increased the size of all `Expr`s 🤔
   
   🤔  it seems we have already Box'd it:
   
https://github.com/apache/datafusion/blob/2797cf7221e5492f14157e461d8066f5223650ce/datafusion/expr/src/expr.rs#L316-L315
   
   Let me see if I can find ways to make Expr smaller though


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] move information_schema to datafusion-catalog [datafusion]

2025-01-29 Thread via GitHub


logan-keede commented on code in PR #14364:
URL: https://github.com/apache/datafusion/pull/14364#discussion_r1934571539


##
datafusion/substrait/Cargo.toml:
##
@@ -46,6 +46,7 @@ url = { workspace = true }
 
 [dev-dependencies]
 datafusion = { workspace = true, features = ["nested_expressions"] }
+datafusion-catalog = { workspace = true }

Review Comment:
   I knew there was some chance of removing this dependency, but not sure on 
how to go about it. So, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] move information_schema to datafusion-catalog [datafusion]

2025-01-29 Thread via GitHub


logan-keede commented on PR #14364:
URL: https://github.com/apache/datafusion/pull/14364#issuecomment-2622850087

   > If you are feeling like some more refactoring projects, any chance you are 
interested in working to split out the data sources (aka make 
`datafusion-datasource-parquet`, `datafusion-datasource-csv`, etc)?
   
   Well, I guess I know what I might be doing when I get back. I will 
definitely look into it, please let me know if there have been some 
discussion(or any relevant resource) on it in the past if that's not too much 
trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Simple Functions [datafusion]

2025-01-29 Thread via GitHub


Omega359 commented on issue #12635:
URL: https://github.com/apache/datafusion/issues/12635#issuecomment-2622912627

   
https://github.com/apache/datafusion/blob/main/datafusion/functions-nested/src/string.rs
 is a good example of what can result with supporting many types and args


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Reduce size of `Expr` struct [datafusion]

2025-01-29 Thread via GitHub


findepi commented on code in PR #14366:
URL: https://github.com/apache/datafusion/pull/14366#discussion_r1934669162


##
datafusion/expr/src/expr.rs:
##
@@ -3067,4 +3078,19 @@ mod test {
 rename: opt_rename,
 }
 }
+
+#[test]
+fn test_size_of_expr() {
+// because Expr is such a widely used struct in DataFusion
+// it is important to keep its size as small as possible
+//
+// If this test fails when you change `Expr`, please try
+// `Box`ing the fields to make `Expr` smaller
+// See https://github.com/apache/datafusion/issues/14256 for details
+assert_eq!(size_of::(), 112);
+assert_eq!(size_of::(), 64);
+assert_eq!(size_of::(), 24); // 3 ptrs

Review Comment:
   does this test pass in debug and release modes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14362:
URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2623096983

   something like this.
   
   ```rust
   #[tokio::test]
   async fn test_name_conflict() {
   let batch = record_batch!(
   ("_rowid", UInt32, [0, 1, 2]),
   ("_rowid", Utf8, ["file-0", "file-1", "file-2"])
   )
   .unwrap();
   let batch = batch
   .with_schema(Arc::new(Schema::new(vec![
   Field::new("_rowid", DataType::UInt32, true).to_system_column(),
   Field::new("_rowid", DataType::Utf8, true),
   ])))
   .unwrap();
   
   let ctx = SessionContext::new_with_config(
   SessionConfig::new().with_information_schema(true),
   );
   let _ = ctx.register_batch("test", batch);
   let select = "SELECT _rowid FROM test";
   let df = ctx.sql(select).await.unwrap();
   let batchs = df.collect().await.unwrap();
   let expected = [
   "++",
   "| _rowid |",
   "++",
   "| file-0 |",
   "| file-1 |",
   "| file-2 |",
   "++",
   ];
   assert_batches_sorted_eq!(expected, &batchs);
   }
   ```
   
   Currently DuplicateQualifiedField error is thrown.  But it's valid in spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]

2025-01-29 Thread via GitHub


adriangb commented on PR #14362:
URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2623148014

   I will still give a shot at adding that feature tomorrow. But I'm not sold 
on the behavior being ideal even if that's what Spark does. Besides if it's an 
error now we can always make it not error in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add expression array_size [datafusion-comet]

2025-01-29 Thread via GitHub


parthchandra commented on code in PR #1122:
URL: https://github.com/apache/datafusion-comet/pull/1122#discussion_r1934793909


##
native/spark-expr/src/list.rs:
##
@@ -708,6 +708,92 @@ impl PartialEq for ArrayInsert {
 }
 }
 
+#[derive(Debug, Hash)]
+pub struct ArraySize {
+src_array_expr: Arc,
+}
+
+impl ArraySize {
+pub fn new(src_array_expr: Arc) -> Self {
+Self { src_array_expr }
+}
+}
+
+impl Display for ArraySize {
+fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+write!(f, "ArraySize [array: {:?}]", self.src_array_expr)
+}
+}
+
+impl PartialEq for ArraySize {
+fn eq(&self, other: &dyn Any) -> bool {
+down_cast_any_ref(other)
+.downcast_ref::()
+.map(|x| self.src_array_expr.eq(&x.src_array_expr))
+.unwrap_or(false)
+}
+}
+
+impl PhysicalExpr for ArraySize {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn data_type(&self, _input_schema: &Schema) -> DataFusionResult {
+Ok(DataType::Int32)
+}
+
+fn nullable(&self, input_schema: &Schema) -> DataFusionResult {
+self.src_array_expr.nullable(input_schema)
+}
+
+fn evaluate(&self, batch: &RecordBatch) -> DataFusionResult 
{
+let array_value = self
+.src_array_expr
+.evaluate(batch)?
+.into_array(batch.num_rows())?;
+match array_value.data_type() {
+DataType::List(_) => {
+let list_array = as_list_array(&array_value)?;
+let mut builder = Int32Array::builder(list_array.len());
+for i in 0..list_array.len() {
+if list_array.is_null(i) {
+builder.append_null();
+} else {
+builder.append_value(list_array.value_length(i));

Review Comment:
   Will `value_length(i)` include nulls in the array at index `i`?



##
native/spark-expr/src/list.rs:
##
@@ -708,6 +708,92 @@ impl PartialEq for ArrayInsert {
 }
 }
 
+#[derive(Debug, Hash)]
+pub struct ArraySize {
+src_array_expr: Arc,
+}
+
+impl ArraySize {
+pub fn new(src_array_expr: Arc) -> Self {
+Self { src_array_expr }
+}
+}
+
+impl Display for ArraySize {
+fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+write!(f, "ArraySize [array: {:?}]", self.src_array_expr)
+}
+}
+
+impl PartialEq for ArraySize {
+fn eq(&self, other: &dyn Any) -> bool {
+down_cast_any_ref(other)
+.downcast_ref::()
+.map(|x| self.src_array_expr.eq(&x.src_array_expr))
+.unwrap_or(false)
+}
+}
+
+impl PhysicalExpr for ArraySize {
+fn as_any(&self) -> &dyn Any {
+self
+}
+
+fn data_type(&self, _input_schema: &Schema) -> DataFusionResult {
+Ok(DataType::Int32)
+}
+
+fn nullable(&self, input_schema: &Schema) -> DataFusionResult {
+self.src_array_expr.nullable(input_schema)
+}
+
+fn evaluate(&self, batch: &RecordBatch) -> DataFusionResult 
{
+let array_value = self
+.src_array_expr
+.evaluate(batch)?
+.into_array(batch.num_rows())?;
+match array_value.data_type() {
+DataType::List(_) => {
+let list_array = as_list_array(&array_value)?;
+let mut builder = Int32Array::builder(list_array.len());
+for i in 0..list_array.len() {
+if list_array.is_null(i) {

Review Comment:
   You might want to handle `legacySizeOfNull` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support marking columns as system columns via Field's metadata [datafusion]

2025-01-29 Thread via GitHub


chenkovsky commented on PR #14362:
URL: https://github.com/apache/datafusion/pull/14362#issuecomment-2623174593

   > Is this that important to support? The example seems a bit contrived, I 
think it'd be more reasonable if it occurred naturally as part of a join or 
something where a user could unexpectedly run into it. Otherwise it seems to me 
like something very spark specific.
   > 
   > Postgres for example doesn't let you duplicate system columns:
   > 
   > ```sql
   > create table t (ctid int);
   > ERROR:  column name "ctid" conflicts with a system column name
   > ```
   
   because spark is not a database, it's a compute engine. For databases such 
as postgres, it can manipulate schema and data by itself. so it's ok to disable 
system column and normal column name conflict. but for spark, the data comes 
from other systems. it cannot guarantee data doesn't contain these conflict 
fields.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement xxhash algorithms as part of the expression API [datafusion]

2025-01-29 Thread via GitHub


Spaarsh commented on issue #14044:
URL: https://github.com/apache/datafusion/issues/14044#issuecomment-2623464462

   Thanks @HectorPascual! I'm opening PR from here on for transperancy's sake!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] 14044/enhancement/add xxhash algorithms in expression api [datafusion]

2025-01-29 Thread via GitHub


Spaarsh opened a new pull request, #14367:
URL: https://github.com/apache/datafusion/pull/14367

   ## Which issue does this PR close?
   
   
   
   Closes #14044.
   
   ## Rationale for this change
   Lack of xxhash (a quick, non-cryptographic hashing technique) functions.
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   Yes.
   
   
   ## Are there any user-facing changes?
   The users shall now be able to use xxhash32 and xxhash64 functions.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Allow plain JOIN without turning it into INNER [datafusion-sqlparser-rs]

2025-01-29 Thread via GitHub


iffyio merged PR #1692:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1692


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   3   >