date:20250224

Re: [PR] Fixed Migrate Datetime functions to invoke_with_args Issue 14705 [datafusion]

2025-02-24 Thread via GitHub



varun-bhardwaj-sde commented on PR #14792:
URL: https://github.com/apache/datafusion/pull/14792#issuecomment-2680505232

   it was closed by misktake i raised another PR 
https://github.com/apache/datafusion/pull/14864 for this task


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fixed Migrate Datetime functions to invoke_with_args Issue 14705 [datafusion]

2025-02-24 Thread via GitHub



varun-bhardwaj-sde commented on PR #14864:
URL: https://github.com/apache/datafusion/pull/14864#issuecomment-2680507929

   @niebayes can you please review this once and guide me if I am doing any 
mistake here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Fixed Migrate Datetime functions to invoke_with_args Issue 14705 [datafusion]

2025-02-24 Thread via GitHub



varun-bhardwaj-sde opened a new pull request, #14864:
URL: https://github.com/apache/datafusion/pull/14864

   ## Which issue does this PR close?
   
   
   
   - Closes #14705.
   
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

2025-02-24 Thread via GitHub



bharath-techie commented on issue #14816:
URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2680694026

   Thanks @chenkovsky for confirming. 
   
   We are new to datafusion , but at high level looks like this feature will 
need a deeper integration in the ParquetExec flow and we might need changes in 
`ParquetRecordBatchStream` in `arrow-rs` as it performs pruning and at 
datafusion layer we might not be able to figure out the actual row ids because 
of it.
   
   Experts can comment on this / see if there are any other ways that they can 
think of. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Examples: boundary analysis example for `AND/OR` conjunctions [datafusion]

2025-02-24 Thread via GitHub



ozankabak commented on PR #14735:
URL: https://github.com/apache/datafusion/pull/14735#issuecomment-2680697590

   Indeed we will open an epic to plan the remaining statistics infrastructure 
work. I plan to put something together today/tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] replace TypeSignature::String with TypeSignature::Coercible for trim functions [datafusion]

2025-02-24 Thread via GitHub



zjregee opened a new pull request, #14865:
URL: https://github.com/apache/datafusion/pull/14865

   ## Which issue does this PR close?
   - Part of #14759.
   ## Rationale for this change
   ## What changes are included in this PR?
   ## Are these changes tested?
   Yes.
   ## Are there any user-facing changes?
   None.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] replace TypeSignature::String with TypeSignature::Coercible for trim functions [datafusion]

2025-02-24 Thread via GitHub



zjregee commented on PR #14865:
URL: https://github.com/apache/datafusion/pull/14865#issuecomment-2680958642

   I think this PR is ready for review now. cc: @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] replace TypeSignature::String with TypeSignature::Coercible for starts_with [datafusion]

2025-02-24 Thread via GitHub



zjregee commented on PR #14812:
URL: https://github.com/apache/datafusion/pull/14812#issuecomment-2680962268

   I think this PR is ready for review now. cc: @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add support column prefix index for MySQL [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



zzzdong closed pull request #1732: Add support column prefix index for MySQL
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1732


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add support column prefix index for MySQL [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



zzzdong commented on PR #1732:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1732#issuecomment-2680992482

   Considering the work on the index parsing in #1707, this will be closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: graceful NULL and type error handling in array functions [datafusion]

2025-02-24 Thread via GitHub



jayzhan211 commented on code in PR #14737:
URL: https://github.com/apache/datafusion/pull/14737#discussion_r1969168798


##
datafusion/functions-nested/src/sort.rs:
##
@@ -143,6 +168,10 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> 
Result {
 return exec_err!("array_sort expects one to three arguments");
 }
 
+if args.iter().any(|array| array.logical_null_count() > 0) {

Review Comment:
   ```
   D select array_sort([1,2,3], null);
   ┌┐
   │ array_sort(main.list_value(1, 2, 3), NULL) │
   │   int32│
   ├┤
   ││
   └┘
   ```
   
   We need to check 2nd arg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Examples: boundary analysis example for `AND/OR` conjunctions [datafusion]

2025-02-24 Thread via GitHub



clflushopt commented on PR #14735:
URL: https://github.com/apache/datafusion/pull/14735#issuecomment-2680365175

   @alamb any suggestions on what to improve here vis-a-vis documentation or 
the example since #14699 has been merged ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Migrate Datetime functions to `invoke_with_args` [datafusion]

2025-02-24 Thread via GitHub



varun-bhardwaj-sde commented on issue #14705:
URL: https://github.com/apache/datafusion/issues/14705#issuecomment-2680482983

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Examples: boundary analysis example for `AND/OR` conjunctions [datafusion]

2025-02-24 Thread via GitHub



clflushopt commented on PR #14735:
URL: https://github.com/apache/datafusion/pull/14735#issuecomment-2680374034

   @alamb @ozankabak do we plan to have open issues for the follow up changes 
described in #14699 ? I am specifically trying to figure out whether we should 
address those items as part of #3929 ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] ExternalSorter Fails to Spill Dictionaries [datafusion]

2025-02-24 Thread via GitHub



davidhewitt commented on issue #4658:
URL: https://github.com/apache/datafusion/issues/4658#issuecomment-2680773931

   Ok great, I'll work on that today 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

2025-02-24 Thread via GitHub



Arpit-Bandejiya commented on issue #14816:
URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2680785661

   Found this PR in arrow-rs : https://github.com/apache/arrow-rs/pull/6624 . 
@XiangpengHao I see the PR is in draft from sometime. Is there any other we are 
trying to do it? Can you please share it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Store spans for Value expressions [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio commented on code in PR #1738:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1738#discussion_r1969034324


##
tests/sqlparser_bigquery.rs:
##
@@ -29,19 +29,19 @@ use test_utils::*;
 #[test]
 fn parse_literal_string() {
 let sql = concat!(
-"SELECT ",
-"'single', ",
-r#""double", "#,
-"'''triple-single''', ",
-r#triple-double""", "#,
-r#"'single\'escaped', "#,
-r#"'''triple-single\'escaped''', "#,
-r#"'''triple-single'unescaped''', "#,
-r#""double\"escaped", "#,
-r#triple-double\"escaped""", "#,
-r#triple-double"unescaped""", "#,
-r#triple-double'unescaped""", "#,
-r#"'''triple-single"unescaped'''"#,
+"SELECT ",// line 1, column 1
+"'single', ", // line 1, column 7
+r#""double", "#,  // line 1, column 14
+"'''triple-single''', ",  // line 1, column 22
+r#triple-double""", "#,   // line 1, column 33
+r#"'single\'escaped', "#, // line 1, column 43
+r#"'''triple-single\'escaped''', "#,  // line 1, column 55
+r#"'''triple-single'unescaped''', "#, // line 1, column 68
+r#""double\"escaped", "#, // line 1, column 83
+r#triple-double\"escaped""", "#,  // line 1, column 92
+r#triple-double"unescaped""", "#, // line 1, column 105
+r#triple-double'unescaped""", "#, // line 1, column 118
+r#"'''triple-single"unescaped'''"#,   // line 1, column 131

Review Comment:
   do the added comments here need cleanup or do they serve a purpose?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Store spans for Value expressions [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio merged PR #1738:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: graceful NULL and type error handling in array functions [datafusion]

2025-02-24 Thread via GitHub



alan910127 commented on code in PR #14737:
URL: https://github.com/apache/datafusion/pull/14737#discussion_r1969021807


##
datafusion/functions-nested/src/sort.rs:
##
@@ -143,6 +168,10 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> 
Result {
 return exec_err!("array_sort expects one to three arguments");
 }
 
+if args.iter().any(|array| array.logical_null_count() > 0) {

Review Comment:
   I think we still need to check for `args[1..]` for null `desc` or 
`nulls_first`? but to be align with the sort options, I think checking 
`is_null(0)` should be enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: adjust create and drop trigger for mysql dialect [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio merged PR #1734:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add support column prefix index for MySQL [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio commented on code in PR #1732:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1732#discussion_r1969053806


##
src/parser/mod.rs:
##
@@ -7646,6 +7646,39 @@ impl<'a> Parser<'a> {
 }
 }
 
+pub fn parse_index_exprs(&mut self) -> Result, ParserError> 
{
+self.parse_parenthesized(|p| 
p.parse_comma_separated(Parser::parse_index_expr))
+}
+
+pub fn parse_index_expr(&mut self) -> Result {
+let expr = self.parse_expr()?;
+let collation = if self.parse_keyword(Keyword::COLLATE) {
+Some(self.parse_object_name(false)?)
+} else {
+None
+};
+
+let (operator_class, order_options) = if 
self.peek_keyword(Keyword::ASC)
+|| self.peek_keyword(Keyword::DESC)
+|| self.peek_keyword(Keyword::NULLS)
+{
+let order_options = self.parse_order_by_options()?;
+(None, order_options)
+} else {
+let operator_class = self.maybe_parse(|p| p.parse_expr())?;

Review Comment:
   hmm I couldn't see `operator_class` being used in the introduced tests or 
documented so its unclear what the syntax is?



##
src/parser/mod.rs:
##
@@ -7646,6 +7646,39 @@ impl<'a> Parser<'a> {
 }
 }
 
+pub fn parse_index_exprs(&mut self) -> Result, ParserError> 
{
+self.parse_parenthesized(|p| 
p.parse_comma_separated(Parser::parse_index_expr))
+}
+
+pub fn parse_index_expr(&mut self) -> Result {

Review Comment:
   Since this is flagged as a public function can we add a short description 
mentioning what it does?



##
src/parser/mod.rs:
##
@@ -7646,6 +7646,39 @@ impl<'a> Parser<'a> {
 }
 }
 
+pub fn parse_index_exprs(&mut self) -> Result, ParserError> 
{
+self.parse_parenthesized(|p| 
p.parse_comma_separated(Parser::parse_index_expr))
+}
+
+pub fn parse_index_expr(&mut self) -> Result {
+let expr = self.parse_expr()?;
+let collation = if self.parse_keyword(Keyword::COLLATE) {
+Some(self.parse_object_name(false)?)
+} else {
+None
+};
+
+let (operator_class, order_options) = if 
self.peek_keyword(Keyword::ASC)
+|| self.peek_keyword(Keyword::DESC)
+|| self.peek_keyword(Keyword::NULLS)

Review Comment:
   can we introduce a test case that covers the `ASC/DESC` and `nulls_first` 
syntax?



##
src/parser/mod.rs:
##
@@ -14688,6 +14721,8 @@ impl Word {
 
 #[cfg(test)]
 mod tests {
+use std::vec;
+

Review Comment:
   hmm is this needed?



##
src/parser/mod.rs:
##
@@ -7646,6 +7646,39 @@ impl<'a> Parser<'a> {
 }
 }
 
+pub fn parse_index_exprs(&mut self) -> Result, ParserError> 
{
+self.parse_parenthesized(|p| 
p.parse_comma_separated(Parser::parse_index_expr))
+}
+
+pub fn parse_index_expr(&mut self) -> Result {
+let expr = self.parse_expr()?;
+let collation = if self.parse_keyword(Keyword::COLLATE) {
+Some(self.parse_object_name(false)?)
+} else {
+None
+};
+
+let (operator_class, order_options) = if 
self.peek_keyword(Keyword::ASC)
+|| self.peek_keyword(Keyword::DESC)
+|| self.peek_keyword(Keyword::NULLS)

Review Comment:
   Instead of using `peek`, we can probably use `self.maybe_parse()` so that we 
don't need to maintain this list of keywords in sync with the 
`parse_order_by_options` function e.g.
   ```rust
   let (operator_class, order_options) = if let Some(order_options ) = 
self.maybe_parse(|p| p.parse_order_by_options())? {
 (None, order_options)
   } else {
 let operator_class = self.maybe_parse(|p| p.parse_expr())?;
 let order_options = self.parse_order_by_options()?;
 (operator_class, order_options)
   };
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Parse signed/unsigned integer data type in MySQL CAST [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio commented on code in PR #1739:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1739#discussion_r1969077053


##
src/ast/data_type.rs:
##
@@ -238,6 +238,26 @@ pub enum DataType {
 UnsignedBigInt(Option),
 /// Unsigned Int8 with optional display width e.g. INT8 UNSIGNED or 
INT8(11) UNSIGNED
 UnsignedInt8(Option),
+/// Signed integer as used in [MySQL CAST] target types, with optional 
`INTEGER` suffix:
+/// `SIGNED [INTEGER]`
+///
+/// Note that this doesn't accept a display width and is reversed from the 
syntax used in column
+/// definitions ([`DataType::Int`]): `INTEGER [SIGNED]`
+///
+/// Semantically equivalent to `BIGINT`.
+///
+/// [MySQL CAST]: 
https://dev.mysql.com/doc/refman/8.4/en/cast-functions.html
+Signed(bool),

Review Comment:
   thinking it feels a unexpected to use `Signed(true)` to represent `SIGNED 
INTEGER`? and the repr assumes  a mysql-only type which might not be the case 
going forward. Feels like it'd be clear to represent them explicitly as 
`SIGNED` and `SIGNED INTEGER` where both can be reused by other dialects later 
if needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Enable Dataframe to be converted into views which can be used in register_table [datafusion-python]

2025-02-24 Thread via GitHub



kosiew commented on code in PR #1016:
URL: 
https://github.com/apache/datafusion-python/pull/1016#discussion_r1969081210


##
src/dataframe.rs:
##
@@ -156,6 +174,22 @@ impl PyDataFrame {
 PyArrowType(self.df.schema().into())
 }
 
+/// Convert this DataFrame into a Table that can be used in register_table
+/// By convention, into_... methods consume self and return the new object.
+/// Disabling the clippy lint, so we can use &self
+/// because we're working with Python bindings
+/// where objects are shared
+#[allow(clippy::wrong_self_convention)]
+fn into_view(&self) -> PyDataFusionResult {
+// Call the underlying Rust DataFrame::into_view method.
+// Note that the Rust method consumes self; here we clone the inner 
Arc
+// so that we don’t invalidate this PyDataFrame.
+let table_provider = self.df.as_ref().clone().into_view();
+let table_provider = PyTableProvider::new(table_provider);

Review Comment:
   
   ```suggestion
   let table_provider = PyTableProvider::new(table_provider);
   ```



##
src/dataframe.rs:
##
@@ -156,6 +174,22 @@ impl PyDataFrame {
 PyArrowType(self.df.schema().into())
 }
 
+/// Convert this DataFrame into a Table that can be used in register_table
+/// By convention, into_... methods consume self and return the new object.
+/// Disabling the clippy lint, so we can use &self
+/// because we're working with Python bindings
+/// where objects are shared
+#[allow(clippy::wrong_self_convention)]
+fn into_view(&self) -> PyDataFusionResult {
+// Call the underlying Rust DataFrame::into_view method.
+// Note that the Rust method consumes self; here we clone the inner 
Arc
+// so that we don’t invalidate this PyDataFrame.
+let table_provider = self.df.as_ref().clone().into_view();
+let table_provider = PyTableProvider::new(table_provider);

Review Comment:
   
   ```suggestion
   let table_provider = PyTableProvider::new(table_provider);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] ExternalSorter Fails to Spill Dictionaries [datafusion]

2025-02-24 Thread via GitHub



tustvold commented on issue #4658:
URL: https://github.com/apache/datafusion/issues/4658#issuecomment-2679171994

   > I assume by row format you mean 
[arrow-row](https://arrow.apache.org/rust/arrow_row/index.html), however it's 
not clear to me if there's a standard way to serialize these to a file. I could 
create something simple which just writes the streams of rows as binary 
(probably a length then the bytes, repeated?).
   
   I suspect this is what I was going for, although it was 2 years ago so can't 
confess to really remembering and a lot has likely changed since then.
   
   > It looks like an alternative to using the row format in datafusion might 
be to support delta-encoded dictionaries in arrow-rs. 
https://github.com/apache/arrow-rs/issues/6783
   
   This is likely an approach, although it would involve re-encoding the 
dictionaries as there is no mechanism to remap an existing key.
   
   IMO I'd be tempted to start a discussion on the mailing list about why this 
constraint exists, it seems somewhat nonsensical to me given that the footer 
identifies where the dictionary blocks are, and therefore it is trivial to 
determine which dictionary to use. If anything the support for delta 
dictionaries is more surprising...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Re-organize shuffle writer code [datafusion-comet]

2025-02-24 Thread via GitHub



mbutrovich commented on PR #1439:
URL: 
https://github.com/apache/datafusion-comet/pull/1439#issuecomment-2679374261

   This is a great help as we look to improve the shuffle performance. Thanks 
@andygrove!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: migrate invoke_batch to invoke_with_args for unicode function [datafusion]

2025-02-24 Thread via GitHub



alamb commented on PR #14856:
URL: https://github.com/apache/datafusion/pull/14856#issuecomment-2679749126

   🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Create an MSRV policy in this crate [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



mvzink commented on issue #1744:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/1744#issuecomment-2679756348

   From looking into #1612, some test code uses associated type bounds 
([stabilized in 1.79](https://github.com/rust-lang/rust/pull/122055/)). It 
could be rewritten if an earlier MSRV is desired (1.75 worked without it). 
That's moot if following a more progressive policy like DataFusion though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `range` table function [datafusion]

2025-02-24 Thread via GitHub



alamb commented on PR #14830:
URL: https://github.com/apache/datafusion/pull/14830#issuecomment-2679748456

   🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: migrate invoke_batch to invoke_with_args for unicode function [datafusion]

2025-02-24 Thread via GitHub



alamb merged PR #14856:
URL: https://github.com/apache/datafusion/pull/14856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `range` table function [datafusion]

2025-02-24 Thread via GitHub



alamb merged PR #14830:
URL: https://github.com/apache/datafusion/pull/14830


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Migrate Unicode function to `invoke_with_args` [datafusion]

2025-02-24 Thread via GitHub



alamb closed issue #14709: Migrate Unicode function to `invoke_with_args`
URL: https://github.com/apache/datafusion/issues/14709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Refactor SortPushdown using the standard top-down visitor. [datafusion]

2025-02-24 Thread via GitHub



alamb commented on code in PR #14821:
URL: https://github.com/apache/datafusion/pull/14821#discussion_r1968480652


##
datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs:
##
@@ -98,66 +98,91 @@ fn pushdown_sorts_helper(
 .ordering_satisfy_requirement(&parent_reqs);
 
 if is_sort(plan) {
-let sort_fetch = plan.fetch();
-let required_ordering = plan
+let current_sort_fetch = plan.fetch();
+let parent_req_fetch = requirements.data.fetch;
+
+let child_reqs = plan

Review Comment:
   I found the overloading of "requirements" confusing here -- there are 
`LexRequirements` and `requirements: SortPushdown`
   
   Perhaps  we  could rename `requirements` to `sort_pushdown` to mirror the 
argument



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Refactor SortPushdown using the standard top-down visitor. [datafusion]

2025-02-24 Thread via GitHub



alamb commented on code in PR #14821:
URL: https://github.com/apache/datafusion/pull/14821#discussion_r1968482276


##
datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs:
##
@@ -98,66 +98,91 @@ fn pushdown_sorts_helper(
 .ordering_satisfy_requirement(&parent_reqs);
 
 if is_sort(plan) {
-let sort_fetch = plan.fetch();
-let required_ordering = plan
+let current_sort_fetch = plan.fetch();
+let parent_req_fetch = requirements.data.fetch;
+
+let child_reqs = plan
 .output_ordering()
 .cloned()
 .map(LexRequirement::from)
 .unwrap_or_default();
-if !satisfy_parent {
-// Make sure this `SortExec` satisfies parent requirements:
-let sort_reqs = 
requirements.data.ordering_requirement.unwrap_or_default();
-// It's possible current plan (`SortExec`) has a fetch value.
-// And if both of them have fetch values, we should use the 
minimum one.
-if let Some(fetch) = sort_fetch {
-if let Some(requirement_fetch) = requirements.data.fetch {
-requirements.data.fetch = 
Some(fetch.min(requirement_fetch));
-}
-}
-let fetch = requirements.data.fetch.or(sort_fetch);
+let parent_is_stricter = plan
+.equivalence_properties()
+.requirements_compatible(&parent_reqs, &child_reqs);

Review Comment:
   I don't see where these `requirements_compatible` calls come from. Is there 
an existing location?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] DeltaLake integration not working (Python) [datafusion]

2025-02-24 Thread via GitHub



mag1cfrog commented on issue #14842:
URL: https://github.com/apache/datafusion/issues/14842#issuecomment-2679777967

   not sure really relevant, but I'm also having some compilation issue when 
trying to use deltalake with datafusion in Rust.
   
   Here's what I get:
   ```bash
  Compiling datafusion v44.0.0
  Compiling datafusion-proto v44.0.0
  Compiling deltalake-core v0.24.0
   error[E0308]: mismatched types
  --> 
/home/hanbo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deltalake-core-0.24.0/src/kernel/models/actions.rs:567:9
   |
   566 | fn try_from(value: &TableFeatures) -> Result {
   |   - 
expected `Result` because of return type
   567 | ReaderFeatures::try_from(value.as_ref())
   |  expected 
`strum::ParseError`, found a different `strum::ParseError`
   |
   = note: `strum::ParseError` and `strum::ParseError` have similar names, 
but are actually distinct types
   note: `strum::ParseError` is defined in crate `strum`
  --> 
/home/hanbo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/strum-0.26.3/src/lib.rs:42:1
   |
   42  | pub enum ParseError {
   | ^^^
   note: `strum::ParseError` is defined in crate `strum`
  --> 
/home/hanbo/.cargo/git/checkouts/strum-222d21164b266718/cc240c3/strum/src/lib.rs:42:1
   |
   42  | pub enum ParseError {
   | ^^^
   = note: perhaps two different versions of crate `strum` are being used?
   
   error[E0308]: mismatched types
  --> 
/home/hanbo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deltalake-core-0.24.0/src/kernel/models/actions.rs:575:9
   |
   574 | fn try_from(value: &TableFeatures) -> Result {
   |   - 
expected `Result` because of return type
   575 | WriterFeatures::try_from(value.as_ref())
   |  expected 
`strum::ParseError`, found a different `strum::ParseError`
   |
   = note: `strum::ParseError` and `strum::ParseError` have similar names, 
but are actually distinct types
   note: `strum::ParseError` is defined in crate `strum`
  --> 
/home/hanbo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/strum-0.26.3/src/lib.rs:42:1
   |
   42  | pub enum ParseError {
   | ^^^
   note: `strum::ParseError` is defined in crate `strum`
  --> 
/home/hanbo/.cargo/git/checkouts/strum-222d21164b266718/cc240c3/strum/src/lib.rs:42:1
   |
   42  | pub enum ParseError {
   | ^^^
   = note: perhaps two different versions of crate `strum` are being used?
   
   For more information about this error, try `rustc --explain E0308`.
   error: could not compile `deltalake-core` (lib) due to 2 previous errors
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Remove the need for registering an ObjectStore for remote files [datafusion-python]

2025-02-24 Thread via GitHub



robtandy commented on issue #899:
URL: 
https://github.com/apache/datafusion-python/issues/899#issuecomment-2679361030

   @kylebarron What are your thoughts on how to approach this?  I'm happy to 
try to address it and submit a PR for it.
   
   I'd like the same thing for github.com/apache/datafusion-ray, as after the 
rewrite, I need to add support for object stores back in and I'd love it to be 
consistent with how it will work in datafusion-python.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Strip debuginfo symbols for release [datafusion]

2025-02-24 Thread via GitHub



comphead commented on code in PR #14843:
URL: https://github.com/apache/datafusion/pull/14843#discussion_r1968211931


##
Cargo.toml:
##
@@ -159,19 +159,20 @@ url = "2.5.4"
 [profile.release]
 codegen-units = 1
 lto = true
+debug = false

Review Comment:
   Agree



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Parse signed/unsigned integer data types correctly in MySQL CAST [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



mvzink commented on code in PR #1739:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1739#discussion_r1968070815


##
src/ast/mod.rs:
##
@@ -798,9 +798,15 @@ pub enum Expr {
 kind: CastKind,
 expr: Box,
 data_type: DataType,
-// Optional CAST(string_expression AS type FORMAT 
format_string_expression) as used by BigQuery
-// 
https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#formatting_syntax
+/// Optional CAST(string_expression AS type FORMAT 
format_string_expression) as used by [BigQuery]
+///
+/// [BigQuery]: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/format-elements#formatting_syntax
 format: Option,
+/// Whether this was parsed as a [MySQL]-style cast, which has a 
different syntax for

Review Comment:
   I simplified this patch to take that approach and not introduce the error 
path for MySQL (leaving existing permissive parsing in place, since I noticed 
that's already the approach taken with other datatypes and dialects).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Strip debuginfo symbols for release [datafusion]

2025-02-24 Thread via GitHub



comphead commented on PR #14843:
URL: https://github.com/apache/datafusion/pull/14843#issuecomment-2679541024

   Hey @rkrishn7 this is a good call yes, although DF does not catch panics, we 
cannot say the for the dependencies. I'll experiment later with panic mode for 
DF crate level only, leaving other dependencies compile flags intact.
   
   But for this PR it is safer to remove abort on panics


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Improve performance of join building performance (up to 1.28x speedup) [datafusion]

2025-02-24 Thread via GitHub



Dandandan opened a new pull request, #14861:
URL: https://github.com/apache/datafusion/pull/14861

   ## Which issue does this PR close?
   
   
   
   - Closes #14859
   
   ## Rationale for this change
   
   
   
   Performance improvements for join (up to 1.28x)
   
   
   
   ```
   
   Benchmark tpch_mem_sf10.json
   
   ┏━━┳━━━┳━┳━━━┓
   ┃ Query┃  main ┃ join_vectorization3 ┃Change ┃
   ┡━━╇━━━╇━╇━━━┩
   │ QQuery 1 │  770.46ms │796.52ms │ no change │
   │ QQuery 2 │  125.30ms │106.59ms │ +1.18x faster │
   │ QQuery 3 │  259.14ms │220.22ms │ +1.18x faster │
   │ QQuery 4 │  134.04ms │122.23ms │ +1.10x faster │
   │ QQuery 5 │  525.49ms │456.57ms │ +1.15x faster │
   │ QQuery 6 │   45.93ms │ 42.05ms │ +1.09x faster │
   │ QQuery 7 │ 1079.09ms │905.24ms │ +1.19x faster │
   │ QQuery 8 │  336.17ms │296.16ms │ +1.14x faster │
   │ QQuery 9 │  881.10ms │687.21ms │ +1.28x faster │
   │ QQuery 10│  406.44ms │353.14ms │ +1.15x faster │
   │ QQuery 11│   94.73ms │ 82.00ms │ +1.16x faster │
   │ QQuery 12│  280.55ms │250.40ms │ +1.12x faster │
   │ QQuery 13│  292.29ms │256.12ms │ +1.14x faster │
   │ QQuery 14│   47.03ms │ 43.25ms │ +1.09x faster │
   │ QQuery 15│  137.05ms │117.04ms │ +1.17x faster │
   │ QQuery 16│   94.72ms │ 82.67ms │ +1.15x faster │
   │ QQuery 17│  910.07ms │717.32ms │ +1.27x faster │
   │ QQuery 18│ 3759.48ms │   3025.62ms │ +1.24x faster │
   │ QQuery 19│  183.15ms │167.47ms │ +1.09x faster │
   │ QQuery 20│  242.99ms │195.70ms │ +1.24x faster │
   │ QQuery 21│ 1582.68ms │   1292.53ms │ +1.22x faster │
   │ QQuery 22│   95.75ms │ 97.61ms │ no change │
   └──┴───┴─┴───┘
   ┏┳┓
   ┃ Benchmark Summary  ┃┃
   ┡╇┩
   │ Total Time (main)  │ 12283.64ms │
   │ Total Time (join_vectorization3)   │ 10313.65ms │
   │ Average Time (main)│   558.35ms │
   │ Average Time (join_vectorization3) │   468.80ms │
   │ Queries Faster │ 20 │
   │ Queries Slower │  0 │
   │ Queries with No Change │  2 │
   └┴┘
   Note: Skipping 
/Users/danielheres/Code/arrow-datafusion-personal/benchmarks/results/main/tpch_sf1.json
 as 
/Users/danielheres/Code/arrow-datafusion-personal/benchmarks/results/join_vectorization3/tpch_sf1.json
 does not exist
   
   Benchmark tpch_sf10.json
   
   ┏━━┳━━━┳━┳━━━┓
   ┃ Query┃  main ┃ join_vectorization3 ┃Change ┃
   ┡━━╇━━━╇━╇━━━┩
   │ QQuery 1 │  981.31ms │946.03ms │ no change │
   │ QQuery 2 │  157.88ms │144.81ms │ +1.09x faster │
   │ QQuery 3 │  475.15ms │409.54ms │ +1.16x faster │
   │ QQuery 4 │  240.46ms │219.94ms │ +1.09x faster │
   │ QQuery 5 │  700.03ms │604.97ms │ +1.16x faster │
   │ QQuery 6 │  157.46ms │149.52ms │ +1.05x faster │
   │ QQuery 7 │ 1072.19ms │881.00ms │ +1.22x faster │
   │ QQuery 8 │  740.30ms │663.77ms │ +1.12x faster │
   │ QQuery 9 │ 1189.08ms │984.28ms │ +1.21x faster │
   │ QQuery 10│  666.24ms │605.10ms │ +1.10x faster │
   │ QQuery 11│  101.91ms │ 98.58ms │ no change │
   │ QQuery 12│  338.89ms │317.19ms │ +1.07x faster │
   │ QQuery 13│  475.00ms │418.44ms │ +1.14x faster │
   │ QQuery 14│  266.15ms │245.19ms │ +1.09x faster │
   │ QQuery 15│  442.98ms │403.32ms │ +1.10x faster │
   │ QQuery 16│  111.29ms │107.48ms │ no change │
   │ QQuery 17│ 1249.82ms │   1031.22ms │ +1.21x faster │
   │ QQuery 18│ 2052.52ms │   1672.16ms │ +1.23x faster │
   │ QQuery 19│  464.19ms │437.39ms │ +1.06x faster │
   │ QQuery 20│  470.89ms │391.63ms │ +1.20x faster │
   │ QQuery 21│ 1637.42ms │   1390.00ms │ +1.18x faster │
   │ QQuery 22│  156.41ms │148.43ms │ +1.05x faster │
   └──┴───┴─┴───┘
   ┏┳┓
   ┃ Benchmark Summary

Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-24 Thread via GitHub



anlinc commented on PR #14553:
URL: https://github.com/apache/datafusion/pull/14553#issuecomment-2679664474

   @Blizzara @alamb I am closing this in favor of the latest iteration here: 
https://github.com/apache/datafusion/pull/14860, which addresses the 
discussions in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-24 Thread via GitHub



anlinc closed pull request #14553: fix(substrait): Do not add implicit groupBy 
expressions when building logical plans from Substrait
URL: https://github.com/apache/datafusion/pull/14553


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support User-Defined Sorting [datafusion]

2025-02-24 Thread via GitHub



tobixdev commented on issue #14828:
URL: https://github.com/apache/datafusion/issues/14828#issuecomment-2679679211

   So i dug a little deeper on how we could implement this functionality. 
   
   ### arrow-rs
   
   Firstly, I think we require two "flavors" of sorting - one for `arrow-ord` 
and one for `arrow-row` as DataFusion uses both of these APIs. 
   
   AFAIK, in `arrow-rs` we don't have a "user defined type" registry that we 
could leverage. So when a user calls, for example, `lexsort`, there is also no 
way to "lookup" whether we have a user defined type that can be applied to a 
particular column. Therefore, we must pass in the information on how to sort a 
column to the sorting procedure. One possible way for getting this information 
into the called functions is to extend `SortField` (`arrow-row`) and 
`SortColumn` (`arrow-ord`).
   
   I think this extension could happen in one of two ways:
   1. *Provide an Implementation*: In this approach users directly provide an 
implementation. In `arrow-row` this could be a custom byte encoder, in 
`arrow-ord` this could be a function that maps to a `Ordering`.
   2. *Provide a User Defined Type*: In this approach users provide a user 
defined type. Somehow we would then need the ability to deduce the 
implementations from 1. based on the given type.
   
   Maybe one way to go forward is implement 1. and then implement 2. if we 
think this is sensible. I am a bit unsure on how we can directly use the arrow 
extension types for this use case. If anyone has a better idea here, I'd love 
to hear your take on that (@mbrobbel maybe you have an opinion if this is 
within the scope of arrow extension types). 
   
   ### DataFusion
   
   Here I think we should go with user defined types and attach the sorting 
information to that type. This should be possible as we can have a "central 
registry" (e.g., `SessionContext`) that holds all available user defined types 
and we can look up whether this particular column has a user defined type with 
a user defined ordering. If we have a use case, we could also think of adding 
the ability to override this ordering behavior. 
   
   I don't have more details on how we could implement this. I think trying to 
implement 1. in arrow-rs is the first step, and then we should check on how we 
can connect that to user defined types in DataFusion.
   
   If this sounds good, I can start to work on a draft. Maybe lmk if you think 
that extending `SortField` and `SortColumn` is a reasonable approach here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Store spans for Value expressions [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



alamb commented on code in PR #1738:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1738#discussion_r1968440688


##
src/ast/mod.rs:
##
@@ -8789,9 +8796,9 @@ mod tests {
 #[test]
 fn test_interval_display() {
 let interval = Expr::Interval(Interval {
-value: Box::new(Expr::Value(Value::SingleQuotedString(String::from(
-"123:45.67",
-,
+value: Box::new(Expr::Value(

Review Comment:
   minor stuff we could use the `Expr::value` here too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Create an MSRV policy in this crate [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



alamb opened a new issue, #1744:
URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1744

   TLDR woudl be:
   1. Define a MSRV policy (maybe copy the existing DataFusion one)
   2. Implement some sort of CI check (again can copy / reuse the datafusion 
one if we want)
   3. 
   
 🤔  interestingly we don't seem to have an MSRV policy in this 
crate (at least not that I could find in 
https://github.com/apache/datafusion-sqlparser-rs/blob/main/README.md) 
   
   We do have such a thing in DataFusion here: 
https://github.com/apache/datafusion?tab=readme-ov-file#rust-version-compatibility-policy
   
   Maybe it is worth considering adding some documentation / policy in the 
sqlparser crate too 🤔
   
   _Originally posted by @alamb in 
https://github.com/apache/datafusion-sqlparser-rs/issues/1736#issuecomment-2676808035_
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve performance of join building performance (up to 1.28x speedup) [datafusion]

2025-02-24 Thread via GitHub



Dandandan closed pull request #14861: Improve performance of join building 
performance (up to 1.28x speedup)
URL: https://github.com/apache/datafusion/pull/14861


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve join building performance [datafusion]

2025-02-24 Thread via GitHub



Dandandan closed issue #14859: Improve join building performance
URL: https://github.com/apache/datafusion/issues/14859


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

2025-02-24 Thread via GitHub



bharath-techie commented on issue #14816:
URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2679245307

   Hi @chenkovsky ,
   Thanks a ton for quick POC on this. :) 
   
   The row ids seems to be specific to each batch and not across the entire 
parquet file - is my understanding correct ? 
   
   Reason is our use case will mainly benefit from parquet file level row ids.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [WIP] Move `FileSourceConfig` and `FileStream` to the new `datafusion-datasource` [datafusion]

2025-02-24 Thread via GitHub



AdamGS commented on code in PR #14838:
URL: https://github.com/apache/datafusion/pull/14838#discussion_r1968175868


##
datafusion/common/src/test_util.rs:
##
@@ -28,7 +28,7 @@ use std::{error::Error, path::PathBuf};
 ///
 /// Expects to be called about like this:
 ///
-/// `assert_batch_eq!(expected_lines: &[&str], batches: &[RecordBatch])`

Review Comment:
   this was just a typo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support bounds evaluation for temporal data types [datafusion]

2025-02-24 Thread via GitHub



ch-sc commented on code in PR #14523:
URL: https://github.com/apache/datafusion/pull/14523#discussion_r1968018999


##
datafusion/physical-expr/src/expressions/in_list.rs:
##
@@ -398,6 +399,51 @@ impl PhysicalExpr for InListExpr {
 self.static_filter.clone(),
 )))
 }
+
+/// The output interval is computed by checking if the list item intervals 
are
+/// a subset of, overlap, or are disjoint with the input expression's 
interval.
+///
+/// If [InListExpr::negated] is true, the output interval gets negated.
+///
+/// # Example:
+/// If the input expression's interval is a superset of the
+/// conjunction of the list items intervals, the output
+/// interval is [`Interval::CERTAINLY_TRUE`].
+///
+/// ```text
+/// interval of expr:   -
+/// Some list items:..|..|.|.|...
+///
+/// output interval:[`true`, `true`]
+/// ```
+fn evaluate_bounds(&self, children: &[&Interval]) -> Result {
+let expr_bounds = children[0];
+
+debug_assert!(
+children.len() >= 2,
+"InListExpr requires at least one list item"
+);
+
+// conjunction of list item intervals
+let list_bounds = children
+.iter()
+.skip(2)
+.try_fold(children[1].clone(), |acc, item| acc.union(*item))?;
+
+if self.negated {
+expr_bounds.contains(list_bounds)?.boolean_negate()
+} else {
+expr_bounds.contains(list_bounds)

Review Comment:
   `contains` for `[3,5]` and `[0,9]` should return `UNCERTAIN`. I added a test 
with your example in in_list.rs.
   
   `CERTAINLY_TRUE` should only be returned if expr_bounds is a superset of the 
union of the list bounds. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [WIP] Move `FileSourceConfig` and `FileStream` to the new `datafusion-datasource` [datafusion]

2025-02-24 Thread via GitHub



AdamGS commented on code in PR #14838:
URL: https://github.com/apache/datafusion/pull/14838#discussion_r1968197144


##
datafusion/datasource/Cargo.toml:
##
@@ -69,6 +71,7 @@ xz2 = { version = "0.1", optional = true, features = 
["static"] }
 zstd = { version = "0.13", optional = true, default-features = false }
 
 [dev-dependencies]
+datafusion = { workspace = true, default-features = true }

Review Comment:
   seems like some was unnecessary, and some were already moved into 
`memory.rs`, so I just moved it up under `#[cfg(test)]`.



##
datafusion/datasource/Cargo.toml:
##
@@ -69,6 +71,7 @@ xz2 = { version = "0.1", optional = true, features = 
["static"] }
 zstd = { version = "0.13", optional = true, default-features = false }
 
 [dev-dependencies]
+datafusion = { workspace = true, default-features = true }

Review Comment:
   seems like some were unnecessary, and some were already moved into 
`memory.rs`, so I just moved it up under `#[cfg(test)]`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Strip debuginfo symbols for release [datafusion]

2025-02-24 Thread via GitHub



comphead commented on PR #14843:
URL: https://github.com/apache/datafusion/pull/14843#issuecomment-2679349895

   DF does not catch panics, so the process will crash anyway no matter what 
the setting is. 
   Although there are some test cases which does `catch_unwind` but we do run 
tests in debug mode.
   Another thing for abort it doesn't call object destructors if process 
crashed which for DF imho it is not an issue.
   
   Potentially DF even can gain some performance which I will check separately 
and share the results here
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Experimental native scan test failures [datafusion-comet]

2025-02-24 Thread via GitHub



mbutrovich opened a new issue, #1441:
URL: https://github.com/apache/datafusion-comet/issues/1441

   We have two experimental native scans based on DataFusion's ParquetExec 
operator. This issue will track the remaining test failures and any related 
notes as we bring the test failures down to zero.
   
   **ParquetReadV1Suite:**
   - test multiple pages with different sizes and nulls (prefetch enabled)
   - test multiple pages with different sizes and nulls
   - scan metrics (prefetch enabled)
   - scan metrics
   
   **CometJoinSuite:**
   - HashJoin struct key (`native_datafusion` only)
   
   **CometCastSuite:**
   - cast TimestampType to LongType
   - cast TimestampType to StringType
   - cast TimestampType to DateType
   - cast StructType to StringType
   
   **CometArrayExpressionSuite:**
   - array_intersect
   
   **CometExecSuite:**
   - Comet native metrics: scan
   - explain native plan (`native_datafusion` only)
   
   **CometExpressionSuite:**
   - round
   - hex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Strip debuginfo symbols for release [datafusion]

2025-02-24 Thread via GitHub



rkrishn7 commented on PR #14843:
URL: https://github.com/apache/datafusion/pull/14843#issuecomment-2679441117

   > DF does not catch panics, so the process will crash anyway no matter what 
the setting is.
   
   @comphead If I'm not mistaken, DF may catch panics in certain cases.
   
   For example, if a serialization task panics during writing to some object 
store (see 
[here](https://github.com/apache/datafusion/blob/e799097cbd2079d658ddc6243817ac529e2f9807/datafusion/datasource/src/write/orchestration.rs#L135)).
   
   I confirmed this by inserting a dummy panic in the serialization task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Chore: Release datafusion-python 45 [datafusion-python]

2025-02-24 Thread via GitHub



kevinjqliu commented on PR #1024:
URL: 
https://github.com/apache/datafusion-python/pull/1024#issuecomment-2679610682

   woot! I see 45.2.0 on https://pypi.org/project/datafusion/
   thanks for working on the release!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump prost-types from 0.13.4 to 0.13.5 [datafusion-python]

2025-02-24 Thread via GitHub



dependabot[bot] commented on PR #1021:
URL: 
https://github.com/apache/datafusion-python/pull/1021#issuecomment-2679578887

   Looks like prost-types is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Release DataFusion `46.0.0` [datafusion]

2025-02-24 Thread via GitHub



alamb commented on issue #14123:
URL: https://github.com/apache/datafusion/issues/14123#issuecomment-2679454880

   I made a PR for testing in delta here: 
   - https://github.com/delta-io/delta-rs/pull/3261
   
   Still has some issues to work out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-24 Thread via GitHub



alamb commented on PR #14860:
URL: https://github.com/apache/datafusion/pull/14860#issuecomment-2679711714

   Hi @anlinc  
   
   It looks like there is a small clippy failure with this PR:
   
   
https://github.com/apache/datafusion/actions/runs/13507870042/job/37742937287?pr=14860


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Move `FileSourceConfig` and `FileStream` to the new `datafusion-datasource` [datafusion]

2025-02-24 Thread via GitHub



alamb commented on code in PR #14838:
URL: https://github.com/apache/datafusion/pull/14838#discussion_r1968468146


##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -15,19 +15,611 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use std::{borrow::Cow, collections::HashMap, marker::PhantomData, sync::Arc};
+//! [`FileScanConfig`] to configure scanning of possibly partitioned
+//! file sources.
+
+use std::{
+any::Any, borrow::Cow, collections::HashMap, fmt::Debug, fmt::Formatter,
+fmt::Result as FmtResult, marker::PhantomData, sync::Arc,
+};
 
 use arrow::{
 array::{
 ArrayData, ArrayRef, BufferBuilder, DictionaryArray, RecordBatch,
 RecordBatchOptions,
 },
 buffer::Buffer,
-datatypes::{ArrowNativeType, DataType, SchemaRef, UInt16Type},
+datatypes::{ArrowNativeType, DataType, Field, Schema, SchemaRef, 
UInt16Type},
+};
+use datafusion_common::{
+exec_err, stats::Precision, ColumnStatistics, Constraints, Result, 
Statistics,
 };
-use datafusion_common::{exec_err, Result};
 use datafusion_common::{DataFusionError, ScalarValue};
-use log::warn;
+use datafusion_execution::{
+object_store::ObjectStoreUrl, SendableRecordBatchStream, TaskContext,
+};
+use datafusion_physical_expr::{
+expressions::Column, EquivalenceProperties, LexOrdering, Partitioning,
+PhysicalSortExpr,
+};
+use datafusion_physical_plan::{
+display::{display_orderings, ProjectSchemaDisplay},
+metrics::ExecutionPlanMetricsSet,
+projection::{all_alias_free_columns, new_projections_for_columns, 
ProjectionExec},
+DisplayAs, DisplayFormatType, ExecutionPlan,
+};
+use log::{debug, warn};
+
+use crate::{
+display::FileGroupsDisplay,
+file::FileSource,
+file_compression_type::FileCompressionType,
+file_stream::FileStream,
+source::{DataSource, DataSourceExec},
+statistics::MinMaxStatistics,
+PartitionedFile,
+};
+
+/// The base configurations for a [`DataSourceExec`], the a physical plan for
+/// any given file format.
+///
+/// Use [`Self::build`] to create a [`DataSourceExec`] from a 
``FileScanConfig`.
+///
+/// # Example
+/// ```ignore
+/// # use std::sync::Arc;
+/// # use arrow::datatypes::{Field, Fields, DataType, Schema};
+/// # use datafusion_datasource::PartitionedFile;
+/// # use datafusion_datasource::file_scan_config::FileScanConfig;
+/// # use datafusion_execution::object_store::ObjectStoreUrl;
+/// # use datafusion::datasource::physical_plan::ArrowSource;

Review Comment:
   This is a weird example anyways because it actually refers to `ParquetFiles`
   
   I took the liberty of pushing a commit to this branch to restore the doc 
test by mocking out a ParquetSource in 35e6ab727
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Store spans for Value expressions [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



lovasoa commented on code in PR #1738:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1738#discussion_r1968471231


##
src/ast/mod.rs:
##
@@ -8789,9 +8796,9 @@ mod tests {
 #[test]
 fn test_interval_display() {
 let interval = Expr::Interval(Interval {
-value: Box::new(Expr::Value(Value::SingleQuotedString(String::from(
-"123:45.67",
-,
+value: Box::new(Expr::Value(

Review Comment:
   Yes! But I don't have the strength to fight ast-grep again 🙂‍↔️



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-24 Thread via GitHub



alamb commented on code in PR #14860:
URL: https://github.com/apache/datafusion/pull/14860#discussion_r1968445509


##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -63,6 +64,32 @@ use indexmap::IndexSet;
 /// Default table name for unnamed table
 pub const UNNAMED_TABLE: &str = "?table?";
 
+#[derive(Debug, Clone)]
+pub struct LogicalPlanBuilderOptions {
+/// Flag indicating whether the plan builder should add
+/// functionally dependent expressions as additional aggregation groupings.
+add_implicit_group_by_exprs: bool,
+}
+
+impl Default for LogicalPlanBuilderOptions {
+fn default() -> Self {
+Self {
+add_implicit_group_by_exprs: false,
+}
+}
+}
+
+impl LogicalPlanBuilderOptions {
+pub fn new() -> Self {
+Default::default()
+}
+
+pub fn with_add_implicit_group_by_exprs(mut self, add: bool) -> Self {

Review Comment:
   ```suggestion
   /// Should the builder add functionally dependent expressions as 
additional aggregation groupings.
   pub fn with_add_implicit_group_by_exprs(mut self, add: bool) -> Self {
   ```



##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -63,6 +64,32 @@ use indexmap::IndexSet;
 /// Default table name for unnamed table
 pub const UNNAMED_TABLE: &str = "?table?";
 
+#[derive(Debug, Clone)]

Review Comment:
   ```suggestion
   /// Options for [`LogicalPlanBuilder`]
   #[derive(Debug, Clone)]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [WIP] Move `FileSourceConfig` and `FileStream` to the new `datafusion-datasource` [datafusion]

2025-02-24 Thread via GitHub



AdamGS commented on code in PR #14838:
URL: https://github.com/apache/datafusion/pull/14838#discussion_r1968201066


##
datafusion/core/Cargo.toml:
##
@@ -40,7 +40,7 @@ nested_expressions = ["datafusion-functions-nested"]
 # This feature is deprecated. Use the `nested_expressions` feature instead.
 array_expressions = ["nested_expressions"]
 # Used to enable the avro format
-avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
+avro = ["apache-avro", "num-traits", "datafusion-common/avro", 
"datafusion-datasource/avro"]

Review Comment:
   I think the next PR will try and do that, but some will always stay even if 
just for stuff like `AvroError`/`ParquetError`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [WIP] Move `FileSourceConfig` and `FileStream` to the new `datafusion-datasource` [datafusion]

2025-02-24 Thread via GitHub



AdamGS commented on code in PR #14838:
URL: https://github.com/apache/datafusion/pull/14838#discussion_r1968368078


##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -15,19 +15,611 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use std::{borrow::Cow, collections::HashMap, marker::PhantomData, sync::Arc};
+//! [`FileScanConfig`] to configure scanning of possibly partitioned
+//! file sources.
+
+use std::{
+any::Any, borrow::Cow, collections::HashMap, fmt::Debug, fmt::Formatter,
+fmt::Result as FmtResult, marker::PhantomData, sync::Arc,
+};
 
 use arrow::{
 array::{
 ArrayData, ArrayRef, BufferBuilder, DictionaryArray, RecordBatch,
 RecordBatchOptions,
 },
 buffer::Buffer,
-datatypes::{ArrowNativeType, DataType, SchemaRef, UInt16Type},
+datatypes::{ArrowNativeType, DataType, Field, Schema, SchemaRef, 
UInt16Type},
+};
+use datafusion_common::{
+exec_err, stats::Precision, ColumnStatistics, Constraints, Result, 
Statistics,
 };
-use datafusion_common::{exec_err, Result};
 use datafusion_common::{DataFusionError, ScalarValue};
-use log::warn;
+use datafusion_execution::{
+object_store::ObjectStoreUrl, SendableRecordBatchStream, TaskContext,
+};
+use datafusion_physical_expr::{
+expressions::Column, EquivalenceProperties, LexOrdering, Partitioning,
+PhysicalSortExpr,
+};
+use datafusion_physical_plan::{
+display::{display_orderings, ProjectSchemaDisplay},
+metrics::ExecutionPlanMetricsSet,
+projection::{all_alias_free_columns, new_projections_for_columns, 
ProjectionExec},
+DisplayAs, DisplayFormatType, ExecutionPlan,
+};
+use log::{debug, warn};
+
+use crate::{
+display::FileGroupsDisplay,
+file::FileSource,
+file_compression_type::FileCompressionType,
+file_stream::FileStream,
+source::{DataSource, DataSourceExec},
+statistics::MinMaxStatistics,
+PartitionedFile,
+};
+
+/// The base configurations for a [`DataSourceExec`], the a physical plan for
+/// any given file format.
+///
+/// Use [`Self::build`] to create a [`DataSourceExec`] from a 
``FileScanConfig`.
+///
+/// # Example
+/// ```ignore
+/// # use std::sync::Arc;
+/// # use arrow::datatypes::{Field, Fields, DataType, Schema};
+/// # use datafusion_datasource::PartitionedFile;
+/// # use datafusion_datasource::file_scan_config::FileScanConfig;
+/// # use datafusion_execution::object_store::ObjectStoreUrl;
+/// # use datafusion::datasource::physical_plan::ArrowSource;

Review Comment:
   Had to ignore this very good rustdoc test because we don't have 
`ArrowSource` here. My plan is to only have that as a temporary thing between 
PRs, with the fix being either:
   1. Have some `InMemoryFileSource` here that just gets `Vec` and 
knows how to return it through `FileSource` (or maybe even the `MockSource` I 
added for tests somewhere in this PR).
   2. Potentially put `ArrowSource` in this crate as the "canonical" source, 
especially given how small it is at less than 200 lines of actual code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump prost from 0.13.4 to 0.13.5 [datafusion-python]

2025-02-24 Thread via GitHub



dependabot[bot] closed pull request #1022: build(deps): bump prost from 0.13.4 
to 0.13.5
URL: https://github.com/apache/datafusion-python/pull/1022


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump prost from 0.13.4 to 0.13.5 [datafusion-python]

2025-02-24 Thread via GitHub



dependabot[bot] commented on PR #1022:
URL: 
https://github.com/apache/datafusion-python/pull/1022#issuecomment-267957

   Looks like prost is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump prost-types from 0.13.4 to 0.13.5 [datafusion-python]

2025-02-24 Thread via GitHub



dependabot[bot] closed pull request #1021: build(deps): bump prost-types from 
0.13.4 to 0.13.5
URL: https://github.com/apache/datafusion-python/pull/1021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Chore: Release datafusion-python 45 [datafusion-python]

2025-02-24 Thread via GitHub



timsaucer merged PR #1024:
URL: https://github.com/apache/datafusion-python/pull/1024


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Set projection before configuring the source [datafusion]

2025-02-24 Thread via GitHub



milenkovicm commented on PR #14685:
URL: https://github.com/apache/datafusion/pull/14685#issuecomment-2679834606

   Would it make sense to add a test which would cover plan round trip?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add example to spark-expr crate [datafusion-comet]

2025-02-24 Thread via GitHub



andygrove closed issue #1365: Add example to spark-expr crate
URL: https://github.com/apache/datafusion-comet/issues/1365


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Discussion] Efficient Row Selection for Multi-Engine Support [datafusion]

2025-02-24 Thread via GitHub



chenkovsky commented on issue #14816:
URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2679992458

   > Hi @chenkovsky ,
   > Thanks a ton for quick POC on this. :) 
   > 
   > The row ids seems to be specific to each batch and not across the entire 
parquet file - is my understanding correct ? 
   > 
   > Reason is our use case will mainly benefit from parquet file level row ids.
   
   @bharath-techie yes，This is just an example, not for real situation. If it 
needs to meet the actual requirements, I will need more time. i think i have to 
learn more about parquet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



himadripal commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968654009


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   > Or you can move the message check below and switch the expected messages 
based on the fromType/toType.
   
   I'll try this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] DataFusion 45 blog post [datafusion-site]

2025-02-24 Thread via GitHub



phillipleblanc commented on code in PR #57:
URL: https://github.com/apache/datafusion-site/pull/57#discussion_r1968653103


##
content/blog/2025-02-20-datafusion-45.0.0.md:
##
@@ -0,0 +1,315 @@
+---
+layout: post
+title: Apache DataFusion 45.0.0 Released
+date: 2025-02-20
+author: pmc
+categories: [release]
+---
+
+
+
+
+
+## Introduction
+
+We are very proud to announce [DataFusion 45.0.0]. This blog highlights some 
of the
+many major improvements since we released [DataFusion 40.0.0] and a preview of
+what the community is thinking about in the next 6 months. It has been an 
exciting
+period of development for DataFusion!
+
+[DataFusion 40.0.0]: 
https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/
+[DataFusion 45.0.0]: https://crates.io/crates/datafusion/45.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library],
+[python library] and [command line SQL tool].
+
+[apache datafusion]: https://datafusion.apache.org/
+[rust]: https://www.rust-lang.org/
+[apache arrow]: https://arrow.apache.org
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
+[python library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+DataFusion's core thesis is that as a community, together we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+## Community Growth  📈 
+
+In the last 6 months, between `40.0.0` and `45.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. We added several PMC members and new committers: [@jayzhan211] and 
[@jonahgao] joined the PMC,
+   [@2010YOUY01], [@rachelint], [@findpi], [@iffyio], [@goldmedal], 
[@Weijun-H], [@Michael-J-Ward] and [@korowa]
+   joined as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone we reviewed and accepted almost 1600 
PRs from 206 different
+   committers, created over 1100 issues and closed 751 of them 🚀. All changes 
are listed in the detailed
+   [changelogs].
+3. DataFusion focused meetups happened in multiple cities around the world: 
[Hangzhou], [Belgrade], [New York], 
+   [Seattle], [Chicago], [Boston] and [Amsterdam] as well as a Rust NYC meetup 
in NYC focussed on DataFusion.

Review Comment:
   ```suggestion
  [Seattle], [Chicago], [Boston] and [Amsterdam] as well as a Rust NYC 
meetup in NYC focused on DataFusion.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



himadripal commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968655592


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   > One or the other way. Otherwise we cannot confidently say that the test is 
passing due to the cannot be represented as message or the too large to store 
message.
   
   I'm contemplating creating a different test/check function for decimal to 
decimal test.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Browser-accessible official DataFusion playground / DataFusion fiddle [datafusion]

2025-02-24 Thread via GitHub



waynexia commented on issue #13818:
URL: https://github.com/apache/datafusion/issues/13818#issuecomment-2680102365

   I'm imaging a highly encapsulated JS component like
   
   ```tsx
   
   ```
   
   and the user who wants an interactive terminal can render a DataFusion 
webpage with one line, who wants to use DataFusion WASM to do computation can 
invoke it as a JS function


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Release DataFusion `46.0.0` [datafusion]

2025-02-24 Thread via GitHub



shehabgamin commented on issue #14123:
URL: https://github.com/apache/datafusion/issues/14123#issuecomment-2680109294

   I will test with Sail by Wednesday!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add DataFrame fill_null [datafusion]

2025-02-24 Thread via GitHub



kosiew commented on PR #14769:
URL: https://github.com/apache/datafusion/pull/14769#issuecomment-2680284864

   > its documented in 
https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html 
not in the code itself
   
   I added fill_null example usage in dataframe/mod.rs for rustdoc to update 
https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: fetch is missed during EnforceDistribution [datafusion]

2025-02-24 Thread via GitHub



xudong963 commented on code in PR #14207:
URL: https://github.com/apache/datafusion/pull/14207#discussion_r1968751449


##
datafusion/core/tests/physical_optimizer/enforce_distribution.rs:
##
@@ -3154,3 +3164,104 @@ fn optimize_away_unnecessary_repartition2() -> 
Result<()> {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn apply_enforce_distribution_multiple_times() -> Result<()> {
+// Create a configuration
+let config = SessionConfig::new();
+let ctx = SessionContext::new_with_config(config);
+let testdata = datafusion::test_util::arrow_test_data();
+let csv_file = format!("{testdata}/csv/aggregate_test_100.csv");
+// Create table schema and data
+let sql = format!(
+"CREATE EXTERNAL TABLE aggregate_test_100 (
+c1  VARCHAR NOT NULL,
+c2  TINYINT NOT NULL,
+c3  SMALLINT NOT NULL,
+c4  SMALLINT,
+c5  INT,
+c6  BIGINT NOT NULL,
+c7  SMALLINT NOT NULL,
+c8  INT NOT NULL,
+c9  BIGINT UNSIGNED NOT NULL,
+c10 VARCHAR NOT NULL,
+c11 FLOAT NOT NULL,
+c12 DOUBLE NOT NULL,
+c13 VARCHAR NOT NULL
+)
+STORED AS CSV
+LOCATION '{csv_file}'
+OPTIONS ('format.has_header' 'true')"
+);
+
+ctx.sql(sql.as_str()).await?;
+
+let df = ctx.sql("SELECT * FROM(SELECT * FROM aggregate_test_100 UNION ALL 
SELECT * FROM aggregate_test_100) ORDER BY c13 LIMIT 5").await?;
+let logical_plan = df.logical_plan().clone();
+let analyzed_logical_plan = ctx.state().analyzer().execute_and_check(
+logical_plan,
+ctx.state().config_options(),
+|_, _| (),
+)?;
+let optimized_logical_plan = ctx.state().optimizer().optimize(
+analyzed_logical_plan,
+&ctx.state(),
+|_, _| (),
+)?;
+
+let optimizers: Vec> = vec![
+Arc::new(OutputRequirements::new_add_mode()),
+Arc::new(EnforceDistribution::new()),
+Arc::new(EnforceSorting::new()),
+Arc::new(ProjectionPushdown::new()),
+Arc::new(CoalesceBatches::new()),
+Arc::new(EnforceDistribution::new()), // -- Add enforce distribution 
rule again
+// The second `EnforceDistribution` should be run before removing 
`OutputRequirements` to reproduce the bug.
+Arc::new(OutputRequirements::new_remove_mode()),
+Arc::new(ProjectionPushdown::new()),
+Arc::new(LimitPushdown::new()),
+Arc::new(SanityCheckPlan::new()),

Review Comment:
   
https://github.com/apache/datafusion/pull/14207/commits/ffb1eb3d3545ff367bbfac715cb005651f433848
 done
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add DataFrame fill_null [datafusion]

2025-02-24 Thread via GitHub



kosiew commented on code in PR #14769:
URL: https://github.com/apache/datafusion/pull/14769#discussion_r1968745726


##
datafusion/core/src/dataframe/mod.rs:
##
@@ -1926,6 +1930,71 @@ impl DataFrame {
 plan,
 })
 }
+
+/// Fill null values in specified columns with a given value
+/// If no columns are specified, applies to all columns
+/// Only fills if the value can be cast to the column's type
+///
+/// # Arguments
+/// * `value` - Value to fill nulls with
+/// * `columns` - Optional list of column names to fill. If None, fills 
all columns
+pub fn fill_null(
+&self,
+value: ScalarValue,
+columns: Option>,

Review Comment:
   Fixed.
   Thanks for spotting this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add DataFrame fill_null [datafusion]

2025-02-24 Thread via GitHub



kosiew commented on code in PR #14769:
URL: https://github.com/apache/datafusion/pull/14769#discussion_r1968745726


##
datafusion/core/src/dataframe/mod.rs:
##
@@ -1926,6 +1930,71 @@ impl DataFrame {
 plan,
 })
 }
+
+/// Fill null values in specified columns with a given value
+/// If no columns are specified, applies to all columns
+/// Only fills if the value can be cast to the column's type
+///
+/// # Arguments
+/// * `value` - Value to fill nulls with
+/// * `columns` - Optional list of column names to fill. If None, fills 
all columns
+pub fn fill_null(
+&self,
+value: ScalarValue,
+columns: Option>,

Review Comment:
   Fixed/
   Thanks for spotting this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add DataFrame fill_null [datafusion]

2025-02-24 Thread via GitHub



kosiew commented on code in PR #14769:
URL: https://github.com/apache/datafusion/pull/14769#discussion_r1967089979


##
datafusion/core/src/dataframe/mod.rs:
##
@@ -1926,6 +1930,71 @@ impl DataFrame {
 plan,
 })
 }
+
+/// Fill null values in specified columns with a given value
+/// If no columns are specified, applies to all columns
+/// Only fills if the value can be cast to the column's type
+///
+/// # Arguments
+/// * `value` - Value to fill nulls with
+/// * `columns` - Optional list of column names to fill. If None, fills 
all columns
+pub fn fill_null(
+&self,
+value: ScalarValue,
+columns: Option>,
+) -> Result {
+let cols = match columns {
+Some(names) => self.find_columns(&names)?,
+None => self

Review Comment:
   I was following the Pandas convention, where no limits (columns) means 
fill_null for all columns
   https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
   
   https://github.com/user-attachments/assets/c4eab1eb-f1d2-4133-90c3-0a2c57b0349d";
 />
   
   
   I could amend to no op as well if this is the preferred convention



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



kazuyukitanimura commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968649099


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   Another way to remove the `if` block is to convert the error message to make 
it similar to Spark
   This is the place that we are defining Spark error messages 
https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
   You can check how these are used.
   
   Or you can move the message check below and switch the expected messages 
based on the fromType/toType.
   
   One or the other way.  Otherwise we cannot confidently say that the test is 
passing due to the `cannot be represented as` message or the `too large to 
store` message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] refactor: use TypeSignature::Coercible for crypto functions [datafusion]

2025-02-24 Thread via GitHub



jayzhan211 commented on PR #14826:
URL: https://github.com/apache/datafusion/pull/14826#issuecomment-2680085289

   > [SQL] EXPLAIN SELECT
 digest(column1_utf8view, 'md5') as c
   FROM test;
   [Diff] (-expected|+actual)
   logical_plan
   -   01)Projection: digest(test.column1_utf8view, Utf8View("md5")) AS c
   +   01)Projection: digest(test.column1_utf8view, Utf8("md5")) AS c
   02)--TableScan: test projection=[column1_utf8view]
   
   This looks like an improvement, we don't need to cast it to utf8view


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



himadripal commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968652251


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   > Another way to remove the if block is to convert the error message to make 
it similar to Spark
   This is the place that we are defining Spark error messages 
https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
   You can check how these are used
   
   This one I checked - problem here is from native execution, we get back 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



himadripal commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968652251


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   > Another way to remove the if block is to convert the error message to make 
it similar to Spark
   This is the place that we are defining Spark error messages 
https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
   You can check how these are used
   
   This one I checked - problem here is from native execution, we get back 
`Arrow(ArrowError)` which only has a string, precision and scale information is 
not present. Also to construct an error message from a string - we need to 
check specific string in the message. 
   We can try to change the ArrowError to have parameter but that will be a big 
change. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] chore: remove jackson dependency [datafusion-comet]

2025-02-24 Thread via GitHub



kazuyukitanimura opened a new pull request, #1442:
URL: https://github.com/apache/datafusion-comet/pull/1442

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   ## What changes are included in this PR?
   
   Removed jackson dependency
   
   ## How are these changes tested?
   
   CI
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: remove jackson dependency [datafusion-comet]

2025-02-24 Thread via GitHub



codecov-commenter commented on PR #1442:
URL: 
https://github.com/apache/datafusion-comet/pull/1442#issuecomment-2680235325

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1442?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 34.77%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`1783e2f`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/1783e2f9222776be82c6805e5ac54d40fb871d7b?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 47 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1442   +/-   ##
   =
   - Coverage 56.12%   34.77%   -21.36% 
   - Complexity  976 1071   +95 
   =
 Files   119  143   +24 
 Lines 1174349374+37631 
 Branches   225110804 +8553 
   =
   + Hits   659117168+10577 
   - Misses 401228777+24765 
   - Partials   1140 3429 +2289 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1442?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard [datafusion]

2025-02-24 Thread via GitHub



jayzhan211 commented on code in PR #14824:
URL: https://github.com/apache/datafusion/pull/14824#discussion_r1968726701


##
datafusion/core/tests/dataframe/dataframe_functions.rs:
##
@@ -1145,9 +1145,9 @@ async fn test_count_wildcard() -> Result<()> {
 .build()
 .unwrap();
 
-let expected = "Sort: count(*) ASC NULLS LAST [count(*):Int64]\
-\n  Projection: count(*) [count(*):Int64]\
-\nAggregate: groupBy=[[test.b]], aggr=[[count(*)]] [b:UInt32, 
count(*):Int64]\
+let expected = "Sort: count(Int64(1)) ASC NULLS LAST 
[count(Int64(1)):Int64]\

Review Comment:
   count_all() is count(1)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: use edition 2024 [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



alamb commented on PR #1736:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1736#issuecomment-2679691606

   > Adding an MSRV sounds good to me!
   
   Filed an issue to track:
   - https://github.com/apache/datafusion-sqlparser-rs/issues/1744


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Compilation error due to rkyv/rkyv#434 [datafusion]

2025-02-24 Thread via GitHub



ryzhyk opened a new issue, #14862:
URL: https://github.com/apache/datafusion/issues/14862

   ### Describe the bug
   
   When datafusion is used in a workspace that enables the `rkyv-64` feature in 
the `chrono` crate, this triggers a Rust compilation error:
   
   ```
   error[E0277]: can't compare `Option<&std::string::String>` with `Option<&mut 
std::string::String>`
   --> datafusion/expr/src/logical_plan/plan.rs:2872:43
|
   2872 | merged.retain(|k, v| input.get(k) == Some(v));
|   ^^ no implementation for 
`Option<&std::string::String> == Option<&mut std::string::String>`
|
= help: the trait `PartialEq>` is not 
implemented for `Option<&std::string::String>`
= help: the following other types implement trait `PartialEq`:
  `Option` implements `PartialEq`
  `Option` implements 
`PartialEq>`
   ```
   
   The root cause of the error is incorrect type unification in the Rust 
compiler, as explained in https://github.com/rkyv/rkyv/issues/434.
   
   I will submit a PR with a workaround.
   
   ### To Reproduce
   
   Change `Cargo.toml` to enable the `rkyv-64` feature in chrono and try 
building the `datafusion-expr` crate:
   
   ```
   chrono = { version = "0.4.38", default-features = false, features = 
["rkyv-64"] }
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Workaround for compilation error due to rkyv#434. [datafusion]

2025-02-24 Thread via GitHub



ryzhyk opened a new pull request, #14863:
URL: https://github.com/apache/datafusion/pull/14863

   ## Which issue does this PR close?
   
   - Closes #14862
   
   ## Rationale for this change
   
   When datafusion is used in a workspace that enables the `rkyv-64` feature in 
the `chrono` crate, this triggered a Rust compilation error:
   
   ```
   error[E0277]: can't compare `Option<&std::string::String>` with `Option<&mut 
std::string::String>`.
   ```
   
   The root cause of the error is incorrect type unification in the Rust 
compiler, as explained in https://github.com/rkyv/rkyv/issues/434.
   
   ## What changes are included in this PR?
   
   The workaround pushes the compiler in the right direction by converting the 
mutable reference to an immutable one manually.
   
   ## Are these changes tested?
   
   I don't think this requires additional tests.
   
   ## Are there any user-facing changes?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] test: Register Spark-compatible expressions with a DataFusion context [datafusion-comet]

2025-02-24 Thread via GitHub



andygrove merged PR #1432:
URL: https://github.com/apache/datafusion-comet/pull/1432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Blog / Example of how to compile DataFusion to WASM [datafusion]

2025-02-24 Thread via GitHub



waynexia commented on issue #13715:
URL: https://github.com/apache/datafusion/issues/13715#issuecomment-2680048213

   There is a developer guide for building, debugging, and publishing the WASM 
bindings: 
https://github.com/datafusion-contrib/datafusion-wasm-bindings/blob/main/CONTRIBUTING.md
 (forget to link 🙈)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Implement actual count wildcard in physical layer and fix duplicated schema name error from count wildcard [datafusion]

2025-02-24 Thread via GitHub



jayzhan211 commented on code in PR #14824:
URL: https://github.com/apache/datafusion/pull/14824#discussion_r1968712649


##
datafusion/core/tests/dataframe/mod.rs:
##
@@ -2455,7 +2455,7 @@ async fn test_count_wildcard_on_sort() -> Result<()> {
 let ctx = create_join_context()?;
 
 let sql_results = ctx
-.sql("select b,count(*) from t1 group by b order by count(*)")
+.sql("select b,count(1) from t1 group by b order by count(1)")

Review Comment:
   count_all() is count(1) now, so we need to change to count(1) to have a 
consistent name (count(*) is count(a) AS count(*) now). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Example for using a separate threadpool for CPU bound work (try 2) [datafusion]

2025-02-24 Thread via GitHub



djanderson commented on code in PR #14286:
URL: https://github.com/apache/datafusion/pull/14286#discussion_r1968617896


##
datafusion-examples/examples/thread_pools_lib/dedicated_executor.rs:
##
@@ -0,0 +1,1778 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [DedicatedExecutor] for running CPU-bound tasks on a separate tokio 
runtime.
+
+use crate::SendableRecordBatchStream;
+use async_trait::async_trait;
+use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+use datafusion_common::error::GenericError;
+use datafusion_common::DataFusionError;
+use futures::stream::BoxStream;
+use futures::{
+future::{BoxFuture, Shared},
+Future, FutureExt, Stream, StreamExt, TryFutureExt,
+};
+use log::{info, warn};
+use object_store::path::Path;
+use object_store::{
+GetOptions, GetResult, GetResultPayload, ListResult, MultipartUpload, 
ObjectMeta,
+ObjectStore, PutMultipartOpts, PutOptions, PutPayload, PutResult, 
UploadPart,
+};
+use std::cell::RefCell;
+use std::pin::Pin;
+use std::sync::RwLock;
+use std::task::{Context, Poll};
+use std::{fmt::Display, sync::Arc, time::Duration};
+use tokio::runtime::Builder;
+use tokio::task::JoinHandle;
+use tokio::{
+runtime::Handle,
+sync::{oneshot::error::RecvError, Notify},
+task::JoinSet,
+};
+use tokio_stream::wrappers::ReceiverStream;
+
+/// Create a [`DedicatedExecutorBuilder`] from a tokio [`Builder`]
+impl From for DedicatedExecutorBuilder {
+fn from(value: Builder) -> Self {
+Self::new_from_builder(value)
+}
+}
+
+/// Manages a separate tokio [`Runtime`] (thread pool) for executing CPU bound
+/// tasks such as DataFusion `ExecutionPlans`.
+///
+/// See [`DedicatedExecutorBuilder`] for creating a new instance.
+///
+/// A `DedicatedExecutor` can helps avoid issues when runnnig IO and CPU bound 
tasks on the
+/// same thread pool by running futures (and any `tasks` that are
+/// `tokio::task::spawned` by them) on a separate tokio [`Executor`].
+///
+/// `DedicatedExecutor`s can be `clone`ed and all clones share the same thread 
pool.
+///
+/// Since the primary use for a `DedicatedExecutor` is offloading CPU bound
+/// work, IO work can not be performed on tasks launched in the Executor.
+///
+/// To perform IO, see:
+/// - [`Self::spawn_io`]
+/// - [`Self::wrap_object_store`]
+///
+/// When [`DedicatedExecutorBuilder::build`] is called, a reference to the
+/// "current" tokio runtime will be stored and used, via 
[`register_io_runtime`] by all
+/// threads spawned by the executor. Any I/O done by threads in this
+/// [`DedicatedExecutor`] should use [`spawn_io`], which will run them on the
+/// I/O runtime.
+///
+/// # TODO examples
+///
+/// # Background
+///
+/// Tokio has the notion of the "current" runtime, which runs the current 
future
+/// and any tasks spawned by it. Typically, this is the runtime created by
+/// `tokio::main` and is used for the main application logic and I/O handling
+///
+/// For CPU bound work, such as DataFusion plan execution, it is important to
+/// run on a separate thread pool to avoid blocking the I/O handling for 
extended
+/// periods of time in order to avoid long poll latencies (which decreases the
+/// throughput of small requests under concurrent load).
+///
+/// # IO Scheduling
+///
+/// I/O, such as network calls, should not be performed on the runtime managed
+/// by [`DedicatedExecutor`]. As tokio is a cooperative scheduler, long-running
+/// CPU tasks will not be preempted and can therefore starve servicing of other
+/// tasks. This manifests in long poll-latencies, where a task is ready to run
+/// but isn't being scheduled to run. For CPU-bound work this isn't a problem 
as
+/// there is no external party waiting on a response, however, for I/O tasks,
+/// long poll latencies can prevent timely servicing of IO, which can have a
+/// significant detrimental effect.
+///
+/// # Details
+///
+/// The worker thread priority is set to low so that such tasks do
+/// not starve other more important tasks (such as answering health checks)
+///
+/// Follows the example from stack overflow and spawns a new
+/// thread to install a Tokio runtime "context"
+///

Re: [I] [Epic] Split datasources out from `datafusion` crate (`datafusion/core`) [datafusion]

2025-02-24 Thread via GitHub



AdamGS commented on issue #1:
URL: https://github.com/apache/datafusion/issues/1#issuecomment-2679630951

   https://github.com/apache/datafusion/pull/14838 turned out to be easier than 
I thought, mostly because @logan-keede did much of the work to move test 
infrastructure over.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: enable full decimal to decimal support [datafusion-comet]

2025-02-24 Thread via GitHub



kazuyukitanimura commented on code in PR #1385:
URL: https://github.com/apache/datafusion-comet/pull/1385#discussion_r1968689241


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -1126,27 +1129,33 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   val cometMessage =
 if (cometException.getCause != null) 
cometException.getCause.getMessage
 else cometException.getMessage
-  if (CometSparkSessionExtensions.isSpark40Plus) {
-// for Spark 4 we expect to sparkException carries the message
-assert(
-  sparkException.getMessage
-.replace(".WITH_SUGGESTION] ", "]")
-.startsWith(cometMessage))
-  } else if (CometSparkSessionExtensions.isSpark34Plus) {
-// for Spark 3.4 we expect to reproduce the error message 
exactly
-assert(cometMessage == sparkMessage)
+  // for comet decimal conversion throws ArrowError(string) from 
arrow - across spark versions the message dont match.
+  if (sparkMessage.contains("cannot be represented as")) {
+cometMessage.contains("cannot be represented as") || 
cometMessage.contains(
+  "too large to store")
   } else {

Review Comment:
   Just a thought. For the former approach, can we get precision info similar 
to 
https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/conversion_funcs/cast.rs#L932
 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Feat: Implement hf:// / "hugging face" integration in datafusion-cli [datafusion]

2025-02-24 Thread via GitHub



github-actions[bot] closed pull request #10792: Feat: Implement hf:// / 
"hugging face" integration in datafusion-cli
URL: https://github.com/apache/datafusion/pull/10792


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Adding node_id to ExecutionPlanProperties [datafusion]

2025-02-24 Thread via GitHub



github-actions[bot] closed pull request #12186: Adding node_id to 
ExecutionPlanProperties
URL: https://github.com/apache/datafusion/pull/12186


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Add support for `CREATE SCHEMA` options [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio opened a new pull request, #1742:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1742

   ```sql
   CREATE SCHEMA DEFAULT COLLATE 'foo' myschema OPTIONS(key='value');
   ```
   
   
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_schema_statement)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Add support for `DROP MATERIALIZED VIEW` [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



iffyio opened a new pull request, #1743:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1743

   Adds support for e.g.
   
   ```sql
   DROP MATERIALIZED VIEW IF EXISTS a.b.c
   ```
   
   
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#drop_materialized_view_statement)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [WIP] Store spans for Value expressions [datafusion-sqlparser-rs]

2025-02-24 Thread via GitHub



lovasoa commented on code in PR #1738:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1738#discussion_r1967169666


##
src/ast/value.rs:
##
@@ -26,10 +26,25 @@ use bigdecimal::BigDecimal;
 #[cfg(feature = "serde")]
 use serde::{Deserialize, Serialize};
 
-use crate::ast::Ident;
+use crate::{ast::Ident, tokenizer::Span};
 #[cfg(feature = "visitor")]
 use sqlparser_derive::{Visit, VisitMut};
 
+/// Primitive SQL values such as number and string
+#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
+#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
+#[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
+pub struct ValueWrapper {
+pub value: Value,
+pub span: Span,
+}

Review Comment:
   I don't know. What do you think is best?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

1 2 3 >

1 - 100 of 226 matches

Mail list logo