Re: [I] supports_filters_pushdown is invoked more than once on a single Custom Data Source [datafusion]

2025-01-07 Thread via GitHub


cisaacson commented on issue #13994:
URL: https://github.com/apache/datafusion/issues/13994#issuecomment-2575344845

   > You can also declare that the data source does not support pushing down 
any filters, and then, within a custom optimization rule similar to 
PushDownFilter, push the filters to your own data source. If this is a common 
requirement from the community, we could try to integrate this logic into 
DataFusion, such as pushing down all supported filters to data sources at once 
only during the final optimization step.
   
   The issue now is that we want DataFusion to use the `filters` that are not 
supported in pushdown, and ignore the ones that do have `Exact` support. Seems 
like how it is now is the best way to do that.
   
   And I understand about `&mut self`, that would change a lot of things. 
   
   Unless I am missing something, short of building our own custom optimizer 
code it is as good as it can be now. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Correctly look for end delimiter dollar quoted string [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


hansott opened a new pull request, #1650:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1650

   Currently the tokenizer throws an error for
   
   ```sql
   SELECT $abc$x$ab$abc$
   ```
   
   The logic is also quite difficult to read so I made it a bit simpler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: extract agg_funcs expressions to folders based on spark grouping [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove commented on PR #1224:
URL: 
https://github.com/apache/datafusion-comet/pull/1224#issuecomment-2575518378

   > Done, I really think some contributor with an access should merge those 
one by one and fix those conflicts as otherwise it will take a lot of back and 
forth and the conflicts are really basic
   
   Sure. I'll try and take care of these today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Correctly look for end delimiter dollar quoted string [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


hansott commented on PR #1650:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1650#issuecomment-2575575853

   Adding some more tests first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


alamb commented on PR #1646:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1646#issuecomment-2575103207

   Hi @yoavcloud  -- it looks like this PR has some conflicts -- is there any 
chance you can resolve them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] sql result discrepency with sqlite, postgres and duckdb [datafusion]

2025-01-07 Thread via GitHub


alamb commented on issue #13780:
URL: https://github.com/apache/datafusion/issues/13780#issuecomment-2575115298

   First of all, very nice 🕵️ work ! 
   
   > 'AS REAL' -> 'AS DOUBLE'
   > 
   > This would better match the actual types being tested and would fix many 
of the failing results. Along with correcting the nullif type behavior would 
fix almost all the remaining tests that have result mismatches with 
sqlite/postgresql.
   
   As I understand, this means using f64 for floating point operations in the 
test. This sounds like a very good and pragmatic thing to do in my mind
   
   The other potential thing to do might simply be to map `AS REAL` to use 
`DataType::Float64` but that is a larger and much more potentially disruptive 
change
   
   Thanks agian @Omega359 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Unparsing optimized (> 2 inputs) unions [datafusion]

2025-01-07 Thread via GitHub


goldmedal commented on PR #14031:
URL: https://github.com/apache/datafusion/pull/14031#issuecomment-2575129897

   Thanks for working on this, @MohamedAbdeen21 
   Instead of applying optimization rules when roundtrip tests, I prefer to 
construct the expected plan manually. Just like other test cases like
   
https://github.com/apache/datafusion/blob/482b48926a871bf2c39d6808ca217e309c705b03/datafusion/sql/tests/cases/plan_to_sql.rs#L1221
   
   If we add the optimization to the testing, the optimizer's behavior will be 
limited by the unparser but we don't expect it. (see the previous discussion  
https://github.com/apache/datafusion/pull/13267#issuecomment-2482503581)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] enum DropBehavior [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


stepancheg opened a new pull request, #1648:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1648

   There's `` definition [in the 
grammar](https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#drop-behavior),
 so corresponding enum seems appropriate:
   
   ```
   pub enum DropBehavior {
   Restrict,
   Cascade,
   }
   ```
   
   vs
   
   ```
   pub enum ReferentialAction {
   Restrict,
   Cascade,
   SetNull,
   NoAction,
   SetDefault,
   }
   ```
   
   Context: `ALTER TABLE DROP COLUMN` sqlparser does not support `RESTRICT`. I 
wanted to add it, but adding `ReferentialAction` there too might be misleading, 
so here I introduce an enum.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve perfomance of `reverse` function [datafusion]

2025-01-07 Thread via GitHub


tlm365 commented on code in PR #14025:
URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905501646


##
datafusion/functions/src/unicode/reverse.rs:
##
@@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> 
Result {
 }
 }
 
-fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>(
+fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>(
 string_array: V,
 ) -> Result {
-let result = ArrayIter::new(string_array)
-.map(|string| string.map(|string: &str| 
string.chars().rev().collect::()))
-.collect::>();
+let mut builder: GenericStringBuilder =
+GenericStringBuilder::with_capacity(string_array.len(), 1024);

Review Comment:
   > > I wonder if 
[this](https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/arrow-array/src/array/mod.rs#L335-L338)
 is a good alternative to `1024`? 🤔
   > 
   > Yes, I think so
   
   Unfortunately, it's a bit slower than the current implementation. It 
allocates more mem than I expected, `string_array.get_array_memory_size()` = 
`16632` when `str_len = 1024`, and = `66168` when `str_len = 4096` 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


bombsimon commented on code in PR #1628:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905449986


##
src/ast/query.rs:
##
@@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr {
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub enum FormatClause {
-Identifier(Ident),
+Identifier {
+ident: Ident,
+expr: Option>,
+},
 Null,
 }

Review Comment:
   My thinking here was that they're referred to as the same in the ClickHouse 
docs which is the only one that uses them so it would make sense to represent 
it the same way in the ast. 
   
   - 
[`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format)
 only has one documentation page referring to both use cases.
   - [The interface](https://clickhouse.com/docs/en/interfaces/formats) 
references both of them:
 > ClickHouse can accept and return data in various formats. A format 
supported for input can be used to parse the data provided to `INSERT`s, to 
perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to 
read a dictionary. A format supported for output can be used to arrange the 
results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All 
format names are case insensitive.
   
   So to me re-using the same ast type with an optinal expression for potential 
insert data still makes sense.
   
   Also double checking if you still don't agree with this, I guess you mean 
that the new type should be called `InputFormatClause` since it's the input 
that holds values and not the output, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] multiply overflow in stats.rs [datafusion]

2025-01-07 Thread via GitHub


LindaSummer commented on issue #13775:
URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575392114

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] multiply overflow in stats.rs [datafusion]

2025-01-07 Thread via GitHub


LindaSummer commented on issue #13775:
URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575391891

   > Hi Edward - Thanks for taking a look at this! I do not have the ability to 
do that however if you add a comment here with just the word 'take' a github 
action will assign it to you
   
   Got it!
   
   Thanks very much! 😊
   
   Best Regards,
   Edward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix error on `array_distinct` when input is empty #13810 [datafusion]

2025-01-07 Thread via GitHub


comphead commented on code in PR #14034:
URL: https://github.com/apache/datafusion/pull/14034#discussion_r1905726601


##
datafusion/functions-nested/src/set_ops.rs:
##
@@ -513,9 +513,6 @@ fn general_array_distinct(
 array: &GenericListArray,
 field: &FieldRef,
 ) -> Result {
-if array.len() == 0 {

Review Comment:
   do we really need to remove this check? 🤔 it prevents unnecessary 
work/allocations if nothing is in the `array`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [comet-parquet-exec] Track remaining test failures in POC 1 & 2 [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove opened a new issue, #1228:
URL: https://github.com/apache/datafusion-comet/issues/1228

   ### What is the problem the feature request solves?
   
   I thought it would be useful to have one issue to track all the test 
failures that we are currently working on resolving
   
   ## POC 1
   
   - "timestamp" test fails with cannot cast Timestamp(Millisecond, None) to 
required schema field of type Timestamp(Microsecond, None)
   - "timestamp as int96" fails with cannot cast Timestamp(Nanosecond, None) to 
required schema field of type Timestamp(Microsecond, Some("UTC"))
   - java.lang.IllegalArgumentException: requirement failed: input[0, int, 
true] IN dynamicpruning#45042 has not finished
   - "columnar shuffle on struct including nulls" fails with 
java.lang.IllegalStateException: Complex type is not supported
"columnar shuffle on nested struct" fails with 
java.lang.IllegalStateException: Complex type is not supported
   
   ## POC 2
   
   
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-01-07 Thread via GitHub


jeffreyssmith2nd commented on code in PR #14028:
URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905736433


##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -478,7 +478,12 @@ impl FileStream {
 reader,
 )),
 partition_values,
-}
+};
+// Return control to the runtime 
when we're ready to open the next file
+// to prevent uncancellable 
queries in scenarios with many large files.
+// This functions similarly to a 
`tokio::task::yield_now()`.
+cx.waker().wake_by_ref();
+return Poll::Pending;

Review Comment:
   I thought it was odd too (I had seen this pattern elsewhere) but looking at 
the implementation of `yield_now`, when you `poll` it does a `context::defer` 
and then returns `Pending` 
[(source)](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/task/yield_now.rs#L57-L59).
 `context::defer` will directly call `wake_by_ref` if called from outside of 
the runtime 
[(source)](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L167-177).
   
   I could be misunderstanding what "outside the runtime" means in this 
case/how that will actually interact with the yield.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: extract agg_funcs expressions to folders based on spark grouping [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove merged PR #1224:
URL: https://github.com/apache/datafusion-comet/pull/1224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove merged PR #1226:
URL: https://github.com/apache/datafusion-comet/pull/1226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support pluralized time units [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


wugeer commented on code in PR #1630:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1630#discussion_r1905597631


##
src/parser/mod.rs:
##
@@ -2353,14 +2355,30 @@ impl<'a> Parser<'a> {
 };
 Ok(DateTimeField::Week(week_day))
 }
+Keyword::WEEKS => {
+let week_day = if dialect_of!(self is BigQueryDialect)
+&& self.consume_token(&Token::LParen)
+{
+let week_day = self.parse_identifier()?;
+self.expect_token(&Token::RParen)?;
+Some(week_day)
+} else {
+None
+};
+Ok(DateTimeField::Weeks(week_day))
+}

Review Comment:
   @iffyio I found the following dialects supported pluralized time units.But 
the is interesting thing is that not all dialects support `weeks`.
   snowflake 
https://docs.snowflake.com/en/sql-reference/data-types-datetime#interval-examples
   hive 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27838462#LanguageManualTypes-Intervals
   postgres 
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
   redshift 
https://docs.aws.amazon.com/redshift/latest/dg/r_interval_data_types.html#r_interval_data_types-arguments
   duckdb https://duckdb.org/docs/sql/data_types/interval.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]

2025-01-07 Thread via GitHub


kevinjqliu commented on code in PR #981:
URL: https://github.com/apache/datafusion-python/pull/981#discussion_r1905611807


##
python/datafusion/dataframe.py:
##
@@ -620,16 +620,25 @@ def write_csv(self, path: str | pathlib.Path, 
with_header: bool = False) -> None
 def write_parquet(
 self,
 path: str | pathlib.Path,
-compression: str = "uncompressed",
+compression: str = "ZSTD",
 compression_level: int | None = None,

Review Comment:
   good point! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]

2025-01-07 Thread via GitHub


mbutrovich commented on code in PR #1226:
URL: https://github.com/apache/datafusion-comet/pull/1226#discussion_r1905609592


##
native/spark-expr/src/cast.rs:
##
@@ -817,17 +818,28 @@ fn cast_struct_to_struct(
 cast_options: &SparkCastOptions,
 ) -> DataFusionResult {
 match (from_type, to_type) {
-(DataType::Struct(_), DataType::Struct(to_fields)) => {
-let mut cast_fields: Vec<(Arc, ArrayRef)> = 
Vec::with_capacity(to_fields.len());
+(DataType::Struct(from_fields), DataType::Struct(to_fields)) => {
+// TODO some of this logic may be specific to converting Parquet 
to Spark
+let mut field_name_to_index_map = HashMap::new();

Review Comment:
   I'm wondering if we'll end up adding a cache in the SchemaAdapter logic for 
these sorts of maps, since it looks like this would be generated for each batch?



##
native/spark-expr/src/schema_adapter.rs:
##
@@ -321,10 +322,26 @@ fn cast_supported(from_type: &DataType, to_type: 
&DataType, options: &SparkCastO
 (Timestamp(_, Some(_)), _) => can_cast_from_timestamp(to_type, 
options),
 (Utf8 | LargeUtf8, _) => can_cast_from_string(to_type, options),
 (_, Utf8 | LargeUtf8) => can_cast_to_string(from_type, options),
-(Struct(from_fields), Struct(to_fields)) => from_fields
-.iter()
-.zip(to_fields.iter())
-.all(|(a, b)| cast_supported(a.data_type(), b.data_type(), 
options)),
+(Struct(from_fields), Struct(to_fields)) => {
+// TODO some of this logic may be specific to converting Parquet 
to Spark
+let mut field_types = HashMap::new();
+for field in from_fields {
+field_types.insert(field.name(), field.data_type());

Review Comment:
   Same caching comment as above, mostly as a reminder to our future selves.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]

2025-01-07 Thread via GitHub


ion-elgreco commented on code in PR #981:
URL: https://github.com/apache/datafusion-python/pull/981#discussion_r1905619163


##
python/datafusion/dataframe.py:
##
@@ -620,16 +620,25 @@ def write_csv(self, path: str | pathlib.Path, 
with_header: bool = False) -> None
 def write_parquet(
 self,
 path: str | pathlib.Path,
-compression: str = "uncompressed",
+compression: str = "ZSTD",
 compression_level: int | None = None,
 ) -> None:
 """Execute the :py:class:`DataFrame` and write the results to a 
Parquet file.
 
 Args:
 path: Path of the Parquet file to write.
-compression: Compression type to use.
-compression_level: Compression level to use.
-"""
+compression: Compression type to use. Default is "ZSTD".
+compression_level: Compression level to use. For ZSTD, the
+recommended range is 1 to 22, with the default being 4. Higher 
levels
+provide better compression but slower speed.
+"""
+if compression == "ZSTD":

Review Comment:
   You might want to reuse this code I added in deltalake a while ago: 
   
   
https://github.com/delta-io/delta-rs/blob/053601b72c2ee053f7ff0309c7d27e122c4e3852/python/deltalake/table.py#L79-L129



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Un-cancellable Query when hitting many large files. [datafusion]

2025-01-07 Thread via GitHub


jeffreyssmith2nd opened a new issue, #14036:
URL: https://github.com/apache/datafusion/issues/14036

   ### Describe the bug
   
   **TLDR; Reading many large Parquet files can prevent a query from being 
cancelled.**
   
   
   We have a customer that is running a query similar to the following (edited 
for privacy):
   ```
   SELECT DISTINCT "A","B","C","D","E" FROM table where "time" > now() - 
INTERVAL '5 days';
   ```
   
   This produces a fairly straightforward plan 
[explain.txt](https://github.com/user-attachments/files/18323208/explain.txt).
   
   Simplified Plan:
   ```
   AggregateExec mode=FinalPartitioned
 CoalesceBatchesExec target_batch_size=8192
   RepartitionExec input_partitions=4
 AggregateExec mode=Partial
   ParquetExec file_groups={4 groups}
   ```
   
   This will read ~85 parquet files at ~100MB each. What we've seen is that 
even when a query is cancelled, the resources with that query (CPU/RAM) are 
still being utilized for almost the same amount of time as a typical execution 
of the query.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   Cancelling a query should (within some reasonable amount of time) truly 
cancel the query, freeing up system resources for other query executions.
   
   ### Additional context
   
   This appears to be a problem with the interaction between the 
[`GroupedHashAggregateStream`](https://github.com/apache/arrow-datafusion/blob/9a808eb7f63385e3e3b02051ed30eee91daeb613/datafusion/physical-plan/src/aggregates/row_hash.rs#L620)
 and 
[`FileStream`](https://github.com/apache/arrow-datafusion/blob/9a808eb7f63385e3e3b02051ed30eee91daeb613/datafusion/core/src/datasource/physical_plan/file_stream.rs#L246)
 approaches to yielding.
   
   The `GroupedHashAggregateStream` will infinitely loop until its child stream 
(in this case a `FileStream`) is exhausted, errors, or returns `Pending`. 
   
   The `FileStream` loops while attempting to read `RecordBatch`es from the 
underlying file, while also doing some book-keeping to stage the next `File` 
for reading. This Stream will return `Ready` when a `RecordBatch` is processed, 
or an `Error` encountered. However, it **does not** return `Pending` at any 
point. When a File is finished reading, the next File up is swapped in and 
reading continues from the new File.
   
   The combination of these two behaviours means that if there are many large 
files being read by the `FileStream`, the `GroupedHashAggregateStream` doesn't 
effectively yield back to Tokio.
   
   My [PR](https://github.com/apache/datafusion/pull/14028) to fix this is to 
have the `FileStream` return `Pending` when swapping over to a new file. This 
seems like a natural yield point, and resolves the issue with cancellation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve perfomance of `reverse` function [datafusion]

2025-01-07 Thread via GitHub


2010YOUY01 commented on code in PR #14025:
URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905442653


##
datafusion/functions/src/unicode/reverse.rs:
##
@@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> 
Result {
 }
 }
 
-fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>(
+fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>(
 string_array: V,
 ) -> Result {
-let result = ArrayIter::new(string_array)
-.map(|string| string.map(|string: &str| 
string.chars().rev().collect::()))
-.collect::>();
+let mut builder: GenericStringBuilder =
+GenericStringBuilder::with_capacity(string_array.len(), 1024);

Review Comment:
   > I wonder if 
[this](https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/arrow-array/src/array/mod.rs#L335-L338)
 is a good alternative to `1024`? 🤔
   
   Yes, I think so



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Chore/single source exec [datafusion]

2025-01-07 Thread via GitHub


mertak-synnada opened a new pull request, #14035:
URL: https://github.com/apache/datafusion/pull/14035

   ## Which issue does this PR close?
   
   
   
   Closes #13838.
   
   ## Rationale for this change
   
   This PR merges all Data sources into one Execution Plan, named 
`DataSourceExec` and a single trait named `DataSource` which is inspired by 
`DataSink`.
   
   This version is not intended to be merged in Upstream since it removes all 
the ParquetExec, CsvExec, etc., and changes all the tests, but I'm sharing this 
as is so that we can see all possible changes. But our main intention is to 
re-adding old execution plans as deprecated ones and implement this part by 
part to keep API stability.
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add arrow cast [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer merged PR #962:
URL: https://github.com/apache/datafusion-python/pull/962


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] release datafusion-python 42.1.0 [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer commented on PR #930:
URL: 
https://github.com/apache/datafusion-python/pull/930#issuecomment-2575300355

   I recommend closing this PR since we have already moved on to 43.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix small issues in pyproject.toml [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer merged PR #976:
URL: https://github.com/apache/datafusion-python/pull/976


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add type hints and capsule validation for FFI Table providers [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer closed issue #950: Add type hints and capsule validation for FFI 
Table providers
URL: https://github.com/apache/datafusion-python/issues/950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] FLUP #13810 [datafusion]

2025-01-07 Thread via GitHub


alamb commented on PR #14034:
URL: https://github.com/apache/datafusion/pull/14034#issuecomment-2575369809

   Thanks @cht42 !
   
   What does FLUP stand for 🤔 My google fu doesn't seem to be able to find 
anythihg relevant:  https://www.google.com/search?q=flup&oq=flup 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve perfomance of `reverse` function [datafusion]

2025-01-07 Thread via GitHub


tlm365 commented on code in PR #14025:
URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905501646


##
datafusion/functions/src/unicode/reverse.rs:
##
@@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> 
Result {
 }
 }
 
-fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>(
+fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>(
 string_array: V,
 ) -> Result {
-let result = ArrayIter::new(string_array)
-.map(|string| string.map(|string: &str| 
string.chars().rev().collect::()))
-.collect::>();
+let mut builder: GenericStringBuilder =
+GenericStringBuilder::with_capacity(string_array.len(), 1024);

Review Comment:
   > Yes, I think so
   
   Unfortunately, it's a bit slower than the current implementation. It 
allocates more mem than I expected, `string_array.get_array_memory_size()` = 
`16632` when `str_len = 1024`, and = `66168` when `str_len = 4096` 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] multiply overflow in stats.rs [datafusion]

2025-01-07 Thread via GitHub


Omega359 commented on issue #13775:
URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575373984

   Hi Edward - Thanks for taking a look at this! I do not have the ability to 
do that however if you add a comment here with just the word 'take' a github 
action will assign it to you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]

2025-01-07 Thread via GitHub


alamb opened a new pull request, #14037:
URL: https://github.com/apache/datafusion/pull/14037

   ## Which issue does this PR close?
   
   - Part of https://github.com/apache/datafusion/issues/13748
   
   
   
   ## Rationale for this change
   
   As a first part of optimizing equivalence / ordering calculations I need to 
ensure the internal invariants are clear do I don't break them during 
optimization
   
   
   
   
   ## What changes are included in this PR?
   
   This changes the members of the equivalent classes to be non pub and 
documents what they do better.
   
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   While this is a breaking API change (the field is no longer pub) I hope the 
additional documentation makes migration as painless as possible
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14037:
URL: https://github.com/apache/datafusion/pull/14037#discussion_r1905776819


##
datafusion/physical-expr/src/equivalence/ordering.rs:
##
@@ -39,7 +39,7 @@ use arrow_schema::SortOptions;
 /// ordering. In this case, we say that these orderings are equivalent.
 #[derive(Debug, Clone, Eq, PartialEq, Hash, Default)]
 pub struct OrderingEquivalenceClass {
-pub orderings: Vec,
+orderings: Vec,

Review Comment:
   This is the key change -- making this field non-pub -- the rest of the PR is 
derived from that and added comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add support for MS-SQL BEGIN/END TRY/CATCH [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


yoavcloud opened a new pull request, #1649:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1649

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: support enable_url_table config [datafusion-python]

2025-01-07 Thread via GitHub


chenkovsky commented on PR #980:
URL: 
https://github.com/apache/datafusion-python/pull/980#issuecomment-2575455417

   > Thank you for the addition! Would it be simpler to just expose the 
function on the session context instead of making it part of the 
initialization? Even if you do think it's important to have it in 
initialization, we probably should expose the function also to maintain parity 
with the rust version.
   
   updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


bombsimon commented on code in PR #1628:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905454728


##
src/keywords.rs:
##
@@ -931,7 +931,7 @@ pub const RESERVED_FOR_TABLE_ALIAS: &[Keyword] = &[
 Keyword::PREWHERE,
 // for ClickHouse SELECT * FROM t SETTINGS ...
 Keyword::SETTINGS,
-// for ClickHouse SELECT * FROM t FORMAT...
+// for ClickHouse SELECT * FROM t FORMAT... or INSERT INTO t FORMAT...

Review Comment:
   Removed them completely for both settings and format. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feature/single source exec [datafusion]

2025-01-07 Thread via GitHub


mertak-synnada closed pull request #14035: Feature/single source exec
URL: https://github.com/apache/datafusion/pull/14035


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: deprecate `ValuesExec` in favour of `MemoryExec` [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14032:
URL: https://github.com/apache/datafusion/pull/14032#discussion_r1905509497


##
datafusion/physical-plan/src/values.rs:
##
@@ -34,6 +34,7 @@ use datafusion_execution::TaskContext;
 use datafusion_physical_expr::EquivalenceProperties;
 
 /// Execution plan for values list based relation (produces constant rows)
+#[deprecated(since = "44.0.0", note = "Use `MemoryExec::try_new_as_values` 
instead")]

Review Comment:
   Since we have already released 44, I think the correct version is the next 
unreleased one, 45
   
   ```suggestion
   #[deprecated(since = "45.0.0", note = "Use `MemoryExec::try_new_as_values` 
instead")]
   ```



##
datafusion/core/src/physical_planner.rs:
##
@@ -466,6 +467,7 @@ impl DefaultPhysicalPlanner {
 .collect::>>>()
 })
 .collect::>>()?;
+#[allow(deprecated)] // TODO: Remove in favour of MemoryExec

Review Comment:
   In this PR I think it would be good to port this code (in the physical 
planner) to use `MemoryExec` -- that way queries will run through `MemoryExec` 
and we will confidence that `ValuesExec` can really be removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feature/single source exec [datafusion]

2025-01-07 Thread via GitHub


mertak-synnada commented on PR #14035:
URL: https://github.com/apache/datafusion/pull/14035#issuecomment-2575383133

   Closed since this PR is opened pre-maturely


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix error on `array_distinct` when input is empty #13810 [datafusion]

2025-01-07 Thread via GitHub


cht42 commented on PR #14034:
URL: https://github.com/apache/datafusion/pull/14034#issuecomment-2575530332

   > Thanks @cht42 !
   > 
   > What does FLUP stand for 🤔 My google fu doesn't seem to be able to find 
anythihg relevant: https://www.google.com/search?q=flup&oq=flup
   
   Means: **F**o**L**low **UP**. Maybe just used at my company internaly 😅 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Simplify error handling in case.rs (#13990) [datafusion]

2025-01-07 Thread via GitHub


alamb commented on PR #14033:
URL: https://github.com/apache/datafusion/pull/14033#issuecomment-2575793893

   Thank you @cj-zhukov  !
   
   It seems as if there is a compile error in this PR: 
https://github.com/apache/datafusion/actions/runs/12646530220/job/35237388507?pr=14033
   
   ```
   error[E0433]: failed to resolve: use of undeclared type `DataFusionError`
  --> datafusion/physical-expr/src/expressions/case.rs:407:13
   |
   407 | DataFusionError::Context(
   | ^^^ use of undeclared type `DataFusionError`
   |
   help: consider importing this enum
   |
   18  + use datafusion_common::DataFusionError;
   |
   
   Checking datafusion-sql v44.0.0 
(/__w/datafusion/datafusion/datafusion/sql)
   Checking datafusion-functions v44.0.0 
(/__w/datafusion/datafusion/datafusion/functions)
   For more information about this error, try `rustc --explain E0433`.
   error: could not compile `datafusion-physical-expr` (lib) due to 1 previous 
error
   warning: build failed, waiting for other jobs to finish...
   Error: Process completed with exit code 101.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-01-07 Thread via GitHub


crepererum commented on code in PR #14028:
URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905787146


##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -478,7 +478,12 @@ impl FileStream {
 reader,
 )),
 partition_values,
-}
+};
+// Return control to the runtime 
when we're ready to open the next file
+// to prevent uncancellable 
queries in scenarios with many large files.
+// This functions similarly to a 
`tokio::task::yield_now()`.
+cx.waker().wake_by_ref();
+return Poll::Pending;

Review Comment:
   For the multi-thread RT (which is the only thing we really care about,  I 
think it's NOT instantly calling `wake_by_ref` instantly:
   
   1. `yield_now`: 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/task/yield_now.rs#L57
   2. `context::defer`: 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L166-L177
   3. `with_scheduler`: 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L183-L195
   4. (via some indirection) `Context::defer`: 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/mod.rs#L287-L289
   5. (via some indirection) `Context::defer` (different `Context` this time): 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L766-L768
   6. `Defer::defer`: 
https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/defer.rs#L15-L26
   7. The deferred threads are woken in two places 
([1](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L513-L516),
 
[2](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L751))
 for which the 2nd one seems relevant. The actual waking will happen when the 
worker is -- I think -- unparked after a "park timeout". This is definitely 
long after `Pending` is returned, because the thread that is returning 
`Pending` is the very same worker, i.e. it cannot possibly be parked at this 
point in time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


bombsimon commented on code in PR #1628:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905449986


##
src/ast/query.rs:
##
@@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr {
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub enum FormatClause {
-Identifier(Ident),
+Identifier {
+ident: Ident,
+expr: Option>,
+},
 Null,
 }

Review Comment:
   My thinking here was that they're referred to as the same in the ClickHouse 
docs which is the only one that uses them so it would make sense to represent 
it the same way in the ast. 
   
   - 
[`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format)
 only has one documentation page referring to both types
   - [The interface](https://clickhouse.com/docs/en/interfaces/formats) 
references both of them:
 > ClickHouse can accept and return data in various formats. A format 
supported for input can be used to parse the data provided to` INSERT`s, to 
perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to 
read a dictionary. A format supported for output can be used to arrange the 
results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All 
format names are case insensitive.
   
   So to me re-using the same ast type with an optinal expression for potential 
insert data still makes sense.
   
   Also double checking if you still don't agree with this, I guess you mean 
that the new type should be called `InputFormatClause` since it's the input 
that holds values and not the output, right?



##
src/ast/query.rs:
##
@@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr {
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub enum FormatClause {
-Identifier(Ident),
+Identifier {
+ident: Ident,
+expr: Option>,
+},
 Null,
 }

Review Comment:
   My thinking here was that they're referred to as the same in the ClickHouse 
docs which is the only one that uses them so it would make sense to represent 
it the same way in the ast. 
   
   - 
[`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format)
 only has one documentation page referring to both types
   - [The interface](https://clickhouse.com/docs/en/interfaces/formats) 
references both of them:
 > ClickHouse can accept and return data in various formats. A format 
supported for input can be used to parse the data provided to `INSERT`s, to 
perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to 
read a dictionary. A format supported for output can be used to arrange the 
results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All 
format names are case insensitive.
   
   So to me re-using the same ast type with an optinal expression for potential 
insert data still makes sense.
   
   Also double checking if you still don't agree with this, I guess you mean 
that the new type should be called `InputFormatClause` since it's the input 
that holds values and not the output, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


alamb commented on code in PR #1646:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1646#discussion_r1905447833


##
src/parser/mod.rs:
##
@@ -10307,11 +10307,67 @@ impl<'a> Parser<'a> {
 snapshot: None,
 session: false,
 })
+} else if self.dialect.supports_set_stmt_without_operator() {
+self.prev_token();
+self.parse_set_session_params()
 } else {
 self.expected("equals sign or TO", self.peek_token())
 }
 }
 
+pub fn parse_set_session_params(&mut self) -> Result {
+let names = 
self.parse_comma_separated(Parser::parse_set_session_param_name)?;
+
+let last_name = if let Some(last_name) = names.last() {
+last_name.to_uppercase()
+} else {
+return self.expected("Session param name", self.peek_token());
+};
+
+let identity_insert_obj = if last_name == "IDENTITY_INSERT" {
+Some(self.parse_object_name(false)?)
+} else {
+None
+};
+
+let offsets_keywords = if last_name == "OFFSETS" {
+Some(self.parse_comma_separated(|parser| {
+let next_token = parser.next_token();
+match &next_token.token {
+Token::Word(w) => Ok(w.to_string()),
+_ => parser.expected("SQL keyword", next_token),
+}
+})?)
+} else {
+None
+};
+
+let value = self.parse_expr()?.to_string();
+Ok(Statement::SetSessionParam {
+names,
+identity_insert_obj,
+offsets_keywords,
+value,
+})
+}
+
+pub fn parse_set_session_param_name(&mut self) -> Result {
+if self.parse_keywords(&[Keyword::STATISTICS, Keyword::IO]) {
+return Ok("STATISTICS IO".to_string());
+} else if self.parse_keywords(&[Keyword::STATISTICS, Keyword::XML]) {
+return Ok("STATISTICS XML".to_string());
+} else if self.parse_keywords(&[Keyword::STATISTICS, 
Keyword::PROFILE]) {
+return Ok("STATISTICS PROFILE".to_string());
+} else if self.parse_keywords(&[Keyword::STATISTICS, Keyword::TIME]) {
+return Ok("STATISTICS TIME".to_string());
+}
+let next_token = self.next_token();
+if let Token::Word(w) = next_token.token {
+return Ok(w.to_string());
+}
+self.expected("Session param name", next_token)

Review Comment:
   maybe it would be nice to list the expected keywords (IO/XML/PROFILE/TIME, 
etc) here



##
src/parser/mod.rs:
##
@@ -10307,11 +10307,67 @@ impl<'a> Parser<'a> {
 snapshot: None,
 session: false,
 })
+} else if self.dialect.supports_set_stmt_without_operator() {
+self.prev_token();
+self.parse_set_session_params()
 } else {
 self.expected("equals sign or TO", self.peek_token())
 }
 }
 
+pub fn parse_set_session_params(&mut self) -> Result {
+let names = 
self.parse_comma_separated(Parser::parse_set_session_param_name)?;
+
+let last_name = if let Some(last_name) = names.last() {
+last_name.to_uppercase()
+} else {
+return self.expected("Session param name", self.peek_token());
+};
+
+let identity_insert_obj = if last_name == "IDENTITY_INSERT" {
+Some(self.parse_object_name(false)?)
+} else {
+None
+};
+
+let offsets_keywords = if last_name == "OFFSETS" {
+Some(self.parse_comma_separated(|parser| {
+let next_token = parser.next_token();
+match &next_token.token {
+Token::Word(w) => Ok(w.to_string()),
+_ => parser.expected("SQL keyword", next_token),
+}
+})?)
+} else {
+None
+};
+
+let value = self.parse_expr()?.to_string();
+Ok(Statement::SetSessionParam {
+names,
+identity_insert_obj,
+offsets_keywords,
+value,
+})
+}
+
+pub fn parse_set_session_param_name(&mut self) -> Result {
+if self.parse_keywords(&[Keyword::STATISTICS, Keyword::IO]) {

Review Comment:
   Given these are hard coded it seems like it would be cleaner to eventually 
return an enum for what setting was set rather than arbitrary strings that the 
using library needs to parse / interpret
   
   However, I also think we could do that as a follow on PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apach

Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


alamb commented on PR #1646:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1646#issuecomment-2575293097

   fyi @iffyio 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support async iteration of RecordBatchStream [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer commented on code in PR #975:
URL: https://github.com/apache/datafusion-python/pull/975#discussion_r1905447211


##
python/datafusion/record_batch.py:
##
@@ -59,18 +59,22 @@ def __init__(self, record_batch_stream: 
df_internal.RecordBatchStream) -> None:
 
 def next(self) -> RecordBatch | None:

Review Comment:
   I think we can update the return type hint to just `RecordBatch` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added references to IDE documentation for dev containers [datafusion]

2025-01-07 Thread via GitHub


alamb commented on PR #14014:
URL: https://github.com/apache/datafusion/pull/14014#issuecomment-2575659015

   Thanks @Omega359  and @goldmedal 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added references to IDE documentation for dev containers [datafusion]

2025-01-07 Thread via GitHub


alamb merged PR #14014:
URL: https://github.com/apache/datafusion/pull/14014


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Document how to use `.devcontainer` [datafusion]

2025-01-07 Thread via GitHub


alamb closed issue #13969: Document how to use `.devcontainer`
URL: https://github.com/apache/datafusion/issues/13969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-01-07 Thread via GitHub


crepererum commented on code in PR #14028:
URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905684304


##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -478,7 +478,12 @@ impl FileStream {
 reader,
 )),
 partition_values,
-}
+};
+// Return control to the runtime 
when we're ready to open the next file
+// to prevent uncancellable 
queries in scenarios with many large files.
+// This functions similarly to a 
`tokio::task::yield_now()`.
+cx.waker().wake_by_ref();
+return Poll::Pending;

Review Comment:
   I think this may not be correct. The `wake_by_ref` should be called AFTER 
returning `Pending`. I think you may be lucky that this works, but it's 
undefined behavior. If I understand the `yield_now` impl. correctly, it tries 
to delay the "wake" call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


bombsimon commented on code in PR #1628:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905462357


##
src/ast/dml.rs:
##
@@ -547,7 +561,15 @@ impl Display for Insert {
 write!(f, "{source}")?;
 }
 
-if self.source.is_none() && self.columns.is_empty() {
+if let Some(settings) = &self.settings {
+write!(f, "SETTINGS {} ", display_comma_separated(settings))?;
+}
+
+if let Some(format_clause) = &self.format_clause {
+write!(f, "{format_clause}")?;
+}
+
+if self.source.is_none() && self.columns.is_empty() && 
self.format_clause.is_none() {

Review Comment:
   Well kinda, if we have a format clause we don't want to add `DEFAULT VALUES` 
which we would otherwise since `source` and `columns` are none.
   
   However after rebasing this on `main` we now have an if-else chain that we 
can use so we now have a separate branch for format clause being `Some` and 
another for columns. I also changed the branch for `columns` since it no longer 
needs to check `source` as well in e75aafe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: support enable_url_table config [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer commented on PR #980:
URL: 
https://github.com/apache/datafusion-python/pull/980#issuecomment-2575307105

   Thank you for the addition! Would it be simpler to just expose the function 
on the session context instead of making it part of the initialization?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: set validation and type hint for ffi tableprovider [datafusion-python]

2025-01-07 Thread via GitHub


timsaucer merged PR #983:
URL: https://github.com/apache/datafusion-python/pull/983


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]

2025-01-07 Thread via GitHub


mbutrovich opened a new pull request, #1229:
URL: https://github.com/apache/datafusion-comet/pull/1229

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support pluralized time units [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


iffyio commented on code in PR #1630:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1630#discussion_r1905821060


##
tests/sqlparser_common.rs:
##
@@ -5374,10 +5396,49 @@ fn parse_interval_all() {
 verified_only_select("SELECT INTERVAL '1' MINUTE TO SECOND");
 verified_only_select("SELECT INTERVAL 1 YEAR");
 verified_only_select("SELECT INTERVAL 1 MONTH");
+verified_only_select("SELECT INTERVAL 1 WEEK");
 verified_only_select("SELECT INTERVAL 1 DAY");
 verified_only_select("SELECT INTERVAL 1 HOUR");
 verified_only_select("SELECT INTERVAL 1 MINUTE");
 verified_only_select("SELECT INTERVAL 1 SECOND");
+verified_only_select("SELECT INTERVAL 1 YEARS");
+verified_only_select("SELECT INTERVAL 1 MONTHS");
+verified_only_select("SELECT INTERVAL 1 WEEKS");
+verified_only_select("SELECT INTERVAL 1 DAYS");
+verified_only_select("SELECT INTERVAL 1 HOURS");
+verified_only_select("SELECT INTERVAL 1 MINUTES");
+verified_only_select("SELECT INTERVAL 1 SECONDS");
+verified_only_select(
+"SELECT '2 years 15 months 100 weeks 99 hours 123456789 
milliseconds'::INTERVAL",
+);
+
+// keep Generic/BigQuery extract week with weekday syntax success
+let supported_dialects = TestedDialects::new(vec![
+Box::new(GenericDialect {}),
+Box::new(BigQueryDialect {}),
+]);
+
+let sql = "SELECT EXTRACT(WEEK(a) FROM date)";
+supported_dialects.verified_stmt(sql);

Review Comment:
   Oh not sure we need to test this here, its [already tested 
here](https://github.com/apache/datafusion-sqlparser-rs/blob/ba3acb243007555bfc90885fcc9aad52642c06ca/tests/sqlparser_bigquery.rs#L2181-L2193)
 for example



##
tests/sqlparser_common.rs:
##
@@ -5374,10 +5396,49 @@ fn parse_interval_all() {
 verified_only_select("SELECT INTERVAL '1' MINUTE TO SECOND");
 verified_only_select("SELECT INTERVAL 1 YEAR");
 verified_only_select("SELECT INTERVAL 1 MONTH");
+verified_only_select("SELECT INTERVAL 1 WEEK");
 verified_only_select("SELECT INTERVAL 1 DAY");
 verified_only_select("SELECT INTERVAL 1 HOUR");
 verified_only_select("SELECT INTERVAL 1 MINUTE");
 verified_only_select("SELECT INTERVAL 1 SECOND");
+verified_only_select("SELECT INTERVAL 1 YEARS");
+verified_only_select("SELECT INTERVAL 1 MONTHS");
+verified_only_select("SELECT INTERVAL 1 WEEKS");
+verified_only_select("SELECT INTERVAL 1 DAYS");
+verified_only_select("SELECT INTERVAL 1 HOURS");
+verified_only_select("SELECT INTERVAL 1 MINUTES");
+verified_only_select("SELECT INTERVAL 1 SECONDS");
+verified_only_select(
+"SELECT '2 years 15 months 100 weeks 99 hours 123456789 
milliseconds'::INTERVAL",
+);
+
+// keep Generic/BigQuery extract week with weekday syntax success
+let supported_dialects = TestedDialects::new(vec![
+Box::new(GenericDialect {}),
+Box::new(BigQueryDialect {}),
+]);
+
+let sql = "SELECT EXTRACT(WEEK(a) FROM date)";
+supported_dialects.verified_stmt(sql);
+
+let all_other_dialects = TestedDialects::new(vec![
+Box::new(PostgreSqlDialect {}),
+Box::new(MsSqlDialect {}),
+Box::new(AnsiDialect {}),
+Box::new(SnowflakeDialect {}),
+Box::new(HiveDialect {}),
+Box::new(RedshiftSqlDialect {}),
+Box::new(MySqlDialect {}),
+Box::new(SQLiteDialect {}),
+Box::new(DuckDbDialect {}),
+]);
+
+assert_eq!(
+ParserError::ParserError("Expected 'FROM' or ','".to_owned()),
+all_other_dialects
+.parse_sql_statements("SELECT EXTRACT(WEEK(a) FROM date)")
+.unwrap_err()
+);

Review Comment:
   similarly I think we can skip this test, if a dialect doesnt support a 
feature no need to test that feature with the dialect



##
src/parser/mod.rs:
##
@@ -2353,14 +2355,30 @@ impl<'a> Parser<'a> {
 };
 Ok(DateTimeField::Week(week_day))
 }
+Keyword::WEEKS => {
+let week_day = if dialect_of!(self is BigQueryDialect)
+&& self.consume_token(&Token::LParen)
+{
+let week_day = self.parse_identifier()?;
+self.expect_token(&Token::RParen)?;
+Some(week_day)
+} else {
+None
+};
+Ok(DateTimeField::Weeks(week_day))
+}

Review Comment:
   That sounds reasonable thanks for checking! It seems like 
`Weeks(Option)` isn't supported by any dialect? I'm thinking if so we 
can change that to `Weeks`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

Re: [PR] Start new line if \r in Postgres dialect [datafusion-sqlparser-rs]

2025-01-07 Thread via GitHub


iffyio merged PR #1647:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add support for array_remove expression [datafusion-comet]

2025-01-07 Thread via GitHub


jatin510 commented on PR #1179:
URL: 
https://github.com/apache/datafusion-comet/pull/1179#issuecomment-2575891718

   > There seems to be a difference in null handling between DataFusion and 
Spark that is causing the tests to fail.
   > 
   > This test passes:
   > 
   > ```scala
   > checkSparkAnswerAndOperator(sql("SELECT array_remove(array(_2, _3,_4), _2) 
from t1 where _2 is not null"))
   > ```
   > 
   > This test fails for me locally and also in CI:
   > 
   > ```scala
   > checkSparkAnswerAndOperator(sql("SELECT array_remove(array(_2, _3,_4), _2) 
from t1"))
   > ```
   
   fixed it @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Disable DPP in stability tests [datafusion-comet]

2025-01-07 Thread via GitHub


parthchandra commented on code in PR #1230:
URL: https://github.com/apache/datafusion-comet/pull/1230#discussion_r1905896252


##
spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala:
##
@@ -263,11 +263,13 @@ trait CometPlanStabilitySuite extends 
DisableAdaptiveExecutionSuite with TPCDSBa
   classLoader = Thread.currentThread().getContextClassLoader)
 // Disable char/varchar read-side handling for better performance.
 withSQLConf(
+  // Comet does not yet support DPP yet
+  // https://github.com/apache/datafusion-comet/issues/121
+  SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "false",

Review Comment:
   Should we do this only if the parquet implementation is `native_full` or 
`native_recordbatch` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] [comet-parquet-exec] Disable DPP in stability tests [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove opened a new pull request, #1230:
URL: https://github.com/apache/datafusion-comet/pull/1230

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] FFI Execution Plans that spawn threads panic [datafusion]

2025-01-07 Thread via GitHub


timsaucer commented on issue #13851:
URL: https://github.com/apache/datafusion/issues/13851#issuecomment-2575975404

   Hi Kevin, I'm back to work this week but have a bit of a backlog right now 
to get through. For the segmentation error above it is probably because you 
would need to build datafusion-python with that same fix to the FFI API - it is 
a breaking change.
   
   I have a dev setup on my machine where I should be able to test this. We can 
potentially get you a work around depending on where you are spawning that task 
from. The ideal solution is to get my PR here merged in, upgrade datafusion 
python, and then use that. It might be a month or so until the next datafusion 
version comes out though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Unparsing optimized (> 2 inputs) unions [datafusion]

2025-01-07 Thread via GitHub


MohamedAbdeen21 commented on PR #14031:
URL: https://github.com/apache/datafusion/pull/14031#issuecomment-2576003567

   Thanks for the suggestion @goldmedal, updated accordingly 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Encapsulate fields of `EquivalenceGroup` [datafusion]

2025-01-07 Thread via GitHub


alamb opened a new pull request, #14039:
URL: https://github.com/apache/datafusion/pull/14039

   ## Which issue does this PR close?
   
   - Part of https://github.com/apache/datafusion/issues/13748
   
   ## Rationale for this change
   
   As a first part of optimizing equivalence / ordering calculations I need to 
ensure the internal invariants are clear do I don't break them during 
optimization
   
   
   ## What changes are included in this PR?
   
   This changes the `classes` to be non pub and adds some conversion methods
   
   
   ## Are these changes tested?
   Existing CI
   
   
   ## Are there any user-facing changes?
   As in https://github.com/apache/datafusion/pull/14037, While this is a 
breaking API change (the field is no longer pub) I think disruptions will be 
minimal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Encapsulate fields of `EquivalenceGroup` [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14039:
URL: https://github.com/apache/datafusion/pull/14039#discussion_r1906004274


##
datafusion/physical-expr/src/equivalence/class.rs:
##
@@ -323,11 +323,10 @@ impl Display for EquivalenceClass {
 }
 }
 
-/// An `EquivalenceGroup` is a collection of `EquivalenceClass`es where each
-/// class represents a distinct equivalence class in a relation.
+/// A collection of distinct `EquivalenceClass`es
 #[derive(Debug, Clone)]
 pub struct EquivalenceGroup {
-pub classes: Vec,
+classes: Vec,

Review Comment:
   the point of this PR is to remove this `pub`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Feat/ffi enter tokio runtime [datafusion]

2025-01-07 Thread via GitHub


kevinjqliu commented on code in PR #13937:
URL: https://github.com/apache/datafusion/pull/13937#discussion_r1906007056


##
datafusion/ffi/src/lib.rs:
##
@@ -26,5 +26,14 @@ pub mod session_config;
 pub mod table_provider;
 pub mod table_source;
 
+/// Returns the major version of the FFI implementation. If the API evolves,
+/// we use the major version to identify compatibility over the unsafe
+/// boundary.
+pub extern "C" fn version() -> u64 {
+let version_str = env!("CARGO_PKG_VERSION");
+let version = semver::Version::parse(version_str).expect("Invalid version 
string");
+version.major
+}

Review Comment:
   i dont see this used anywhere in this PR to guard against unsafe boundary; 
please ignore if the PR is still WIP



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream [datafusion]

2025-01-07 Thread via GitHub


korowa commented on code in PR #13995:
URL: https://github.com/apache/datafusion/pull/13995#discussion_r1905846854


##
datafusion/core/src/dataframe/mod.rs:
##
@@ -43,6 +38,10 @@ use crate::physical_plan::{
 ExecutionPlan, SendableRecordBatchStream,
 };
 use crate::prelude::SessionContext;
+use std::any::Any;

Review Comment:
   minor: this import reordering can be reverted to leave the file unmodified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add support for array_remove expression [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove commented on code in PR #1179:
URL: https://github.com/apache/datafusion-comet/pull/1179#discussion_r1905845993


##
native/core/src/execution/planner.rs:
##
@@ -735,6 +736,36 @@ impl PhysicalPlanner {
 ));
 Ok(array_has_expr)
 }
+ExprStruct::ArrayRemove(expr) => {
+let src_array_expr =
+self.create_expr(expr.left.as_ref().unwrap(), 
Arc::clone(&input_schema))?;
+let key_expr =
+self.create_expr(expr.right.as_ref().unwrap(), 
Arc::clone(&input_schema))?;
+let args = vec![Arc::clone(&src_array_expr), 
Arc::clone(&key_expr)];
+let return_type = src_array_expr.data_type(&input_schema)?;
+
+let datafusion_array_remove = array_remove_all_udf();
+
+let array_remove_expr: Arc = 
Arc::new(ScalarFunctionExpr::new(
+"array_remove",
+datafusion_array_remove,
+args,
+return_type,
+));
+let is_null_expr: Arc = 
Arc::new(IsNullExpr::new(key_expr));
+
+let null_literal_expr: Arc =
+Arc::new(Literal::new(ScalarValue::Null));
+
+let case_expr = CaseExpr::try_new(
+None,
+vec![(is_null_expr, null_literal_expr)],
+Some(array_remove_expr),
+)?;
+
+Ok(Arc::new(case_expr))
+// Ok(array_remove_expr)

Review Comment:
   Could you remove this commented out code?
   
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add support for array_remove expression [datafusion-comet]

2025-01-07 Thread via GitHub


jatin510 commented on code in PR #1179:
URL: https://github.com/apache/datafusion-comet/pull/1179#discussion_r1905885168


##
native/core/src/execution/planner.rs:
##
@@ -735,6 +736,36 @@ impl PhysicalPlanner {
 ));
 Ok(array_has_expr)
 }
+ExprStruct::ArrayRemove(expr) => {
+let src_array_expr =
+self.create_expr(expr.left.as_ref().unwrap(), 
Arc::clone(&input_schema))?;
+let key_expr =
+self.create_expr(expr.right.as_ref().unwrap(), 
Arc::clone(&input_schema))?;
+let args = vec![Arc::clone(&src_array_expr), 
Arc::clone(&key_expr)];
+let return_type = src_array_expr.data_type(&input_schema)?;
+
+let datafusion_array_remove = array_remove_all_udf();
+
+let array_remove_expr: Arc = 
Arc::new(ScalarFunctionExpr::new(
+"array_remove",
+datafusion_array_remove,
+args,
+return_type,
+));
+let is_null_expr: Arc = 
Arc::new(IsNullExpr::new(key_expr));
+
+let null_literal_expr: Arc =
+Arc::new(Literal::new(ScalarValue::Null));
+
+let case_expr = CaseExpr::try_new(
+None,
+vec![(is_null_expr, null_literal_expr)],
+Some(array_remove_expr),
+)?;
+
+Ok(Arc::new(case_expr))
+// Ok(array_remove_expr)

Review Comment:
   fixed ! 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: Set scan implementation choice via environment variable [datafusion-comet]

2025-01-07 Thread via GitHub


parthchandra opened a new pull request, #1231:
URL: https://github.com/apache/datafusion-comet/pull/1231

   Makes it easier to switch scan implementation types during development by 
setting the environment variable  `NATIVE_SCAN_IMPL= {native | native_full | 
native_recordbatch }`
   Default remains `native_full`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] FFI Execution Plans that spawn threads panic [datafusion]

2025-01-07 Thread via GitHub


kevinjqliu commented on issue #13851:
URL: https://github.com/apache/datafusion/issues/13851#issuecomment-2576127057

   > For the segmentation error above it is probably because you would need to 
build datafusion-python with that same fix to the FFI API - it is a breaking 
change.
   
   Thats exactly what I was missing! I was using [`datafusion>=43.1.0` in the 
pytest env 
](https://github.com/kevinjqliu/iceberg-rust/pull/3/files#diff-bf903f17157a753e10c075d464383a2e36fc657641d17673f054239f8d1a3999R46).
   
   > The ideal solution is to get my PR here merged in, upgrade datafusion 
python, and then use that. 
   
   Looking forward to it! Please let me know if there's anything i can help out 
with. 
   
   Also do you think its a good idea to include Delta and Iceberg 
TableProviders in the `ffitest` modules of your PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: Release Ballista 43.0.0 [datafusion-ballista]

2025-01-07 Thread via GitHub


milenkovicm opened a new pull request, #1156:
URL: https://github.com/apache/datafusion-ballista/pull/1156

   # Which issue does this PR close?
   
   
   
   Closes #974.
   
# Rationale for this change
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14038:
URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905986687


##
datafusion/physical-expr/src/equivalence/ordering.rs:
##
@@ -207,19 +212,6 @@ impl IntoIterator for OrderingEquivalenceClass {
 }
 }
 
-/// This function constructs a duplicate-free `LexOrdering` by filtering out

Review Comment:
   Despite the function being `pub` it appears to be in a non pub module, so is 
not exposed. You can see that it is not present in the docs:
   
   
https://docs.rs/datafusion/latest/datafusion/index.html?search=collapse_lex_ordering
   
   (when I marked it deprecated then clippy complained it was dead code)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14038:
URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905985754


##
datafusion/physical-expr/src/equivalence/mod.rs:
##
@@ -41,14 +41,9 @@ pub use properties::{
 /// It will also filter out entries that are ordered if the next entry is;
 /// for instance, `vec![floor(a) Some(ASC), a Some(ASC)]` will be collapsed to
 /// `vec![a Some(ASC)]`.
+#[deprecated(since = "45.0.0", note = "Use LexRequirement::collapse")]

Review Comment:
   This function is part of the public API: 
https://docs.rs/datafusion/latest/datafusion/physical_expr/equivalence/fn.collapse_lex_req.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14038:
URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905987391


##
datafusion/physical-expr-common/src/sort_expr.rs:
##
@@ -409,6 +409,22 @@ impl LexOrdering {
 .map(PhysicalSortExpr::from)
 .collect()
 }
+
+/// Collapse a `LexOrdering` into a new duplicate-free `LexOrdering` based 
on expression.
+///
+/// This function filters  duplicate entries that have same physical
+/// expression inside, ignoring [`SortOptions`]. For example:
+///
+/// `vec![a ASC, a DESC]` collapses to `vec![a ASC]`.
+pub fn collapse(self) -> Self {
+let mut output = LexOrdering::default();
+for item in self {
+if !output.iter().any(|req| req.expr.eq(&item.expr)) {
+output.push(item);

Review Comment:
   In the original code this was `item.clone()` -- though it is only cloning an 
Arc so the performance benefit is likely minimal



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream [datafusion]

2025-01-07 Thread via GitHub


korowa commented on code in PR #13995:
URL: https://github.com/apache/datafusion/pull/13995#discussion_r1905842942


##
datafusion/core/src/dataframe/mod.rs:
##
@@ -2743,6 +2754,143 @@ mod tests {
 Ok(())
 }
 
+// test for https://github.com/apache/datafusion/issues/13949
+async fn run_test_with_spill_pool_if_necessary(

Review Comment:
   My bad, yes, I meant aggregates/mod.rs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]

2025-01-07 Thread via GitHub


parthchandra commented on code in PR #1226:
URL: https://github.com/apache/datafusion-comet/pull/1226#discussion_r1905866905


##
native/spark-expr/src/cast.rs:
##
@@ -817,17 +818,28 @@ fn cast_struct_to_struct(
 cast_options: &SparkCastOptions,
 ) -> DataFusionResult {
 match (from_type, to_type) {
-(DataType::Struct(_), DataType::Struct(to_fields)) => {
-let mut cast_fields: Vec<(Arc, ArrayRef)> = 
Vec::with_capacity(to_fields.len());
+(DataType::Struct(from_fields), DataType::Struct(to_fields)) => {
+// TODO some of this logic may be specific to converting Parquet 
to Spark
+let mut field_name_to_index_map = HashMap::new();
+for (i, field) in from_fields.iter().enumerate() {
+field_name_to_index_map.insert(field.name(), i);
+}
+assert_eq!(field_name_to_index_map.len(), from_fields.len());
+let mut cast_fields: Vec = 
Vec::with_capacity(to_fields.len());

Review Comment:
   So this takes care of reordering of struct fields. Do we have a unit test 
for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove commented on PR #1229:
URL: 
https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576060934

   > Also, can we (for the moment), simply call spark cast directly in parquet 
support instead of duplicating code. Then, we can override the cast operations 
that are parquet specific.
   
   One challenge is that we can't just override specific casts in the parquet 
code because of the recursive nature of casting with complex types, so we will 
end up adding a lot of specializations in the spark_cast code.  It seems better 
to keep spark cast and parquet-to-spark conversion separate. Once we have it 
working we can then review and move any common code out to shared code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: Add plan execution during tests for bounded source [datafusion]

2025-01-07 Thread via GitHub


avkirilishin commented on PR #14013:
URL: https://github.com/apache/datafusion/pull/14013#issuecomment-2575981457

   Sorry for the confusion, that was my mistake. Of course, this PR only 
addresses bounded sources. I’ve updated the title and commit message 
accordingly.
   
   > Do you also plan to add support for unbounded sources?
   
   As mentioned in the 
[comment](https://github.com/apache/datafusion/issues/8230#issuecomment-2208185444),
 the changes for unbounded sources might be more complex, so no, I plan to add 
the test only for bounded sources.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Disable DPP in stability tests when full native scan is enabled [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove commented on code in PR #1230:
URL: https://github.com/apache/datafusion-comet/pull/1230#discussion_r1905927899


##
spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala:
##
@@ -263,11 +263,13 @@ trait CometPlanStabilitySuite extends 
DisableAdaptiveExecutionSuite with TPCDSBa
   classLoader = Thread.currentThread().getContextClassLoader)
 // Disable char/varchar read-side handling for better performance.
 withSQLConf(
+  // Comet does not yet support DPP yet
+  // https://github.com/apache/datafusion-comet/issues/121
+  SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "false",

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Refactor into `LexOrdering::collapse`, avoid clone [datafusion]

2025-01-07 Thread via GitHub


alamb opened a new pull request, #14038:
URL: https://github.com/apache/datafusion/pull/14038

   ## Which issue does this PR close?
   
   
   
   - Part of https://github.com/apache/datafusion/issues/13748
   
   ## Rationale for this change
   
   While working to encapsulate the sort order code more, I noticed the free 
function `collapse_lex_ordering` that consumed a `LexOrdering` but wasn't easy 
to find. It also clones expressions unecessairly. 
   
   ## What changes are included in this PR?
   1. Move code to `LexOrdering::collapse`
   2. Improve docs
   3. Avoid clone
   
   ## Are these changes tested?
   By exisitng CI
   
   
   ## Are there any user-facing changes?
   No. Despite the function being `pub` it is in a non pub module, so is not 
exposed. You can see that it is not present in the docs:
   
   
https://docs.rs/datafusion/latest/datafusion/index.html?search=collapse_lex_ordering
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Move CPU Bound Tasks off Tokio Threadpool [datafusion]

2025-01-07 Thread via GitHub


djanderson commented on issue #13692:
URL: https://github.com/apache/datafusion/issues/13692#issuecomment-2576419792

   @tustvold, I gave this a thorough read today and plan to test out the 
approach soon.
   
   The only thing I'm a little flaky on is the use of the call to 
`use_current_thread()` in they rayon ThreadPoolBuilder 
[here](https://github.com/tustvold/io_stall/blob/main/src/tokio.rs#L62). The 
[docs](https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html#method.use_current_thread)
 for that method seem pretty nuanced and don't make total sense to me, so I was 
wondering if this was a small optimization (re-use the existing thread because 
why-not) or is there was something deeper there with how the task spawned on 
Rayon is allowed to interact with the tokio runtime?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14028:
URL: https://github.com/apache/datafusion/pull/14028#discussion_r1906059983


##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -478,7 +478,12 @@ impl FileStream {
 reader,
 )),
 partition_values,
-}
+};
+// Return control to the runtime 
when we're ready to open the next file
+// to prevent uncancellable 
queries in scenarios with many large files.
+// This functions similarly to a 
`tokio::task::yield_now()`.
+cx.waker().wake_by_ref();
+return Poll::Pending;

Review Comment:
   I wonder if it would be possible to call `yield_now` directly rather than 
trying to insert waker manipulation directly into the file opener
   
   Like could we do something like
   ```rust
   let future = tokio::task::yield_now()
   ```
   
   And return `future.poll_next()` 🤔 
   
   
   For example, what if we made a RecordBatchStream that wrapped another stream 
and called `yield_now` after some number of batches returned?
   
   



##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -478,7 +478,12 @@ impl FileStream {
 reader,
 )),
 partition_values,
-}
+};
+// Return control to the runtime 
when we're ready to open the next file
+// to prevent uncancellable 
queries in scenarios with many large files.
+// This functions similarly to a 
`tokio::task::yield_now()`.
+cx.waker().wake_by_ref();
+return Poll::Pending;

Review Comment:
   I wonder if it would be possible to call `yield_now` directly rather than 
trying to insert waker manipulation directly into the file opener
   
   Like could we do something like
   ```rust
   let future = tokio::task::yield_now()
   ```
   
   And return `future.poll_next()` 🤔 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]

2025-01-07 Thread via GitHub


mbutrovich commented on PR #1229:
URL: 
https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576256559

   > Finally, can we include two more things (either in spark_parquet_options 
or in some parquet_conversion_context struct) which has the conversion and type 
promition options that are also used in 
Java_org_apache_comet_parquet_Native_initColumnReader ?
   Could you clarify which fields you want? I looked at adding things like 
decimal precision, expected precision, scale, etc. but those seem like 
individual column properties (more like schema) than a property of the Parquet 
reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add `InSubqueryExec` support [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove commented on issue #121:
URL: 
https://github.com/apache/datafusion-comet/issues/121#issuecomment-2576247721

   I'm going to pick this issue up. I am currently studying Spark's code to 
understand how it works without Comet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat(optimizer): Enable filter pushdown on window functions [datafusion]

2025-01-07 Thread via GitHub


alamb commented on PR #14026:
URL: https://github.com/apache/datafusion/pull/14026#issuecomment-2576259000

   This looks amazing @nuno-faria  -- thanks you -- I plan to review it 
carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14037:
URL: https://github.com/apache/datafusion/pull/14037#discussion_r1905777670


##
datafusion/physical-expr/src/equivalence/ordering.rs:
##
@@ -53,24 +53,33 @@ impl OrderingEquivalenceClass {
 self.orderings.clear();
 }
 
-/// Creates new ordering equivalence class from the given orderings.
+/// Creates new ordering equivalence class from the given orderings
+///
+/// Any redundant entries are removed, as described on 
[`Self::remove_redundant_entries`].
 pub fn new(orderings: Vec) -> Self {
 let mut result = Self { orderings };
 result.remove_redundant_entries();
 result
 }
 
+/// Converts this OrderingEquivalenceClass to a vector of orderings.
+pub fn into_inner(self) -> Vec {
+self.orderings
+}
+
 /// Checks whether `ordering` is a member of this equivalence class.
 pub fn contains(&self, ordering: &LexOrdering) -> bool {
 self.orderings.contains(ordering)
 }
 
 /// Adds `ordering` to this equivalence class.
 #[allow(dead_code)]
+#[deprecated(

Review Comment:
   note this method is only used in tests (and thus needed "allow(dead_code)"). 
It is the same as `add_new_ordering` so I made that explicit and deprecated the 
method and updated the tests. 



##
datafusion/physical-expr/src/equivalence/properties.rs:
##
@@ -2009,16 +2009,8 @@ fn calculate_union_binary(
 // Next, calculate valid orderings for the union by searching for prefixes
 // in both sides.
 let mut orderings = UnionEquivalentOrderingBuilder::new();
-orderings.add_satisfied_orderings(
-lhs.normalized_oeq_class().orderings,
-lhs.constants(),
-&rhs,
-);
-orderings.add_satisfied_orderings(
-rhs.normalized_oeq_class().orderings,
-rhs.constants(),
-&lhs,
-);
+orderings.add_satisfied_orderings(lhs.normalized_oeq_class(), 
lhs.constants(), &rhs);

Review Comment:
   most of the test changes I think actually make the code easier to read -- 
for example `OrderingEquivalenceClass` already implements `IntoIterator` so 
there is no need to explicitly name the inner field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Disable DPP in stability tests when full native scan is enabled [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove merged PR #1230:
URL: https://github.com/apache/datafusion-comet/pull/1230


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: extract datetime_funcs expressions to folders based on spark grouping [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove merged PR #1222:
URL: https://github.com/apache/datafusion-comet/pull/1222


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]

2025-01-07 Thread via GitHub


parthchandra commented on PR #1229:
URL: 
https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576305496

   > > Also, can we (for the moment), simply call spark cast directly in 
parquet support instead of duplicating code. Then, we can override the cast 
operations that are parquet specific.
   > 
   > One challenge is that we can't just override specific casts in the parquet 
code because of the recursive nature of casting with complex types, so we will 
end up adding a lot of specializations in the spark_cast code. It seems better 
to keep spark cast and parquet-to-spark conversion separate. Once we have it 
working we can then review and move any common code out to shared code?
   
   Makes sense. 
   
   Also, `parquet/read/mod.rs` has traits called `PlainDecoding` and 
`PlainDictDecoding` which are implemented for all the combinations of parquet 
type - arrow type needed by Spark. All the implementations are in 
`parquet/read/values.rs` and use copious quantities of unsafe code encapsulated 
in macros. 
   Not sure if our Parquet-Spark cast functions can leverage the macros here 
but it would be nice if we could.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Exponential planning time (100s of seconds) with `UNION` and `ORDER BY` queries [datafusion]

2025-01-07 Thread via GitHub


alamb commented on issue #13748:
URL: https://github.com/apache/datafusion/issues/13748#issuecomment-2576212517

   I have a few PRs open to make the fields non pub. I have also been studying 
the code -- my first POC will be to figure out how to avoid calling 
`normalized_oeq_class()` so much as that is expensive and gets called many 
times. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: use datafusion from crates.io [datafusion-comet]

2025-01-07 Thread via GitHub


rluvaton opened a new pull request, #1232:
URL: https://github.com/apache/datafusion-comet/pull/1232

   DataFusion 44.0.0 was release to crates.io, therefore I'm using it instead 
of the release candidate 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-07 Thread via GitHub


alamb commented on PR #14038:
URL: https://github.com/apache/datafusion/pull/14038#issuecomment-2576231781

   I pushed 
https://github.com/apache/datafusion/pull/14038/commits/a44acfdb3af5bf0082c277de6ee7e09e92251a49
 to this PR which had the content of the suggestion from @akurmustafa  on 
https://github.com/alamb/datafusion/pull/26


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Encapsulate fields of `EquivalenceProperties` [datafusion]

2025-01-07 Thread via GitHub


alamb commented on code in PR #14040:
URL: https://github.com/apache/datafusion/pull/14040#discussion_r1906019498


##
datafusion/physical-expr/src/equivalence/properties.rs:
##
@@ -124,15 +124,15 @@ use itertools::Itertools;
 /// ```
 #[derive(Debug, Clone)]
 pub struct EquivalenceProperties {
-/// Collection of equivalence classes that store expressions with the same
-/// value.
-pub eq_group: EquivalenceGroup,
-/// Equivalent sort expressions for this table.
-pub oeq_class: OrderingEquivalenceClass,
-/// Expressions whose values are constant throughout the table.
+/// Distinct equivalence classes (exprs known to have the same expressions)
+eq_group: EquivalenceGroup,

Review Comment:
   here are the changes. I want to make sure that nothing externally can modify 
these fields without going through methods on EquivalenceProperties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Encapsulate fields of `EquivalenceProperties` [datafusion]

2025-01-07 Thread via GitHub


alamb opened a new pull request, #14040:
URL: https://github.com/apache/datafusion/pull/14040

   ## Which issue does this PR close?
   
   - Part of https://github.com/apache/datafusion/issues/13748
   
   ## Rationale for this change
   
   As a first part of optimizing equivalence / ordering calculations I need to 
ensure the internal invariants are clear do I don't break them during 
optimization
   
   
   ## What changes are included in this PR?
   
   This changes the various fields to be non `pub` 
   
   
   ## Are these changes tested?
   Existing CI
   
   
   ## Are there any user-facing changes?
   As in https://github.com/apache/datafusion/pull/14037, While this is a 
breaking API change (the fields are no longer pub) I think disruptions will be 
minimal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: extract strings file to `strings_func` like in spark grouping [datafusion-comet]

2025-01-07 Thread via GitHub


codecov-commenter commented on PR #1215:
URL: 
https://github.com/apache/datafusion-comet/pull/1215#issuecomment-2576216434

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1215?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 57.25%. Comparing base 
[(`4cf840f`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/4cf840f571b916b399e6ca6708bf0de4c4c1ca95?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`feac0b2`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/feac0b290f0c26e2ee710bc06062520238b4868d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1215   +/-   ##
   =
   + Coverage 34.88%   57.25%   +22.37% 
   + Complexity  990  958   -32 
   =
 Files   116  113-3 
 Lines 4375511130-32625 
 Branches   9569 2130 -7439 
   =
   - Hits  15263 6373 -8890 
   + Misses25513 3635-21878 
   + Partials   2979 1122 -1857 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1215?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: extract strings file to `strings_func` like in spark grouping [datafusion-comet]

2025-01-07 Thread via GitHub


viirya commented on code in PR #1215:
URL: https://github.com/apache/datafusion-comet/pull/1215#discussion_r1906040640


##
native/spark-expr/src/string_funcs/string_space.rs:
##
@@ -0,0 +1,103 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#![allow(deprecated)]
+
+use crate::kernels::strings::string_space;
+use arrow::record_batch::RecordBatch;
+use arrow_schema::{DataType, Schema};
+use datafusion::logical_expr::ColumnarValue;
+use datafusion_common::DataFusionError;
+use datafusion_physical_expr::PhysicalExpr;
+use std::{
+any::Any,
+fmt::{Display, Formatter},
+hash::Hash,
+sync::Arc,
+};

Review Comment:
   ```suggestion
   };
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]

2025-01-07 Thread via GitHub


parthchandra commented on PR #1229:
URL: 
https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576342088

   > > Finally, can we include two more things (either in spark_parquet_options 
or in some parquet_conversion_context struct) which has the conversion and type 
promition options that are also used in 
Java_org_apache_comet_parquet_Native_initColumnReader ?
   > 
   > Could you clarify which fields you have in mind? I looked at adding things 
like decimal precision, expected precision, scale, etc. but those seem like 
individual column properties (more like schema) than a property of the Parquet 
reader. The timestamp conversion options seem to apply to the entire query.
   
   I meant all of those but I realize that this needs to be per column. We need 
to pass in `useDecimal128` and `useLegacyDateTimestampOrNTZ`at least unless we 
can access the SQLConf. The type promotion info can be derived in native, I 
think).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Release Ballista 43.0.0 [datafusion-ballista]

2025-01-07 Thread via GitHub


andygrove merged PR #1156:
URL: https://github.com/apache/datafusion-ballista/pull/1156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Ballista 43.0.0 Release [datafusion-ballista]

2025-01-07 Thread via GitHub


andygrove closed issue #974: Ballista 43.0.0 Release
URL: https://github.com/apache/datafusion-ballista/issues/974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 [datafusion-comet]

2025-01-07 Thread via GitHub


andygrove merged PR #1232:
URL: https://github.com/apache/datafusion-comet/pull/1232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Define extension API for user-defined invariants. [datafusion]

2025-01-07 Thread via GitHub


alamb commented on issue #14029:
URL: https://github.com/apache/datafusion/issues/14029#issuecomment-2575052672

   > But that would still leave the logical plan invariant extensions for 
consideration.
   
   I would recommend adding a method to: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.UserDefinedLogicalNode.html
   
   Perhaps even the same API:
   
   ```rust
   pub trait UserDefinedLogicalNode { 
   ...
   fn check_invariants(&self, invariant_level: InvariantLevel) -> 
Result<()> 
 Ok(())
   }
   ...
   ```
   
   🤔 
   
   Then you could extend `LogicalPlan::check_invariants` to call that method 
for any `LogicalPlan::UserDefined`
   
   
https://github.com/apache/datafusion/blob/4e877a08d224d992a8cbcc9a14f59468e312b13f/datafusion/expr/src/logical_plan/plan.rs#L1134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >