Re: [I] supports_filters_pushdown is invoked more than once on a single Custom Data Source [datafusion]
cisaacson commented on issue #13994: URL: https://github.com/apache/datafusion/issues/13994#issuecomment-2575344845 > You can also declare that the data source does not support pushing down any filters, and then, within a custom optimization rule similar to PushDownFilter, push the filters to your own data source. If this is a common requirement from the community, we could try to integrate this logic into DataFusion, such as pushing down all supported filters to data sources at once only during the final optimization step. The issue now is that we want DataFusion to use the `filters` that are not supported in pushdown, and ignore the ones that do have `Exact` support. Seems like how it is now is the best way to do that. And I understand about `&mut self`, that would change a lot of things. Unless I am missing something, short of building our own custom optimizer code it is as good as it can be now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Correctly look for end delimiter dollar quoted string [datafusion-sqlparser-rs]
hansott opened a new pull request, #1650: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1650 Currently the tokenizer throws an error for ```sql SELECT $abc$x$ab$abc$ ``` The logic is also quite difficult to read so I made it a bit simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: extract agg_funcs expressions to folders based on spark grouping [datafusion-comet]
andygrove commented on PR #1224: URL: https://github.com/apache/datafusion-comet/pull/1224#issuecomment-2575518378 > Done, I really think some contributor with an access should merge those one by one and fix those conflicts as otherwise it will take a lot of back and forth and the conflicts are really basic Sure. I'll try and take care of these today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Correctly look for end delimiter dollar quoted string [datafusion-sqlparser-rs]
hansott commented on PR #1650: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1650#issuecomment-2575575853 Adding some more tests first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]
alamb commented on PR #1646: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1646#issuecomment-2575103207 Hi @yoavcloud -- it looks like this PR has some conflicts -- is there any chance you can resolve them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] sql result discrepency with sqlite, postgres and duckdb [datafusion]
alamb commented on issue #13780: URL: https://github.com/apache/datafusion/issues/13780#issuecomment-2575115298 First of all, very nice 🕵️ work ! > 'AS REAL' -> 'AS DOUBLE' > > This would better match the actual types being tested and would fix many of the failing results. Along with correcting the nullif type behavior would fix almost all the remaining tests that have result mismatches with sqlite/postgresql. As I understand, this means using f64 for floating point operations in the test. This sounds like a very good and pragmatic thing to do in my mind The other potential thing to do might simply be to map `AS REAL` to use `DataType::Float64` but that is a larger and much more potentially disruptive change Thanks agian @Omega359 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Unparsing optimized (> 2 inputs) unions [datafusion]
goldmedal commented on PR #14031: URL: https://github.com/apache/datafusion/pull/14031#issuecomment-2575129897 Thanks for working on this, @MohamedAbdeen21 Instead of applying optimization rules when roundtrip tests, I prefer to construct the expected plan manually. Just like other test cases like https://github.com/apache/datafusion/blob/482b48926a871bf2c39d6808ca217e309c705b03/datafusion/sql/tests/cases/plan_to_sql.rs#L1221 If we add the optimization to the testing, the optimizer's behavior will be limited by the unparser but we don't expect it. (see the previous discussion https://github.com/apache/datafusion/pull/13267#issuecomment-2482503581) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] enum DropBehavior [datafusion-sqlparser-rs]
stepancheg opened a new pull request, #1648: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1648 There's `` definition [in the grammar](https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#drop-behavior), so corresponding enum seems appropriate: ``` pub enum DropBehavior { Restrict, Cascade, } ``` vs ``` pub enum ReferentialAction { Restrict, Cascade, SetNull, NoAction, SetDefault, } ``` Context: `ALTER TABLE DROP COLUMN` sqlparser does not support `RESTRICT`. I wanted to add it, but adding `ReferentialAction` there too might be misleading, so here I introduce an enum. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve perfomance of `reverse` function [datafusion]
tlm365 commented on code in PR #14025: URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905501646 ## datafusion/functions/src/unicode/reverse.rs: ## @@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> Result { } } -fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>( +fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>( string_array: V, ) -> Result { -let result = ArrayIter::new(string_array) -.map(|string| string.map(|string: &str| string.chars().rev().collect::())) -.collect::>(); +let mut builder: GenericStringBuilder = +GenericStringBuilder::with_capacity(string_array.len(), 1024); Review Comment: > > I wonder if [this](https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/arrow-array/src/array/mod.rs#L335-L338) is a good alternative to `1024`? 🤔 > > Yes, I think so Unfortunately, it's a bit slower than the current implementation. It allocates more mem than I expected, `string_array.get_array_memory_size()` = `16632` when `str_len = 1024`, and = `66168` when `str_len = 4096` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]
bombsimon commented on code in PR #1628: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905449986 ## src/ast/query.rs: ## @@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr { #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub enum FormatClause { -Identifier(Ident), +Identifier { +ident: Ident, +expr: Option>, +}, Null, } Review Comment: My thinking here was that they're referred to as the same in the ClickHouse docs which is the only one that uses them so it would make sense to represent it the same way in the ast. - [`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format) only has one documentation page referring to both use cases. - [The interface](https://clickhouse.com/docs/en/interfaces/formats) references both of them: > ClickHouse can accept and return data in various formats. A format supported for input can be used to parse the data provided to `INSERT`s, to perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to read a dictionary. A format supported for output can be used to arrange the results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All format names are case insensitive. So to me re-using the same ast type with an optinal expression for potential insert data still makes sense. Also double checking if you still don't agree with this, I guess you mean that the new type should be called `InputFormatClause` since it's the input that holds values and not the output, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] multiply overflow in stats.rs [datafusion]
LindaSummer commented on issue #13775: URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575392114 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] multiply overflow in stats.rs [datafusion]
LindaSummer commented on issue #13775: URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575391891 > Hi Edward - Thanks for taking a look at this! I do not have the ability to do that however if you add a comment here with just the word 'take' a github action will assign it to you Got it! Thanks very much! 😊 Best Regards, Edward -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix error on `array_distinct` when input is empty #13810 [datafusion]
comphead commented on code in PR #14034: URL: https://github.com/apache/datafusion/pull/14034#discussion_r1905726601 ## datafusion/functions-nested/src/set_ops.rs: ## @@ -513,9 +513,6 @@ fn general_array_distinct( array: &GenericListArray, field: &FieldRef, ) -> Result { -if array.len() == 0 { Review Comment: do we really need to remove this check? 🤔 it prevents unnecessary work/allocations if nothing is in the `array` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] [comet-parquet-exec] Track remaining test failures in POC 1 & 2 [datafusion-comet]
andygrove opened a new issue, #1228: URL: https://github.com/apache/datafusion-comet/issues/1228 ### What is the problem the feature request solves? I thought it would be useful to have one issue to track all the test failures that we are currently working on resolving ## POC 1 - "timestamp" test fails with cannot cast Timestamp(Millisecond, None) to required schema field of type Timestamp(Microsecond, None) - "timestamp as int96" fails with cannot cast Timestamp(Nanosecond, None) to required schema field of type Timestamp(Microsecond, Some("UTC")) - java.lang.IllegalArgumentException: requirement failed: input[0, int, true] IN dynamicpruning#45042 has not finished - "columnar shuffle on struct including nulls" fails with java.lang.IllegalStateException: Complex type is not supported "columnar shuffle on nested struct" fails with java.lang.IllegalStateException: Complex type is not supported ## POC 2 ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]
jeffreyssmith2nd commented on code in PR #14028: URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905736433 ## datafusion/core/src/datasource/physical_plan/file_stream.rs: ## @@ -478,7 +478,12 @@ impl FileStream { reader, )), partition_values, -} +}; +// Return control to the runtime when we're ready to open the next file +// to prevent uncancellable queries in scenarios with many large files. +// This functions similarly to a `tokio::task::yield_now()`. +cx.waker().wake_by_ref(); +return Poll::Pending; Review Comment: I thought it was odd too (I had seen this pattern elsewhere) but looking at the implementation of `yield_now`, when you `poll` it does a `context::defer` and then returns `Pending` [(source)](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/task/yield_now.rs#L57-L59). `context::defer` will directly call `wake_by_ref` if called from outside of the runtime [(source)](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L167-177). I could be misunderstanding what "outside the runtime" means in this case/how that will actually interact with the yield. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: extract agg_funcs expressions to folders based on spark grouping [datafusion-comet]
andygrove merged PR #1224: URL: https://github.com/apache/datafusion-comet/pull/1224 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]
andygrove merged PR #1226: URL: https://github.com/apache/datafusion-comet/pull/1226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support pluralized time units [datafusion-sqlparser-rs]
wugeer commented on code in PR #1630: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1630#discussion_r1905597631 ## src/parser/mod.rs: ## @@ -2353,14 +2355,30 @@ impl<'a> Parser<'a> { }; Ok(DateTimeField::Week(week_day)) } +Keyword::WEEKS => { +let week_day = if dialect_of!(self is BigQueryDialect) +&& self.consume_token(&Token::LParen) +{ +let week_day = self.parse_identifier()?; +self.expect_token(&Token::RParen)?; +Some(week_day) +} else { +None +}; +Ok(DateTimeField::Weeks(week_day)) +} Review Comment: @iffyio I found the following dialects supported pluralized time units.But the is interesting thing is that not all dialects support `weeks`. snowflake https://docs.snowflake.com/en/sql-reference/data-types-datetime#interval-examples hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27838462#LanguageManualTypes-Intervals postgres https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT redshift https://docs.aws.amazon.com/redshift/latest/dg/r_interval_data_types.html#r_interval_data_types-arguments duckdb https://duckdb.org/docs/sql/data_types/interval.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]
kevinjqliu commented on code in PR #981: URL: https://github.com/apache/datafusion-python/pull/981#discussion_r1905611807 ## python/datafusion/dataframe.py: ## @@ -620,16 +620,25 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None def write_parquet( self, path: str | pathlib.Path, -compression: str = "uncompressed", +compression: str = "ZSTD", compression_level: int | None = None, Review Comment: good point! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]
mbutrovich commented on code in PR #1226: URL: https://github.com/apache/datafusion-comet/pull/1226#discussion_r1905609592 ## native/spark-expr/src/cast.rs: ## @@ -817,17 +818,28 @@ fn cast_struct_to_struct( cast_options: &SparkCastOptions, ) -> DataFusionResult { match (from_type, to_type) { -(DataType::Struct(_), DataType::Struct(to_fields)) => { -let mut cast_fields: Vec<(Arc, ArrayRef)> = Vec::with_capacity(to_fields.len()); +(DataType::Struct(from_fields), DataType::Struct(to_fields)) => { +// TODO some of this logic may be specific to converting Parquet to Spark +let mut field_name_to_index_map = HashMap::new(); Review Comment: I'm wondering if we'll end up adding a cache in the SchemaAdapter logic for these sorts of maps, since it looks like this would be generated for each batch? ## native/spark-expr/src/schema_adapter.rs: ## @@ -321,10 +322,26 @@ fn cast_supported(from_type: &DataType, to_type: &DataType, options: &SparkCastO (Timestamp(_, Some(_)), _) => can_cast_from_timestamp(to_type, options), (Utf8 | LargeUtf8, _) => can_cast_from_string(to_type, options), (_, Utf8 | LargeUtf8) => can_cast_to_string(from_type, options), -(Struct(from_fields), Struct(to_fields)) => from_fields -.iter() -.zip(to_fields.iter()) -.all(|(a, b)| cast_supported(a.data_type(), b.data_type(), options)), +(Struct(from_fields), Struct(to_fields)) => { +// TODO some of this logic may be specific to converting Parquet to Spark +let mut field_types = HashMap::new(); +for field in from_fields { +field_types.insert(field.name(), field.data_type()); Review Comment: Same caching comment as above, mostly as a reminder to our future selves. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]
ion-elgreco commented on code in PR #981: URL: https://github.com/apache/datafusion-python/pull/981#discussion_r1905619163 ## python/datafusion/dataframe.py: ## @@ -620,16 +620,25 @@ def write_csv(self, path: str | pathlib.Path, with_header: bool = False) -> None def write_parquet( self, path: str | pathlib.Path, -compression: str = "uncompressed", +compression: str = "ZSTD", compression_level: int | None = None, ) -> None: """Execute the :py:class:`DataFrame` and write the results to a Parquet file. Args: path: Path of the Parquet file to write. -compression: Compression type to use. -compression_level: Compression level to use. -""" +compression: Compression type to use. Default is "ZSTD". +compression_level: Compression level to use. For ZSTD, the +recommended range is 1 to 22, with the default being 4. Higher levels +provide better compression but slower speed. +""" +if compression == "ZSTD": Review Comment: You might want to reuse this code I added in deltalake a while ago: https://github.com/delta-io/delta-rs/blob/053601b72c2ee053f7ff0309c7d27e122c4e3852/python/deltalake/table.py#L79-L129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Un-cancellable Query when hitting many large files. [datafusion]
jeffreyssmith2nd opened a new issue, #14036: URL: https://github.com/apache/datafusion/issues/14036 ### Describe the bug **TLDR; Reading many large Parquet files can prevent a query from being cancelled.** We have a customer that is running a query similar to the following (edited for privacy): ``` SELECT DISTINCT "A","B","C","D","E" FROM table where "time" > now() - INTERVAL '5 days'; ``` This produces a fairly straightforward plan [explain.txt](https://github.com/user-attachments/files/18323208/explain.txt). Simplified Plan: ``` AggregateExec mode=FinalPartitioned CoalesceBatchesExec target_batch_size=8192 RepartitionExec input_partitions=4 AggregateExec mode=Partial ParquetExec file_groups={4 groups} ``` This will read ~85 parquet files at ~100MB each. What we've seen is that even when a query is cancelled, the resources with that query (CPU/RAM) are still being utilized for almost the same amount of time as a typical execution of the query. ### To Reproduce _No response_ ### Expected behavior Cancelling a query should (within some reasonable amount of time) truly cancel the query, freeing up system resources for other query executions. ### Additional context This appears to be a problem with the interaction between the [`GroupedHashAggregateStream`](https://github.com/apache/arrow-datafusion/blob/9a808eb7f63385e3e3b02051ed30eee91daeb613/datafusion/physical-plan/src/aggregates/row_hash.rs#L620) and [`FileStream`](https://github.com/apache/arrow-datafusion/blob/9a808eb7f63385e3e3b02051ed30eee91daeb613/datafusion/core/src/datasource/physical_plan/file_stream.rs#L246) approaches to yielding. The `GroupedHashAggregateStream` will infinitely loop until its child stream (in this case a `FileStream`) is exhausted, errors, or returns `Pending`. The `FileStream` loops while attempting to read `RecordBatch`es from the underlying file, while also doing some book-keeping to stage the next `File` for reading. This Stream will return `Ready` when a `RecordBatch` is processed, or an `Error` encountered. However, it **does not** return `Pending` at any point. When a File is finished reading, the next File up is swapped in and reading continues from the new File. The combination of these two behaviours means that if there are many large files being read by the `FileStream`, the `GroupedHashAggregateStream` doesn't effectively yield back to Tokio. My [PR](https://github.com/apache/datafusion/pull/14028) to fix this is to have the `FileStream` return `Pending` when swapping over to a new file. This seems like a natural yield point, and resolves the issue with cancellation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve perfomance of `reverse` function [datafusion]
2010YOUY01 commented on code in PR #14025: URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905442653 ## datafusion/functions/src/unicode/reverse.rs: ## @@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> Result { } } -fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>( +fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>( string_array: V, ) -> Result { -let result = ArrayIter::new(string_array) -.map(|string| string.map(|string: &str| string.chars().rev().collect::())) -.collect::>(); +let mut builder: GenericStringBuilder = +GenericStringBuilder::with_capacity(string_array.len(), 1024); Review Comment: > I wonder if [this](https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/arrow-array/src/array/mod.rs#L335-L338) is a good alternative to `1024`? 🤔 Yes, I think so -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Chore/single source exec [datafusion]
mertak-synnada opened a new pull request, #14035: URL: https://github.com/apache/datafusion/pull/14035 ## Which issue does this PR close? Closes #13838. ## Rationale for this change This PR merges all Data sources into one Execution Plan, named `DataSourceExec` and a single trait named `DataSource` which is inspired by `DataSink`. This version is not intended to be merged in Upstream since it removes all the ParquetExec, CsvExec, etc., and changes all the tests, but I'm sharing this as is so that we can see all possible changes. But our main intention is to re-adding old execution plans as deprecated ones and implement this part by part to keep API stability. ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add arrow cast [datafusion-python]
timsaucer merged PR #962: URL: https://github.com/apache/datafusion-python/pull/962 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] release datafusion-python 42.1.0 [datafusion-python]
timsaucer commented on PR #930: URL: https://github.com/apache/datafusion-python/pull/930#issuecomment-2575300355 I recommend closing this PR since we have already moved on to 43. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix small issues in pyproject.toml [datafusion-python]
timsaucer merged PR #976: URL: https://github.com/apache/datafusion-python/pull/976 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add type hints and capsule validation for FFI Table providers [datafusion-python]
timsaucer closed issue #950: Add type hints and capsule validation for FFI Table providers URL: https://github.com/apache/datafusion-python/issues/950 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] FLUP #13810 [datafusion]
alamb commented on PR #14034: URL: https://github.com/apache/datafusion/pull/14034#issuecomment-2575369809 Thanks @cht42 ! What does FLUP stand for 🤔 My google fu doesn't seem to be able to find anythihg relevant: https://www.google.com/search?q=flup&oq=flup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve perfomance of `reverse` function [datafusion]
tlm365 commented on code in PR #14025: URL: https://github.com/apache/datafusion/pull/14025#discussion_r1905501646 ## datafusion/functions/src/unicode/reverse.rs: ## @@ -116,14 +115,23 @@ pub fn reverse(args: &[ArrayRef]) -> Result { } } -fn reverse_impl<'a, T: OffsetSizeTrait, V: ArrayAccessor>( +fn reverse_impl<'a, T: OffsetSizeTrait, V: StringArrayType<'a>>( string_array: V, ) -> Result { -let result = ArrayIter::new(string_array) -.map(|string| string.map(|string: &str| string.chars().rev().collect::())) -.collect::>(); +let mut builder: GenericStringBuilder = +GenericStringBuilder::with_capacity(string_array.len(), 1024); Review Comment: > Yes, I think so Unfortunately, it's a bit slower than the current implementation. It allocates more mem than I expected, `string_array.get_array_memory_size()` = `16632` when `str_len = 1024`, and = `66168` when `str_len = 4096` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] multiply overflow in stats.rs [datafusion]
Omega359 commented on issue #13775: URL: https://github.com/apache/datafusion/issues/13775#issuecomment-2575373984 Hi Edward - Thanks for taking a look at this! I do not have the ability to do that however if you add a comment here with just the word 'take' a github action will assign it to you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]
alamb opened a new pull request, #14037: URL: https://github.com/apache/datafusion/pull/14037 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/13748 ## Rationale for this change As a first part of optimizing equivalence / ordering calculations I need to ensure the internal invariants are clear do I don't break them during optimization ## What changes are included in this PR? This changes the members of the equivalent classes to be non pub and documents what they do better. ## Are these changes tested? ## Are there any user-facing changes? While this is a breaking API change (the field is no longer pub) I hope the additional documentation makes migration as painless as possible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]
alamb commented on code in PR #14037: URL: https://github.com/apache/datafusion/pull/14037#discussion_r1905776819 ## datafusion/physical-expr/src/equivalence/ordering.rs: ## @@ -39,7 +39,7 @@ use arrow_schema::SortOptions; /// ordering. In this case, we say that these orderings are equivalent. #[derive(Debug, Clone, Eq, PartialEq, Hash, Default)] pub struct OrderingEquivalenceClass { -pub orderings: Vec, +orderings: Vec, Review Comment: This is the key change -- making this field non-pub -- the rest of the PR is derived from that and added comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add support for MS-SQL BEGIN/END TRY/CATCH [datafusion-sqlparser-rs]
yoavcloud opened a new pull request, #1649: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1649 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support enable_url_table config [datafusion-python]
chenkovsky commented on PR #980: URL: https://github.com/apache/datafusion-python/pull/980#issuecomment-2575455417 > Thank you for the addition! Would it be simpler to just expose the function on the session context instead of making it part of the initialization? Even if you do think it's important to have it in initialization, we probably should expose the function also to maintain parity with the rust version. updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]
bombsimon commented on code in PR #1628: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905454728 ## src/keywords.rs: ## @@ -931,7 +931,7 @@ pub const RESERVED_FOR_TABLE_ALIAS: &[Keyword] = &[ Keyword::PREWHERE, // for ClickHouse SELECT * FROM t SETTINGS ... Keyword::SETTINGS, -// for ClickHouse SELECT * FROM t FORMAT... +// for ClickHouse SELECT * FROM t FORMAT... or INSERT INTO t FORMAT... Review Comment: Removed them completely for both settings and format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Feature/single source exec [datafusion]
mertak-synnada closed pull request #14035: Feature/single source exec URL: https://github.com/apache/datafusion/pull/14035 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: deprecate `ValuesExec` in favour of `MemoryExec` [datafusion]
alamb commented on code in PR #14032: URL: https://github.com/apache/datafusion/pull/14032#discussion_r1905509497 ## datafusion/physical-plan/src/values.rs: ## @@ -34,6 +34,7 @@ use datafusion_execution::TaskContext; use datafusion_physical_expr::EquivalenceProperties; /// Execution plan for values list based relation (produces constant rows) +#[deprecated(since = "44.0.0", note = "Use `MemoryExec::try_new_as_values` instead")] Review Comment: Since we have already released 44, I think the correct version is the next unreleased one, 45 ```suggestion #[deprecated(since = "45.0.0", note = "Use `MemoryExec::try_new_as_values` instead")] ``` ## datafusion/core/src/physical_planner.rs: ## @@ -466,6 +467,7 @@ impl DefaultPhysicalPlanner { .collect::>>>() }) .collect::>>()?; +#[allow(deprecated)] // TODO: Remove in favour of MemoryExec Review Comment: In this PR I think it would be good to port this code (in the physical planner) to use `MemoryExec` -- that way queries will run through `MemoryExec` and we will confidence that `ValuesExec` can really be removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Feature/single source exec [datafusion]
mertak-synnada commented on PR #14035: URL: https://github.com/apache/datafusion/pull/14035#issuecomment-2575383133 Closed since this PR is opened pre-maturely -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix error on `array_distinct` when input is empty #13810 [datafusion]
cht42 commented on PR #14034: URL: https://github.com/apache/datafusion/pull/14034#issuecomment-2575530332 > Thanks @cht42 ! > > What does FLUP stand for 🤔 My google fu doesn't seem to be able to find anythihg relevant: https://www.google.com/search?q=flup&oq=flup Means: **F**o**L**low **UP**. Maybe just used at my company internaly 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Simplify error handling in case.rs (#13990) [datafusion]
alamb commented on PR #14033: URL: https://github.com/apache/datafusion/pull/14033#issuecomment-2575793893 Thank you @cj-zhukov ! It seems as if there is a compile error in this PR: https://github.com/apache/datafusion/actions/runs/12646530220/job/35237388507?pr=14033 ``` error[E0433]: failed to resolve: use of undeclared type `DataFusionError` --> datafusion/physical-expr/src/expressions/case.rs:407:13 | 407 | DataFusionError::Context( | ^^^ use of undeclared type `DataFusionError` | help: consider importing this enum | 18 + use datafusion_common::DataFusionError; | Checking datafusion-sql v44.0.0 (/__w/datafusion/datafusion/datafusion/sql) Checking datafusion-functions v44.0.0 (/__w/datafusion/datafusion/datafusion/functions) For more information about this error, try `rustc --explain E0433`. error: could not compile `datafusion-physical-expr` (lib) due to 1 previous error warning: build failed, waiting for other jobs to finish... Error: Process completed with exit code 101. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]
crepererum commented on code in PR #14028: URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905787146 ## datafusion/core/src/datasource/physical_plan/file_stream.rs: ## @@ -478,7 +478,12 @@ impl FileStream { reader, )), partition_values, -} +}; +// Return control to the runtime when we're ready to open the next file +// to prevent uncancellable queries in scenarios with many large files. +// This functions similarly to a `tokio::task::yield_now()`. +cx.waker().wake_by_ref(); +return Poll::Pending; Review Comment: For the multi-thread RT (which is the only thing we really care about, I think it's NOT instantly calling `wake_by_ref` instantly: 1. `yield_now`: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/task/yield_now.rs#L57 2. `context::defer`: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L166-L177 3. `with_scheduler`: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/context.rs#L183-L195 4. (via some indirection) `Context::defer`: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/mod.rs#L287-L289 5. (via some indirection) `Context::defer` (different `Context` this time): https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L766-L768 6. `Defer::defer`: https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/defer.rs#L15-L26 7. The deferred threads are woken in two places ([1](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L513-L516), [2](https://github.com/tokio-rs/tokio/blob/bd3e8577377a2b684b50fc0cb50d98f03ad09703/tokio/src/runtime/scheduler/multi_thread/worker.rs#L751)) for which the 2nd one seems relevant. The actual waking will happen when the worker is -- I think -- unparked after a "park timeout". This is definitely long after `Pending` is returned, because the thread that is returning `Pending` is the very same worker, i.e. it cannot possibly be parked at this point in time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]
bombsimon commented on code in PR #1628: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905449986 ## src/ast/query.rs: ## @@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr { #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub enum FormatClause { -Identifier(Ident), +Identifier { +ident: Ident, +expr: Option>, +}, Null, } Review Comment: My thinking here was that they're referred to as the same in the ClickHouse docs which is the only one that uses them so it would make sense to represent it the same way in the ast. - [`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format) only has one documentation page referring to both types - [The interface](https://clickhouse.com/docs/en/interfaces/formats) references both of them: > ClickHouse can accept and return data in various formats. A format supported for input can be used to parse the data provided to` INSERT`s, to perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to read a dictionary. A format supported for output can be used to arrange the results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All format names are case insensitive. So to me re-using the same ast type with an optinal expression for potential insert data still makes sense. Also double checking if you still don't agree with this, I guess you mean that the new type should be called `InputFormatClause` since it's the input that holds values and not the output, right? ## src/ast/query.rs: ## @@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr { #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))] #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))] pub enum FormatClause { -Identifier(Ident), +Identifier { +ident: Ident, +expr: Option>, +}, Null, } Review Comment: My thinking here was that they're referred to as the same in the ClickHouse docs which is the only one that uses them so it would make sense to represent it the same way in the ast. - [`FORMAT`](https://clickhouse.com/docs/en/sql-reference/statements/select/format) only has one documentation page referring to both types - [The interface](https://clickhouse.com/docs/en/interfaces/formats) references both of them: > ClickHouse can accept and return data in various formats. A format supported for input can be used to parse the data provided to `INSERT`s, to perform `SELECT`s from a file-backed table such as File, URL or HDFS, or to read a dictionary. A format supported for output can be used to arrange the results of a `SELECT`, and to perform `INSERT`s into a file-backed table. All format names are case insensitive. So to me re-using the same ast type with an optinal expression for potential insert data still makes sense. Also double checking if you still don't agree with this, I guess you mean that the new type should be called `InputFormatClause` since it's the input that holds values and not the output, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]
alamb commented on code in PR #1646: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1646#discussion_r1905447833 ## src/parser/mod.rs: ## @@ -10307,11 +10307,67 @@ impl<'a> Parser<'a> { snapshot: None, session: false, }) +} else if self.dialect.supports_set_stmt_without_operator() { +self.prev_token(); +self.parse_set_session_params() } else { self.expected("equals sign or TO", self.peek_token()) } } +pub fn parse_set_session_params(&mut self) -> Result { +let names = self.parse_comma_separated(Parser::parse_set_session_param_name)?; + +let last_name = if let Some(last_name) = names.last() { +last_name.to_uppercase() +} else { +return self.expected("Session param name", self.peek_token()); +}; + +let identity_insert_obj = if last_name == "IDENTITY_INSERT" { +Some(self.parse_object_name(false)?) +} else { +None +}; + +let offsets_keywords = if last_name == "OFFSETS" { +Some(self.parse_comma_separated(|parser| { +let next_token = parser.next_token(); +match &next_token.token { +Token::Word(w) => Ok(w.to_string()), +_ => parser.expected("SQL keyword", next_token), +} +})?) +} else { +None +}; + +let value = self.parse_expr()?.to_string(); +Ok(Statement::SetSessionParam { +names, +identity_insert_obj, +offsets_keywords, +value, +}) +} + +pub fn parse_set_session_param_name(&mut self) -> Result { +if self.parse_keywords(&[Keyword::STATISTICS, Keyword::IO]) { +return Ok("STATISTICS IO".to_string()); +} else if self.parse_keywords(&[Keyword::STATISTICS, Keyword::XML]) { +return Ok("STATISTICS XML".to_string()); +} else if self.parse_keywords(&[Keyword::STATISTICS, Keyword::PROFILE]) { +return Ok("STATISTICS PROFILE".to_string()); +} else if self.parse_keywords(&[Keyword::STATISTICS, Keyword::TIME]) { +return Ok("STATISTICS TIME".to_string()); +} +let next_token = self.next_token(); +if let Token::Word(w) = next_token.token { +return Ok(w.to_string()); +} +self.expected("Session param name", next_token) Review Comment: maybe it would be nice to list the expected keywords (IO/XML/PROFILE/TIME, etc) here ## src/parser/mod.rs: ## @@ -10307,11 +10307,67 @@ impl<'a> Parser<'a> { snapshot: None, session: false, }) +} else if self.dialect.supports_set_stmt_without_operator() { +self.prev_token(); +self.parse_set_session_params() } else { self.expected("equals sign or TO", self.peek_token()) } } +pub fn parse_set_session_params(&mut self) -> Result { +let names = self.parse_comma_separated(Parser::parse_set_session_param_name)?; + +let last_name = if let Some(last_name) = names.last() { +last_name.to_uppercase() +} else { +return self.expected("Session param name", self.peek_token()); +}; + +let identity_insert_obj = if last_name == "IDENTITY_INSERT" { +Some(self.parse_object_name(false)?) +} else { +None +}; + +let offsets_keywords = if last_name == "OFFSETS" { +Some(self.parse_comma_separated(|parser| { +let next_token = parser.next_token(); +match &next_token.token { +Token::Word(w) => Ok(w.to_string()), +_ => parser.expected("SQL keyword", next_token), +} +})?) +} else { +None +}; + +let value = self.parse_expr()?.to_string(); +Ok(Statement::SetSessionParam { +names, +identity_insert_obj, +offsets_keywords, +value, +}) +} + +pub fn parse_set_session_param_name(&mut self) -> Result { +if self.parse_keywords(&[Keyword::STATISTICS, Keyword::IO]) { Review Comment: Given these are hard coded it seems like it would be cleaner to eventually return an enum for what setting was set rather than arbitrary strings that the using library needs to parse / interpret However, I also think we could do that as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apach
Re: [PR] MsSQL SET for session params [datafusion-sqlparser-rs]
alamb commented on PR #1646: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1646#issuecomment-2575293097 fyi @iffyio -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support async iteration of RecordBatchStream [datafusion-python]
timsaucer commented on code in PR #975: URL: https://github.com/apache/datafusion-python/pull/975#discussion_r1905447211 ## python/datafusion/record_batch.py: ## @@ -59,18 +59,22 @@ def __init__(self, record_batch_stream: df_internal.RecordBatchStream) -> None: def next(self) -> RecordBatch | None: Review Comment: I think we can update the return type hint to just `RecordBatch` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added references to IDE documentation for dev containers [datafusion]
alamb commented on PR #14014: URL: https://github.com/apache/datafusion/pull/14014#issuecomment-2575659015 Thanks @Omega359 and @goldmedal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Added references to IDE documentation for dev containers [datafusion]
alamb merged PR #14014: URL: https://github.com/apache/datafusion/pull/14014 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Document how to use `.devcontainer` [datafusion]
alamb closed issue #13969: Document how to use `.devcontainer` URL: https://github.com/apache/datafusion/issues/13969 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]
crepererum commented on code in PR #14028: URL: https://github.com/apache/datafusion/pull/14028#discussion_r1905684304 ## datafusion/core/src/datasource/physical_plan/file_stream.rs: ## @@ -478,7 +478,12 @@ impl FileStream { reader, )), partition_values, -} +}; +// Return control to the runtime when we're ready to open the next file +// to prevent uncancellable queries in scenarios with many large files. +// This functions similarly to a `tokio::task::yield_now()`. +cx.waker().wake_by_ref(); +return Poll::Pending; Review Comment: I think this may not be correct. The `wake_by_ref` should be called AFTER returning `Pending`. I think you may be lucky that this works, but it's undefined behavior. If I understand the `yield_now` impl. correctly, it tries to delay the "wake" call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]
bombsimon commented on code in PR #1628: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1905462357 ## src/ast/dml.rs: ## @@ -547,7 +561,15 @@ impl Display for Insert { write!(f, "{source}")?; } -if self.source.is_none() && self.columns.is_empty() { +if let Some(settings) = &self.settings { +write!(f, "SETTINGS {} ", display_comma_separated(settings))?; +} + +if let Some(format_clause) = &self.format_clause { +write!(f, "{format_clause}")?; +} + +if self.source.is_none() && self.columns.is_empty() && self.format_clause.is_none() { Review Comment: Well kinda, if we have a format clause we don't want to add `DEFAULT VALUES` which we would otherwise since `source` and `columns` are none. However after rebasing this on `main` we now have an if-else chain that we can use so we now have a separate branch for format clause being `Some` and another for columns. I also changed the branch for `columns` since it no longer needs to check `source` as well in e75aafe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: support enable_url_table config [datafusion-python]
timsaucer commented on PR #980: URL: https://github.com/apache/datafusion-python/pull/980#issuecomment-2575307105 Thank you for the addition! Would it be simpler to just expose the function on the session context instead of making it part of the initialization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: set validation and type hint for ffi tableprovider [datafusion-python]
timsaucer merged PR #983: URL: https://github.com/apache/datafusion-python/pull/983 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]
mbutrovich opened a new pull request, #1229: URL: https://github.com/apache/datafusion-comet/pull/1229 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support pluralized time units [datafusion-sqlparser-rs]
iffyio commented on code in PR #1630: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1630#discussion_r1905821060 ## tests/sqlparser_common.rs: ## @@ -5374,10 +5396,49 @@ fn parse_interval_all() { verified_only_select("SELECT INTERVAL '1' MINUTE TO SECOND"); verified_only_select("SELECT INTERVAL 1 YEAR"); verified_only_select("SELECT INTERVAL 1 MONTH"); +verified_only_select("SELECT INTERVAL 1 WEEK"); verified_only_select("SELECT INTERVAL 1 DAY"); verified_only_select("SELECT INTERVAL 1 HOUR"); verified_only_select("SELECT INTERVAL 1 MINUTE"); verified_only_select("SELECT INTERVAL 1 SECOND"); +verified_only_select("SELECT INTERVAL 1 YEARS"); +verified_only_select("SELECT INTERVAL 1 MONTHS"); +verified_only_select("SELECT INTERVAL 1 WEEKS"); +verified_only_select("SELECT INTERVAL 1 DAYS"); +verified_only_select("SELECT INTERVAL 1 HOURS"); +verified_only_select("SELECT INTERVAL 1 MINUTES"); +verified_only_select("SELECT INTERVAL 1 SECONDS"); +verified_only_select( +"SELECT '2 years 15 months 100 weeks 99 hours 123456789 milliseconds'::INTERVAL", +); + +// keep Generic/BigQuery extract week with weekday syntax success +let supported_dialects = TestedDialects::new(vec![ +Box::new(GenericDialect {}), +Box::new(BigQueryDialect {}), +]); + +let sql = "SELECT EXTRACT(WEEK(a) FROM date)"; +supported_dialects.verified_stmt(sql); Review Comment: Oh not sure we need to test this here, its [already tested here](https://github.com/apache/datafusion-sqlparser-rs/blob/ba3acb243007555bfc90885fcc9aad52642c06ca/tests/sqlparser_bigquery.rs#L2181-L2193) for example ## tests/sqlparser_common.rs: ## @@ -5374,10 +5396,49 @@ fn parse_interval_all() { verified_only_select("SELECT INTERVAL '1' MINUTE TO SECOND"); verified_only_select("SELECT INTERVAL 1 YEAR"); verified_only_select("SELECT INTERVAL 1 MONTH"); +verified_only_select("SELECT INTERVAL 1 WEEK"); verified_only_select("SELECT INTERVAL 1 DAY"); verified_only_select("SELECT INTERVAL 1 HOUR"); verified_only_select("SELECT INTERVAL 1 MINUTE"); verified_only_select("SELECT INTERVAL 1 SECOND"); +verified_only_select("SELECT INTERVAL 1 YEARS"); +verified_only_select("SELECT INTERVAL 1 MONTHS"); +verified_only_select("SELECT INTERVAL 1 WEEKS"); +verified_only_select("SELECT INTERVAL 1 DAYS"); +verified_only_select("SELECT INTERVAL 1 HOURS"); +verified_only_select("SELECT INTERVAL 1 MINUTES"); +verified_only_select("SELECT INTERVAL 1 SECONDS"); +verified_only_select( +"SELECT '2 years 15 months 100 weeks 99 hours 123456789 milliseconds'::INTERVAL", +); + +// keep Generic/BigQuery extract week with weekday syntax success +let supported_dialects = TestedDialects::new(vec![ +Box::new(GenericDialect {}), +Box::new(BigQueryDialect {}), +]); + +let sql = "SELECT EXTRACT(WEEK(a) FROM date)"; +supported_dialects.verified_stmt(sql); + +let all_other_dialects = TestedDialects::new(vec![ +Box::new(PostgreSqlDialect {}), +Box::new(MsSqlDialect {}), +Box::new(AnsiDialect {}), +Box::new(SnowflakeDialect {}), +Box::new(HiveDialect {}), +Box::new(RedshiftSqlDialect {}), +Box::new(MySqlDialect {}), +Box::new(SQLiteDialect {}), +Box::new(DuckDbDialect {}), +]); + +assert_eq!( +ParserError::ParserError("Expected 'FROM' or ','".to_owned()), +all_other_dialects +.parse_sql_statements("SELECT EXTRACT(WEEK(a) FROM date)") +.unwrap_err() +); Review Comment: similarly I think we can skip this test, if a dialect doesnt support a feature no need to test that feature with the dialect ## src/parser/mod.rs: ## @@ -2353,14 +2355,30 @@ impl<'a> Parser<'a> { }; Ok(DateTimeField::Week(week_day)) } +Keyword::WEEKS => { +let week_day = if dialect_of!(self is BigQueryDialect) +&& self.consume_token(&Token::LParen) +{ +let week_day = self.parse_identifier()?; +self.expect_token(&Token::RParen)?; +Some(week_day) +} else { +None +}; +Ok(DateTimeField::Weeks(week_day)) +} Review Comment: That sounds reasonable thanks for checking! It seems like `Weeks(Option)` isn't supported by any dialect? I'm thinking if so we can change that to `Weeks`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
Re: [PR] Start new line if \r in Postgres dialect [datafusion-sqlparser-rs]
iffyio merged PR #1647: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1647 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add support for array_remove expression [datafusion-comet]
jatin510 commented on PR #1179: URL: https://github.com/apache/datafusion-comet/pull/1179#issuecomment-2575891718 > There seems to be a difference in null handling between DataFusion and Spark that is causing the tests to fail. > > This test passes: > > ```scala > checkSparkAnswerAndOperator(sql("SELECT array_remove(array(_2, _3,_4), _2) from t1 where _2 is not null")) > ``` > > This test fails for me locally and also in CI: > > ```scala > checkSparkAnswerAndOperator(sql("SELECT array_remove(array(_2, _3,_4), _2) from t1")) > ``` fixed it @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Disable DPP in stability tests [datafusion-comet]
parthchandra commented on code in PR #1230: URL: https://github.com/apache/datafusion-comet/pull/1230#discussion_r1905896252 ## spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala: ## @@ -263,11 +263,13 @@ trait CometPlanStabilitySuite extends DisableAdaptiveExecutionSuite with TPCDSBa classLoader = Thread.currentThread().getContextClassLoader) // Disable char/varchar read-side handling for better performance. withSQLConf( + // Comet does not yet support DPP yet + // https://github.com/apache/datafusion-comet/issues/121 + SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "false", Review Comment: Should we do this only if the parquet implementation is `native_full` or `native_recordbatch` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [comet-parquet-exec] Disable DPP in stability tests [datafusion-comet]
andygrove opened a new pull request, #1230: URL: https://github.com/apache/datafusion-comet/pull/1230 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] FFI Execution Plans that spawn threads panic [datafusion]
timsaucer commented on issue #13851: URL: https://github.com/apache/datafusion/issues/13851#issuecomment-2575975404 Hi Kevin, I'm back to work this week but have a bit of a backlog right now to get through. For the segmentation error above it is probably because you would need to build datafusion-python with that same fix to the FFI API - it is a breaking change. I have a dev setup on my machine where I should be able to test this. We can potentially get you a work around depending on where you are spawning that task from. The ideal solution is to get my PR here merged in, upgrade datafusion python, and then use that. It might be a month or so until the next datafusion version comes out though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Unparsing optimized (> 2 inputs) unions [datafusion]
MohamedAbdeen21 commented on PR #14031: URL: https://github.com/apache/datafusion/pull/14031#issuecomment-2576003567 Thanks for the suggestion @goldmedal, updated accordingly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Encapsulate fields of `EquivalenceGroup` [datafusion]
alamb opened a new pull request, #14039: URL: https://github.com/apache/datafusion/pull/14039 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/13748 ## Rationale for this change As a first part of optimizing equivalence / ordering calculations I need to ensure the internal invariants are clear do I don't break them during optimization ## What changes are included in this PR? This changes the `classes` to be non pub and adds some conversion methods ## Are these changes tested? Existing CI ## Are there any user-facing changes? As in https://github.com/apache/datafusion/pull/14037, While this is a breaking API change (the field is no longer pub) I think disruptions will be minimal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Encapsulate fields of `EquivalenceGroup` [datafusion]
alamb commented on code in PR #14039: URL: https://github.com/apache/datafusion/pull/14039#discussion_r1906004274 ## datafusion/physical-expr/src/equivalence/class.rs: ## @@ -323,11 +323,10 @@ impl Display for EquivalenceClass { } } -/// An `EquivalenceGroup` is a collection of `EquivalenceClass`es where each -/// class represents a distinct equivalence class in a relation. +/// A collection of distinct `EquivalenceClass`es #[derive(Debug, Clone)] pub struct EquivalenceGroup { -pub classes: Vec, +classes: Vec, Review Comment: the point of this PR is to remove this `pub` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Feat/ffi enter tokio runtime [datafusion]
kevinjqliu commented on code in PR #13937: URL: https://github.com/apache/datafusion/pull/13937#discussion_r1906007056 ## datafusion/ffi/src/lib.rs: ## @@ -26,5 +26,14 @@ pub mod session_config; pub mod table_provider; pub mod table_source; +/// Returns the major version of the FFI implementation. If the API evolves, +/// we use the major version to identify compatibility over the unsafe +/// boundary. +pub extern "C" fn version() -> u64 { +let version_str = env!("CARGO_PKG_VERSION"); +let version = semver::Version::parse(version_str).expect("Invalid version string"); +version.major +} Review Comment: i dont see this used anywhere in this PR to guard against unsafe boundary; please ignore if the PR is still WIP -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream [datafusion]
korowa commented on code in PR #13995: URL: https://github.com/apache/datafusion/pull/13995#discussion_r1905846854 ## datafusion/core/src/dataframe/mod.rs: ## @@ -43,6 +38,10 @@ use crate::physical_plan::{ ExecutionPlan, SendableRecordBatchStream, }; use crate::prelude::SessionContext; +use std::any::Any; Review Comment: minor: this import reordering can be reverted to leave the file unmodified -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add support for array_remove expression [datafusion-comet]
andygrove commented on code in PR #1179: URL: https://github.com/apache/datafusion-comet/pull/1179#discussion_r1905845993 ## native/core/src/execution/planner.rs: ## @@ -735,6 +736,36 @@ impl PhysicalPlanner { )); Ok(array_has_expr) } +ExprStruct::ArrayRemove(expr) => { +let src_array_expr = +self.create_expr(expr.left.as_ref().unwrap(), Arc::clone(&input_schema))?; +let key_expr = +self.create_expr(expr.right.as_ref().unwrap(), Arc::clone(&input_schema))?; +let args = vec![Arc::clone(&src_array_expr), Arc::clone(&key_expr)]; +let return_type = src_array_expr.data_type(&input_schema)?; + +let datafusion_array_remove = array_remove_all_udf(); + +let array_remove_expr: Arc = Arc::new(ScalarFunctionExpr::new( +"array_remove", +datafusion_array_remove, +args, +return_type, +)); +let is_null_expr: Arc = Arc::new(IsNullExpr::new(key_expr)); + +let null_literal_expr: Arc = +Arc::new(Literal::new(ScalarValue::Null)); + +let case_expr = CaseExpr::try_new( +None, +vec![(is_null_expr, null_literal_expr)], +Some(array_remove_expr), +)?; + +Ok(Arc::new(case_expr)) +// Ok(array_remove_expr) Review Comment: Could you remove this commented out code? ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add support for array_remove expression [datafusion-comet]
jatin510 commented on code in PR #1179: URL: https://github.com/apache/datafusion-comet/pull/1179#discussion_r1905885168 ## native/core/src/execution/planner.rs: ## @@ -735,6 +736,36 @@ impl PhysicalPlanner { )); Ok(array_has_expr) } +ExprStruct::ArrayRemove(expr) => { +let src_array_expr = +self.create_expr(expr.left.as_ref().unwrap(), Arc::clone(&input_schema))?; +let key_expr = +self.create_expr(expr.right.as_ref().unwrap(), Arc::clone(&input_schema))?; +let args = vec![Arc::clone(&src_array_expr), Arc::clone(&key_expr)]; +let return_type = src_array_expr.data_type(&input_schema)?; + +let datafusion_array_remove = array_remove_all_udf(); + +let array_remove_expr: Arc = Arc::new(ScalarFunctionExpr::new( +"array_remove", +datafusion_array_remove, +args, +return_type, +)); +let is_null_expr: Arc = Arc::new(IsNullExpr::new(key_expr)); + +let null_literal_expr: Arc = +Arc::new(Literal::new(ScalarValue::Null)); + +let case_expr = CaseExpr::try_new( +None, +vec![(is_null_expr, null_literal_expr)], +Some(array_remove_expr), +)?; + +Ok(Arc::new(case_expr)) +// Ok(array_remove_expr) Review Comment: fixed ! 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: Set scan implementation choice via environment variable [datafusion-comet]
parthchandra opened a new pull request, #1231: URL: https://github.com/apache/datafusion-comet/pull/1231 Makes it easier to switch scan implementation types during development by setting the environment variable `NATIVE_SCAN_IMPL= {native | native_full | native_recordbatch }` Default remains `native_full` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] FFI Execution Plans that spawn threads panic [datafusion]
kevinjqliu commented on issue #13851: URL: https://github.com/apache/datafusion/issues/13851#issuecomment-2576127057 > For the segmentation error above it is probably because you would need to build datafusion-python with that same fix to the FFI API - it is a breaking change. Thats exactly what I was missing! I was using [`datafusion>=43.1.0` in the pytest env ](https://github.com/kevinjqliu/iceberg-rust/pull/3/files#diff-bf903f17157a753e10c075d464383a2e36fc657641d17673f054239f8d1a3999R46). > The ideal solution is to get my PR here merged in, upgrade datafusion python, and then use that. Looking forward to it! Please let me know if there's anything i can help out with. Also do you think its a good idea to include Delta and Iceberg TableProviders in the `ffitest` modules of your PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: Release Ballista 43.0.0 [datafusion-ballista]
milenkovicm opened a new pull request, #1156: URL: https://github.com/apache/datafusion-ballista/pull/1156 # Which issue does this PR close? Closes #974. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]
alamb commented on code in PR #14038: URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905986687 ## datafusion/physical-expr/src/equivalence/ordering.rs: ## @@ -207,19 +212,6 @@ impl IntoIterator for OrderingEquivalenceClass { } } -/// This function constructs a duplicate-free `LexOrdering` by filtering out Review Comment: Despite the function being `pub` it appears to be in a non pub module, so is not exposed. You can see that it is not present in the docs: https://docs.rs/datafusion/latest/datafusion/index.html?search=collapse_lex_ordering (when I marked it deprecated then clippy complained it was dead code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]
alamb commented on code in PR #14038: URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905985754 ## datafusion/physical-expr/src/equivalence/mod.rs: ## @@ -41,14 +41,9 @@ pub use properties::{ /// It will also filter out entries that are ordered if the next entry is; /// for instance, `vec![floor(a) Some(ASC), a Some(ASC)]` will be collapsed to /// `vec![a Some(ASC)]`. +#[deprecated(since = "45.0.0", note = "Use LexRequirement::collapse")] Review Comment: This function is part of the public API: https://docs.rs/datafusion/latest/datafusion/physical_expr/equivalence/fn.collapse_lex_req.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]
alamb commented on code in PR #14038: URL: https://github.com/apache/datafusion/pull/14038#discussion_r1905987391 ## datafusion/physical-expr-common/src/sort_expr.rs: ## @@ -409,6 +409,22 @@ impl LexOrdering { .map(PhysicalSortExpr::from) .collect() } + +/// Collapse a `LexOrdering` into a new duplicate-free `LexOrdering` based on expression. +/// +/// This function filters duplicate entries that have same physical +/// expression inside, ignoring [`SortOptions`]. For example: +/// +/// `vec![a ASC, a DESC]` collapses to `vec![a ASC]`. +pub fn collapse(self) -> Self { +let mut output = LexOrdering::default(); +for item in self { +if !output.iter().any(|req| req.expr.eq(&item.expr)) { +output.push(item); Review Comment: In the original code this was `item.clone()` -- though it is only cloning an Arc so the performance benefit is likely minimal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream [datafusion]
korowa commented on code in PR #13995: URL: https://github.com/apache/datafusion/pull/13995#discussion_r1905842942 ## datafusion/core/src/dataframe/mod.rs: ## @@ -2743,6 +2754,143 @@ mod tests { Ok(()) } +// test for https://github.com/apache/datafusion/issues/13949 +async fn run_test_with_spill_pool_if_necessary( Review Comment: My bad, yes, I meant aggregates/mod.rs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] fix: fix various bugs in casting between struct types [datafusion-comet]
parthchandra commented on code in PR #1226: URL: https://github.com/apache/datafusion-comet/pull/1226#discussion_r1905866905 ## native/spark-expr/src/cast.rs: ## @@ -817,17 +818,28 @@ fn cast_struct_to_struct( cast_options: &SparkCastOptions, ) -> DataFusionResult { match (from_type, to_type) { -(DataType::Struct(_), DataType::Struct(to_fields)) => { -let mut cast_fields: Vec<(Arc, ArrayRef)> = Vec::with_capacity(to_fields.len()); +(DataType::Struct(from_fields), DataType::Struct(to_fields)) => { +// TODO some of this logic may be specific to converting Parquet to Spark +let mut field_name_to_index_map = HashMap::new(); +for (i, field) in from_fields.iter().enumerate() { +field_name_to_index_map.insert(field.name(), i); +} +assert_eq!(field_name_to_index_map.len(), from_fields.len()); +let mut cast_fields: Vec = Vec::with_capacity(to_fields.len()); Review Comment: So this takes care of reordering of struct fields. Do we have a unit test for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]
andygrove commented on PR #1229: URL: https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576060934 > Also, can we (for the moment), simply call spark cast directly in parquet support instead of duplicating code. Then, we can override the cast operations that are parquet specific. One challenge is that we can't just override specific casts in the parquet code because of the recursive nature of casting with complex types, so we will end up adding a lot of specializations in the spark_cast code. It seems better to keep spark cast and parquet-to-spark conversion separate. Once we have it working we can then review and move any common code out to shared code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] test: Add plan execution during tests for bounded source [datafusion]
avkirilishin commented on PR #14013: URL: https://github.com/apache/datafusion/pull/14013#issuecomment-2575981457 Sorry for the confusion, that was my mistake. Of course, this PR only addresses bounded sources. I’ve updated the title and commit message accordingly. > Do you also plan to add support for unbounded sources? As mentioned in the [comment](https://github.com/apache/datafusion/issues/8230#issuecomment-2208185444), the changes for unbounded sources might be more complex, so no, I plan to add the test only for bounded sources. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Disable DPP in stability tests when full native scan is enabled [datafusion-comet]
andygrove commented on code in PR #1230: URL: https://github.com/apache/datafusion-comet/pull/1230#discussion_r1905927899 ## spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala: ## @@ -263,11 +263,13 @@ trait CometPlanStabilitySuite extends DisableAdaptiveExecutionSuite with TPCDSBa classLoader = Thread.currentThread().getContextClassLoader) // Disable char/varchar read-side handling for better performance. withSQLConf( + // Comet does not yet support DPP yet + // https://github.com/apache/datafusion-comet/issues/121 + SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "false", Review Comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Refactor into `LexOrdering::collapse`, avoid clone [datafusion]
alamb opened a new pull request, #14038: URL: https://github.com/apache/datafusion/pull/14038 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/13748 ## Rationale for this change While working to encapsulate the sort order code more, I noticed the free function `collapse_lex_ordering` that consumed a `LexOrdering` but wasn't easy to find. It also clones expressions unecessairly. ## What changes are included in this PR? 1. Move code to `LexOrdering::collapse` 2. Improve docs 3. Avoid clone ## Are these changes tested? By exisitng CI ## Are there any user-facing changes? No. Despite the function being `pub` it is in a non pub module, so is not exposed. You can see that it is not present in the docs: https://docs.rs/datafusion/latest/datafusion/index.html?search=collapse_lex_ordering -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Move CPU Bound Tasks off Tokio Threadpool [datafusion]
djanderson commented on issue #13692: URL: https://github.com/apache/datafusion/issues/13692#issuecomment-2576419792 @tustvold, I gave this a thorough read today and plan to test out the approach soon. The only thing I'm a little flaky on is the use of the call to `use_current_thread()` in they rayon ThreadPoolBuilder [here](https://github.com/tustvold/io_stall/blob/main/src/tokio.rs#L62). The [docs](https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html#method.use_current_thread) for that method seem pretty nuanced and don't make total sense to me, so I was wondering if this was a small optimization (re-use the existing thread because why-not) or is there was something deeper there with how the task spawned on Rayon is allowed to interact with the tokio runtime? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]
alamb commented on code in PR #14028: URL: https://github.com/apache/datafusion/pull/14028#discussion_r1906059983 ## datafusion/core/src/datasource/physical_plan/file_stream.rs: ## @@ -478,7 +478,12 @@ impl FileStream { reader, )), partition_values, -} +}; +// Return control to the runtime when we're ready to open the next file +// to prevent uncancellable queries in scenarios with many large files. +// This functions similarly to a `tokio::task::yield_now()`. +cx.waker().wake_by_ref(); +return Poll::Pending; Review Comment: I wonder if it would be possible to call `yield_now` directly rather than trying to insert waker manipulation directly into the file opener Like could we do something like ```rust let future = tokio::task::yield_now() ``` And return `future.poll_next()` 🤔 For example, what if we made a RecordBatchStream that wrapped another stream and called `yield_now` after some number of batches returned? ## datafusion/core/src/datasource/physical_plan/file_stream.rs: ## @@ -478,7 +478,12 @@ impl FileStream { reader, )), partition_values, -} +}; +// Return control to the runtime when we're ready to open the next file +// to prevent uncancellable queries in scenarios with many large files. +// This functions similarly to a `tokio::task::yield_now()`. +cx.waker().wake_by_ref(); +return Poll::Pending; Review Comment: I wonder if it would be possible to call `yield_now` directly rather than trying to insert waker manipulation directly into the file opener Like could we do something like ```rust let future = tokio::task::yield_now() ``` And return `future.poll_next()` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]
mbutrovich commented on PR #1229: URL: https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576256559 > Finally, can we include two more things (either in spark_parquet_options or in some parquet_conversion_context struct) which has the conversion and type promition options that are also used in Java_org_apache_comet_parquet_Native_initColumnReader ? Could you clarify which fields you want? I looked at adding things like decimal precision, expected precision, scale, etc. but those seem like individual column properties (more like schema) than a property of the Parquet reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add `InSubqueryExec` support [datafusion-comet]
andygrove commented on issue #121: URL: https://github.com/apache/datafusion-comet/issues/121#issuecomment-2576247721 I'm going to pick this issue up. I am currently studying Spark's code to understand how it works without Comet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat(optimizer): Enable filter pushdown on window functions [datafusion]
alamb commented on PR #14026: URL: https://github.com/apache/datafusion/pull/14026#issuecomment-2576259000 This looks amazing @nuno-faria -- thanks you -- I plan to review it carefully tomorrow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) [datafusion]
alamb commented on code in PR #14037: URL: https://github.com/apache/datafusion/pull/14037#discussion_r1905777670 ## datafusion/physical-expr/src/equivalence/ordering.rs: ## @@ -53,24 +53,33 @@ impl OrderingEquivalenceClass { self.orderings.clear(); } -/// Creates new ordering equivalence class from the given orderings. +/// Creates new ordering equivalence class from the given orderings +/// +/// Any redundant entries are removed, as described on [`Self::remove_redundant_entries`]. pub fn new(orderings: Vec) -> Self { let mut result = Self { orderings }; result.remove_redundant_entries(); result } +/// Converts this OrderingEquivalenceClass to a vector of orderings. +pub fn into_inner(self) -> Vec { +self.orderings +} + /// Checks whether `ordering` is a member of this equivalence class. pub fn contains(&self, ordering: &LexOrdering) -> bool { self.orderings.contains(ordering) } /// Adds `ordering` to this equivalence class. #[allow(dead_code)] +#[deprecated( Review Comment: note this method is only used in tests (and thus needed "allow(dead_code)"). It is the same as `add_new_ordering` so I made that explicit and deprecated the method and updated the tests. ## datafusion/physical-expr/src/equivalence/properties.rs: ## @@ -2009,16 +2009,8 @@ fn calculate_union_binary( // Next, calculate valid orderings for the union by searching for prefixes // in both sides. let mut orderings = UnionEquivalentOrderingBuilder::new(); -orderings.add_satisfied_orderings( -lhs.normalized_oeq_class().orderings, -lhs.constants(), -&rhs, -); -orderings.add_satisfied_orderings( -rhs.normalized_oeq_class().orderings, -rhs.constants(), -&lhs, -); +orderings.add_satisfied_orderings(lhs.normalized_oeq_class(), lhs.constants(), &rhs); Review Comment: most of the test changes I think actually make the code easier to read -- for example `OrderingEquivalenceClass` already implements `IntoIterator` so there is no need to explicitly name the inner field -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Disable DPP in stability tests when full native scan is enabled [datafusion-comet]
andygrove merged PR #1230: URL: https://github.com/apache/datafusion-comet/pull/1230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: extract datetime_funcs expressions to folders based on spark grouping [datafusion-comet]
andygrove merged PR #1222: URL: https://github.com/apache/datafusion-comet/pull/1222 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]
parthchandra commented on PR #1229: URL: https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576305496 > > Also, can we (for the moment), simply call spark cast directly in parquet support instead of duplicating code. Then, we can override the cast operations that are parquet specific. > > One challenge is that we can't just override specific casts in the parquet code because of the recursive nature of casting with complex types, so we will end up adding a lot of specializations in the spark_cast code. It seems better to keep spark cast and parquet-to-spark conversion separate. Once we have it working we can then review and move any common code out to shared code? Makes sense. Also, `parquet/read/mod.rs` has traits called `PlainDecoding` and `PlainDictDecoding` which are implemented for all the combinations of parquet type - arrow type needed by Spark. All the implementations are in `parquet/read/values.rs` and use copious quantities of unsafe code encapsulated in macros. Not sure if our Parquet-Spark cast functions can leverage the macros here but it would be nice if we could. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Exponential planning time (100s of seconds) with `UNION` and `ORDER BY` queries [datafusion]
alamb commented on issue #13748: URL: https://github.com/apache/datafusion/issues/13748#issuecomment-2576212517 I have a few PRs open to make the fields non pub. I have also been studying the code -- my first POC will be to figure out how to avoid calling `normalized_oeq_class()` so much as that is expensive and gets called many times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: use datafusion from crates.io [datafusion-comet]
rluvaton opened a new pull request, #1232: URL: https://github.com/apache/datafusion-comet/pull/1232 DataFusion 44.0.0 was release to crates.io, therefore I'm using it instead of the release candidate -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]
alamb commented on PR #14038: URL: https://github.com/apache/datafusion/pull/14038#issuecomment-2576231781 I pushed https://github.com/apache/datafusion/pull/14038/commits/a44acfdb3af5bf0082c277de6ee7e09e92251a49 to this PR which had the content of the suggestion from @akurmustafa on https://github.com/alamb/datafusion/pull/26 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Encapsulate fields of `EquivalenceProperties` [datafusion]
alamb commented on code in PR #14040: URL: https://github.com/apache/datafusion/pull/14040#discussion_r1906019498 ## datafusion/physical-expr/src/equivalence/properties.rs: ## @@ -124,15 +124,15 @@ use itertools::Itertools; /// ``` #[derive(Debug, Clone)] pub struct EquivalenceProperties { -/// Collection of equivalence classes that store expressions with the same -/// value. -pub eq_group: EquivalenceGroup, -/// Equivalent sort expressions for this table. -pub oeq_class: OrderingEquivalenceClass, -/// Expressions whose values are constant throughout the table. +/// Distinct equivalence classes (exprs known to have the same expressions) +eq_group: EquivalenceGroup, Review Comment: here are the changes. I want to make sure that nothing externally can modify these fields without going through methods on EquivalenceProperties -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Encapsulate fields of `EquivalenceProperties` [datafusion]
alamb opened a new pull request, #14040: URL: https://github.com/apache/datafusion/pull/14040 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/13748 ## Rationale for this change As a first part of optimizing equivalence / ordering calculations I need to ensure the internal invariants are clear do I don't break them during optimization ## What changes are included in this PR? This changes the various fields to be non `pub` ## Are these changes tested? Existing CI ## Are there any user-facing changes? As in https://github.com/apache/datafusion/pull/14037, While this is a breaking API change (the fields are no longer pub) I think disruptions will be minimal -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: extract strings file to `strings_func` like in spark grouping [datafusion-comet]
codecov-commenter commented on PR #1215: URL: https://github.com/apache/datafusion-comet/pull/1215#issuecomment-2576216434 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1215?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 57.25%. Comparing base [(`4cf840f`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/4cf840f571b916b399e6ca6708bf0de4c4c1ca95?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`feac0b2`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/feac0b290f0c26e2ee710bc06062520238b4868d?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1215 +/- ## = + Coverage 34.88% 57.25% +22.37% + Complexity 990 958 -32 = Files 116 113-3 Lines 4375511130-32625 Branches 9569 2130 -7439 = - Hits 15263 6373 -8890 + Misses25513 3635-21878 + Partials 2979 1122 -1857 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1215?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: extract strings file to `strings_func` like in spark grouping [datafusion-comet]
viirya commented on code in PR #1215: URL: https://github.com/apache/datafusion-comet/pull/1215#discussion_r1906040640 ## native/spark-expr/src/string_funcs/string_space.rs: ## @@ -0,0 +1,103 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#![allow(deprecated)] + +use crate::kernels::strings::string_space; +use arrow::record_batch::RecordBatch; +use arrow_schema::{DataType, Schema}; +use datafusion::logical_expr::ColumnarValue; +use datafusion_common::DataFusionError; +use datafusion_physical_expr::PhysicalExpr; +use std::{ +any::Any, +fmt::{Display, Formatter}, +hash::Hash, +sync::Arc, +}; Review Comment: ```suggestion }; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [comet-parquet-exec] Move type conversion logic for ParquetExec out of Cast expression. [datafusion-comet]
parthchandra commented on PR #1229: URL: https://github.com/apache/datafusion-comet/pull/1229#issuecomment-2576342088 > > Finally, can we include two more things (either in spark_parquet_options or in some parquet_conversion_context struct) which has the conversion and type promition options that are also used in Java_org_apache_comet_parquet_Native_initColumnReader ? > > Could you clarify which fields you have in mind? I looked at adding things like decimal precision, expected precision, scale, etc. but those seem like individual column properties (more like schema) than a property of the Parquet reader. The timestamp conversion options seem to apply to the entire query. I meant all of those but I realize that this needs to be per column. We need to pass in `useDecimal128` and `useLegacyDateTimestampOrNTZ`at least unless we can access the SQLConf. The type promotion info can be derived in native, I think). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Release Ballista 43.0.0 [datafusion-ballista]
andygrove merged PR #1156: URL: https://github.com/apache/datafusion-ballista/pull/1156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Ballista 43.0.0 Release [datafusion-ballista]
andygrove closed issue #974: Ballista 43.0.0 Release URL: https://github.com/apache/datafusion-ballista/issues/974 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 [datafusion-comet]
andygrove merged PR #1232: URL: https://github.com/apache/datafusion-comet/pull/1232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Define extension API for user-defined invariants. [datafusion]
alamb commented on issue #14029: URL: https://github.com/apache/datafusion/issues/14029#issuecomment-2575052672 > But that would still leave the logical plan invariant extensions for consideration. I would recommend adding a method to: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.UserDefinedLogicalNode.html Perhaps even the same API: ```rust pub trait UserDefinedLogicalNode { ... fn check_invariants(&self, invariant_level: InvariantLevel) -> Result<()> Ok(()) } ... ``` 🤔 Then you could extend `LogicalPlan::check_invariants` to call that method for any `LogicalPlan::UserDefined` https://github.com/apache/datafusion/blob/4e877a08d224d992a8cbcc9a14f59468e312b13f/datafusion/expr/src/logical_plan/plan.rs#L1134 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org