[PR] chore(deps): bump prost-derive from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] opened a new pull request, #14622:
URL: https://github.com/apache/datafusion/pull/14622

   Bumps [prost-derive](https://github.com/tokio-rs/prost) from 0.13.4 to 
0.13.5.
   
   Release notes
   Sourced from https://github.com/tokio-rs/prost/releases";>prost-derive's 
releases.
   
   Prost v0.13.5
   PROST! is a https://developers.google.com/protocol-buffers/";>Protocol Buffers 
implementation for the https://www.rust-lang.org/";>Rust Language. 
prost generates simple, idiomatic Rust code from 
proto2 and proto3 files.
   Features
   
   prost-types: Derive Arbitrary (https://redirect.github.com/tokio-rs/prost/issues/1188";>#1188)
   
   Documentation
   
   Use intra doc links instead of HTML tags (https://redirect.github.com/tokio-rs/prost/issues/1219";>#1219)
   
   Dependencies
   
   Update pulldown-cmark-to-cmark requirement from >=16, =16, <=20 (https://redirect.github.com/tokio-rs/prost/issues/1206";>#1206)
   Update itertools requirement from >=0.10, =0.10, <=0.14 (https://redirect.github.com/tokio-rs/prost/issues/1222";>#1222)
   Update petgraph requirement to include 0.7 (https://redirect.github.com/tokio-rs/prost/issues/1226";>#1226)
   Update rand requirement from 0.8 to 0.9 (https://redirect.github.com/tokio-rs/prost/issues/1233";>#1233)
   Bump clippy to 1.83 (https://redirect.github.com/tokio-rs/prost/issues/1220";>#1220)
   Update flake.lock (https://redirect.github.com/tokio-rs/prost/issues/1216";>#1216)
   
   Styling
   
   Replace unnecessary map_or (https://redirect.github.com/tokio-rs/prost/issues/1221";>#1221)
   prost-build: Use enum getter (https://redirect.github.com/tokio-rs/prost/issues/1238";>#1238)
   
   Testing
   
   default_enum_value: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1198";>#1198)
   nesting: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1218";>#1218)
   recursive_oneof: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1225";>#1225)
   boxed_field: Box an oneof field (https://redirect.github.com/tokio-rs/prost/issues/1235";>#1235)
   groups: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1234";>#1234)
   default_string_escape: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1239";>#1239)
   Move DecodeError tests closer to the implementation (https://redirect.github.com/tokio-rs/prost/issues/1227";>#1227)
   
   
   
   
   Changelog
   Sourced from https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md";>prost-derive's
 changelog.
   
   Prost version 0.13.5
   PROST! is a https://developers.google.com/protocol-buffers/";>Protocol Buffers 
implementation for the https://www.rust-lang.org/";>Rust Language. 
prost generates simple, idiomatic Rust code from 
proto2 and proto3 files.
   Features
   
   prost-types: Derive Arbitrary (https://redirect.github.com/tokio-rs/prost/issues/1188";>#1188)
   
   Documentation
   
   Use intra doc links instead of HTML tags (https://redirect.github.com/tokio-rs/prost/issues/1219";>#1219)
   
   Dependencies
   
   Update pulldown-cmark-to-cmark requirement from >=16, =16, <=20 (https://redirect.github.com/tokio-rs/prost/issues/1206";>#1206)
   Update itertools requirement from >=0.10, =0.10, <=0.14 (https://redirect.github.com/tokio-rs/prost/issues/1222";>#1222)
   Update petgraph requirement to include 0.7 (https://redirect.github.com/tokio-rs/prost/issues/1226";>#1226)
   Update rand requirement from 0.8 to 0.9 (https://redirect.github.com/tokio-rs/prost/issues/1233";>#1233)
   Bump clippy to 1.83 (https://redirect.github.com/tokio-rs/prost/issues/1220";>#1220)
   Update flake.lock (https://redirect.github.com/tokio-rs/prost/issues/1216";>#1216)
   
   Styling
   
   Replace unnecessary map_or (https://redirect.github.com/tokio-rs/prost/issues/1221";>#1221)
   prost-build: Use enum getter (https://redirect.github.com/tokio-rs/prost/issues/1238";>#1238)
   
   Testing
   
   default_enum_value: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1198";>#1198)
   nesting: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1218";>#1218)
   recursive_oneof: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1225";>#1225)
   boxed_field: Box an oneof field (https://redirect.github.com/tokio-rs/prost/issues/1235";>#1235)
   groups: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1234";>#1234)
   default_string_escape: Move tests to separate module (https://redirect.github.com/tokio-rs/prost/issues/1239";>#1239)
   Move DecodeError tests closer to the implementation (https://redirect.github.com/tokio-rs/prost/issues/1227";>#1227)
   
   
   
   
   Commits
   
   https://github.com/tokio-rs/prost/commit/d505b184e933e1f9ff5679106ffc51b7e3c2755e";>d505b18
 chore: Release version 0.13.5 (https://redirect.github.com/tokio

Re: [PR] equivalence classes: use normalized mapping for projection [datafusion]

2025-02-12 Thread via GitHub


berkaysynnada commented on PR #14327:
URL: https://github.com/apache/datafusion/pull/14327#issuecomment-2652990411

   and sorry for the late response 😞 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add DataFrame fill_nan/fill_null [datafusion-python]

2025-02-12 Thread via GitHub


kosiew opened a new pull request, #1019:
URL: https://github.com/apache/datafusion-python/pull/1019

   # Which issue does this PR close?
   
   
   
   Closes #922.
   
# Rationale for this change
   
   
   DataFusion currently lacks built-in methods for handling missing values 
(nulls and NaNs) in DataFrames. This functionality is commonly needed in data 
processing workflows and is available in other DataFrame libraries like pandas 
and PySpark.
   
   The changes add:
   - `fill_null()` method to replace NULL values with a specified value
   - `fill_nan()` method to replace NaN values with a numeric value in 
floating-point columns
   
   # What changes are included in this PR?
   
   
   1. Added `fill_null()` method to DataFrame class:
  - Replaces NULL values with a specified value
  - Handles type casting validation
  - Allows filling specific columns or entire DataFrame
  - Preserves columns where casting fails
   
   2. Added `fill_nan()` method to DataFrame class:
  - Replaces NaN values with numeric values
  - Only operates on floating-point columns
  - Validates input types and column types
  - Allows filling specific columns or all numeric columns
   
   3. Added comprehensive test cases for both methods covering:
  - Different data types
  - Column subsetting
  - Type validation
  - Error cases
   
   # Are there any user-facing changes?
   
   
   Yes, two new public methods are added to the DataFrame class:
   ```python
   # Fill nulls with a value
   df = df.fill_null(0)  # Fill all nulls with 0 where possible
   df = df.fill_null("missing", subset=["name"])  # Fill specific columns
   
   # Fill NaN values in numeric columns
   df = df.fill_nan(0)  # Fill all NaNs with 0
   df = df.fill_nan(99.9, subset=["price"])  # Fill specific columns
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Ad-hoc or scheduled mutation based testing [datafusion]

2025-02-12 Thread via GitHub


2010YOUY01 commented on issue #14589:
URL: https://github.com/apache/datafusion/issues/14589#issuecomment-2653111614

   > [@alamb](https://github.com/alamb) they did, although they went out of 
space! If you click on the "this" hyperlink in the text of the issue, you get 
here 
https://github.com/edmondop/arrow-datafusion/actions/runs/13248520757/job/36980712855
   > 
   > before it run out of memory, it output
   
   Looks like it's running out of disk space due to lots of recompilations
   The default CI runner's disk budget is < 20GB
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump prost-build from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14623:
URL: https://github.com/apache/datafusion/pull/14623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add DataFrame fill_nan/fill_null [datafusion-python]

2025-02-12 Thread via GitHub


kosiew commented on code in PR #1019:
URL: 
https://github.com/apache/datafusion-python/pull/1019#discussion_r1952245124


##
python/tests/test_functions.py:
##
@@ -1173,3 +1173,57 @@ def test_between_default(df):
 
 actual = df.collect()[0].to_pydict()
 assert actual == expected
+
+
+def test_coalesce(df):
+# Create a DataFrame with null values
+ctx = SessionContext()
+batch = pa.RecordBatch.from_arrays(
+[
+pa.array(["Hello", None, "!"]),  # string column with null
+pa.array([4, None, 6]),  # integer column with null
+pa.array(["hello ", None, " !"]),  # string column with null
+pa.array(
+[datetime(2022, 12, 31), None, datetime(2020, 7, 2)]
+),  # datetime with null
+pa.array([False, None, True]),  # boolean column with null
+],
+names=["a", "b", "c", "d", "e"],
+)
+df_with_nulls = ctx.create_dataframe([[batch]])
+
+# Test coalesce with different data types
+result_df = df_with_nulls.select(
+f.coalesce(column("a"), literal("default")).alias("a_coalesced"),
+f.coalesce(column("b"), literal(0)).alias("b_coalesced"),
+f.coalesce(column("c"), literal("default")).alias("c_coalesced"),
+f.coalesce(column("d"), literal(datetime(2000, 1, 
1))).alias("d_coalesced"),
+f.coalesce(column("e"), literal(False)).alias("e_coalesced"),
+)
+
+result = result_df.collect()[0]
+
+# Verify results
+assert result.column(0) == pa.array(
+["Hello", "default", "!"], type=pa.string_view()
+)
+assert result.column(1) == pa.array([4, 0, 6], type=pa.int64())
+assert result.column(2) == pa.array(
+["hello ", "default", " !"], type=pa.string_view()
+)
+assert result.column(3) == pa.array(
+[datetime(2022, 12, 31), datetime(2000, 1, 1), datetime(2020, 7, 2)],
+type=pa.timestamp("us"),
+)
+assert result.column(4) == pa.array([False, False, True], type=pa.bool_())
+
+# Test multiple arguments
+result_df = df_with_nulls.select(
+f.coalesce(column("a"), literal(None), literal("fallback")).alias(
+"multi_coalesce"
+)

Review Comment:
   I could not find tests for coalesce which I used for fill_null, so I added 
these



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Signature::Coercible with user defined implicit casting [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14440:
URL: https://github.com/apache/datafusion/pull/14440#discussion_r1952259827


##
datafusion/expr-common/src/signature.rs:
##
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> 
Vec {
 }
 }
 
+#[derive(Debug, Clone, Eq, PartialOrd)]
+pub struct Coercion {

Review Comment:
   This is completely another issue 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Signature::Coercible with user defined implicit casting [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14440:
URL: https://github.com/apache/datafusion/pull/14440#discussion_r1952259827


##
datafusion/expr-common/src/signature.rs:
##
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> 
Vec {
 }
 }
 
+#[derive(Debug, Clone, Eq, PartialOrd)]
+pub struct Coercion {

Review Comment:
   This is completely another issue 😅 
   
   provide their own coercion rules not own signature



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Signature::Coercible with user defined implicit casting [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14440:
URL: https://github.com/apache/datafusion/pull/14440#discussion_r1952259827


##
datafusion/expr-common/src/signature.rs:
##
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> 
Vec {
 }
 }
 
+#[derive(Debug, Clone, Eq, PartialOrd)]
+pub struct Coercion {

Review Comment:
   This is completely another issue 😅 
   
   own coercion rules is not own signature



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Signature::Coercible with user defined implicit casting [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14440:
URL: https://github.com/apache/datafusion/pull/14440#discussion_r1952259827


##
datafusion/expr-common/src/signature.rs:
##
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> 
Vec {
 }
 }
 
+#[derive(Debug, Clone, Eq, PartialOrd)]
+pub struct Coercion {

Review Comment:
   This is completely another issue 😅 , before that we need to improve existing 
function signature
   
   provide their own coercion rules not own signature



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952299587


##
datafusion/common/src/dfschema.rs:
##
@@ -1028,20 +1028,48 @@ impl SchemaExt for Schema {
 })
 }
 
-fn logically_equivalent_names_and_types(&self, other: &Self) -> bool {
+// There are three cases we need to check
+// 1. The len of the schema of the plan and the schema of the table should 
be the same
+// 2. The nullable flag of the schema of the plan and the schema of the 
table should be the same
+// 3. The datatype of the schema of the plan and the schema of the table 
should be the same
+fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<(), 
String> {

Review Comment:
   Thank you @jayzhan211 for this good suggestion, i change my code to use 
result<()>, it seems it's enough for this case, similar with many other cases, 
for example:
   
   ```rust
   /// Check if the schema have some fields with the same name
   pub fn check_names(&self) -> Result<()> {
  }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump prost from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] closed pull request #14621: chore(deps): bump prost from 0.13.4 
to 0.13.5
URL: https://github.com/apache/datafusion/pull/14621


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump prost from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] commented on PR #14621:
URL: https://github.com/apache/datafusion/pull/14621#issuecomment-2653139423

   Looks like prost is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix ci test [datafusion]

2025-02-12 Thread via GitHub


xudong963 opened a new pull request, #14625:
URL: https://github.com/apache/datafusion/pull/14625

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix ci test [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14625:
URL: https://github.com/apache/datafusion/pull/14625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix ci test [datafusion]

2025-02-12 Thread via GitHub


xudong963 commented on PR #14625:
URL: https://github.com/apache/datafusion/pull/14625#issuecomment-2653399521

   thanks all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Implement UNION ALL BY NAME [datafusion]

2025-02-12 Thread via GitHub


berkaysynnada commented on PR #14538:
URL: https://github.com/apache/datafusion/pull/14538#issuecomment-2653402466

   I'll take a look ASAP, but any additional reviewers are welcomed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952422238


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   Error should be after `query error` to pass CI



##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   `because PARTITIONED BY (b), will make the b nullable to false` can be add 
as comment



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Optimize `repeat` function [datafusion]

2025-02-12 Thread via GitHub


zjregee commented on issue #14610:
URL: https://github.com/apache/datafusion/issues/14610#issuecomment-2653494265

   Hi, @alamb, I have a few questions and hope to get some help.
   
   Is the optimization mentioned here similar to using the following method 
instead?
   
   ```rust
   for _ in 0..number {
   write!(builder, "{}", string)?;
   }
   builder.append_value("")
   ```
   
   This seems to work well when the number of repetitions is small, because it 
reduces the number of memory copies, but when the number of repetitions is 
large, the repeated memory expansion will lead to performance degradation.
   
   The test results are as follows:
   ```text
   repeat 3 times/repeat_string_view [size=1024, repeat_times=3]
   time:   [26.203 µs 26.286 µs 26.390 µs]
   change: [-34.686% -34.382% -34.061%] (p = 0.00 < 
0.05)
   Performance has improved.
   repeat 3 times/repeat_string [size=1024, repeat_times=3]
   time:   [26.313 µs 27.236 µs 28.189 µs]
   change: [-32.881% -30.398% -27.949%] (p = 0.00 < 
0.05)
   Performance has improved.
   repeat 3 times/repeat_large_string [size=1024, repeat_times=3]
   time:   [26.785 µs 26.876 µs 26.970 µs]
   change: [-12.831% -11.753% -10.991%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   repeat 30 times/repeat_string_view [size=1024, repeat_times=30]
   time:   [255.89 µs 257.33 µs 258.78 µs]
   change: [+91.575% +103.40% +116.22%] (p = 0.00 < 
0.05)
   Performance has regressed.
   repeat 30 times/repeat_string [size=1024, repeat_times=30]
   time:   [228.56 µs 235.81 µs 244.25 µs]
   change: [+64.653% +71.663% +80.334%] (p = 0.00 < 
0.05)
   Performance has regressed.
   repeat 30 times/repeat_large_string [size=1024, repeat_times=30]
   time:   [238.48 µs 245.03 µs 252.40 µs]
   change: [+76.311% +86.412% +98.466%] (p = 0.00 < 
0.05)
   Performance has regressed.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): bump substrait from 0.53.1 to 0.53.2 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] opened a new pull request, #14627:
URL: https://github.com/apache/datafusion/pull/14627

   Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.53.1 
to 0.53.2.
   
   Release notes
   Sourced from https://github.com/substrait-io/substrait-rs/releases";>substrait's 
releases.
   
   v0.53.2
   Other
   
   (deps,cargo) bump the proto group with 3 updates (https://redirect.github.com/substrait-io/substrait-rs/issues/299";>#299)
   
   
   
   
   Changelog
   Sourced from https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md";>substrait's
 changelog.
   
   https://github.com/substrait-io/substrait-rs/compare/v0.53.1...v0.53.2";>0.53.2
 - 2025-02-12
   Other
   
   (deps,cargo) bump the proto group with 3 updates (https://redirect.github.com/substrait-io/substrait-rs/issues/299";>#299)
   
   
   
   
   Commits
   
   https://github.com/substrait-io/substrait-rs/commit/5817f11af1a759dbc4d820c926f4e71794582f4c";>5817f11
 chore: release v0.53.2 (https://redirect.github.com/substrait-io/substrait-rs/issues/300";>#300)
   https://github.com/substrait-io/substrait-rs/commit/4aebc85a7ec518c81abb0713d9e78c65f2be5e8c";>4aebc85
 chore(deps,cargo): bump the proto group with 3 updates (https://redirect.github.com/substrait-io/substrait-rs/issues/299";>#299)
   See full diff in https://github.com/substrait-io/substrait-rs/compare/v0.53.1...v0.53.2";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=substrait&package-manager=cargo&previous-version=0.53.1&new-version=0.53.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653488012

   Yes and not only the 3 I mentioned, FileFormat, FileFormatFactory, etc. I 
think `File` related structure should belong into it own crate and then the 
implementation like parquet, arrow or listing table import it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): group `prost` and `pbjson` dependabot updates [datafusion]

2025-02-12 Thread via GitHub


alamb merged PR #14626:
URL: https://github.com/apache/datafusion/pull/14626


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#discussion_r1952536901


##
datafusion/catalog-listing/src/file_stream_part.rs:
##
@@ -0,0 +1,214 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! A generic stream over file format readers that can be used by
+//! any file format that read its files from start to end.
+//!
+//! Note: Most traits here need to be marked `Sync + Send` to be
+//! compliant with the `SendableRecordBatchStream` trait.
+
+use crate::file_meta::FileMeta;
+use datafusion_common::error::Result;
+use datafusion_physical_plan::metrics::{
+Count, ExecutionPlanMetricsSet, MetricBuilder, Time,
+};
+
+use arrow::error::ArrowError;
+use arrow::record_batch::RecordBatch;
+use datafusion_common::instant::Instant;
+use datafusion_common::ScalarValue;
+
+use futures::future::BoxFuture;
+use futures::stream::BoxStream;
+
+/// A fallible future that resolves to a stream of [`RecordBatch`]
+pub type FileOpenFuture =
+BoxFuture<'static, Result>>>;
+
+/// Describes the behavior of the `FileStream` if file opening or scanning 
fails
+pub enum OnError {
+/// Fail the entire stream and return the underlying error
+Fail,
+/// Continue scanning, ignoring the failed file
+Skip,
+}
+
+impl Default for OnError {
+fn default() -> Self {
+Self::Fail
+}
+}
+
+/// Generic API for opening a file using an [`ObjectStore`] and resolving to a
+/// stream of [`RecordBatch`]
+///
+/// [`ObjectStore`]: object_store::ObjectStore
+pub trait FileOpener: Unpin + Send + Sync {
+/// Asynchronously open the specified file and return a stream
+/// of [`RecordBatch`]
+fn open(&self, file_meta: FileMeta) -> Result;
+}
+
+/// Represents the state of the next `FileOpenFuture`. Since we need to poll
+/// this future while scanning the current file, we need to store the result 
if it
+/// is ready
+pub enum NextOpen {
+Pending(FileOpenFuture),
+Ready(Result>>),
+}
+
+pub enum FileStreamState {
+/// The idle state, no file is currently being read
+Idle,
+/// Currently performing asynchronous IO to obtain a stream of RecordBatch
+/// for a given file
+Open {
+/// A [`FileOpenFuture`] returned by [`FileOpener::open`]
+future: FileOpenFuture,
+/// The partition values for this file
+partition_values: Vec,
+},
+/// Scanning the [`BoxStream`] returned by the completion of a 
[`FileOpenFuture`]
+/// returned by [`FileOpener::open`]
+Scan {
+/// Partitioning column values for the current batch_iter
+partition_values: Vec,
+/// The reader instance
+reader: BoxStream<'static, Result>,
+/// A [`FileOpenFuture`] for the next file to be processed,
+/// and its corresponding partition column values, if any.
+/// This allows the next file to be opened in parallel while the
+/// current file is read.
+next: Option<(NextOpen, Vec)>,
+},
+/// Encountered an error
+Error,
+/// Reached the row limit
+Limit,
+}
+
+/// A timer that can be started and stopped.
+pub struct StartableTime {
+pub metrics: Time,
+// use for record each part cost time, will eventually add into 'metrics'.
+pub start: Option,
+}
+
+impl StartableTime {
+pub fn start(&mut self) {
+assert!(self.start.is_none());
+self.start = Some(Instant::now());
+}
+
+pub fn stop(&mut self) {
+if let Some(start) = self.start.take() {
+self.metrics.add_elapsed(start);
+}
+}
+}
+
+#[allow(rustdoc::broken_intra_doc_links)]
+/// Metrics for [`FileStream`]
+///
+/// Note that all of these metrics are in terms of wall clock time
+/// (not cpu time) so they include time spent waiting on I/O as well
+/// as other operators.
+///
+/// [`FileStream`]: 

+pub struct FileStreamMetrics {

Review Comment:
   I prefer `file_stream`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comm

[PR] Add support for `ORDER BY ALL` [datafusion-sqlparser-rs]

2025-02-12 Thread via GitHub


PokIsemaine opened a new pull request, #1724:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1724

   Add support for `ORDER BY ALL`
   ```
   SELECT *
   FROM addresses
   ORDER BY ALL;
   ```
   https://duckdb.org/docs/sql/query_syntax/orderby.html#order-by-all-examples
   https://clickhouse.com/docs/en/sql-reference/statements/select/order-by


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952544758


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   But why CI is green 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952543911


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   It is auto generated a long time before, we need to manually move it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653497557

   `Catalog -> Schema -> Table -> FileFormat -> QueryPlanner` is the dependency 
based on my previous research.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#discussion_r1952536136


##
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##
@@ -31,49 +31,17 @@ use crate::datasource::listing::PartitionedFile;
 use 
crate::datasource::physical_plan::file_scan_config::PartitionColumnProjector;
 use crate::datasource::physical_plan::{FileMeta, FileScanConfig};
 use crate::error::Result;
-use crate::physical_plan::metrics::{
-BaselineMetrics, Count, ExecutionPlanMetricsSet, MetricBuilder, Time,
-};
+use crate::physical_plan::metrics::{BaselineMetrics, ExecutionPlanMetricsSet};
 use crate::physical_plan::RecordBatchStream;
 
 use arrow::datatypes::SchemaRef;
 use arrow::error::ArrowError;
 use arrow::record_batch::RecordBatch;
-use datafusion_common::instant::Instant;
+pub use datafusion_catalog_listing::file_stream_part::*;

Review Comment:
   It's better to avoid importing everything at once so that the imported 
modules are explicitly controlled and managed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] AQE may materialize a non-supported Final-mode HashAggregate [datafusion-comet]

2025-02-12 Thread via GitHub


EmilyMatt opened a new issue, #1389:
URL: https://github.com/apache/datafusion-comet/issues/1389

   ### Describe the bug
   
   In cases where we support a HashAggregate's aggregate functions, we will 
convert the partial stage HashAggregate, execute it in DF, then use native 
shuffle to forward the results, however, the next stage will not materialize 
due to AQE waiting for the shuffle and its stats.
   Then, if the agg() function contained unsupported expressions wrapping the 
aggregates themselves, we cannot convert the Final mode.
   This causes the Spark HashAggregate to crash as it attempts to access fields 
that don't exist, and even if they do it expects its own aggregate buffer 
representation, which we don't have.
   
   I believe I've seen a few tests that are ignored because of this.
   I don't think this is a valid situation, We should not crash based on 
previously Comet-ran operators if they were successful.
   
   ### Steps to reproduce
   
   Have a group_by/aggregate that either needs a shuffle or aggregate buffer
   Like
   .agg(
 collect_set(col("my_column"))
   )
   
   but wrap that collect_set with an unsupported expression or cast or something
   (I am not sure, but I believe I saw something about a simliar behaviour that 
can be created using Decimal avg, as the Partial aggregate is supported but 
generates a Sum and Count, but generating results from the intermediate data 
does is not supported natively)
   
   For example:
   .agg(
 concat(flatten(collect_set(col("my_column"
   ) 
   can create this behaviour with AQE on.
   not sure if this is datatype related.
   
   
   
   ### Expected behavior
   
   Comet should either not convert a Partial HashAggregate whose Final stage 
cannot be converted.
   Or if it already did, should elegantly execute the Final aggregations and 
let Spark finish the work without breaking the plan.
   
   ### Additional context
   
   I believe I have a non-invasive solution where if the result expressions are 
not supported, we convert them into a separate ProjectExec, which will be the 
parent of the CometHashAggregateExec, which will not have result expressions 
(Like the Partial stage doesn't), and will have the grouping+aggregate 
attributes as its output.
   We then have a conversion to rows and run a ProjectExec with the unsupported 
expression, ensuring that even if the rest of the stage cannot be run using 
Comet, we don't break an already running workflow.
   
   Will open a PR shortly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Discuss: Check in Cargo.lock file? [datafusion]

2025-02-12 Thread via GitHub


timsaucer commented on issue #14135:
URL: https://github.com/apache/datafusion/issues/14135#issuecomment-2653606832

   When I needed to add a new third party github action to the Apache list of 
approved actions for `datafusion-python`, it was a very easy process. If that's 
a blocker I can file an issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] WIP: Update to arrow 52.0.0 [datafusion]

2025-02-12 Thread via GitHub


alamb opened a new pull request, #14628:
URL: https://github.com/apache/datafusion/pull/14628

   ## Which issue does this PR close?
   
   - Part of https://github.com/apache/arrow-rs/issues/7083
   
   ## Rationale for this change
   Testing that DataFusion has no issues when upgrading to latest arrow
   
   
   ## What changes are included in this PR?
   
   1. Update the pin so I can run the tests
   
   ## Are these changes tested?
   Yes, by CI
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952513001


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   Added comments in latest PR, thanks! @jayzhan211 
   
   Error should be after query error to pass CI, i think it's auto generated 
after the PR here:
   
   https://github.com/apache/datafusion/pull/14439
   
   For example:
   
   
https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Improve consistency and documentation on error handling in in UDFs [datafusion]

2025-02-12 Thread via GitHub


davidhewitt commented on issue #11618:
URL: https://github.com/apache/datafusion/issues/11618#issuecomment-2653788835

   I just got here after discovering that the `get_field` UDF returns 
`exec_err!` when passed an invalid argument type. (We treat `exec_err!` as 500 
and `plan_err!` as 400 in Pydantic Logfire's backend).
   
   Currently its signature is `Signature::any`. Based on the discussion here, 
it sounds like the correct solution would be to update its signature to a more 
precise one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Ok push for now until I understand how to regenerate golden files [datafusion-comet]

2025-02-12 Thread via GitHub


EmilyMatt opened a new pull request, #1390:
URL: https://github.com/apache/datafusion-comet/pull/1390

   ## What issue does this close?
   
   Closes #1389 .
   
   ## Rationale for this change
   
   As described in the issue, we'd like to prevent situations where despite the 
Partial aggregate being supported and converted, and the shuffle being 
supported and converted, the Final would not be converted, because the result 
expressions were not supported.
   This leads to an unrecoverable state, where Spark expects an aggregate 
buffer to be created by the Partial HA and it doesn't exist.
   
   ## What changes are included in this PR?
   
   I've separated the conversion of the hash aggregate into a separate 
function(I believe everything should be separated tbh, its very hard to manage 
rn), which also returns information about whether the result expressions were 
converted, when they are not, we create a new ProjectExec with those result 
expressions, convert the HA without them, and place a conversion between the 
two, that way we can ensure a valid state at all times.
   This feature can be ignored by enforcing result conversion, using 
"spark.comet.exec.aggregate.enforceResults=true",
   result enforcing is disabled by default.
   
   ## How are these changes tested?
   Essentially a lot of the stability tests, will have a new plan where the 
aggregate is completed natively, and the ProjectExec runs in Spark, instead of 
the current situation, where the final stage of the HashAggregate is done in 
Spark completely.
   Those tests currently fail because I am unable to run them with 
SPARK_GENERATE_GOLDEN_FILES, might  be a skill issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support bounds evaluation for temporal data types [datafusion]

2025-02-12 Thread via GitHub


ch-sc commented on PR #14523:
URL: https://github.com/apache/datafusion/pull/14523#issuecomment-2653779520

   @berkaysynnada do you have time to take another look? :)
   
   `NotEq` leads to the removal of the sort operator. 
   
   I debugged into this and noticed that the `EnforceSorting` optimiser removes 
`SortExec`. [enforce_sorting/mod.rs:415][1]
   However, prior to that the order requirement somehow is lost and I don't 
understand why that happens. 
   
   I removed `NotEq` again from the supported binary operators to move this 
forward. We might want to look into this in a follow-up PR. WDYT?
   
   [1]: 
https://github.com/apache/datafusion/blob/02cc22eebd7e422cbba6788971f2ffae103180e8/datafusion/physical-optimizer/src/enforce_sorting/mod.rs#L415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Reduce timestamp issues in native_datafusion and native_icerberg_compat Parquet modes [datafusion-comet]

2025-02-12 Thread via GitHub


parthchandra commented on code in PR #1387:
URL: https://github.com/apache/datafusion-comet/pull/1387#discussion_r1952694138


##
native/core/src/parquet/parquet_support.rs:
##
@@ -596,7 +595,10 @@ fn cast_array(
 parquet_options: &SparkParquetOptions,
 ) -> DataFusionResult {
 use DataType::*;
-let array = array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?;
+let array = match to_type {
+Timestamp(_, None) => array, // array_with_timezone does not support 
to_type of NTZ.

Review Comment:
   `utils.rs` timezone functions are intended for use by `cast` and 
specifically convert between Spark representation and Arrow representation.  
Changing them is likely to break some timezone related unit tests. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Update Community Events in concepts-readings-events.md [datafusion]

2025-02-12 Thread via GitHub


oznur-synnada opened a new pull request, #14629:
URL: https://github.com/apache/datafusion/pull/14629

   * Remove "(Upcoming)" from Community Events held in the past
   * Correct the date of Amsterdam Apache DataFusion Meetup (January 23, not 
January 25)
   
   ## Which issue does this PR close?
   
   
   
   
   ## Rationale for this change
   
   Updating the page with the most up-to-date information.
   
   ## What changes are included in this PR?
   
   Wording and date changes.
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


logan-keede commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653644761

   
![image](https://github.com/user-attachments/assets/d4e27727-d734-4d6f-86bb-f86a845b1ecb)
   mostly for my clarity
   
   also, I am not suggesting that we keep the file_format and physical_plan 
folders in new solution,
   though I was going to suggest(only one folder for all `file` related) if we 
move forward by old approach
   
   PS: @jayzhan211 can/should I tag you in related PRs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952620730


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   @jayzhan211  This 
https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273
 merged less than a week ago. I think it's start from there, and i tried local 
sql logic test use -- --complete, it's also the same result, i guess the CI 
also use it to generate and verify?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Reduce timestamp issues in native_datafusion and native_icerberg_compat Parquet modes [datafusion-comet]

2025-02-12 Thread via GitHub


mbutrovich commented on code in PR #1387:
URL: https://github.com/apache/datafusion-comet/pull/1387#discussion_r1952620440


##
native/core/src/parquet/parquet_support.rs:
##
@@ -596,7 +595,10 @@ fn cast_array(
 parquet_options: &SparkParquetOptions,
 ) -> DataFusionResult {
 use DataType::*;
-let array = array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?;
+let array = match to_type {
+Timestamp(_, None) => array, // array_with_timezone does not support 
to_type of NTZ.
+_ => array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?,

Review Comment:
   I'm slowly going through all of the behavior that we hoisted from cast.rs to 
parquet_support.rs, but yep I can file an issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


jonahgao commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952677078


##
datafusion/common/src/dfschema.rs:
##
@@ -1028,21 +1028,41 @@ impl SchemaExt for Schema {
 })
 }
 
-fn logically_equivalent_names_and_types(&self, other: &Self) -> bool {
+fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<()> 
{
 if self.fields().len() != other.fields().len() {
-return false;
+_plan_err!(
+"Inserting query must have the same schema length as the 
table. \
+Expected table schema length: {}, got: {}",
+self.fields().len(),
+other.fields().len()
+)
+} else {
+self.fields()
+.iter()
+.zip(other.fields().iter())
+.try_for_each(|(f1, f2)| {
+// only check the case when the table field is not 
nullable and the insert data field is nullable
+if !f1.is_nullable() && f2.is_nullable() {

Review Comment:
   This condition would prevent the following query from executing, but it 
works on both the main branch and Postgres.
   ```sql
   create table t1(a int not null);
   create table t2(a int);
   insert into t2 values(100);
   insert into t1 select * from t2;
   ```
   As I mentioned earlier, we already have a check during execution called 
[check_not_null_constraints](https://github.com/apache/datafusion/blob/2c73fcd47ed7391c7fd34d6c593c6f0c455670d0/datafusion/physical-plan/src/execution_plan.rs#L1003),
 so I think we should not add this restriction here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] WIP: Add LogicalScalar [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on code in PR #14617:
URL: https://github.com/apache/datafusion/pull/14617#discussion_r1952808036


##
datafusion/common/src/scalar/logical/mod.rs:
##
@@ -0,0 +1,400 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! TODO logical-types
+
+mod from_scalar_value;
+mod logical_date;
+mod logical_decimal;
+mod logical_duration;
+mod logical_fixed_size_binary;
+mod logical_fixed_size_list;
+mod logical_interval;
+mod logical_list;
+mod logical_map;
+mod logical_struct;
+mod logical_time;
+mod logical_timestamp;
+mod logical_union;
+
+use crate::types::{
+logical_binary, logical_boolean, logical_date, logical_float16, 
logical_float32,
+logical_float64, logical_int16, logical_int32, logical_int64, logical_int8,
+logical_null, logical_string, logical_uint16, logical_uint32, 
logical_uint64,
+logical_uint8, LogicalField, LogicalType, LogicalTypeRef,
+};
+use half::f16;
+use std::cmp::Ordering;
+use std::fmt::{Display, Formatter};
+use std::hash::{Hash, Hasher};
+
+pub use logical_date::LogicalDate;
+pub use logical_decimal::LogicalDecimal;
+pub use logical_duration::LogicalDuration;
+pub use logical_fixed_size_binary::LogicalFixedSizeBinary;
+pub use logical_fixed_size_list::LogicalFixedSizeList;
+pub use logical_interval::LogicalInterval;
+pub use logical_list::LogicalList;
+pub use logical_map::LogicalMap;
+pub use logical_struct::LogicalStruct;
+pub use logical_time::LogicalTime;
+pub use logical_timestamp::LogicalTimestamp;
+pub use logical_timestamp::LogicalTimestampValue;
+pub use logical_union::LogicalUnion;
+
+/// Representation of a logical scalar. Contrary to a physical 
[`ScalarValue`], a logical scalar
+/// is *not* coupled to a physical [`DataType`].
+///
+/// Most variants of this enum allow storing values for the variants in 
[`NativeType`]. Furthermore,
+/// [`LogicalScalar::Extension`] allows storing logical values for extension 
types.
+#[derive(Clone, Debug)]
+pub enum LogicalScalar {
+/// A null value
+Null,
+/// Stores a scalar for [`NativeType::Boolean`].
+Boolean(bool),
+/// Stores a scalar for [`NativeType::Int8`].
+Int8(i8),
+/// Stores a scalar for [`NativeType::Int16`].
+Int16(i16),
+/// Stores a scalar for [`NativeType::Int32`].
+Int32(i32),
+/// Stores a scalar for [`NativeType::Int64`].
+Int64(i64),
+/// Stores a scalar for [`NativeType::UInt8`].
+UInt8(u8),
+/// Stores a scalar for [`NativeType::UInt16`].
+UInt16(u16),
+/// Stores a scalar for [`NativeType::UInt32`].
+UInt32(u32),
+/// Stores a scalar for [`NativeType::UInt64`].
+UInt64(u64),
+/// Stores a scalar for [`NativeType::Float16`].
+Float16(f16),
+/// Stores a scalar for [`NativeType::Float32`].
+Float32(f32),
+/// Stores a scalar for [`NativeType::Float64`].
+Float64(f64),
+/// Stores a scalar for [`NativeType::Timestamp`].
+Timestamp(LogicalTimestamp),
+/// Stores a scalar for [NativeType::Date].
+Date(LogicalDate),
+/// Stores a scalar for [`NativeType::Time`].
+Time(LogicalTime),
+/// Stores a scalar for [`NativeType::Duration`].
+Duration(LogicalDuration),
+/// Stores a scalar for [`NativeType::Interval`].
+Interval(LogicalInterval),
+/// Stores a scalar for [`NativeType::Binary`].
+Binary(Vec),
+/// Stores a scalar for [`NativeType::FixedSizeBinary`].
+FixedSizeBinary(LogicalFixedSizeBinary),
+/// Stores a scalar for [`NativeType::String`].
+String(String),
+/// Stores a scalar for [`NativeType::List`].
+List(LogicalList),
+/// Stores a scalar for [`NativeType::FixedSizeList`].
+FixedSizeList(LogicalFixedSizeList),
+/// Stores a scalar for [`NativeType::Struct`].
+Struct(LogicalStruct),
+/// Stores a scalar for [`NativeType::Union`].
+Union(LogicalUnion),
+/// Stores a scalar for [`NativeType::Decimal`].
+Decimal(LogicalDecimal),
+/// Stores a scalar for [`NativeType::Map`].
+Map(LogicalMap),
+// TODO logical-types: Values for ExtensionTypes

Review Comment:
   I think we don't need Scalar for ExtensionTypes, they can build on top of 
these native scalar instead



-- 
This is an automat

[PR] feat: add table source to DML proto to eliminate need for table lookup after deserialisation [datafusion]

2025-02-12 Thread via GitHub


milenkovicm opened a new pull request, #14631:
URL: https://github.com/apache/datafusion/pull/14631

   ## Which issue does this PR close?
   
   this is more request for comment change, at the moment 
   No PR at the moment, would open it after discussion
   
   - Closes #.
   
   ## Rationale for this change
   
   As discussed in https://github.com/apache/datafusion-ballista/issues/1164 
DML statement has table reference instead of table source, which as implication 
impacts usability in case of ser/de logical plan.  After deserialisation 
another table lookup need to be performed which impacts usability in case of 
ballista (where we keep two separate session context and only one of them has 
table definition)
   
   This PR add table source as well to LogicalPlan:DML to improve usability. 
   
   
   ## What changes are included in this PR?
   
   this PR changes: 
   
   - LogicalPlan:DML
   - proto LogicalPlanNode::DML
   
   ## Are these changes tested?
   
   using existent tests 
   
   ## Are there any user-facing changes?
   
   proto definition changed 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: type checking [datafusion-python]

2025-02-12 Thread via GitHub


chenkovsky commented on PR #993:
URL: 
https://github.com/apache/datafusion-python/pull/993#issuecomment-2653946007

   @kylebarron should I remove type annotation, and review bug fix first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-12 Thread via GitHub


Blizzara commented on code in PR #14553:
URL: https://github.com/apache/datafusion/pull/14553#discussion_r1952817778


##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -1090,11 +1090,31 @@ impl LogicalPlanBuilder {
 group_expr: impl IntoIterator>,
 aggr_expr: impl IntoIterator>,
 ) -> Result {
-let group_expr = normalize_cols(group_expr, &self.plan)?;
+self._aggregate(group_expr, aggr_expr, true)
+}
+
+pub fn aggregate_without_implicit_group_by_exprs(
+self,
+group_expr: impl IntoIterator>,
+aggr_expr: impl IntoIterator>,
+) -> Result {
+self._aggregate(group_expr, aggr_expr, false)
+}
+
+fn _aggregate(
+self,
+group_expr: impl IntoIterator>,
+aggr_expr: impl IntoIterator>,
+include_implicit_group_by_exprs: bool,
+) -> Result {
+let mut group_expr = normalize_cols(group_expr, &self.plan)?;
 let aggr_expr = normalize_cols(aggr_expr, &self.plan)?;
 
-let group_expr =
-add_group_by_exprs_from_dependencies(group_expr, 
self.plan.schema())?;
+if include_implicit_group_by_exprs {
+group_expr =
+add_group_by_exprs_from_dependencies(group_expr, 
self.plan.schema())?;
+}

Review Comment:
   nit: style-wise I'd prefer keeping `group_expr` non-mut and doing something 
like:
   ```suggestion
   let group_expr = if include_implicit_group_by_exprs {
   group_expr =
   add_group_by_exprs_from_dependencies(group_expr, 
self.plan.schema())?;
   } else {
  group_exrp
   };
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-12 Thread via GitHub


Blizzara commented on code in PR #14553:
URL: https://github.com/apache/datafusion/pull/14553#discussion_r1952814846


##
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##
@@ -300,6 +300,17 @@ async fn aggregate_grouping_rollup() -> Result<()> {
 ).await
 }
 
+#[tokio::test]
+async fn multilayer_aggregate() -> Result<()> {

Review Comment:
   was this succeeding also before? (I'd guess so as it'd add the extra 
groupbys but take that into consideration while producing the plan, is that 
right?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Add docs and examples for `DataFusionErrorBuilder` [datafusion]

2025-02-12 Thread via GitHub


jonahgao commented on code in PR #14551:
URL: https://github.com/apache/datafusion/pull/14551#discussion_r1952818429


##
datafusion/common/src/error.rs:
##
@@ -602,6 +607,9 @@ impl DataFusionError {
 DiagnosticsIterator { head: self }.next()
 }
 
+/// Return an iterator this [`DataFusionError`] and any other

Review Comment:
   Perhaps it should be "Return an iterator over this ..."?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update Community Events in concepts-readings-events.md [datafusion]

2025-02-12 Thread via GitHub


berkaysynnada merged PR #14629:
URL: https://github.com/apache/datafusion/pull/14629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


alamb commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653971828

   Ok, I am going to merge this one in and we can keep working on this in 
follow on PRs. 
   
   @logan-keede can you update the plan on 
https://github.com/apache/datafusion/issues/1 and keep on hacking ?
   
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


alamb merged PR #14616:
URL: https://github.com/apache/datafusion/pull/14616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Add docs and examples for `DataFusionErrorBuilder` [datafusion]

2025-02-12 Thread via GitHub


alamb commented on PR #14551:
URL: https://github.com/apache/datafusion/pull/14551#issuecomment-2653981501

   Thank you very much for the review @jonahgao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Add docs and examples for `DataFusionErrorBuilder` [datafusion]

2025-02-12 Thread via GitHub


alamb commented on code in PR #14551:
URL: https://github.com/apache/datafusion/pull/14551#discussion_r1952833858


##
datafusion/common/src/error.rs:
##
@@ -602,6 +607,9 @@ impl DataFusionError {
 DiagnosticsIterator { head: self }.next()
 }
 
+/// Return an iterator this [`DataFusionError`] and any other

Review Comment:
   in 1da42a2b8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952620730


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   @jayzhan211  This 
https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273
 merged less than a week ago. I think it's start from there, and i tried local 
sql logic test use -- --complete, it's also the same result, i guess the CI 
also use it to generate and verify?
   
   
   We may need a follow-up jira to investigate it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Reduce timestamp issues in native_datafusion and native_icerberg_compat Parquet modes [datafusion-comet]

2025-02-12 Thread via GitHub


mbutrovich commented on code in PR #1387:
URL: https://github.com/apache/datafusion-comet/pull/1387#discussion_r1952625044


##
native/core/src/parquet/parquet_support.rs:
##
@@ -596,7 +595,10 @@ fn cast_array(
 parquet_options: &SparkParquetOptions,
 ) -> DataFusionResult {
 use DataType::*;
-let array = array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?;
+let array = match to_type {
+Timestamp(_, None) => array, // array_with_timezone does not support 
to_type of NTZ.

Review Comment:
   Yeah, it shouldn't be called with that type:
   
https://github.com/apache/datafusion-comet/blob/f099e6e40aa18441c7882e5bffd9d6dfb10c6c19/native/spark-expr/src/utils.rs#L59
   
   The condition that matches inside `array_with_timezone` was added by #1348 
and we likely didn't understand the blast radius at the time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Reduce timestamp issues in native_datafusion and native_icerberg_compat Parquet modes [datafusion-comet]

2025-02-12 Thread via GitHub


mbutrovich commented on code in PR #1387:
URL: https://github.com/apache/datafusion-comet/pull/1387#discussion_r1952626361


##
native/core/src/parquet/parquet_support.rs:
##
@@ -596,7 +595,10 @@ fn cast_array(
 parquet_options: &SparkParquetOptions,
 ) -> DataFusionResult {
 use DataType::*;
-let array = array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?;
+let array = match to_type {
+Timestamp(_, None) => array, // array_with_timezone does not support 
to_type of NTZ.

Review Comment:
   In fact maybe I'll remove that in this PR if nothing breaks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952513001


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   Added comments in latest PR, thanks!
   
   Error should be after query error to pass CI, i think it's auto generated 
after the PR here:
   
   https://github.com/apache/datafusion/pull/14439



##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   Added comments in latest PR, thanks! @jayzhan211 
   
   Error should be after query error to pass CI, i think it's auto generated 
after the PR here:
   
   https://github.com/apache/datafusion/pull/14439



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement predicate pruning for not like expressions [datafusion]

2025-02-12 Thread via GitHub


UBarney commented on code in PR #14567:
URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952625400


##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,66 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'foo%'`, we rewrite it as `(col_min NOT LIKE 
'foo%' OR col_max NOT LIKE 'foo%')`. If both col_min and col_max have the 
prefix foo, we skip the entire row group (as we can be certain that all data in 
this row group has the prefix foo).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'prefix%' ->  !(col_min LIKE 'prefix%' && col_max LIKE 
'prefix%') -> (col_min NOT LIKE 'prefix%' || col_max NOT LIKE 'prefix%')
+
+let min_column_expr = expr_builder.min_column_expr()?;
+let max_column_expr = expr_builder.max_column_expr()?;
+
+let scalar_expr = expr_builder.scalar_expr();
+
+let pattern = extract_string_literal(scalar_expr).ok_or_else(|| {
+plan_datafusion_err!("cannot extract literal from NOT LIKE expression")
+})?;
+
+let chars: Vec = pattern.chars().collect();
+for i in 0..chars.len() - 1 {
+// Check if current char is a wildcard and is not escaped with 
backslash
+if (chars[i] == '%' || chars[i] == '_') && (i == 0 || chars[i - 1] != 
'\\') {
+// Example: For pattern "foo%bar", the row group might include 
values like
+// ["foobar", "food", "foodbar"], making it unsafe to prune.
+// Even if the min/max values in the group (e.g., "foobar" and 
"foodbar")
+// match the pattern, intermediate values like "food" may not
+// match the full pattern "foo%bar", making pruning unsafe.
+// (truncate foo%bar to foo% have same problem)
+return Err(plan_datafusion_err!(
+"NOT LIKE expressions with unescaped wildcards ('%' or '_') at 
the beginning or middle of the pattern are not supported"
+));
+}
+}
+
+if chars.last() == Some(&'_') && (chars.len() > 1 && chars[chars.len() - 
2] != '\\') {

Review Comment:
   Now it only handle `constant-prefix%` . Adding `split_constant_prefix`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952620730


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   @jayzhan211  This 
https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273
 merged less than a week ago. I think it's start from there, and i tried local 
sql logic test use -- --complete, it's also the same result, i guess the CI 
also use it to generate and verify?
   
   
   We may need a follow-up issue to investigate it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Reduce timestamp issues in native_datafusion and native_icerberg_compat Parquet modes [datafusion-comet]

2025-02-12 Thread via GitHub


mbutrovich commented on code in PR #1387:
URL: https://github.com/apache/datafusion-comet/pull/1387#discussion_r1952625044


##
native/core/src/parquet/parquet_support.rs:
##
@@ -596,7 +595,10 @@ fn cast_array(
 parquet_options: &SparkParquetOptions,
 ) -> DataFusionResult {
 use DataType::*;
-let array = array_with_timezone(array, parquet_options.timezone.clone(), 
Some(to_type))?;
+let array = match to_type {
+Timestamp(_, None) => array, // array_with_timezone does not support 
to_type of NTZ.

Review Comment:
   Yeah, it shouldn't be called with that type:
   
https://github.com/apache/datafusion-comet/blob/f099e6e40aa18441c7882e5bffd9d6dfb10c6c19/native/spark-expr/src/utils.rs#L59
   
   The condition that matches inside `array_with_timezone` was added by #1348...
   
https://github.com/apache/datafusion-comet/blob/f099e6e40aa18441c7882e5bffd9d6dfb10c6c19/native/spark-expr/src/utils.rs#L76
   ...and we likely didn't understand the blast radius at the time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement predicate pruning for not like expressions [datafusion]

2025-02-12 Thread via GitHub


findepi commented on code in PR #14567:
URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952641084


##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')

Review Comment:
   this is not true either
   
   the truth is (or should be):
   
   ```
   col NOT LIKE 'const_prefix%' ->
  col_max < 'const_prefix' OR col_min >= 'const_prefiy'
  (notice the last character of the rightmost constant is different)
   ```



##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).

Review Comment:
   This is entirely not true. Please update to reflect actual logic



##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')
+
+let min_column_expr = expr_builder.min_column_expr()?;
+let max_column_expr = expr_builder.max_column_expr()?;
+
+let scalar_expr = expr_builder.scalar_expr();
+
+let pattern = extract_string_literal(scalar_expr).ok_or_else(|| {
+plan_datafusion_err!("cannot extract literal from NOT LIKE expression")
+})?;
+
+let (const_prefix, remaining) = split_constant_prefix(pattern);
+if const_prefix.is_empty() || remaining != "%" {
+// we can not handle `%` at the beginning or in the middle of the 
pattern
+// Example: For pattern "foo%bar", the row group might include values 
like
+// ["foobar", "food", "foodbar"], making it unsafe to prune.
+// Even if the min/max values in the group (e.g., "foobar" and 
"foodbar")
+// match the pattern, intermediate values like "food" may not
+// match the full pattern "foo%bar", making pruning unsafe.
+// (truncate foo%bar to foo% have same problem)
+
+// we can not handle pattern containing `_`
+// Example: For pattern "foo_", row groups might contain ["fooa", 
"fooaa", "foob"],
+// which means not every row is guaranteed to match the pattern.
+return Err(plan_datafusion_err!(
+"NOT LIKE expressions only support constant_prefix+wildcard`%`"
+));
+}
+
+let min_col_not_like_epxr = Arc::new(phys_expr::LikeExpr::new(
+true,
+false,
+Arc::clone(&min_column_expr),
+Arc::clone(scalar_expr),

Review Comment:
   This should be
   
   col_min >= 'const_prefiy'
   
   (and renamed)



##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')
+
+let min_column_expr = expr_builder.min_column_expr()?;
+let max_column_expr = expr_builder.max_column_expr()?;
+
+let scalar_expr = expr_builder.scalar_expr();
+
+let pattern = extract_string_literal(scalar_expr).ok_or_else(|| {
+plan_datafusion_e

Re: [PR] chore(deps): bump substrait from 0.53.1 to 0.53.2 [datafusion]

2025-02-12 Thread via GitHub


jonahgao commented on PR #14627:
URL: https://github.com/apache/datafusion/pull/14627#issuecomment-2653835578

   Thank you @mbrobbel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump substrait from 0.53.1 to 0.53.2 [datafusion]

2025-02-12 Thread via GitHub


jonahgao merged PR #14627:
URL: https://github.com/apache/datafusion/pull/14627


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] StaticInvoke class checked with elided types [datafusion-comet]

2025-02-12 Thread via GitHub


EmilyMatt opened a new issue, #1391:
URL: https://github.com/apache/datafusion-comet/issues/1391

   ### Describe the bug
   
   In here
   
https://github.com/apache/datafusion-comet/blob/f099e6e40aa18441c7882e5bffd9d6dfb10c6c19/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L2130
   
   We check if staticInvoke's staticObject is an instance of 
Class[CharVarcharCodegenUtils], so we don't enter the case block for other 
types, while this is currently saved by having all sorts of checks afterwards, 
this is still a logic error, as Scala erases generics information.
   I.e., at runtime, the check that actually happens is not
   "s.staticObject.isInstanceOf[Class[CharVarcharCodegenUtils]]", which is the 
expected check
   the check that runs is "s.staticObject.isInstanceOf[Class]", which will 
always evaluate to true(more specifically, it will become 
isInstanceOf[Class[Object]] or something similar).
   
   This is very easily fixed by actually comparing staticObject to the class of 
CharVarcharCodegenUtils, which is also of type Class at runtime instead of 
using an isInstanceOf check.
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: remove unused `AutoFinishBzEncoder` [datafusion]

2025-02-12 Thread via GitHub


jonahgao opened a new pull request, #14630:
URL: https://github.com/apache/datafusion/pull/14630

   ## Which issue does this PR close?
   
   N/A
   
   ## Rationale for this change
   
   Follow-up of [chore(deps): bump bzip2 from 0.5.0 to 0.5.1 
#14620](https://github.com/apache/datafusion/pull/14620)
   
   The auto-finish on drop feature has been added back in bzip2 0.5.1, so 
there's no need to keep our implementation.
   > 
[3f3bb54](https://github.com/trifectatechfoundation/bzip2-rs/commit/3f3bb54746ff1afee6a6039018dc9c6c2d832b7b)
 Add finisher drop implementation to BzEncoder
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   Yes
   
   
   ## Are there any user-facing changes?
   No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement predicate pruning for not like expressions [datafusion]

2025-02-12 Thread via GitHub


UBarney commented on code in PR #14567:
URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952755081


##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')

Review Comment:
   > this is not true either
   
   Why? Does it prune row groups that contain data that does not match the 
pattern? Or is it just inefficient?  
   
   First, let's clarify one point: if a row group contains any data that does 
not match the pattern, then this row group **must not** be pruned. The expected 
behavior is to return all data that does not match. If the row group gets 
pruned, data loss will occur (see [this 
PR](https://github.com/apache/datafusion/pull/561)).  
   
   ```
   For `col NOT LIKE 'const_prefix%'`:  
   It applies the condition:  
   `col_max < 'const_prefix' OR col_min >= 'const_prefiy'`  
   (Note that the last character of the rightmost constant is different.)  
   ```
   
   I think this mistakenly prunes row groups that contain data not matching the 
pattern.  
   
   Consider this case: `col NOT LIKE 'const_prefix%'`. The row group might 
include values such as `["aaa", "b", "const_prefix"]`. The condition `col_max < 
'const_prefix' OR col_min >= 'const_prefiy'` evaluates to `false`, causing the 
row group to be pruned. However, this is incorrect because `"aaa"` does not 
match the pattern and should be included in the result.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement predicate pruning for not like expressions [datafusion]

2025-02-12 Thread via GitHub


UBarney commented on code in PR #14567:
URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952755081


##
datafusion/physical-optimizer/src/pruning.rs:
##
@@ -1710,6 +1711,76 @@ fn build_like_match(
 Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result> {
+// col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')

Review Comment:
   Why? Does it prune row groups that contain data that does not match the 
pattern? Or is it just inefficient?  
   
   First, let's clarify one point: if a row group contains any data that does 
not match the pattern, then this row group **must not** be pruned. The expected 
behavior is to return all data that does not match. If the row group gets 
pruned, data loss will occur (see [this 
PR](https://github.com/apache/datafusion/pull/561)).  
   
   ```
   For `col NOT LIKE 'const_prefix%'`:  
   It applies the condition:  
   `col_max < 'const_prefix' OR col_min >= 'const_prefiy'`  
   (Note that the last character of the rightmost constant is different.)  
   ```
   
   I think this mistakenly prunes row groups that contain data not matching the 
pattern.  
   
   Consider this case: `col NOT LIKE 'const_prefix%'`. The row group might 
include values such as `["aaa", "b", "const_prefix"]`. The condition `col_max < 
'const_prefix' OR col_min >= 'const_prefiy'` evaluates to `false`, causing the 
row group to be pruned. However, this is incorrect because `"aaa"` does not 
match the pattern and should be included in the result.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: Passthrough condition in StaticInvoke case block [datafusion-comet]

2025-02-12 Thread via GitHub


EmilyMatt opened a new pull request, #1392:
URL: https://github.com/apache/datafusion-comet/pull/1392

   ## Which issue does this PR close?
   
   Closes #1391 .
   
   ## Rationale for this change
   
   It is a logic error.
   
   ## What changes are included in this PR?
   
   Moved to a direct comparator instead of checking that Class isInstanceOf 
Class
   
   ## How are these changes tested?
   
   I believe this is an inherent language feature, it does not need a TestSuite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


alamb commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653893209

   > also, I am not suggesting that we keep the file_format and physical_plan 
folders in new solution,
   > though I was going to suggest(only one folder for all file related) if we 
move forward by old approach
   
   That is my understanding as well
   
   https://github.com/user-attachments/assets/2c1672c2-b41a-4878-8ce8-d4689e3d240e";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DDL Statement Propagation (`INSERT INTO` support) [datafusion-ballista]

2025-02-12 Thread via GitHub


milenkovicm commented on issue #1164:
URL: 
https://github.com/apache/datafusion-ballista/issues/1164#issuecomment-2653893307

   hey @alamb sorry to bother you again, I put some effort into implementing 
INSERT INTO, implementation is close to option 1, `Replace TableReference with 
actual table in the LogicalPlan::DML`. 
   
   Now I have approach where I created my own logical extension node in 
ballista where I serialise TableSource as demonstrate in #1177 
   
   also I also implemented that in datafusion LogicalPlan::DML, adding table 
source reference, more details at 
https://github.com/apache/datafusion/commit/02db5e56d8bda26c890265bb047d2591f05641b6
   where I believe it should go in the first place.  Now I see it as a bit of 
"hacked" approach, but I don't see any better way to do it. 
   
   
   Any advice would be appreciated, thanks a lot 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic] Split datasources out from `datafusion` crate (`datafusion/core`) [datafusion]

2025-02-12 Thread via GitHub


logan-keede commented on issue #1:
URL: https://github.com/apache/datafusion/issues/1#issuecomment-2654016041

   This has been an update in plan, specifically addition of 
`datafusion-datasource` crate
   
   - `datafusion-catalog-listing`: `ListingTable` and associated types like 
`PartitionedFile`
   - `datafusion-datasource`: **NEW** that holds `FileCompressionType`, 
`PartitonedFile`,and `DataSource`
   - `datafusion-datasource-parquet`: `ParquetExec` and file firmat
   - `datafusion-datasource-avro` `AvroExec` and file formats
   - `datafusion-datasource-arrow`
   - `datafusion-datasource-json`
   - `datafusion-datasource-csv`
   
   _Originally posted by @alamb and suggested by @jayzhan211 in 
https://github.com/apache/datafusion/issues/14616#issuecomment-2653423772_
   
   Please refer to the above mentioned issue for more context.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Optimize `repeat` function [datafusion]

2025-02-12 Thread via GitHub


alamb commented on issue #14610:
URL: https://github.com/apache/datafusion/issues/14610#issuecomment-2654019734

   Hi @zjregee  -- would it be possible to iterate through the input once and 
calculate how much space is needed and then create a StringBuilder with the 
appropriate capacity via 
[`StringBuilder::with_capacity`](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteBuilder.html#tymethod.with_capacity)
   
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Feb 4, 2025: This week(s) in DataFusion [datafusion]

2025-02-12 Thread via GitHub


alamb commented on issue #14491:
URL: https://github.com/apache/datafusion/issues/14491#issuecomment-2654024105

   Great writeup of the Amsterdam meetup: 
   - 
https://github.com/apache/datafusion/discussions/12988#discussioncomment-12140634
   
   Thanks @oznur-synnada 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic] Split datasources out from `datafusion` crate (`datafusion/core`) [datafusion]

2025-02-12 Thread via GitHub


alamb commented on issue #1:
URL: https://github.com/apache/datafusion/issues/1#issuecomment-2654007044

   Update here is that @logan-keede  is cranking right along:
   - https://github.com/apache/datafusion/pull/14616
   
   After some discussion with @jayzhan211  I think we have a good plan going 
forward
   
   I feel like we are on the cusp of finally getting these things split from 
the core 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait [datafusion]

2025-02-12 Thread via GitHub


Blizzara commented on code in PR #14553:
URL: https://github.com/apache/datafusion/pull/14553#discussion_r1952853782


##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -1090,11 +1090,31 @@ impl LogicalPlanBuilder {
 group_expr: impl IntoIterator>,
 aggr_expr: impl IntoIterator>,
 ) -> Result {
-let group_expr = normalize_cols(group_expr, &self.plan)?;
+self._aggregate(group_expr, aggr_expr, true)
+}
+
+pub fn aggregate_without_implicit_group_by_exprs(
+self,
+group_expr: impl IntoIterator>,
+aggr_expr: impl IntoIterator>,
+) -> Result {
+self._aggregate(group_expr, aggr_expr, false)
+}
+
+fn _aggregate(

Review Comment:
   I don't think there's need for `_` since the function is already private (by 
virtue of not being `pub fn`). Something like `aggregate_inner` I think is used 
quite a lot.
   
   Alternatively, given the logicalplanbuilder for aggregate doesn't do that 
much, we could also just inline it into the substrait consumer. That way it's 
not changing the LogicalPlanBuilder api, which might be easier.
   
   Or maybe this whole `add_group_by_exprs_from_dependencies` thing should move 
from the plan builder into the analyzer/optimizer? Intuitively it feels like 
the constructed logical plan shouldn't do this kind of magic, but the 
analyzer/optimizer can if it makes things faster to execute. But that might be 
a bigger undertaking, so I'd be quite fine with this PR or the alternative 
above first.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952860229


##
datafusion/common/src/dfschema.rs:
##
@@ -1028,21 +1028,41 @@ impl SchemaExt for Schema {
 })
 }
 
-fn logically_equivalent_names_and_types(&self, other: &Self) -> bool {
+fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<()> 
{
 if self.fields().len() != other.fields().len() {
-return false;
+_plan_err!(
+"Inserting query must have the same schema length as the 
table. \
+Expected table schema length: {}, got: {}",
+self.fields().len(),
+other.fields().len()
+)
+} else {
+self.fields()
+.iter()
+.zip(other.fields().iter())
+.try_for_each(|(f1, f2)| {
+// only check the case when the table field is not 
nullable and the insert data field is nullable
+if !f1.is_nullable() && f2.is_nullable() {

Review Comment:
   Thanks @jonahgao , got it now, this is a good example to explain.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add support for `ORDER BY ALL` [datafusion-sqlparser-rs]

2025-02-12 Thread via GitHub


iffyio commented on code in PR #1724:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1724#discussion_r1952866285


##
src/ast/query.rs:
##
@@ -2205,31 +2205,50 @@ pub enum JoinConstraint {
 #[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
-pub struct OrderBy {
-pub exprs: Vec,
-/// Optional: `INTERPOLATE`
-/// Supported by [ClickHouse syntax]
+pub enum OrderBy {

Review Comment:
   Would it make sense to use this representation instead?
   ```rust
   enum OrderByKind {
   Expr(Vec),
   All
   }
   struct OrderBy {
   pub kind: OrderByKind,
   pub interpolate: Option,
   }
   ```
   thinking in order to avoid duplicating interpolate (and other shared) 
parameters



##
src/parser/mod.rs:
##
@@ -9183,17 +9183,19 @@ impl<'a> Parser<'a> {
 
 pub fn parse_optional_order_by(&mut self) -> Result, 
ParserError> {
 if self.parse_keywords(&[Keyword::ORDER, Keyword::BY]) {
-let order_by_exprs = 
self.parse_comma_separated(Parser::parse_order_by_expr)?;
-let interpolate = if dialect_of!(self is ClickHouseDialect | 
GenericDialect) {
-self.parse_interpolations()?
+let order_by = if self.parse_keyword(Keyword::ALL) {

Review Comment:
   could we introduce a dialect method `self.dialect.supports_order_by_all()` 
and guard this behavior with it? since it would be incorrect behavior if this 
branch returns true for a dialect that does not support this feature



##
src/ast/query.rs:
##
@@ -2205,31 +2205,50 @@ pub enum JoinConstraint {
 #[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
-pub struct OrderBy {
-pub exprs: Vec,
-/// Optional: `INTERPOLATE`
-/// Supported by [ClickHouse syntax]
+pub enum OrderBy {
+/// ALL syntax of [DuckDB] and [ClickHouse].
 ///
-/// [ClickHouse syntax]: 

-pub interpolate: Option,
+/// [DuckDB]:  
+/// [ClickHouse]: 

+All(OrderByAll),
+
+/// Expressions
+Expressions(OrderByExprsWithInterpolate),
 }
 
 impl fmt::Display for OrderBy {
 fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
 write!(f, "ORDER BY")?;
-if !self.exprs.is_empty() {
-write!(f, " {}", display_comma_separated(&self.exprs))?;
-}
-if let Some(ref interpolate) = self.interpolate {
-match &interpolate.exprs {
-Some(exprs) => write!(f, " INTERPOLATE ({})", 
display_comma_separated(exprs))?,
-None => write!(f, " INTERPOLATE")?,
+match self {
+OrderBy::Expressions(exprs) => {
+write!(f, "{}", exprs)?;
+}
+OrderBy::All(all) => {
+write!(f, " ALL{}", all)?;
 }
 }
+
 Ok(())
 }
 }
 
+impl OrderBy {
+pub fn get_exprs(&self) -> Option<&Vec> {
+match self {
+OrderBy::Expressions(exprs_with_interpolate) => 
Some(&exprs_with_interpolate.exprs),
+OrderBy::All(_) => None,
+}
+}
+pub fn get_interpolate(&self) -> Option<&Interpolate> {
+match self {
+OrderBy::Expressions(exprs_with_interpolate) => {
+exprs_with_interpolate.interpolate.as_ref()
+}
+OrderBy::All(_) => None,
+}
+}
+}

Review Comment:
   I'm thinking we would rather avoid these methods to avoid more api breakage 
than necessary when things change. Did we introduce them primarily for the 
tests? if so I think we could as usual have the test match the expected expr 
variant



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Optimize `repeat` function [datafusion]

2025-02-12 Thread via GitHub


alamb commented on issue #14610:
URL: https://github.com/apache/datafusion/issues/14610#issuecomment-2654015159

   Thanks @zjregee 
   
   > This seems to work well when the number of repetitions is small, because 
it reduces the number of memory copies, but when the number of repetitions is 
large, the repeated memory expansion will lead to performance degradation.
   
   Is there any way to increase the capacity of StringBuilder, like 
[`Vec::reserve`](https://doc.rust-lang.org/std/vec/struct.Vec.html#method.reserve)
 ?
   
   If not maybe we could add that API upstream 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: type checking [datafusion-python]

2025-02-12 Thread via GitHub


kylebarron commented on PR #993:
URL: 
https://github.com/apache/datafusion-python/pull/993#issuecomment-2653967414

   Let's cc @timsaucer for thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-12 Thread via GitHub


Weijun-H commented on PR #14411:
URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2654051392

   > > I wonder why tpch_mem_sf10 is slower for some queries? Might it be 
possible the created memtable is not created evenly because of the new round 
robin (that might be fixable e.g. by introducing another repartition after 
memoryexec).
   > 
   > Thank you for the advice @Dandandan. We will certainly check that after 
completing on-demand optimizations.
   
   I agree with @berkaysynnada because `OnDemandRepartition` is not set by 
default; it should be enabled when necessary. However, it is a good point for 
the following optimization, @Dandandan 👍. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Introducing mutation testing [datafusion]

2025-02-12 Thread via GitHub


2010YOUY01 commented on PR #14590:
URL: https://github.com/apache/datafusion/pull/14590#issuecomment-2653100779

   > > Thanks @edmondop I cannot see this flow in the list of PR checks, how 
long it takes?
   > 
   > On my fork it didn't appear either, you need to merge the PR. Maybe you 
can try to merge on your own fork? Note that at the moment this checks can only 
be run manually (until we figure out what to do with the information in it, 
which is useful imho)
   > 
   > See here, it took 54 minutes 
https://github.com/edmondop/arrow-datafusion/actions/runs/13248520757 and it 
crashed before it could run all 22844 mutants
   > 
   > ```rust
   > Found 22844 mutants to test
   >  WARN An explicit test timeout is recommended when using --baseline=skip; 
using 300 seconds by default
   > MISSED   datafusion/functions/src/math/log.rs:222:61: replace == with != 
in ::simplify in 84.2s build + 6.9s test
   > MISSED   datafusion/functions/src/lib.rs:179:5: replace register_all -> 
Result<()> with Ok(()) in 18.1s build + 5.7s test
   > MISSED   datafusion/physical-optimizer/src/join_selection.rs:227:9: 
replace || with && in try_collect_left in 77.1s build + 0.4s test
   > MISSED   datafusion/physical-plan/src/aggregates/mod.rs:1278:27: replace 
<= with > in group_id_array in 90.6s build + 41.3s test
   > MISSED   datafusion/physical-optimizer/src/limit_pushdown.rs:77:9: replace 
::name -> &str with "" in 14.8s 
build + 0.3s test
   > MISSED   datafusion/functions-aggregate/src/array_agg.rs:383:13: replace - 
with + in ::size in 70.4s 
build + 1.7s test
   > MISSED   datafusion/functions-aggregate/src/sum.rs:469:9: replace >::size -> usize with 1 in 35.2s build 
+ 1.6s test
   > MISSED   datafusion/functions-nested/src/concat.rs:292:35: delete ! in 
::return_type in 40.2s build + 5.2s test
   > MISSED   datafusion/proto-common/src/to_proto/mod.rs:247:9: replace ::try_from -> Result with 
Ok(Default::default()) in 14.9s build + 2.7s test
   > MISSED   datafusion/optimizer/src/eliminate_duplicated_expr.rs:45:9: 
replace ::eq -> bool with true in 46.4s 
build + 11.5s test
   > MISSED   datafusion/functions-aggregate/src/string_agg.rs:176:9: replace 
::size -> usize with 0 in 41.0s 
build + 1.6s test
   > MISSED   
datafusion/physical-plan/src/aggregates/group_values/single_group_by/bytes.rs:77:9:
 replace >::size -> usize with 0 in 
23.8s build + 39.4s test
   > MISSED   datafusion/catalog/src/memory.rs:166:9: replace ::table_names -> Vec with 
vec!["xyzzy".into()] in 25.3s build + 0.6s test
   > MISSED   datafusion/expr-common/src/type_coercion/aggregates.rs:276:5: 
replace is_covariance_support_arg_type -> bool with false in 8.7s build + 5.7s 
test
   > MISSED   datafusion/physical-expr/src/window/window_expr.rs:294:20: delete 
! in AggregateWindowExpr::get_result_column in 38.3s build + 9.0s test
   > TIMEOUT  datafusion/physical-plan/src/aggregates/row_hash.rs:1059:9: 
replace GroupedHashAggregateStream::update_merged_stream -> Result<()> with 
Ok(()) in 53.6s build + 300.0s test
   > MISSED   datafusion/physical-expr-common/src/sort_expr.rs:429:9: replace 
LexOrdering::from_lex_requirement -> LexOrdering with Default::default() in 
7.7s build + 10.2s test
   > MISSED   datafusion/sql/src/statement.rs:1607:69: replace - with / in 
SqlToRel<'_, S>::show_variable_to_plan in 45.2s build + 9.4s test
   > ```
   
   Thank you for the experiments, I think mutation tests definitely can improve 
software quality
   
   I'm wondering how many mutations have this CI run executed in 53 minutes? 
Looks like each takes 1 minute, and it's only 50/22844 ran and so many errors 
showed up.
   And each one I believe is expensive to fix and get reviewed. I tried to fix 
the first one and this took some time, I think they all require us to figure 
out how an unfamiliar code snippet works. (then it's impossible to fix them all)
   
   An alternative can be for each PR, only run mutation tests for modified 
code, it would be much easier to fix test coverage issues since the PR authors 
just wrote the code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


jayzhan211 commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653103730

   `FileCompressionType`, `PartitionedFile`, `FileRange` can be move to 
`datasource`.
   
   If A and B is tightly couple, you need to pull partial structure out to C 
and import C for A and B. Not moving A and B together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump prost-derive from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] closed pull request #14622: chore(deps): bump prost-derive from 
0.13.4 to 0.13.5
URL: https://github.com/apache/datafusion/pull/14622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump prost-derive from 0.13.4 to 0.13.5 [datafusion]

2025-02-12 Thread via GitHub


dependabot[bot] commented on PR #14622:
URL: https://github.com/apache/datafusion/pull/14622#issuecomment-2653137159

   Looks like prost-derive is up-to-date now, so this is no longer needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix ci test [datafusion]

2025-02-12 Thread via GitHub


xudong963 commented on PR #14625:
URL: https://github.com/apache/datafusion/pull/14625#issuecomment-2653202385

   The pr is fixing main ci once the ci of the pr passes I'll merge it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#issuecomment-2653205051

   The CI error is caused by:
   
   https://github.com/apache/datafusion/pull/14625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump clap from 4.5.28 to 4.5.29 [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14619:
URL: https://github.com/apache/datafusion/pull/14619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Nullable doesn't work when create memory table [datafusion]

2025-02-12 Thread via GitHub


xudong963 commented on issue #14522:
URL: https://github.com/apache/datafusion/issues/14522#issuecomment-2653115249

   Thanks @blaginin, I add a test for the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump bzip2 from 0.5.0 to 0.5.1 [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14620:
URL: https://github.com/apache/datafusion/pull/14620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Integrate `Analyzer` within LogicalPlan building stage [datafusion]

2025-02-12 Thread via GitHub


xudong963 commented on issue #14618:
URL: https://github.com/apache/datafusion/issues/14618#issuecomment-2653171072

   +1 for the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952317116


##
datafusion/sqllogictest/test_files/insert.slt:
##
@@ -296,8 +296,11 @@ insert into table_without_values(field1) values(3);
 1
 
 # insert NULL values for the missing column (field1), but column is 
non-nullable
-statement error Execution error: Invalid batch column at '0' has null but 
schema specifies non-nullable
+statement error
 insert into table_without_values(field2) values(300);
+

Review Comment:
   This is due to we have checking in insert into plan now, before this we only 
have the check for execution plan check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): bump sqllogictest from 0.26.4 to 0.27.0 [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14598:
URL: https://github.com/apache/datafusion/pull/14598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] bug: improve schema checking for `insert into` cases [datafusion]

2025-02-12 Thread via GitHub


zhuqi-lucas commented on code in PR #14572:
URL: https://github.com/apache/datafusion/pull/14572#discussion_r1952352719


##
datafusion/sqllogictest/test_files/insert_to_external.slt:
##
@@ -60,17 +60,16 @@ STORED AS parquet
 LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
 PARTITIONED BY (b);
 
-query I
+query error
 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 
-2
+DataFusion error: Error during planning: Inserting query must have the same 
schema nullability as the table. Expected table field 'b' nullability: false, 
got field: 'b', nullability: true

Review Comment:
   This is also expected, because PARTITIONED BY (b), will make the b nullable 
to false.
   
   We shouldn't support insert nullable value for partition key i think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Equivalence class projection does not find new equivalent classes correctly [datafusion]

2025-02-12 Thread via GitHub


berkaysynnada closed issue #14326: Equivalence class projection does not find 
new equivalent classes correctly
URL: https://github.com/apache/datafusion/issues/14326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] equivalence classes: use normalized mapping for projection [datafusion]

2025-02-12 Thread via GitHub


berkaysynnada merged PR #14327:
URL: https://github.com/apache/datafusion/pull/14327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Project Ideas for GSoC 2025 (Google Summer of Code) [datafusion]

2025-02-12 Thread via GitHub


ozankabak commented on issue #14478:
URL: https://github.com/apache/datafusion/issues/14478#issuecomment-2653026106

   Hi @PokIsemaine, we applied independently (not due to any particular reason, 
we simply weren't aware of these three projects applying jointly). We will 
announce here if we become a participating organization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add test for nullable doesn't work when create memory table [datafusion]

2025-02-12 Thread via GitHub


xudong963 opened a new pull request, #14624:
URL: https://github.com/apache/datafusion/pull/14624

   ## Which issue does this PR close?
   
   
   
   - Closes #14522
   
   
   ## Rationale for this change
   
   
   
   Add a test for #14522
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: case-sensitive quoted identifiers in DELETE statements [datafusion]

2025-02-12 Thread via GitHub


xudong963 merged PR #14584:
URL: https://github.com/apache/datafusion/pull/14584


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DELETE statement fails to preserve quoted identifiers [datafusion]

2025-02-12 Thread via GitHub


xudong963 closed issue #14583: DELETE statement fails to preserve quoted 
identifiers
URL: https://github.com/apache/datafusion/issues/14583


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): group `prost` and `pbjson` dependabot updates [datafusion]

2025-02-12 Thread via GitHub


mbrobbel opened a new pull request, #14626:
URL: https://github.com/apache/datafusion/pull/14626

   ## Which issue does this PR close?
   
   None.
   
   ## Rationale for this change
   
   Groups PRs like the ones listed into one:
   - https://github.com/apache/datafusion/pull/14621
   - https://github.com/apache/datafusion/pull/14622
   - https://github.com/apache/datafusion/pull/14623
   
   ## What changes are included in this PR?
   
   Dependabot config update to group `prost*` and `pbjson*` updates.
   
   ## Are these changes tested?
   
   No.
   
   ## Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: Move various parts of datasource out of core [datafusion]

2025-02-12 Thread via GitHub


alamb commented on PR #14616:
URL: https://github.com/apache/datafusion/pull/14616#issuecomment-2653423772

   > `FileCompressionType`, `PartitionedFile`, `FileRange` can be move to 
`datasource`.
   > 
   > If A and B is tightly couple, you need to pull partial structure out to C 
and import C for A and B. Not moving A and B together.
   
   The possible plan I proposed on 
https://github.com/apache/datafusion/issues/1 proposes a structure like this
   - `datafusion-catalog-listing`: `ListingTable` and associated types like 
`PartitionedFile`
   - `datafusion-datasource-parquet`: `ParquetExec` and file firmat
   - `datafusion-datasource-avro` `AvroExec` and file formats
   - `datafusion-datasource-arrow`
   - `datafusion-datasource-json`
   - `datafusion-datasource-csv`
   
   This was before the refactor with `DataSource` landed:
   - https://github.com/apache/datafusion/pull/14224
   
   @jayzhan211  are you proposing we add another crate in there, something like?
   - `datafusion-catalog-listing`: `ListingTable` and associated types like 
`PartitionedFile`
   - `datafusion-datasource`: **NEW** that holds `FileCompressionType`, 
`PartitonedFile`,and `DataSource`
   - `datafusion-datasource-parquet`: `ParquetExec` and file firmat
   - `datafusion-datasource-avro` `AvroExec` and file formats
   - `datafusion-datasource-arrow`
   - `datafusion-datasource-json`
   - `datafusion-datasource-csv`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   3   >