date:20250110

Re: [I] Move `SanityChecker` into `datafusion-physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



mnpw commented on issue #14072:
URL: https://github.com/apache/datafusion/issues/14072#issuecomment-2585112045

   @cj-zhukov Apologies, I started working on this PR without explicitly 
assigning it to myself. Would appreciate your review on 
https://github.com/apache/datafusion/pull/14083.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] build broken due to breaking change in winnow crate [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove closed issue #1263: build broken due to breaking change in winnow 
crate
URL: https://github.com/apache/datafusion-comet/issues/1263


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: [comet-parquet-exec] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



parthchandra commented on PR #1265:
URL: 
https://github.com/apache/datafusion-comet/pull/1265#issuecomment-2584878525

   > > @parthchandra could you run `cargo fmt` and `cargo clippy`
   > 
   > and `make format` .. that is if we want to see tests passing on this PR, 
which I think they should now? Or do we want to wait until we PR 
comet-parquet-exec against main before trying to get CI green?
   
   Fixed the build. Yes, we should get the CI to turn green with this PR before 
we attempt any more changes to the feature branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



comphead commented on code in PR #1262:
URL: https://github.com/apache/datafusion-comet/pull/1262#discussion_r1911745757


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   I'm thinking should we add a test with unique values, that fails now by 
correctness issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on code in PR #1262:
URL: https://github.com/apache/datafusion-comet/pull/1262#discussion_r1911767513


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   yes, many of the tests were passing because the aggregate was running in 
Spark rather than Comet or because the input values were already distinct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



codecov-commenter commented on PR #1262:
URL: 
https://github.com/apache/datafusion-comet/pull/1262#issuecomment-2584940287

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1262?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 34.72%. Comparing base 
[(`be48839`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/be488390dcd89b2a5bff6f24b9ae7724383cffde?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`43c8f55`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/43c8f559fc948a94cf7e3a5dcb70149ef2e29acb?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 2 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1262  +/-   ##
   
   + Coverage 34.69%   34.72%   +0.03% 
   - Complexity  991  994   +3 
   
 Files   116  116  
 Lines 4489144907  +16 
 Branches   9864 9869   +5 
   
   + Hits  1557415595  +21 
 Misses2616826168  
   + Partials   3149 3144   -5 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1262?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Make `group_schema` as `PhysicalGroupBy` method [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 merged PR #14064:
URL: https://github.com/apache/datafusion/pull/14064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Make `group_schema` as `PhysicalGroupBy` method [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on PR #14064:
URL: https://github.com/apache/datafusion/pull/14064#issuecomment-2585038720

   Thanks @berkaysynnada 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add comments to physical optimizer tests [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on PR #14075:
URL: https://github.com/apache/datafusion/pull/14075#issuecomment-2585039343

   Thanks @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Support `MemoryExec` in proto `try_from_physical_plan` [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 opened a new issue, #14082:
URL: https://github.com/apache/datafusion/issues/14082

   ### Is your feature request related to a problem or challenge?
   
   In `try_from_physical_plan` 
https://github.com/apache/datafusion/blob/d91a7c0f5b93bfb7061dcb6aa8b78dd31b7273b3/datafusion/proto/src/physical_plan/mod.rs#L1146-L2052
   
   MemoryExec is not supported yet
   
   ### Describe the solution you'd like
   
   Support memory exec with ordering
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add comments to physical optimizer tests [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 merged PR #14075:
URL: https://github.com/apache/datafusion/pull/14075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Move `LimitPushdown` tests to be in the same file as the code [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 merged PR #14076:
URL: https://github.com/apache/datafusion/pull/14076


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Optimized spill file format [datafusion]

2025-01-10 Thread via GitHub



2010YOUY01 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2585043689

   Although we're currently spilling column-wise record batches, I think this 
will change to row-wise batches in the future. It would be better to benchmark 
and optimize spilling the Arrow Row format in this issue as well.
   
   The reason is that the spilling operation involves sorting, spilling sorted 
runs, and reading back those runs for merging. Both sorting and merging benefit 
from the row format. The current implementation performs several unnecessary 
conversions between row and column formats, which could become inefficient.
   The preferred way should be:
   1. Convert to row format and do sorting
   2. Maintaining the row format until the final output (I believe `Sort` will 
benefit more from it, because it will do 2 phase or merging, `Aggregate` will 
only do 1 phase)
   This is tracked by https://github.com/apache/datafusion/issues/7053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911868813


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Hi @jayzhan211 I pushed a commit, could you please review it again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] added "DEFAULT_CLI_FORMAT_OPTIONS" for cli and sqllogic test [datafusion]

2025-01-10 Thread via GitHub



jonahgao merged PR #14052:
URL: https://github.com/apache/datafusion/pull/14052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] added "DEFAULT_CLI_FORMAT_OPTIONS" for cli and sqllogic test [datafusion]

2025-01-10 Thread via GitHub



jonahgao commented on PR #14052:
URL: https://github.com/apache/datafusion/pull/14052#issuecomment-2585048618

   Thanks @jatin510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Functionality of `array_repeat` udf [datafusion]

2025-01-10 Thread via GitHub



jonahgao closed issue #13872: Functionality of `array_repeat` udf
URL: https://github.com/apache/datafusion/issues/13872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] count distinct on NaN produces incorrect results [datafusion-comet]

2025-01-10 Thread via GitHub



parthchandra commented on issue #1238:
URL: 
https://github.com/apache/datafusion-comet/issues/1238#issuecomment-2585075506

   I think that we will have to explicitly check if both sides of a floating 
point comparison are NaN to match Spark behavior. By definition NaN is not 
equal to NaN, so technically Comet is correct.
   Falling back to Spark for  floating point can be forced by a config value 
but I don't think we should do that by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911887850


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Okay this approach looks good to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911886029


##
datafusion/common/src/dfschema.rs:
##
@@ -442,22 +603,24 @@ impl DFSchema {
 
 /// Find all fields that match the given name
 pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> {
-self.fields()
-.iter()
-.filter(|field| field.name() == name)
-.map(|f| f.as_ref())
-.collect()
+let mut fields: Vec<&Field> = 
self.inner.fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut schema.fields_with_unqualified_name(name));

Review Comment:
   ```suggestion
   fields.append(schema.fields_with_unqualified_name(name));
   ```



##
datafusion/common/src/dfschema.rs:
##
@@ -442,22 +603,24 @@ impl DFSchema {
 
 /// Find all fields that match the given name
 pub fn fields_with_unqualified_name(&self, name: &str) -> Vec<&Field> {
-self.fields()
-.iter()
-.filter(|field| field.name() == name)
-.map(|f| f.as_ref())
-.collect()
+let mut fields: Vec<&Field> = 
self.inner.fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut schema.fields_with_unqualified_name(name));
+}
+fields
 }
 
 /// Find all fields that match the given name and return them with their 
qualifier
 pub fn qualified_fields_with_unqualified_name(
 &self,
 name: &str,
 ) -> Vec<(Option<&TableReference>, &Field)> {
-self.iter()
-.filter(|(_, field)| field.name() == name)
-.map(|(qualifier, field)| (qualifier, field.as_ref()))
-.collect()
+let mut fields: Vec<(Option<&TableReference>, &Field)> =
+self.inner.qualified_fields_with_unqualified_name(name);
+if let Some(schema) = self.metadata_schema() {
+fields.append(&mut 
schema.qualified_fields_with_unqualified_name(name));

Review Comment:
   ```suggestion
   
fields.append(schema.qualified_fields_with_unqualified_name(name));
   ```



##
datafusion/expr/src/logical_plan/plan.rs:
##
@@ -2600,6 +2619,16 @@ impl TableScan {
 let df_schema = DFSchema::new_with_metadata(
 p.iter()
 .map(|i| {
+if *i >= schema.fields.len() {
+if let Some(metadata) = &metadata {
+return (
+Some(table_name.clone()),
+Arc::new(
+metadata.field(*i - 
METADATA_OFFSET).clone(),

Review Comment:
   handle where i < METADATA_OFFSET



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

2025-01-10 Thread via GitHub



viirya commented on code in PR #1190:
URL: https://github.com/apache/datafusion-comet/pull/1190#discussion_r1911699224


##
native/core/src/execution/shuffle/codec.rs:
##
@@ -0,0 +1,695 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_array::cast::AsArray;
+use arrow_array::types::Int32Type;
+use arrow_array::{
+Array, ArrayRef, BinaryArray, BooleanArray, Date32Array, Decimal128Array, 
DictionaryArray,
+Float32Array, Float64Array, Int16Array, Int32Array, Int64Array, Int8Array, 
RecordBatch,
+RecordBatchOptions, StringArray, TimestampMicrosecondArray,
+};
+use arrow_buffer::{BooleanBuffer, Buffer, NullBuffer, OffsetBuffer, 
ScalarBuffer};
+use arrow_schema::{DataType, Field, Schema, TimeUnit};
+use datafusion_common::DataFusionError;
+use std::io::Write;
+use std::sync::Arc;
+
+pub fn fast_codec_supports_type(data_type: &DataType) -> bool {
+match data_type {
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Date32
+| DataType::Timestamp(TimeUnit::Microsecond, _)
+| DataType::Utf8
+| DataType::Binary => true,
+DataType::Decimal128(_, s) if *s >= 0 => true,
+DataType::Dictionary(k, v) if **k == DataType::Int32 => 
fast_codec_supports_type(v),
+_ => false,
+}
+}
+
+enum DataTypeId {
+Boolean = 0,
+Int8,
+Int16,
+Int32,
+Int64,
+Date32,
+Timestamp,
+TimestampNtz,
+Decimal128,
+Float32,
+Float64,
+Utf8,
+Binary,
+Dictionary,
+}
+
+pub struct BatchWriter {
+inner: W,
+}
+
+impl BatchWriter {
+pub fn new(inner: W) -> Self {
+Self { inner }
+}
+
+/// Encode the schema (just column names because data types can vary per 
batch)
+pub fn write_partial_schema(&mut self, schema: &Schema) -> Result<(), 
DataFusionError> {
+let schema_len = schema.fields().len();
+self.inner.write_all(&schema_len.to_le_bytes())?;
+for field in schema.fields() {
+// field name
+let field_name = field.name();
+self.inner.write_all(&field_name.len().to_le_bytes())?;
+self.inner.write_all(field_name.as_str().as_bytes())?;
+// TODO nullable - assume all nullable for now
+}
+Ok(())
+}
+
+fn write_data_type(&mut self, data_type: &DataType) -> Result<(), 
DataFusionError> {
+match data_type {
+DataType::Boolean => {
+self.inner.write_all(&[DataTypeId::Boolean as u8])?;
+}
+DataType::Int8 => {
+self.inner.write_all(&[DataTypeId::Int8 as u8])?;
+}
+DataType::Int16 => {
+self.inner.write_all(&[DataTypeId::Int16 as u8])?;
+}
+DataType::Int32 => {
+self.inner.write_all(&[DataTypeId::Int32 as u8])?;
+}
+DataType::Int64 => {
+self.inner.write_all(&[DataTypeId::Int64 as u8])?;
+}
+DataType::Float32 => {
+self.inner.write_all(&[DataTypeId::Float32 as u8])?;
+}
+DataType::Float64 => {
+self.inner.write_all(&[DataTypeId::Float64 as u8])?;
+}
+DataType::Date32 => {
+self.inner.write_all(&[DataTypeId::Date32 as u8])?;
+}
+DataType::Timestamp(TimeUnit::Microsecond, None) => {
+self.inner.write_all(&[DataTypeId::TimestampNtz as u8])?;
+}
+DataType::Timestamp(TimeUnit::Microsecond, Some(tz)) => {
+self.inner.write_all(&[DataTypeId::Timestamp as u8])?;
+let tz_bytes = tz.as_bytes();
+self.inner.write_all(&tz_bytes.len().to_le_bytes())?;
+self.inner.write_all(tz_bytes)?;
+}
+DataType::Utf8 => {
+self.inner.write_all(&[DataTypeId::Utf8 as u8])?;
+}
+DataType::Binary => {
+self.inner.write_all(&[DataTypeId::Binary as u8])?;
+}
+

Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

2025-01-10 Thread via GitHub



viirya commented on code in PR #1190:
URL: https://github.com/apache/datafusion-comet/pull/1190#discussion_r1911704555


##
native/core/src/execution/shuffle/codec.rs:
##
@@ -0,0 +1,695 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_array::cast::AsArray;
+use arrow_array::types::Int32Type;
+use arrow_array::{
+Array, ArrayRef, BinaryArray, BooleanArray, Date32Array, Decimal128Array, 
DictionaryArray,
+Float32Array, Float64Array, Int16Array, Int32Array, Int64Array, Int8Array, 
RecordBatch,
+RecordBatchOptions, StringArray, TimestampMicrosecondArray,
+};
+use arrow_buffer::{BooleanBuffer, Buffer, NullBuffer, OffsetBuffer, 
ScalarBuffer};
+use arrow_schema::{DataType, Field, Schema, TimeUnit};
+use datafusion_common::DataFusionError;
+use std::io::Write;
+use std::sync::Arc;
+
+pub fn fast_codec_supports_type(data_type: &DataType) -> bool {
+match data_type {
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Date32
+| DataType::Timestamp(TimeUnit::Microsecond, _)
+| DataType::Utf8
+| DataType::Binary => true,
+DataType::Decimal128(_, s) if *s >= 0 => true,
+DataType::Dictionary(k, v) if **k == DataType::Int32 => 
fast_codec_supports_type(v),
+_ => false,
+}
+}
+
+enum DataTypeId {
+Boolean = 0,
+Int8,
+Int16,
+Int32,
+Int64,
+Date32,
+Timestamp,
+TimestampNtz,
+Decimal128,
+Float32,
+Float64,
+Utf8,
+Binary,
+Dictionary,
+}
+
+pub struct BatchWriter {
+inner: W,
+}
+
+impl BatchWriter {
+pub fn new(inner: W) -> Self {
+Self { inner }
+}
+
+/// Encode the schema (just column names because data types can vary per 
batch)
+pub fn write_partial_schema(&mut self, schema: &Schema) -> Result<(), 
DataFusionError> {
+let schema_len = schema.fields().len();
+self.inner.write_all(&schema_len.to_le_bytes())?;
+for field in schema.fields() {
+// field name
+let field_name = field.name();
+self.inner.write_all(&field_name.len().to_le_bytes())?;
+self.inner.write_all(field_name.as_str().as_bytes())?;
+// TODO nullable - assume all nullable for now
+}
+Ok(())
+}
+
+fn write_data_type(&mut self, data_type: &DataType) -> Result<(), 
DataFusionError> {
+match data_type {
+DataType::Boolean => {
+self.inner.write_all(&[DataTypeId::Boolean as u8])?;
+}
+DataType::Int8 => {
+self.inner.write_all(&[DataTypeId::Int8 as u8])?;
+}
+DataType::Int16 => {
+self.inner.write_all(&[DataTypeId::Int16 as u8])?;
+}
+DataType::Int32 => {
+self.inner.write_all(&[DataTypeId::Int32 as u8])?;
+}
+DataType::Int64 => {
+self.inner.write_all(&[DataTypeId::Int64 as u8])?;
+}
+DataType::Float32 => {
+self.inner.write_all(&[DataTypeId::Float32 as u8])?;
+}
+DataType::Float64 => {
+self.inner.write_all(&[DataTypeId::Float64 as u8])?;
+}
+DataType::Date32 => {
+self.inner.write_all(&[DataTypeId::Date32 as u8])?;
+}
+DataType::Timestamp(TimeUnit::Microsecond, None) => {
+self.inner.write_all(&[DataTypeId::TimestampNtz as u8])?;
+}
+DataType::Timestamp(TimeUnit::Microsecond, Some(tz)) => {
+self.inner.write_all(&[DataTypeId::Timestamp as u8])?;
+let tz_bytes = tz.as_bytes();
+self.inner.write_all(&tz_bytes.len().to_le_bytes())?;
+self.inner.write_all(tz_bytes)?;
+}
+DataType::Utf8 => {
+self.inner.write_all(&[DataTypeId::Utf8 as u8])?;
+}
+DataType::Binary => {
+self.inner.write_all(&[DataTypeId::Binary as u8])?;
+}
+

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



comphead commented on code in PR #1262:
URL: https://github.com/apache/datafusion-comet/pull/1262#discussion_r1911726694


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   so now this tests passed? 



##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   so now this test passed? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Distinct aggregates return incorrect results [datafusion-comet]

2025-01-10 Thread via GitHub



comphead commented on issue #1260:
URL: 
https://github.com/apache/datafusion-comet/issues/1260#issuecomment-2584865275

   that is really good find @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



parthchandra opened a new pull request, #1265:
URL: https://github.com/apache/datafusion-comet/pull/1265

   Notable changes: 
   
   1. There are three scan implementations: 
   
   |  Name  | Description   
  | Operator| Underlying implementation | 
   | -- | 
--- | 
--- | - |
   | native_comet | original, supports primitive types (default)
 | CometScan   | Comet |
   | native_datafusion| POC1, supports (or will support) complex types  
| CometNativeScan | DataFusion|
   | native_iceberg_compat | POC2, supports (or will support) complex types, 
exposes API for Iceberg | Comet Scan  | DataFusion|
   
   The scan implementation can be selected by setting the **_conf_** 
`spark.comet.scan.impl` or by setting the **_environment variable_** 
`COMET_PARQUET_SCAN_IMPL`
   
   2. Plan compatibility suites generate a different plan based on the 
implementation. As a result, we now have three sets of expected plans based on 
the scan implementation chosen
   
   3. We now use the Spark Session timezone instead of UTC while reading 
timestamp fields. This is so that we can compare them with literal timestamps 
(Spark apparently automatically applies the session timezone to those)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



parthchandra commented on PR #1265:
URL: 
https://github.com/apache/datafusion-comet/pull/1265#issuecomment-2584607637

   @andygrove @mbutrovich 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [comet-parquet-exec] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on PR #1265:
URL: 
https://github.com/apache/datafusion-comet/pull/1265#issuecomment-2584713456

   @parthchandra could you run `cargo fmt` and `cargo clippy`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: [comet-parquet-exec] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on PR #1265:
URL: 
https://github.com/apache/datafusion-comet/pull/1265#issuecomment-2584726893

   > @parthchandra could you run `cargo fmt` and `cargo clippy`
   
   and `make format` .. that is if we want to see tests passing on this PR, 
which I think they should now? Or do we want to wait until we PR 
comet-parquet-exec against main before trying to get CI green?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

2025-01-10 Thread via GitHub



kazuyukitanimura commented on code in PR #1190:
URL: https://github.com/apache/datafusion-comet/pull/1190#discussion_r1911623676


##
native/core/benches/shuffle_writer.rs:
##
@@ -31,67 +31,54 @@ use std::sync::Arc;
 use tokio::runtime::Runtime;
 
 fn criterion_benchmark(c: &mut Criterion) {
+let batch = create_batch(8192, true);
 let mut group = c.benchmark_group("shuffle_writer");
-group.bench_function("shuffle_writer: encode (no compression))", |b| {
-let batch = create_batch(8192, true);
-let mut buffer = vec![];
-let ipc_time = Time::default();
-b.iter(|| {
-buffer.clear();
-let mut cursor = Cursor::new(&mut buffer);
-write_ipc_compressed(&batch, &mut cursor, &CompressionCodec::None, 
&ipc_time)
-});
-});
-group.bench_function("shuffle_writer: encode and compress (snappy)", |b| {
-let batch = create_batch(8192, true);
-let mut buffer = vec![];
-let ipc_time = Time::default();
-b.iter(|| {
-buffer.clear();
-let mut cursor = Cursor::new(&mut buffer);
-write_ipc_compressed(&batch, &mut cursor, 
&CompressionCodec::Snappy, &ipc_time)
-});
-});
-group.bench_function("shuffle_writer: encode and compress (lz4)", |b| {
-let batch = create_batch(8192, true);
-let mut buffer = vec![];
-let ipc_time = Time::default();
-b.iter(|| {
-buffer.clear();
-let mut cursor = Cursor::new(&mut buffer);
-write_ipc_compressed(&batch, &mut cursor, 
&CompressionCodec::Lz4Frame, &ipc_time)
-});
-});
-group.bench_function("shuffle_writer: encode and compress (zstd level 1)", 
|b| {
-let batch = create_batch(8192, true);
-let mut buffer = vec![];
-let ipc_time = Time::default();
-b.iter(|| {
-buffer.clear();
-let mut cursor = Cursor::new(&mut buffer);
-write_ipc_compressed(&batch, &mut cursor, 
&CompressionCodec::Zstd(1), &ipc_time)
-});
-});
-group.bench_function("shuffle_writer: encode and compress (zstd level 6)", 
|b| {
-let batch = create_batch(8192, true);
-let mut buffer = vec![];
-let ipc_time = Time::default();
-b.iter(|| {
-buffer.clear();
-let mut cursor = Cursor::new(&mut buffer);
-write_ipc_compressed(&batch, &mut cursor, 
&CompressionCodec::Zstd(6), &ipc_time)
-});
-});
-group.bench_function("shuffle_writer: end to end", |b| {
-let ctx = SessionContext::new();
-let exec = create_shuffle_writer_exec(CompressionCodec::Zstd(1));
-b.iter(|| {
-let task_ctx = ctx.task_ctx();
-let stream = exec.execute(0, task_ctx).unwrap();
-let rt = Runtime::new().unwrap();
-criterion::black_box(rt.block_on(collect(stream)).unwrap());
-});
-});
+for compression_codec in &[
+CompressionCodec::None,
+CompressionCodec::Lz4Frame,
+CompressionCodec::Snappy,
+CompressionCodec::Zstd(1),
+CompressionCodec::Zstd(6),
+] {
+for enable_fast_encoding in [true, false] {
+let name = format!("shuffle_writer: write encoded 
(enable_fast_encoding={enable_fast_encoding}, 
compression={compression_codec:?})");
+group.bench_function(name, |b| {
+let mut buffer = vec![];
+let ipc_time = Time::default();
+let w = ShuffleBlockWriter::try_new(
+&batch.schema(),
+enable_fast_encoding,
+compression_codec.clone(),
+)
+.unwrap();
+b.iter(|| {
+buffer.clear();
+let mut cursor = Cursor::new(&mut buffer);
+w.write_batch(&batch, &mut cursor, &ipc_time).unwrap();
+});
+});
+}
+}
+
+for compression_codec in [
+CompressionCodec::None,
+CompressionCodec::Lz4Frame,
+CompressionCodec::Zstd(1),

Review Comment:
   nit/optional: snappy and zstd(6) here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on code in PR #1262:
URL: https://github.com/apache/datafusion-comet/pull/1262#discussion_r1911770443


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   I'm looking at this test again on main and do not understand how this one is 
passing now :confused: 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   I see. it's error prone. Can we change the offsets of metadata columns, e.g. 
(-1 as usize) (-2 as usize) then there's no such problem. I see some databases 
use this trick.
   
   > Isn't only where you need meta columns you need to change the code with 
meta_field? Others code that call with field remain the same.
   
   yes, we can. but many api use Vec to represent columns. I have to 
change many structs and method defnitions to pass extra parameters.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



chenkovsky commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911814285


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   I see. it's error prone. Can we change the offsets of metadata columns, e.g. 
(-1 as usize) (-2 as usize) then there's no such problem. I see some databases 
use this trick.
   
   > Isn't only where you need meta columns you need to change the code with 
meta_field? Others code that call with field remain the same.
   
   yes, we can. but many apis use Vec to represent columns. I have to 
change many structs and method defnitions to pass extra parameters.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Minor: use hashmap for `physical_exprs_contains` and move `PhysicalExprRef` to `physical-expr-common` [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 opened a new pull request, #14081:
URL: https://github.com/apache/datafusion/pull/14081

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: make get_valid_types handle TypeSignature::Numeric correctly [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on PR #14060:
URL: https://github.com/apache/datafusion/pull/14060#issuecomment-2584988703

   Thanks @niebayes 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: make get_valid_types handle TypeSignature::Numeric correctly [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 merged PR #14060:
URL: https://github.com/apache/datafusion/pull/14060


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Isn't only where you need meta columns you need to change the code with 
`meta_field`? Others code that call with `field` remain the same.
   
   The downside of the current approach is that whenever the schema is changed, 
the index of meta columns need to adjust too. I think this is error prone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911800158


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   Isn't only where you need meta columns you need to change the code with 
`meta_field`? Others code that call with `field` remain the same.
   
   The downside of the current approach is that whenever the schema is changed, 
the index of meta columns need to adjust too. I think this is error prone. 
Minimize the dependency of meta schema and schema is better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Default to some compression when writing Parquet [datafusion-python]

2025-01-10 Thread via GitHub



timsaucer closed issue #978: Default to some compression when writing Parquet
URL: https://github.com/apache/datafusion-python/issues/978


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]

2025-01-10 Thread via GitHub



timsaucer commented on PR #981:
URL: 
https://github.com/apache/datafusion-python/pull/981#issuecomment-2585006200

   Thank you for another great addition @kosiew !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Default to ZSTD compression when writing Parquet [datafusion-python]

2025-01-10 Thread via GitHub



timsaucer merged PR #981:
URL: https://github.com/apache/datafusion-python/pull/981


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] formatting the AST while preserving the source location information from the original query [datafusion-sqlparser-rs]

2025-01-10 Thread via GitHub



graup commented on issue #1634:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/1634#issuecomment-2584354845

   I'm playing around with a different approach for editing parts of the AST 
that works more like tools like eslint work. The idea is that given accurate 
source spans, why don't we just edit the original query directly without 
round-tripping the whole AST?
   
   The steps would be: 1. parse query into AST. 2. Visit the AST, looking for 
nodes we want to replace. 3. Construct and render a new AST node for local 
replacement. 4. Use the old node's span() to replace the new text right in the 
source.
   
   This way we preserve all the other formatting outside of the node we're 
replacing, without any work on the AST displaying logic.
   
   Of course, this needs accurate and complete spans which we don't have yet, 
but sounds feasible to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: [comet-parquet-exec] Unit test fixes, default scan impl to comet_native [datafusion-comet]

2025-01-10 Thread via GitHub



parthchandra commented on PR #1265:
URL: 
https://github.com/apache/datafusion-comet/pull/1265#issuecomment-2585079878

   Updated the plns for Spark 3.5 and Spark 4.0. However plan generation for 
the native_datafusion impl is failing which will not affect the ci, but which 
needs to be addressed at some point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] chore: move `SanityChecker` into `physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



mnpw opened a new pull request, #14083:
URL: https://github.com/apache/datafusion/pull/14083

   ## Which issue does this PR close?
   
   
   
   
   
   Closes #14072.
   
   ## Rationale for this change
   
   From #14072 
   > Historically DataFusion was one (very) large crate datafusion, and as it 
grew bigger we extracted various functionality into separate crates. This leads 
to both faster compile times (as the crates can be compiled in parallel) as 
well easier to navigate code (as the crates force a cleaner dependency 
separation)
   
   
   
   ## What changes are included in this PR?
   
   - Move `SanityChecker` from `datafusion` crate to 
`datafusion-physical-optimizer` crate
   
   
   ## Are these changes tested?
   
   Ran `cargo test` 
   
   
   
   ## Are there any user-facing changes?
   
   No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: move `SanityChecker` into `physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



mnpw commented on PR #14083:
URL: https://github.com/apache/datafusion/pull/14083#issuecomment-2585101517

   @alamb for your consideration. 
   
   I don't like that `datafusion-physical-optimizer` needs to use `datafusion` 
crate as a dev-dependency. This was required as `SanityChecker` tests were 
tightly coupled with the helper functions in 
`datafusion::physical_optimizer::test_utils`. Perhaps we can consider moving 
these helper functions somewhere outside, say, to `test-utils` crate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada merged PR #14065:
URL: https://github.com/apache/datafusion/pull/14065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Fix build issues on latest stable Rust toolchain (1.84) [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada closed issue #14061: Fix build issues on latest stable Rust 
toolchain (1.84)
URL: https://github.com/apache/datafusion/issues/14061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada commented on PR #14065:
URL: https://github.com/apache/datafusion/pull/14065#issuecomment-2582075524

   @jonahgao does linter pass in your local after this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Handle null `IntervalCompound` from substrait [datafusion]

2025-01-10 Thread via GitHub



cht42 closed issue #14066: Handle null `IntervalCompound` from substrait
URL: https://github.com/apache/datafusion/issues/14066


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] handle null `IntervalCompound` from substrait [datafusion]

2025-01-10 Thread via GitHub



cht42 closed pull request #14067: handle null `IntervalCompound` from substrait
URL: https://github.com/apache/datafusion/pull/14067


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



jonahgao commented on PR #14065:
URL: https://github.com/apache/datafusion/pull/14065#issuecomment-2582092984

   > @jonahgao does linter pass in your local after this change?
   
   Yes, it passed. Do you encounter any issues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] test: Add plan execution during tests for bounded source [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada merged PR #14013:
URL: https://github.com/apache/datafusion/pull/14013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Execute plans during plan tests to make sure plans are valid and executable. [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada closed issue #8230: Execute plans during plan tests to make sure 
plans are valid and executable.
URL: https://github.com/apache/datafusion/issues/8230


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] test: Add plan execution during tests for bounded source [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada commented on PR #14013:
URL: https://github.com/apache/datafusion/pull/14013#issuecomment-2582295394

   Thank you @avkirilishin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



jonahgao opened a new pull request, #14065:
URL: https://github.com/apache/datafusion/pull/14065

   ## Which issue does this PR close?
   
   
   
   Closes #14061.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   Yes
   
   
   ## Are there any user-facing changes?
   No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Fix build issues on latest stable Rust toolchain (1.84) [datafusion]

2025-01-10 Thread via GitHub



jonahgao commented on issue #14061:
URL: https://github.com/apache/datafusion/issues/14061#issuecomment-2582008020

   Since it blocks many PRs, I created #14065 to fix it. cc @niebayes 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add support for ClickHouse `FORMAT` on `INSERT` [datafusion-sqlparser-rs]

2025-01-10 Thread via GitHub



bombsimon commented on code in PR #1628:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1628#discussion_r1909983745


##
src/ast/query.rs:
##
@@ -2465,14 +2465,25 @@ impl fmt::Display for GroupByExpr {
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub enum FormatClause {
-Identifier(Ident),
+Identifier {
+ident: Ident,
+expr: Option>,
+},
 Null,
 }

Review Comment:
   Switch to a dedicated `InputFormatClause` only used when parsing `Insert` 
statements, keeping the previous as is. PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Add config for enabling SMJ with join condition [datafusion-comet]

2025-01-10 Thread via GitHub



kazuyukitanimura commented on code in PR #937:
URL: https://github.com/apache/datafusion-comet/pull/937#discussion_r1910067146


##
spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala:
##
@@ -262,6 +262,7 @@ trait CometPlanStabilitySuite extends 
DisableAdaptiveExecutionSuite with TPCDSBa
   CometConf.COMET_EXEC_ENABLED.key -> "true",
   CometConf.COMET_DPP_FALLBACK_ENABLED.key -> "false",
   CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true",
+  CometConf.COMET_EXEC_SORT_MERGE_JOIN_WITH_JOIN_FILTER_ENABLED.key -> 
"true",

Review Comment:
   @andygrove Do you remember by any chance why 
`COMET_EXEC_SORT_MERGE_JOIN_WITH_JOIN_FILTER_ENABLED` is enabled only for 
`testQuery()` but not for `createSparkSession()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Add config for enabling SMJ with join condition [datafusion-comet]

2025-01-10 Thread via GitHub



kazuyukitanimura commented on code in PR #937:
URL: https://github.com/apache/datafusion-comet/pull/937#discussion_r1910089857


##
spark/src/test/scala/org/apache/spark/sql/comet/CometPlanStabilitySuite.scala:
##
@@ -262,6 +262,7 @@ trait CometPlanStabilitySuite extends 
DisableAdaptiveExecutionSuite with TPCDSBa
   CometConf.COMET_EXEC_ENABLED.key -> "true",
   CometConf.COMET_DPP_FALLBACK_ENABLED.key -> "false",
   CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true",
+  CometConf.COMET_EXEC_SORT_MERGE_JOIN_WITH_JOIN_FILTER_ENABLED.key -> 
"true",

Review Comment:
   Maybe this one? https://github.com/apache/datafusion-comet/issues/901



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Bump `wasm-bindgen` and `wasm-bindgen-futures` [datafusion]

2025-01-10 Thread via GitHub



mbrobbel opened a new pull request, #14068:
URL: https://github.com/apache/datafusion/pull/14068

   ## Which issue does this PR close?
   
   Prevent 
https://github.com/apache/datafusion/pull/14065#issuecomment-2582412339.
   
   ## Rationale for this change
   
   Bumps `wasm-bindgen-*` patch versions to make sure cargo selects the latest 
version for users building with rust `1.84` to prevent the issue linked above.
   
   ## What changes are included in this PR?
   
   Version bumps.
   
   ## Are these changes tested?
   
   In CI.
   
   ## Are there any user-facing changes?
   
   Cargo selects the latest patch version for `wasm-bindgen-*` crates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada commented on PR #14065:
URL: https://github.com/apache/datafusion/pull/14065#issuecomment-2582438534

   > Can you try again after `cargo update wasm-bindgen`? This was fixed in 
[rustwasm/wasm-bindgen#4284](https://github.com/rustwasm/wasm-bindgen/pull/4284).
   
   That works, thank you.
   
   ```
   Checking datafusion-expr v44.0.0 
(/Users/berkaysahin/Desktop/datafusion/datafusion/expr)
   error: unexpected `cfg` condition value: `used_linker`
  --> datafusion/expr/src/lib.rs:101:1
   |
   101 | #[ctor::ctor]
   | ^
   |
   = note: expected values for `feature` are: `recursive_protection`
   = help: consider adding `used_linker` as a feature in `Cargo.toml`
   = note: see 
 for 
more information about checking conditional configuration
   = note: `-D unexpected-cfgs` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unexpected_cfgs)]`
   = note: this error originates in the attribute macro `ctor::ctor` (in 
Nightly builds, run with -Z macro-backtrace for more info);
   ```
   
   Do you have a remedy for this one as well :D ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



mbrobbel commented on PR #14065:
URL: https://github.com/apache/datafusion/pull/14065#issuecomment-2582467124

   > > Can you try again after `cargo update wasm-bindgen`? This was fixed in 
[rustwasm/wasm-bindgen#4284](https://github.com/rustwasm/wasm-bindgen/pull/4284).
   > 
   > That works, thank you.
   > 
   > ```
   > Checking datafusion-expr v44.0.0 
(/Users/berkaysahin/Desktop/datafusion/datafusion/expr)
   > error: unexpected `cfg` condition value: `used_linker`
   >--> datafusion/expr/src/lib.rs:101:1
   > |
   > 101 | #[ctor::ctor]
   > | ^
   > |
   > = note: expected values for `feature` are: `recursive_protection`
   > = help: consider adding `used_linker` as a feature in `Cargo.toml`
   > = note: see 
 for 
more information about checking conditional configuration
   > = note: `-D unexpected-cfgs` implied by `-D warnings`
   > = help: to override `-D warnings` add `#[allow(unexpected_cfgs)]`
   > = note: this error originates in the attribute macro `ctor::ctor` (in 
Nightly builds, run with -Z macro-backtrace for more info);
   > ```
   > 
   > Do you have a remedy for this one as well 😄 ?
   
   https://github.com/mmastrac/rust-ctor/issues/309#issuecomment-2489037492


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Fall back to Spark for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on code in PR #1262:
URL: https://github.com/apache/datafusion-comet/pull/1262#discussion_r1911777608


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -974,7 +974,8 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
-  test("distinct") {
+  // TODO enable once https://github.com/apache/datafusion-comet/issues/1267 
is implemented

Review Comment:
   I understand what is happening now. Spark optimizes some simple cases like 
`SELECT a, COUNT(DISTINCT b) ...` to first perform an aggregate to just get the 
distinct values, so by the time we perform the `COUNT` there are only distinct 
values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Add support for distinct aggregates [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove opened a new issue, #1267:
URL: https://github.com/apache/datafusion-comet/issues/1267

   ### What is the problem the feature request solves?
   
   Add support for distinct aggregates and enable "distinct" test in 
CometAggregateSuite
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] docs: Update TPC-H benchmark results [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on PR #1257:
URL: 
https://github.com/apache/datafusion-comet/pull/1257#issuecomment-2584451268

   Moving this to draft now that we know that the count(distinct) in q16 was 
not actually working correctly due to 
https://github.com/apache/datafusion-comet/issues/1260


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Comet possibly preventing AQE optimization [datafusion-comet]

2025-01-10 Thread via GitHub



kazuyukitanimura opened a new issue, #1266:
URL: https://github.com/apache/datafusion-comet/issues/1266

   ### Describe the bug
   
   `SPARK-50258: Fix output column order changed issue after AQE optimization` 
test fails with Comet on because comet plan does not have 
`AdaptiveSparkPlanExec` in the `executedPlan`
   
   Possibly comet plan is preventing AQE optimization from kicking in
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` what is this trick? This is a large offset, we have vector 
instead of map, how does this work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863137


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` what is this trick? This is a large offset, we have vector 
instead of map, how does this work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: metadata columns [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on code in PR #14057:
URL: https://github.com/apache/datafusion/pull/14057#discussion_r1911863259


##
datafusion/common/src/dfschema.rs:
##
@@ -319,52 +465,56 @@ impl DFSchema {
 qualifiers.push(qualifier.cloned());
 }
 }
-let mut metadata = self.inner.metadata.clone();
-metadata.extend(other_schema.inner.metadata.clone());
+let mut metadata = self.inner.schema.metadata.clone();
+metadata.extend(other_schema.inner.schema.metadata.clone());
 
 let finished = schema_builder.finish();
 let finished_with_metadata = finished.with_metadata(metadata);
-self.inner = finished_with_metadata.into();
-self.field_qualifiers.extend(qualifiers);
+self.inner.schema = finished_with_metadata.into();
+self.inner.field_qualifiers.extend(qualifiers);
 }
 
 /// Get a list of fields
 pub fn fields(&self) -> &Fields {
-&self.inner.fields
+&self.inner.schema.fields
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector
 pub fn field(&self, i: usize) -> &Field {
-&self.inner.fields[i]
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.field(i - self.inner.len());
+}
+}
+self.inner.field(i)
 }
 
 /// Returns an immutable reference of a specific `Field` instance selected 
using an
 /// offset within the internal `fields` vector and its qualifier
 pub fn qualified_field(&self, i: usize) -> (Option<&TableReference>, 
&Field) {
-(self.field_qualifiers[i].as_ref(), self.field(i))
+if i >= self.inner.len() {
+if let Some(metadata) = &self.metadata {
+return metadata.qualified_field(i - self.inner.len());
+}
+}
+self.inner.qualified_field(i)

Review Comment:
   `(-1 as usize)` how does this large offset work? We have vector instead of 
map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [EPIC] Decouple logical from physical types [datafusion]

2025-01-10 Thread via GitHub



jayzhan211 commented on issue #12622:
URL: https://github.com/apache/datafusion/issues/12622#issuecomment-2585131985

   Since the logical-types branch can easily diverge from the main branch, even 
when the sub-tasks are incomplete, would it be better to merge it into the main 
branch frequently and continue evolving it as new ideas emerge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix clippy for Rust 1.84 [datafusion]

2025-01-10 Thread via GitHub



alamb commented on code in PR #14065:
URL: https://github.com/apache/datafusion/pull/14065#discussion_r1910263895


##
datafusion/common/src/pyarrow.rs:
##
@@ -138,6 +138,9 @@ mod tests {
 fn test_py_scalar() {
 init_python();
 
+// TODO: remove this attribute when bumping pyo3 to v0.23.0
+// See: 

+#[allow(unexpected_cfgs)]

Review Comment:
   pyo3 will be updated in 
   - https://github.com/apache/datafusion/pull/13663
   
   I need to find some time to polish that PR a bit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Bump `ctor` to `0.2.9` [datafusion]

2025-01-10 Thread via GitHub



alamb merged PR #14069:
URL: https://github.com/apache/datafusion/pull/14069


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Create test pattern for Spans [datafusion-sqlparser-rs]

2025-01-10 Thread via GitHub



graup commented on issue #1563:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/issues/1563#issuecomment-2582492517

   Hi! Just a small note since I started playing around with this.  I assume 
it's because it's not implemented yet but anyway to keep track: it seems 
currently that Function args don't have a span, and also the Function span only 
covers the function name, not its arguments. So for `CURRENT_DATE('UTC')`, the 
ast node for the Function expr has a span containing only `CURRENT_DATE` (and 
func.args has an empty span). I'd expect the function expr span to cover the 
whole expression including the parentheses.
   
   If I can help in anyway (like writing a test case), I'd be happy to 
contribute.
   
   Very tangential, it would be useful if there was a helper function to 
convert line-column-based locations to byte offsets. (My use case is 
implementing a fixer without rebuilding the SQL from the AST, just applying 
fixed locally to individual nodes.)
   
   Input: `SELECT CURRENT_DATE(),\nCURRENT_DATE('UTC')`
   Output: `
   [Query(Query { with: None, body: Select(Select { select_token: TokenWithSpan 
{ token: Word(Word { value: "SELECT", quote_style: None, keyword: SELECT }), 
span: Span(Location(1,1)..Location(1,7)) }, distinct: None, top: None, 
top_before_distinct: false, projection: [UnnamedExpr(Function(Function { name: 
ObjectName([Ident { value: "CURRENT_DATE", quote_style: None, span: 
Span(Location(1,8)..Location(1,20)) }]), uses_odbc_syntax: false, parameters: 
None, args: List(FunctionArgumentList { duplicate_treatment: None, args: [], 
clauses: [] }), filter: None, null_treatment: None, over: None, within_group: 
[] })), UnnamedExpr(Function(Function { name: ObjectName([Ident { value: 
"CURRENT_DATE", quote_style: None, span: Span(Location(2,1)..Location(2,13)) 
}]), uses_odbc_syntax: false, parameters: None, args: List(FunctionArgumentList 
{ duplicate_treatment: None, args: 
[Unnamed(Expr(Value(SingleQuotedString("UTC"], clauses: [] }), filter: 
None, null_treatment: None, over: None, within_gro
 up: [] }))], into: None, from: [], lateral_views: [], prewhere: None, 
selection: None, group_by: Expressions([], []), cluster_by: [], distribute_by: 
[], sort_by: [], having: None, named_window: [], qualify: None, 
window_before_qualify: false, value_table_mode: None, connect_by: None }), 
order_by: None, limit: None, limit_by: [], offset: None, fetch: None, locks: 
[], for_clause: None, settings: None, format_clause: None })]
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Bump `ctor` to `0.2.9` [datafusion]

2025-01-10 Thread via GitHub



mbrobbel opened a new pull request, #14069:
URL: https://github.com/apache/datafusion/pull/14069

   Fixes 
https://github.com/apache/datafusion/pull/14065#issuecomment-2582438534.
   
   Maybe we should just add a `Cargo.lock` file to have checks in CI to prevent 
this in the future, following suggestions from 
https://doc.rust-lang.org/cargo/faq.html#why-have-cargolock-in-version-control 
and 
https://doc.rust-lang.org/cargo/guide/continuous-integration.html#verifying-latest-dependencies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Update ctor rep to latest [datafusion]

2025-01-10 Thread via GitHub



berkaysynnada opened a new pull request, #14070:
URL: https://github.com/apache/datafusion/pull/14070

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   I have faced with
   ```
   Checking datafusion-expr v44.0.0 
(/Users/berkaysahin/Desktop/datafusion/datafusion/expr)
   error: unexpected `cfg` condition value: `used_linker`
  --> datafusion/expr/src/lib.rs:101:1
   |
   101 | #[ctor::ctor]
   | ^
   |
   = note: expected values for `feature` are: `recursive_protection`
   = help: consider adding `used_linker` as a feature in `Cargo.toml`
   = note: see 
 for 
more information about checking conditional configuration
   = note: `-D unexpected-cfgs` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unexpected_cfgs)]`
   = note: this error originates in the attribute macro `ctor::ctor` (in 
Nightly builds, run with -Z macro-backtrace for more info);
   ```
   
   and the problem seems to resolved in
   https://github.com/mmastrac/rust-ctor/issues/309#issuecomment-2489037492
   
   ## What changes are included in this PR?
   
   
   
   version update of actor
   
   ## Are these changes tested?
   
   
   
   clippy complains before in my local, but not now
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add optimizer rule to replace inlist with `or` chain for small list [datafusion]

2025-01-10 Thread via GitHub



alamb closed issue #799: Add optimizer rule to replace inlist with `or` chain 
for small list
URL: https://github.com/apache/datafusion/issues/799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add optimizer rule to replace inlist with `or` chain for small list [datafusion]

2025-01-10 Thread via GitHub



alamb commented on issue #799:
URL: https://github.com/apache/datafusion/issues/799#issuecomment-2582597548

   This is done now from what I can tell: 
   
   
   ```sql
   > create table foo(x int, a int, b int) as values (1,2,3);
   0 row(s) fetched.
   Elapsed 0.006 seconds.
   
   > explain select * from foo where x in (a, b);
   +---+---+
   | plan_type | plan  |
   +---+---+
   | logical_plan  | Filter: foo.x = foo.a OR foo.x = foo.b|
   |   |   TableScan: foo projection=[x, a, b] |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192   |
   |   |   FilterExec: x@0 = a@1 OR x@0 = b@2  |
   |   | MemoryExec: partitions=1, partition_sizes=[1] |
   |   |   |
   +---+---+
   2 row(s) fetched.
   Elapsed 0.018 seconds.
   
   >
   ```
   
   Note the 
   ```
   FilterExec: x@0 = a@1 OR x@0 = b@2   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Question: is the combination of limit and predicate push-down safe in ParquetExec? [datafusion]

2025-01-10 Thread via GitHub



alamb closed issue #900: Question: is the combination of limit and predicate 
push-down safe in ParquetExec?
URL: https://github.com/apache/datafusion/issues/900


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Question: is the combination of limit and predicate push-down safe in ParquetExec? [datafusion]

2025-01-10 Thread via GitHub



alamb commented on issue #900:
URL: https://github.com/apache/datafusion/issues/900#issuecomment-2582599182

   I believe this is a dupe of the recently fixed issue:
   - https://github.com/apache/datafusion/issues/13745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Rust] DataFrame.collect should return RecordBatchReader [datafusion]

2025-01-10 Thread via GitHub



alamb closed issue #97: [Rust]  DataFrame.collect should return 
RecordBatchReader
URL: https://github.com/apache/datafusion/issues/97


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Track memory usage for each individual operator [datafusion]

2025-01-10 Thread via GitHub



alamb commented on issue #899:
URL: https://github.com/apache/datafusion/issues/899#issuecomment-2582600235

   This is now handled with the 
https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Track memory usage for each individual operator [datafusion]

2025-01-10 Thread via GitHub



alamb closed issue #899: Track memory usage for each individual operator
URL: https://github.com/apache/datafusion/issues/899


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation [datafusion]

2025-01-10 Thread via GitHub



alamb commented on issue #13123:
URL: https://github.com/apache/datafusion/issues/13123#issuecomment-2582773039

   Here is a PR to optimize the spill format:
   - https://github.com/apache/datafusion/issues/14078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement custom RecordBatch serde for shuffle for improved performance [datafusion-comet]

2025-01-10 Thread via GitHub



alamb commented on PR #1190:
URL: 
https://github.com/apache/datafusion-comet/pull/1190#issuecomment-2582775122

   FYI I filed a ticket in DataFusion to consider adding this code (or 
something similar) upstream:
   - https://github.com/apache/datafusion/issues/14078
   
   I think it would help other serialization usecases (not just the shuffle 
reader/writer) -- for example sorting and grouping


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Regression: `DataFrame::schema` returns incorrect schema for NATURAL JOIN [datafusion]

2025-01-10 Thread via GitHub



jonahgao commented on issue #14058:
URL: https://github.com/apache/datafusion/issues/14058#issuecomment-2582776727

   I'll try to fix this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-10 Thread via GitHub



alamb merged PR #14038:
URL: https://github.com/apache/datafusion/pull/14038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Refactor into `LexOrdering::collapse`, `LexRequirement::collapse` avoid clone [datafusion]

2025-01-10 Thread via GitHub



alamb commented on PR #14038:
URL: https://github.com/apache/datafusion/pull/14038#issuecomment-2582777452

   Thanks again @berkaysynnada  -- I merged up from main and the CI checks now 
pass so merging it in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Bump `wasm-bindgen-*` crates [datafusion]

2025-01-10 Thread via GitHub



jonahgao merged PR #14068:
URL: https://github.com/apache/datafusion/pull/14068


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Bump `wasm-bindgen-*` crates [datafusion]

2025-01-10 Thread via GitHub



jonahgao commented on PR #14068:
URL: https://github.com/apache/datafusion/pull/14068#issuecomment-2582789511

   Thanks @mbrobbel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: rand expression support [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on code in PR #1199:
URL: https://github.com/apache/datafusion-comet/pull/1199#discussion_r1910483116


##
native/core/src/execution/jni_api.rs:
##
@@ -317,7 +317,7 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_Native_executePlan(
 // query plan, we need to defer stream initialization to first time 
execution.
 if exec_context.root_op.is_none() {
 let start = Instant::now();
-let planner = 
PhysicalPlanner::new(Arc::clone(&exec_context.session_ctx))
+let planner = 
PhysicalPlanner::new(Arc::clone(&exec_context.session_ctx), partition)

Review Comment:
   This is something that I would like to see improved. We currently use 
partition 0 for each native plan rather than the real partition id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Move `SanityChecker` into `datafusion-physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



cj-zhukov commented on issue #14072:
URL: https://github.com/apache/datafusion/issues/14072#issuecomment-2582906768

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Build is broken in main [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove commented on issue #1258:
URL: 
https://github.com/apache/datafusion-comet/issues/1258#issuecomment-2582907784

   I am working on a fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Move `JoinSelection` into `datafusion-physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



cj-zhukov commented on issue #14073:
URL: https://github.com/apache/datafusion/issues/14073#issuecomment-2582907973

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Build is broken in main [datafusion-comet]

2025-01-10 Thread via GitHub



andygrove opened a new issue, #1258:
URL: https://github.com/apache/datafusion-comet/issues/1258

   ### Describe the bug
   
   There was a conflict between some PRs merged recently that has introduced a 
test failure in CometAggregateSuite
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Bump `ctor` to `0.2.9` [datafusion]

2025-01-10 Thread via GitHub



mbrobbel commented on PR #14069:
URL: https://github.com/apache/datafusion/pull/14069#issuecomment-2582533271

   > I think the reason we don't have a Cargo.lock is that datafusion is meant 
to be used as a libary and thus we wanted to give downstream crates the 
flexibility for most dependent library versions
   
   The `Cargo.lock` file of library crates is always ignored by dependent 
crates:
   
   > However, this determinism can give a false sense of security because 
Cargo.lock does not affect the consumers of your package, only Cargo.toml does 
that. For example:
   >- [cargo 
install](https://doc.rust-lang.org/cargo/commands/cargo-install.html) will 
select the latest dependencies unless 
[--locked](https://doc.rust-lang.org/cargo/commands/cargo.html#option-cargo---locked)
 is passed in.
   >  - New dependencies, like those added with [cargo 
add](https://doc.rust-lang.org/cargo/commands/cargo-add.html), will be locked 
to the latest version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Bump `ctor` to `0.2.9` [datafusion]

2025-01-10 Thread via GitHub



alamb commented on PR #14069:
URL: https://github.com/apache/datafusion/pull/14069#issuecomment-2582555840

   > > I think the reason we don't have a Cargo.lock is that datafusion is 
meant to be used as a libary and thus we wanted to give downstream crates the 
flexibility for most dependent library versions
   > 
   > The `Cargo.lock` file of library crates is always ignored by dependent 
crates:
   > 
   > > However, this determinism can give a false sense of security because 
Cargo.lock does not affect the consumers of your package, only Cargo.toml does 
that. For example:
   > > 
   > > * [cargo 
install](https://doc.rust-lang.org/cargo/commands/cargo-install.html) will 
select the latest dependencies unless 
[--locked](https://doc.rust-lang.org/cargo/commands/cargo.html#option-cargo---locked)
 is passed in.
   > > * New dependencies, like those added with [cargo 
add](https://doc.rust-lang.org/cargo/commands/cargo-add.html), will be locked 
to the latest version
   
   I tried to document the rationale as I understand it here: 
https://github.com/apache/datafusion/pull/14071. Perhaps we can discuss 
potential changes there as well
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Document the rationale for the lack of Cargo.lock [datafusion]

2025-01-10 Thread via GitHub



alamb commented on code in PR #14071:
URL: https://github.com/apache/datafusion/pull/14071#discussion_r1910286465


##
README.md:
##
@@ -146,3 +146,27 @@ stable API, we also improve the API over time. As a 
result, we typically
 deprecate methods before removing them, according to the [deprecation 
guidelines].
 
 [deprecation guidelines]: 
https://datafusion.apache.org/library-user-guide/api-health.html
+
+## Dependencies and a `Cargo.lock`

Review Comment:
   @mbrobbel  has a good point here 
https://github.com/apache/datafusion/pull/14069#issuecomment-2582533271
   
   I think one concern originally was how we would keep a `Cargo.lock` file up 
to date with the latest version of the dependencies
   
   I can't remember if depndabot was available at that point -- maybe 
dependabot is good enough that we could check in Cargo.lock and have Dependabot 
update it
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Document the rationale for the lack of Cargo.lock [datafusion]

2025-01-10 Thread via GitHub



alamb commented on code in PR #14071:
URL: https://github.com/apache/datafusion/pull/14071#discussion_r1910289206


##
README.md:
##
@@ -146,3 +146,27 @@ stable API, we also improve the API over time. As a 
result, we typically
 deprecate methods before removing them, according to the [deprecation 
guidelines].
 
 [deprecation guidelines]: 
https://datafusion.apache.org/library-user-guide/api-health.html
+
+## Dependencies and a `Cargo.lock`

Review Comment:
   From my perspective it is very important that CI covers testing with the 
latest versions of all dependencies (as that is what many/most downstream 
crates will use as well)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Add a sum statistic [datafusion]

2025-01-10 Thread via GitHub



gatesn opened a new pull request, #14074:
URL: https://github.com/apache/datafusion/pull/14074

   ## Which issue does this PR close?
   
   This PR adds a sum statistic to DataFusion. 
   
   Future use will include optimizing aggregation functions (sum, avg, count), 
see https://github.com/apache/datafusion/pull/13736/files for examples.
   
   ## Are there any user-facing changes?
   
   The ColumnStatistics struct has an extra public sum_value field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Rust] DataFrame.collect should return RecordBatchReader [datafusion]

2025-01-10 Thread via GitHub



alamb commented on issue #97:
URL: https://github.com/apache/datafusion/issues/97#issuecomment-2582600917

   Seems like we are done here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Move `SanityChecker` into `datafusion-physical-optimizer` crate [datafusion]

2025-01-10 Thread via GitHub



alamb opened a new issue, #14072:
URL: https://github.com/apache/datafusion/issues/14072

   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/issues/11502
   - Related to https://github.com/apache/datafusion/issues/13814
   
   Historically DataFusion was one (very) large crate `datafusion`, and as it 
grew bigger we extracted various functionality into separate crates. This leads 
to both faster compile times (as the crates can be compiled in parallel) as 
well easier to navigate code (as the crates force a cleaner dependency 
separation)
   
   One project we have not yet completed is to extract the physical optimizer 
passes out https://github.com/apache/datafusion/issues/11502
   
   * The original physical optimizers are here 
https://github.com/apache/datafusion/tree/main/datafusion/core/src/physical_optimizer
   * The new crate is here  
https://github.com/apache/datafusion/tree/main/datafusion/physical-optimizer
   
   
   
   ### Describe the solution you'd like
   
   Extract the `SanityChecker` from the datafusion core crate to the 
datafusion-physical-optimizer crate
   
   ### Describe alternatives you've considered
   
   Move 
https://github.com/apache/datafusion/blob/main/datafusion/core/src/physical_optimizer/sanity_checker.rs
 to 
   
https://github.com/apache/datafusion/tree/main/datafusion/physical-optimizer/src/sanity_checker.rs
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

1 2 >

1 - 100 of 197 matches

Mail list logo