Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-04-01 Thread via GitHub


Shreyaskr1409 commented on issue #15511:
URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769119055

   > I don't think we need to actually "benchmark" the code for each merge
   
   How about we set a tag/label for performance related PRs and run benchmark 
tests for those specific PRs only? I could look into it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] AQE Unable to Rewrite Joins as Broadcast Hash Joins Due to Existing CometBroadcastHashJoin Operator [datafusion-comet]

2025-04-01 Thread via GitHub


Kontinuation opened a new issue, #1589:
URL: https://github.com/apache/datafusion-comet/issues/1589

   ### Describe the bug
   
   AQE could transform SortMergeJoin or ShuffledHashJoin to BroadcastHashJoin 
dynamically after discovering that one of the Exchange operator only shuffle 
writes small amount of data. However, this optimization does not always happen 
when using Comet.
   
   TPC-H Q7 has an equi-join between `supplier` and `lineitem`. Spark could 
discover that `supplier` is small enough to be broadcasted after running the 
`Exchange` operator, and dynamically change the sort-merge-join to a broadcast 
hash join (see `BroadcastHashJoin Inner BuildLeft (15)`):
   
   ```
   == Physical Plan ==
   AdaptiveSparkPlan (99)
   +- == Final Plan ==
  * Sort (62)
  +- AQEShuffleRead (61)
 +- ShuffleQueryStage (60), Statistics(sizeInBytes=288.0 B, rowCount=4)
+- Exchange (59)
   +- * HashAggregate (58)
  +- AQEShuffleRead (57)
 +- ShuffleQueryStage (56), Statistics(sizeInBytes=2.8 KiB, 
rowCount=36)
+- Exchange (55)
   +- * HashAggregate (54)
  +- * Project (53)
 +- * BroadcastHashJoin Inner BuildRight (52)
:- * Project (49)
:  +- * BroadcastHashJoin Inner BuildRight 
(48)
: :- * Project (42)
: :  +- * SortMergeJoin Inner (41)
: : :- * Sort (33)
: : :  +- AQEShuffleRead (32)
: : : +- ShuffleQueryStage 
(31), Statistics(sizeInBytes=667.5 MiB, rowCount=1.46E+7)
: : :+- Exchange (30)
: : :   +- * Project (29)
: : :  +- * 
SortMergeJoin Inner (28)
: : : :- * Sort (20)
: : : :  +- 
AQEShuffleRead (19)
: : : : +- 
ShuffleQueryStage (18), Statistics(sizeInBytes=667.5 MiB, rowCount=1.46E+7)
: : : :+- 
Exchange (17)
: : : :   
+- * Project (16)
: : : : 
 +- * BroadcastHashJoin Inner BuildLeft (15)  <-- Transformed from 
SortMergeJoin by AQE
: : : : 
:- BroadcastQueryStage (8), Statistics(sizeInBytes=8.0 MiB, 
rowCount=8.00E+4)
: : : : 
:  +- BroadcastExchange (7)
: : : : 
: +- AQEShuffleRead (6)
: : : : 
:+- ShuffleQueryStage (5), Statistics(sizeInBytes=1874.1 KiB, 
rowCount=8.00E+4)
: : : : 
:   +- Exchange (4)
: : : : 
:  +- * Filter (3)
: : : : 
: +- * ColumnarToRow (2)
: : : : 
:+- Scan parquet  (1)
: : : : 
+- AQEShuffleRead (14)
: : : : 
   +- ShuffleQueryStage (13), Statistics(sizeInBytes=8.1 GiB, 
rowCount=1.82E+8)
: : : : 
  +- Exchange (12)
: : : : 
 +- * Filter (11)
: : : : 
+- * ColumnarToRow (10)
: : : : 
   +- Scan parquet  (9)
: : : +- * Sort (27)
: : :+- 
AQEShuffleRead (26)
: : :   +- 
ShuffleQueryStage (25), Statistics(sizeInBytes=3.4 GiB, rowCount=1.50E+8)
: : :  +- 
Exchange (24)
  

Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

2025-04-01 Thread via GitHub


suibianwanwank commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769563811

   @jayzhan211 From this field, it seems there's no issue, but how to get the 
e.b field if we perform aggregation on the Join. Since this query requires all 
rows from the left table to be included in the results. This might also involve 
the equivalence of Aggregate operations under outer joins.  
   ```SQL
   select e.b ,(select case when max(e2.a) > 10 then 'a' else 'b' end from t2 
e2 where e2.b = e.b+1 ) from t1 e;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Internal error: PhysicalExpr Column references bound error, Failure in spilling for `AggregateMode::Single` [datafusion]

2025-04-01 Thread via GitHub


rluvaton opened a new issue, #15530:
URL: https://github.com/apache/datafusion/issues/15530

   ### Describe the bug
   
   when using aggregate exec with single mode, and spilling and the group by 
expressions are not the first expressions from the previous plan there will be 
schema mismatch
   
   ### To Reproduce
   
   ```rust
   #[cfg(test)]
   mod tests {
   use std::fmt::{Display, Formatter};
   use arrow::array::{ArrayRef, Int32Array, Int64Array, RecordBatch, 
StringArray};
   use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
   use datafusion::common::Result;
   use datafusion::execution::memory_pool::FairSpillPool;
   use datafusion::execution::runtime_env::RuntimeEnvBuilder;
   use datafusion::execution::TaskContext;
   use datafusion::functions_aggregate::sum::sum_udaf;
   use datafusion::physical_expr::aggregate::AggregateExprBuilder;
   use datafusion::physical_expr::expressions::{lit, Column};
   use datafusion::physical_plan::aggregates::{PhysicalGroupBy, 
AggregateExec, AggregateMode};
   use datafusion::physical_plan::common::collect;
   use datafusion::physical_plan::ExecutionPlan;
   use rand::{random, thread_rng, Rng};
   use std::sync::Arc;
   use datafusion_physical_plan::memory::{LazyBatchGenerator, 
LazyMemoryExec};
   use parking_lot::RwLock;
   
   #[tokio::test]
   async fn test_debug() -> Result<()> {
   let scan_schema = Arc::new(Schema::new(vec![
   Field::new("col_0", DataType::Int64, true),
   Field::new("col_1", DataType::Utf8, true),
   Field::new("col_2", DataType::Utf8, true),
   Field::new("col_3", DataType::Utf8, true),
   Field::new("col_4", DataType::Utf8, true),
   Field::new("col_5", DataType::Int32, true),
   Field::new("col_6", DataType::Utf8, true),
   Field::new("col_7", DataType::Utf8, true),
   Field::new("col_8", DataType::Utf8, true),
   ]));
   
   let group_by = PhysicalGroupBy::new_single(vec![
   (Arc::new(Column::new("col_1", 1)), "col_1".to_string()),
   (Arc::new(Column::new("col_7", 7)), "col_7".to_string()),
   (Arc::new(Column::new("col_0", 0)), "col_0".to_string()),
   (Arc::new(Column::new("col_8", 8)), "col_8".to_string()),
   ]);
   
   fn generate_int64_array() -> ArrayRef {
   Arc::new(Int64Array::from_iter_values(
   (0..8192).map(|_| random::()),
   ))
   }
   fn generate_int32_array() -> ArrayRef {
   Arc::new(Int32Array::from_iter_values(
   (0..8192).map(|_| random::()),
   ))
   }
   
   fn generate_string_array() -> ArrayRef {
   Arc::new(StringArray::from(
   (0..8192)
   .map(|_| -> String {
   thread_rng()
   .sample_iter::(rand::distributions::Standard)
   .take(10)
   .collect()
   })
   .collect::>(),
   ))
   }
   
   fn generate_record_batch(schema: &SchemaRef) -> Result {
   RecordBatch::try_new(
   Arc::clone(&schema),
   vec![
   generate_int64_array(),
   generate_string_array(),
   generate_string_array(),
   generate_string_array(),
   generate_string_array(),
   generate_int32_array(),
   generate_string_array(),
   generate_string_array(),
   generate_string_array(),
   ],
   )
   .map_err(|err| err.into())
   }
   
   let aggregate_expressions = vec![Arc::new(
   AggregateExprBuilder::new(sum_udaf(), vec![lit(1i64)])
   .schema(Arc::clone(&scan_schema))
   .alias("SUM(1i64)")
   .build()?,
   )];
   
   #[derive(Debug)]
   struct Generator {
   index: usize,
   count: usize,
   schema: SchemaRef,
   }
   
   impl Display for Generator {
   fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
   write!(f, "Generator")
   }
   }
   
   impl LazyBatchGenerator for Generator {
   fn generate_next_batch(&mut self) -> Result> 
{
   if self.index > self.count {
   return Ok(None);
   }
   
   let batch = generate_record_batch(&self.schema)?;
   self.index += 1;
   
   Ok(Some(batch))
   }
   }
   
   let gene

Re: [PR] docs: change OSX/OS X to macOS [datafusion-comet]

2025-04-01 Thread via GitHub


andygrove merged PR #1584:
URL: https://github.com/apache/datafusion-comet/pull/1584


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-04-01 Thread via GitHub


Shreyaskr1409 commented on issue #15369:
URL: https://github.com/apache/datafusion/issues/15369#issuecomment-2769196668

   @alamb should I add this example to datafusion-examples as well? as per 
https://github.com/apache/datafusion/pull/15504#issuecomment-2767270135


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-04-01 Thread via GitHub


Kontinuation commented on code in PR #1511:
URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022987423


##
native/core/src/execution/shuffle/shuffle_writer.rs:
##
@@ -852,16 +1079,64 @@ impl PartitionBuffer {
 file: spill_data,
 });
 }
-self.spill_file
-.as_mut()
-.unwrap()
-.file
-.write_all(&output_batches)?;
+Ok(())
+}
+}
+
+/// Write batches to writer while using a buffer to avoid frequent system 
calls.
+/// The record batches were first written by ShuffleBlockWriter into an 
internal buffer.
+/// Once the buffer exceeds the max size, the buffer will be flushed to the 
writer.
+struct BufBatchWriter, W: Write> {
+shuffle_block_writer: S,
+writer: W,
+buffer: Vec,
+buffer_max_size: usize,
+}
+
+impl, W: Write> BufBatchWriter {
+fn new(shuffle_block_writer: S, writer: W) -> Self {
+// 1MB should be good enough to avoid frequent system calls,
+// and also won't cause too much memory usage
+let buffer_max_size = 1024 * 1024;
+Self {
+shuffle_block_writer,
+writer,
+buffer: vec![],
+buffer_max_size,
+}
+}
+
+fn write(
+&mut self,
+batch: &RecordBatch,
+encode_time: &Time,
+write_time: &Time,
+) -> Result {
+let mut cursor = Cursor::new(&mut self.buffer);
+cursor.seek(SeekFrom::End(0))?;
+let mut write_timer = write_time.timer();
+let bytes_written =
+self.shuffle_block_writer
+.borrow()
+.write_batch(batch, &mut cursor, encode_time)?;
+let pos = cursor.position();
+if pos >= self.buffer_max_size as u64 {
+self.writer.write_all(&self.buffer)?;
+self.buffer.clear();
+}
 write_timer.stop();

Review Comment:
   Just found that I have removed the timing for std::io::copy and 
BufWriter.flush, I have added them back in the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub


geoffreyclaude commented on issue #15529:
URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2769593513

   I ran some quick [experiments on my 
fork](https://github.com/geoffreyclaude/datafusion/pull/3) by checking for 
early termination after each batch processed in the "topK" on the example TPCH 
query above:
   - Elapsed dropped from `16s` to `800ms`: 20x speedup
   - The Parquet DataSource `output_rows` metric dropped from `17135217` to 
`81920` (81920 because it read 1 batch of 8192 rows in parallel on 10 
partitions): 200x reduction
   - The Parquet DataSource `bytes_scanned` metric dropped from `130MB` to 
`23MB`: 5x reduction (which doesn't align at all with the `output_rows` 
reduction for some reason...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] March 17, 2025: This week(s) in DataFusion [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #15269:
URL: https://github.com/apache/datafusion/issues/15269#issuecomment-2769431998

   And we now have explain plans on by default 😍 
   - https://github.com/apache/datafusion/pull/15427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-04-01 Thread via GitHub


alamb merged PR #15499:
URL: https://github.com/apache/datafusion/pull/15499


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Improve html table rendering formatting [datafusion-python]

2025-04-01 Thread via GitHub


AgalyaS1757 commented on issue #1078:
URL: 
https://github.com/apache/datafusion-python/issues/1078#issuecomment-2769659726

   Solution Are;
   1. Modify the Data Size Limit
   Identify the part of the code where the 2MB limit is enforced.
   
   Introduce a user-configurable parameter (e.g., max_data_size), allowing 
users to set their preferred limit.
   
   Ensure that the size approximation logic remains efficient and does not slow 
down performance.
   
   2. Add an Option to Disable Styling
   Locate the existing HTML rendering logic where styles are applied.
   
   Introduce a flag (e.g., disable_styling) that allows users to enable or 
disable CSS styling in the output.
   
   Implement a conditional check to apply styling only if the flag is not set.
   
   3. Enable Custom Formatting
   Modify the ArrayFormatter or introduce a parameter (e.g., custom_formatter) 
that allows users to pass their own formatting function.
   
   Ensure that the custom function is validated and safely applied without 
breaking existing functionality.
   
   4. Refactor HTML Generation Code (Nice-to-have for maintainability)
   Break down the existing monolithic HTML rendering function into smaller, 
reusable helper functions.
   
   Consider moving the HTML-related functions into a separate module/file if 
the changes become too extensive.
   
   5. Testing & Documentation
   Write test cases to ensure the new functionalities work as expected.
   
   Update the documentation with clear examples on how to:
   
   Set a custom data size limit.
   
   Disable styling.
   
   Use a custom formatter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Use global tokio runtime per executor process [datafusion-comet]

2025-04-01 Thread via GitHub


andygrove commented on issue #1590:
URL: 
https://github.com/apache/datafusion-comet/issues/1590#issuecomment-2769667177

   See https://github.com/apache/datafusion-comet/pull/1104 for a previous 
attempt at implementing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix sequential metadata fetching in ListingTable causing high latency [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #14918:
URL: https://github.com/apache/datafusion/pull/14918#issuecomment-2769716046

   This was referenced by @sergiimk  in 
https://discord.com/channels/885562378132000778/1290751484807352412/1356393367566553240
 (they hit the same problem and was pleased to find it fixed!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

2025-04-01 Thread via GitHub


jayzhan211 commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769722147

   The projection required to be in the group expression. I think the query of 
these 2 are equivalent but the subquery one group by `e2.b` and the join query 
group by `e1.b`.
   
   Not sure if this rewrite could be general enough πŸ€” 
   
   ```
   query IT
   select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 
e2 where e2.b = e1.b + 1) from t1 e1;
   
   0 a
   2 a
   
   query TT
   explain
   select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 
e2 where e2.b = e1.b + 1) from t1 e1;
   
   logical_plan
   01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN 
Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") 
ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE 
Utf8("b") END
   02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS 
Int64)
   03)SubqueryAlias: e1
   04)--TableScan: t1 projection=[b]
   05)SubqueryAlias: __scalar_sq_1
   06)--Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") 
END, e2.b, Boolean(true) AS __always_true
   07)Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
   08)--SubqueryAlias: e2
   09)TableScan: t2 projection=[a, b]
   physical_plan
   01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN b 
ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END as 
CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END]
   02)--CoalesceBatchesExec: target_batch_size=8192
   03)HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + 
Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN 
max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4]
   04)--CoalesceBatchesExec: target_batch_size=8192
   05)RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), 
input_partitions=1
   06)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b 
+ Int64(1)]
   07)DataSourceExec: partitions=1, partition_sizes=[1]
   08)--CoalesceBatchesExec: target_batch_size=8192
   09)RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS 
Int64)@3], 4), input_partitions=4
   10)--ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b 
END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 
as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS 
Int64)]
   11)AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], 
aggr=[max(e2.a)]
   12)--CoalesceBatchesExec: target_batch_size=8192
   13)RepartitionExec: partitioning=Hash([b@0], 4), 
input_partitions=4
   14)--RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   15)AggregateExec: mode=Partial, gby=[b@1 as b], 
aggr=[max(e2.a)]
   16)--DataSourceExec: partitions=1, partition_sizes=[1]
   
   query IT
   SELECT
   e1.b,
   CASE 
   WHEN MAX(e2.a) > 10 THEN 'a' 
   ELSE 'b' 
   END AS result
   FROM t2 e2
   LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   GROUP BY e1.b;
   
   2 a
   0 a
   
   query TT
   explain
   SELECT
   e1.b,
   CASE 
   WHEN MAX(e2.a) > 10 THEN 'a' 
   ELSE 'b' 
   END AS result
   FROM t2 e2
   LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   GROUP BY e1.b;
   
   logical_plan
   01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS result
   02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]]
   03)Projection: e2.a, e1.b
   04)--Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1)
   05)SubqueryAlias: e2
   06)--TableScan: t2 projection=[a, b]
   07)SubqueryAlias: e1
   08)--TableScan: t1 projection=[b]
   physical_plan
   01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b 
END as result]
   02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)]
   03)CoalesceBatchesExec: target_batch_size=8192
   04)--RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4
   05)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)]
   06)--RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   07)ProjectionExec: expr=[a@1 as a, b@0 as b]
   08)--CoalesceBatchesExec: target_batch_size=8192
   09)HashJoinExec: mode=Partitioned, join_type=Right, 
on=[(e1.b + Int64(1)@1, CAST(e2.b AS Int64)@2)], projection=[b@0, a@2]
   10)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 
as e1.b + Int64(1)]
   11)DataSourceExec: partition

Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]

2025-04-01 Thread via GitHub


LiaCastaneda commented on code in PR #15438:
URL: https://github.com/apache/datafusion/pull/15438#discussion_r2022571200


##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -1470,17 +1470,27 @@ impl ValuesFields {
 
 pub fn change_redundant_column(fields: &Fields) -> Vec {

Review Comment:
   yep that's a good idea πŸ‘ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-01 Thread via GitHub


ctsk commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2022894121


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   Might be overkill, but one *could* try a sampling approach: Run the loop 
with the early exit for the first few chunks, and then switch over to the 
unconditional loop.
   
   Almost seems like something the compiler could automagically do...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] use state machine to refactor the `get_files_with_limit` method [datafusion]

2025-04-01 Thread via GitHub


xudong963 commented on PR #15521:
URL: https://github.com/apache/datafusion/pull/15521#issuecomment-2769429971

   Thanks @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Follow up #15432 [datafusion]

2025-04-01 Thread via GitHub


xudong963 closed issue #15519: Follow up #15432
URL: https://github.com/apache/datafusion/issues/15519


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Weekly Plan (Andrew Lamb) March 24, 2025 [datafusion]

2025-04-01 Thread via GitHub


alamb closed issue #15393: Weekly Plan (Andrew Lamb) March 24, 2025
URL: https://github.com/apache/datafusion/issues/15393


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Weekly Plan (Andrew Lamb) March 31, 2025 [datafusion]

2025-04-01 Thread via GitHub


alamb opened a new issue, #15528:
URL: https://github.com/apache/datafusion/issues/15528

   This is an attempt to organize myself and make what I plan to work on more 
visible
   
   ## Weekly High Level Goals
   - [ ] Work on integrating tpch data generator with @clflushopt  : 
https://github.com/apache/datafusion/issues/14608
   - [ ] Get https://github.com/apache/datafusion/issues/3463 ready for merge 
with @XiangpengHao 
   - [ ] TopK Pushdown #15037 with @adriangb 
   - [ ] https://github.com/apache/arrow-rs/issues/7084
   
   # Other projects I plan to review
   - [ ] Bug  fixes
   - [ ] Other performance improvements
   - [ ] Complete insta test migration 
https://github.com/apache/datafusion/issues/15178 with @blaginin @shruti2522 
@qstommyshu and others
   - [ ] Hardening external sorts: 
https://github.com/apache/datafusion/issues/14692 with @2010YOUY01 
   - [ ] Set up Spark function library pattern: 
https://github.com/apache/datafusion/pull/15168 with @shehabgamin and 
@andygrove 
   - [ ] Use UTF8 view by default  
https://github.com/apache/datafusion/issues/15096 with @zhuqi-lucas  
   
   
   ## Background
   I am putting this list on github because:
   1. I like how github renders checklists w/ PR titles so it is easy to track 
(I currently have a local text file...)
   2. I thought others might be interested from seeing what I am doing / 
planning to do
   3. It makes me feel better that I don't have time to review all the PRs 😭 
   
   The way I am trying to prioritize PRs is in the following order
   1. Bug fixes
   2. Documentation / UX / API improvements (things that make DataFusion 
easier/better to work with)
   3. Performance improvements
   4. New features with wide appeal
   5. New functions
   
   Note new features and functions are deliberately at the bottom 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Consolidate statistics aggregation [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #8229:
URL: https://github.com/apache/datafusion/issues/8229#issuecomment-2769446857

   @xudong963  do you think we have completed this issue now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub


geoffreyclaude opened a new issue, #15529:
URL: https://github.com/apache/datafusion/issues/15529

   ### Is your feature request related to a problem or challenge?
   
   DataFusion currently has a "TopK early termination" optimization, which 
speeds up queries that involve `ORDER BY` and `LIMIT` if the input data is 
already sorted by the full ordering requested in the query.
   
   However, many real-world scenarios involve datasets that are only partially 
sorted.
   
   For example, consider a time-series dataset that's pre-sorted by `day` but 
not sorted within each day. Queries requesting data sorted by `day, timestamp` 
should still benefit significantly from optimization because once DataFusion 
has collected the required number of rows from the most recent day(s), it could 
safely ignore data from earlier days.
   
   Today, DataFusion does not take advantage of such partial ordering, 
resulting in unnecessary scans and sorts.
   
   Example query affected by this:
   
   ```sql
   SELECT day, sensor_id, reading, timestamp
   FROM sensor_readings
   WHERE sensor_id = 1002
   ORDER BY day DESC, timestamp DESC
   LIMIT 10;
   ```
   
   If the data source providing `sensor_readings` can guarantee a `day DESC` 
ordering, this query should quickly finish after scanning enough rows from the 
most recent days, but currently DataFusion will continue scanning unnecessarily 
the full `sensor_readings`.
   
   
   
   ### Describe the solution you'd like
   
   I propose extending DataFusion's existing "TopK early termination" 
optimization to handle cases where the input data is partially sorted by a 
prefix of the requested ordering.
   
   Specifically, DataFusion should detect:
   
   - When the input ordering has a non-empty common prefix with the query's 
requested ordering.
   - When the top-K buffer is full.
   - If all still pending rows are guaranteed to be strictly worse than the 
top-K's max value, comparing only on the common prefix.
   
   Under these conditions, DataFusion can safely terminate scanning early, 
significantly improving query performance and reducing resource consumption.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   I wasn't able to find benchmarks on already sorted data.
   
   However, a simple reproducer from the TPCH dataset could be:
   ```sql
   CREATE EXTERNAL TABLE lineitem_ship (
   l_shipdate DATE,
   l_commitdate   DATE,
   l_shipmode VARCHAR,
   l_quantity INT
   )
   STORED AS PARQUET
   LOCATION 'scratch/topk'
   WITH ORDER (l_shipdate);
   
   INSERT INTO lineitem_ship
   SELECT
   l_shipdate,
   l_commitdate,
   l_shipmode,
   l_quantity
   FROM lineitem
   ORDER BY l_shipdate;
   
   SELECT
   l_shipdate,
   l_commitdate,
   l_quantity
   FROM lineitem_ship
   WHERE l_shipmode IN ('MAIL', 'AIR')
   ORDER BY l_shipdate, l_commitdate, l_quantity
   LIMIT 10;
   ```
   
   This query today scans the full `lineitem_ship` table. I'd expect it to be 
orders of magnitude faster with the sort prefix enhancement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

2025-04-01 Thread via GitHub


jayzhan211 commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769195062

   ```
   [2025-04-01T12:19:41Z DEBUG datafusion_optimizer::utils] 
scalar_subquery_to_join:
   Projection: e.b, __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN 
Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") 
ELSE Utf8("b") END
 Left Join:  Filter: CAST(__scalar_sq_1.b AS Int64) = CAST(e.b AS 
Int64) + Int64(1)
   SubqueryAlias: e
 TableScan: t
   SubqueryAlias: __scalar_sq_1
 Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") 
END, e2.b
   Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
 SubqueryAlias: e2
   TableScan: t2
   ```
   
   In `scalar_subquery_to_join` optimization, we have left join on 
`__scalar_sq_1` and `e`.
   
   I wonder could we join the plan first and then call aggregation on top of 
joined plan.
   
   The plan I expected is something like this
   ```
   [2025-04-01T12:19:41Z DEBUG datafusion_optimizer::utils] 
scalar_subquery_to_join:
   
   Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") 
END AS CASE WHEN 
   max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b
  Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
 Left Join:  Filter: CAST(e2.b AS Int64) = CAST(e.b AS Int64) + Int64(1)
   SubqueryAlias: e
 TableScan: t
   SubqueryAlias: e2
 TableScan: t2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] AQE Unable to Rewrite Joins as Broadcast Hash Joins Due to Existing CometBroadcastHashJoin Operator [datafusion-comet]

2025-04-01 Thread via GitHub


mbutrovich commented on issue #1589:
URL: 
https://github.com/apache/datafusion-comet/issues/1589#issuecomment-2769269205

   Good catch, @Kontinuation! 
https://github.com/apache/datafusion-comet/pull/1578 has me looking at AQE 
wondering if there are other places where Comet isn't working with AQE where we 
should be. This is motivating to keep investigating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] GroupsAccumulator for Duration (#15322) [datafusion]

2025-04-01 Thread via GitHub


emilk closed pull request #15522: GroupsAccumulator for Duration (#15322)
URL: https://github.com/apache/datafusion/pull/15522


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [EPIC] ClickBench Improvements (Vanity Benchmark) [datafusion]

2025-04-01 Thread via GitHub


zhuqi-lucas commented on issue #14586:
URL: https://github.com/apache/datafusion/issues/14586#issuecomment-2768790003

   Make Clickbench Q29 5X faster:
   
   https://github.com/apache/datafusion/issues/15524


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-01 Thread via GitHub


xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2022286476


##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -,4 +2315,163 @@ mod tests {
 assert_eq!(new_config.constraints, Constraints::default());
 assert!(new_config.new_lines_in_values);
 }
+
+#[test]
+fn test_split_groups_by_statistics_with_target_partitions() -> Result<()> {

Review Comment:
   Add tests for new method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-01 Thread via GitHub


ctsk commented on code in PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022551332


##
datafusion/physical-plan/src/sorts/sort_filters.rs:
##
@@ -0,0 +1,236 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::{Arc, RwLock};
+
+use arrow_schema::SortOptions;
+use datafusion_common::{Result, ScalarValue};
+use datafusion_expr::Operator;
+use datafusion_physical_expr::{
+expressions::{is_not_null, is_null, lit, BinaryExpr},
+LexOrdering, PhysicalExpr,
+};
+
+use crate::dynamic_filters::{DynamicFilterPhysicalExpr, DynamicFilterSource};
+
+/// Holds threshold value and sort order information for a column
+#[derive(Debug, Clone)]
+struct ColumnThreshold {
+/// The current threshold value
+pub value: Arc>>,
+/// The column expression
+pub expr: Arc,
+/// Sort options
+pub sort_options: SortOptions,
+}
+
+/// Pushdown of dynamic fitlers from sort + limit operators (aka `TopK`) is 
used to speed up queries
+/// such as `SELECT * FROM table ORDER BY col DESC LIMIT 10` by pushing down 
the
+/// threshold values for the sort columns to the data source.
+/// That is, the TopK operator will keep track of the top 10 values for the 
sort
+/// and before a new file is opened it's statitics will be checked against the
+/// threshold values to determine if the file can be skipped and predicate 
pushdown
+/// will use these to skip rows during the scan.
+///
+/// For example, imagine this data gets created if multiple sources with clock 
skews,
+/// network delays, etc. are writing data and you don't do anything fancy to 
guarantee
+/// perfect sorting by `timestamp` (i.e. you naively write out the data to 
Parquet, maybe do some compaction, etc.).
+/// The point is that 99% of yesterday's files have a `timestamp` smaller than 
99% of today's files
+/// but there may be a couple seconds of overlap between files.
+/// To be concrete, let's say this is our data:
+//
+// | file | min | max |
+// |--|-|-|
+// | 1| 1   | 10  |
+// | 2| 9   | 19  |
+// | 3| 20  | 31  |
+// | 4| 30  | 35  |
+//
+// Ideally a [`TableProvider`] is able to use file level stats or other 
methods to roughly order the files
+// within each partition / file group such that we start with the newest / 
largest `timestamp`s.
+// If this is not possible the optimization still works but is less efficient 
and harder to visualize,
+// so for this example let's assume that we process 1 file at a time and we 
started with file 4.
+// After processing file 4 let's say we have 10 values in our TopK heap, the 
smallest of which is 30.
+// The TopK operator will then push down the filter `timestamp < 30` down the 
tree of [`ExecutionPlan`]s
+// and if the data source supports dynamic filter pushdown it will accept a 
reference to this [`DynamicPhysicalExprSource`]
+// and when it goes to open file 3 it will ask the 
[`DynamicPhysicalExprSource`] for the current filters.
+// Since file 3 may contain values larger than 30 we cannot skip it entirely,
+// but scanning it may still be more efficient due to page pruning and other 
optimizations.
+// Once we get to file 2 however we can skip it entirely because we know that 
all values in file 2 are smaller than 30.
+// The same goes for file 1.
+// So this optimization just saved us 50% of the work of scanning the data.
+#[derive(Debug, Clone)]
+pub struct SortDynamicFilterSource {
+thresholds: Vec,
+}
+
+impl SortDynamicFilterSource {
+pub fn new(ordering: &LexOrdering) -> Self {
+let thresholds = ordering
+.iter()
+.map(|sort_expr| ColumnThreshold {
+value: Arc::new(RwLock::new(None)),
+expr: Arc::clone(&sort_expr.expr),
+sort_options: sort_expr.options,
+})
+.collect();
+
+Self { thresholds }
+}
+
+pub fn update_values(&self, new_values: &[ScalarValue]) {
+if new_values.len() != self.thresholds.len() {
+panic!("New values length does not match the number of 
thresholds");
+}
+for (i, new_value) in new_values.iter().enumerate() {
+let threshold = &self.thresholds[i];
+   

Re: [I] TPCH unit tests failure [datafusion-ballista]

2025-04-01 Thread via GitHub


milenkovicm commented on issue #1194:
URL: 
https://github.com/apache/datafusion-ballista/issues/1194#issuecomment-2769278754

   @vmingchen is this issue closed with #1195? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-04-01 Thread via GitHub


Omega359 commented on issue #15511:
URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769284420

   > > There has been a number of issues where benchmarks stopped working and 
no one noticed until someone happened to try and run them
   > 
   > Instead of running the benchmark, how about adding those benchmark query 
to tests, I don't think we need to actually "benchmark" the code for each merge.
   
   We could do that too however not all benchmarks are sql queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-01 Thread via GitHub


ctsk commented on code in PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022274853


##
datafusion/physical-plan/src/topk/mod.rs:
##
@@ -186,6 +235,90 @@ impl TopK {
 Ok(())
 }
 
+fn calculate_dynamic_filters(
+thresholds: Vec,
+) -> Result>> {
+// Create filter expressions for each threshold
+let mut filters: Vec> =

Review Comment:
   Been thinking about this too :D.
   
   One could add another term to the generated expression:
   ```
   passive OR 
   ```
   `passive` itself is a dynamic literal that starts out as `true` and gets set 
to `false` by the TopK Heap once the heap is full.
   
   A downside of the DynamicLiteral approach is that it requires one mutex per 
field (vs 1 mutex for the whole expression previously). For native types, this 
*feels* like something that could be backed by an atomic, but I don't see an 
easy way of achieving that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-01 Thread via GitHub


ctsk commented on code in PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022327718


##
datafusion/physical-plan/src/sorts/sort_filters.rs:
##
@@ -0,0 +1,236 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::{Arc, RwLock};
+
+use arrow_schema::SortOptions;
+use datafusion_common::{Result, ScalarValue};
+use datafusion_expr::Operator;
+use datafusion_physical_expr::{
+expressions::{is_not_null, is_null, lit, BinaryExpr},
+LexOrdering, PhysicalExpr,
+};
+
+use crate::dynamic_filters::{DynamicFilterPhysicalExpr, DynamicFilterSource};
+
+/// Holds threshold value and sort order information for a column
+#[derive(Debug, Clone)]
+struct ColumnThreshold {
+/// The current threshold value
+pub value: Arc>>,
+/// The column expression
+pub expr: Arc,
+/// Sort options
+pub sort_options: SortOptions,
+}
+
+/// Pushdown of dynamic fitlers from sort + limit operators (aka `TopK`) is 
used to speed up queries
+/// such as `SELECT * FROM table ORDER BY col DESC LIMIT 10` by pushing down 
the
+/// threshold values for the sort columns to the data source.
+/// That is, the TopK operator will keep track of the top 10 values for the 
sort
+/// and before a new file is opened it's statitics will be checked against the
+/// threshold values to determine if the file can be skipped and predicate 
pushdown
+/// will use these to skip rows during the scan.
+///
+/// For example, imagine this data gets created if multiple sources with clock 
skews,
+/// network delays, etc. are writing data and you don't do anything fancy to 
guarantee
+/// perfect sorting by `timestamp` (i.e. you naively write out the data to 
Parquet, maybe do some compaction, etc.).
+/// The point is that 99% of yesterday's files have a `timestamp` smaller than 
99% of today's files
+/// but there may be a couple seconds of overlap between files.
+/// To be concrete, let's say this is our data:
+//
+// | file | min | max |
+// |--|-|-|
+// | 1| 1   | 10  |
+// | 2| 9   | 19  |
+// | 3| 20  | 31  |
+// | 4| 30  | 35  |
+//
+// Ideally a [`TableProvider`] is able to use file level stats or other 
methods to roughly order the files
+// within each partition / file group such that we start with the newest / 
largest `timestamp`s.
+// If this is not possible the optimization still works but is less efficient 
and harder to visualize,
+// so for this example let's assume that we process 1 file at a time and we 
started with file 4.
+// After processing file 4 let's say we have 10 values in our TopK heap, the 
smallest of which is 30.
+// The TopK operator will then push down the filter `timestamp < 30` down the 
tree of [`ExecutionPlan`]s
+// and if the data source supports dynamic filter pushdown it will accept a 
reference to this [`DynamicPhysicalExprSource`]
+// and when it goes to open file 3 it will ask the 
[`DynamicPhysicalExprSource`] for the current filters.
+// Since file 3 may contain values larger than 30 we cannot skip it entirely,
+// but scanning it may still be more efficient due to page pruning and other 
optimizations.
+// Once we get to file 2 however we can skip it entirely because we know that 
all values in file 2 are smaller than 30.
+// The same goes for file 1.
+// So this optimization just saved us 50% of the work of scanning the data.
+#[derive(Debug, Clone)]
+pub struct SortDynamicFilterSource {
+thresholds: Vec,
+}
+
+impl SortDynamicFilterSource {
+pub fn new(ordering: &LexOrdering) -> Self {
+let thresholds = ordering
+.iter()
+.map(|sort_expr| ColumnThreshold {
+value: Arc::new(RwLock::new(None)),
+expr: Arc::clone(&sort_expr.expr),
+sort_options: sort_expr.options,
+})
+.collect();
+
+Self { thresholds }
+}
+
+pub fn update_values(&self, new_values: &[ScalarValue]) {
+if new_values.len() != self.thresholds.len() {
+panic!("New values length does not match the number of 
thresholds");
+}
+for (i, new_value) in new_values.iter().enumerate() {
+let threshold = &self.thresholds[i];
+   

[I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-01 Thread via GitHub


zhuqi-lucas opened a new issue, #15524:
URL: https://github.com/apache/datafusion/issues/15524

   ### Is your feature request related to a problem or challenge?
   
   https://github.com/user-attachments/assets/c82b798f-7c14-42e9-b6d9-b67a6b038c9d";
 />
   
   
   Our datafusion is 5x slower than duckdb for q29, it's easy for us to 
optimize to 5x faster, here is the try:
   
   
   Before rewrite:
   
   
   ```rust
   cargo run --profile release-nonlto   --target aarch64-apple-darwin --bin 
dfbench -- clickbench  -p benchmarks/data/hits_partitioned -q 29
   Finished `release-nonlto` profile [optimized] target(s) in 0.26s
Running `target/aarch64-apple-darwin/release-nonlto/dfbench clickbench 
-p benchmarks/data/hits_partitioned -q 29`
   Running benchmarks with the following options: RunOpt { query: Some(29), 
common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, 
mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, 
debug: false }, path: "benchmarks/data/hits_partitioned", queries_path: 
"benchmarks/queries/clickbench/queries.sql", output_path: None }
   Q29: SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth" + 1), 
SUM("ResolutionWidth" + 2), SUM("ResolutionWidth" + 3), SUM("ResolutionWidth" + 
4), SUM("ResolutionWidth" + 5), SUM("ResolutionWidth" + 6), 
SUM("ResolutionWidth" + 7), SUM("ResolutionWidth" + 8), SUM("ResolutionWidth" + 
9), SUM("ResolutionWidth" + 10), SUM("ResolutionWidth" + 11), 
SUM("ResolutionWidth" + 12), SUM("ResolutionWidth" + 13), SUM("ResolutionWidth" 
+ 14), SUM("ResolutionWidth" + 15), SUM("ResolutionWidth" + 16), 
SUM("ResolutionWidth" + 17), SUM("ResolutionWidth" + 18), SUM("ResolutionWidth" 
+ 19), SUM("ResolutionWidth" + 20), SUM("ResolutionWidth" + 21), 
SUM("ResolutionWidth" + 22), SUM("ResolutionWidth" + 23), SUM("ResolutionWidth" 
+ 24), SUM("ResolutionWidth" + 25), SUM("ResolutionWidth" + 26), 
SUM("ResolutionWidth" + 27), SUM("ResolutionWidth" + 28), SUM("ResolutionWidth" 
+ 29), SUM("ResolutionWidth" + 30), SUM("ResolutionWidth" + 31), 
SUM("ResolutionWidth" + 32), SUM("ResolutionWidth" + 33), SUM("Resolu
 tionWidth" + 34), SUM("ResolutionWidth" + 35), SUM("ResolutionWidth" + 36), 
SUM("ResolutionWidth" + 37), SUM("ResolutionWidth" + 38), SUM("ResolutionWidth" 
+ 39), SUM("ResolutionWidth" + 40), SUM("ResolutionWidth" + 41), 
SUM("ResolutionWidth" + 42), SUM("ResolutionWidth" + 43), SUM("ResolutionWidth" 
+ 44), SUM("ResolutionWidth" + 45), SUM("ResolutionWidth" + 46), 
SUM("ResolutionWidth" + 47), SUM("ResolutionWidth" + 48), SUM("ResolutionWidth" 
+ 49), SUM("ResolutionWidth" + 50), SUM("ResolutionWidth" + 51), 
SUM("ResolutionWidth" + 52), SUM("ResolutionWidth" + 53), SUM("ResolutionWidth" 
+ 54), SUM("ResolutionWidth" + 55), SUM("ResolutionWidth" + 56), 
SUM("ResolutionWidth" + 57), SUM("ResolutionWidth" + 58), SUM("ResolutionWidth" 
+ 59), SUM("ResolutionWidth" + 60), SUM("ResolutionWidth" + 61), 
SUM("ResolutionWidth" + 62), SUM("ResolutionWidth" + 63), SUM("ResolutionWidth" 
+ 64), SUM("ResolutionWidth" + 65), SUM("ResolutionWidth" + 66), 
SUM("ResolutionWidth" + 67), SUM("ResolutionWidth" 
 + 68), SUM("ResolutionWidth" + 69), SUM("ResolutionWidth" + 70), 
SUM("ResolutionWidth" + 71), SUM("ResolutionWidth" + 72), SUM("ResolutionWidth" 
+ 73), SUM("ResolutionWidth" + 74), SUM("ResolutionWidth" + 75), 
SUM("ResolutionWidth" + 76), SUM("ResolutionWidth" + 77), SUM("ResolutionWidth" 
+ 78), SUM("ResolutionWidth" + 79), SUM("ResolutionWidth" + 80), 
SUM("ResolutionWidth" + 81), SUM("ResolutionWidth" + 82), SUM("ResolutionWidth" 
+ 83), SUM("ResolutionWidth" + 84), SUM("ResolutionWidth" + 85), 
SUM("ResolutionWidth" + 86), SUM("ResolutionWidth" + 87), SUM("ResolutionWidth" 
+ 88), SUM("ResolutionWidth" + 89) FROM hits;
   Query 29 iteration 0 took 341.5 ms and returned 1 rows
   Query 29 iteration 1 took 320.7 ms and returned 1 rows
   Query 29 iteration 2 took 303.0 ms and returned 1 rows
   Query 29 avg time: 321.73 ms
   ```
   
   
   After rewrite:
   
   ```rust
   cargo run --profile release-nonlto   --target aarch64-apple-darwin --bin 
dfbench -- clickbench  -p benchmarks/data/hits_partitioned -q 29
   Finished `release-nonlto` profile [optimized] target(s) in 0.26s
Running `target/aarch64-apple-darwin/release-nonlto/dfbench clickbench 
-p benchmarks/data/hits_partitioned -q 29`
   Running benchmarks with the following options: RunOpt { query: Some(29), 
common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, 
mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, 
debug: false }, path: "benchmarks/data/hits_partitioned", queries_path: 
"benchmarks/queries/clickbench/queries.sql", output_path: None }
   Q29: SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth") + 1 * COUNT(*), 
SUM("ResolutionWidth") + 2 * COUNT(*), SUM("ResolutionWidth") + 3 * COUNT(*), 
SUM("ResolutionWidth") + 4 * COUNT(*), SUM("ResolutionWidth") + 5 * COUNT(*), 
SUM("ResolutionWi

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-04-01 Thread via GitHub


Kontinuation commented on code in PR #1511:
URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022553633


##
native/core/src/execution/shuffle/shuffle_writer.rs:
##
@@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner {
 }
 }
 
-/// The status of appending rows to a partition buffer.
-#[derive(Debug)]
-enum AppendRowStatus {
-/// Rows were appended
-Appended,
-/// Not all rows were appended due to lack of available memory
-StartIndex(usize),
-}
-
-struct PartitionBuffer {
-/// The schema of batches to be partitioned.
-schema: SchemaRef,
-/// The "frozen" Arrow IPC bytes of active data. They are frozen when 
`flush` is called.
-frozen: Vec,
-/// Array builders for appending rows into buffering batches.
-active: Vec>,
-/// The estimation of memory size of active builders in bytes when they 
are filled.
-active_slots_mem_size: usize,
-/// Number of rows in active builders.
-num_active_rows: usize,
-/// The maximum number of rows in a batch. Once `num_active_rows` reaches 
`batch_size`,
-/// the active array builders will be frozen and appended to frozen buffer 
`frozen`.
+/// A partitioner that writes all shuffle data to a single file and a single 
index file
+struct SinglePartitionShufflePartitioner {
+// output_data_file: File,
+output_data_writer: BufBatchWriter,
+output_index_path: String,
+/// Batches that are smaller than the batch size and to be concatenated
+buffered_batches: Vec,
+/// Number of rows in the concatenating batches
+num_buffered_rows: usize,
+/// Metrics for the repartitioner
+metrics: ShuffleRepartitionerMetrics,
+/// The configured batch size
 batch_size: usize,
-/// Memory reservation for this partition buffer.
-reservation: MemoryReservation,
-/// Spill file for intermediate shuffle output for this partition. Each 
spill event
-/// will append to this file and the contents will be copied to the 
shuffle file at
-/// the end of processing.
-spill_file: Option,
-/// Writer that performs encoding and compression
-shuffle_block_writer: ShuffleBlockWriter,
-}
-
-struct SpillFile {
-temp_file: RefCountedTempFile,
-file: File,
 }
 
-impl PartitionBuffer {
+impl SinglePartitionShufflePartitioner {
 fn try_new(
+output_data_path: String,
+output_index_path: String,
 schema: SchemaRef,
+metrics: ShuffleRepartitionerMetrics,
 batch_size: usize,
-reservation: MemoryReservation,
 codec: CompressionCodec,
 enable_fast_encoding: bool,
 ) -> Result {
 let shuffle_block_writer =
-ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, 
codec)?;
-let active_slots_mem_size = schema
-.fields()
-.iter()
-.map(|field| slot_size(batch_size, field.data_type()))
-.sum::();
+ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, 
codec.clone())?;
+
+let output_data_file = OpenOptions::new()
+.write(true)
+.create(true)
+.truncate(true)
+.open(output_data_path)
+.map_err(to_df_err)?;
+
+let output_data_writer = BufBatchWriter::new(shuffle_block_writer, 
output_data_file);
+
 Ok(Self {
-schema,
-frozen: vec![],
-active: vec![],
-active_slots_mem_size,
-num_active_rows: 0,
+output_data_writer,
+output_index_path,
+buffered_batches: vec![],
+num_buffered_rows: 0,
+metrics,
 batch_size,
-reservation,
-spill_file: None,
-shuffle_block_writer,
 })
 }
 
-/// Initializes active builders if necessary.
-/// Returns error if memory reservation fails.
-fn allocate_active_builders(&mut self, metrics: 
&ShuffleRepartitionerMetrics) -> Result<()> {
-if self.active.is_empty() {
-let mut mempool_timer = metrics.mempool_time.timer();
-self.reservation.try_grow(self.active_slots_mem_size)?;
-mempool_timer.stop();
-
-let mut repart_timer = metrics.repart_time.timer();
-self.active = new_array_builders(&self.schema, self.batch_size);
-repart_timer.stop();
-}
-Ok(())
+/// Add a batch to the buffer of the partitioner, these buffered batches 
will be concatenated
+/// and written to the output data file when the number of rows in the 
buffer reaches the batch size.
+fn add_buffered_batch(&mut self, batch: RecordBatch) {
+self.num_buffered_rows += batch.num_rows();
+self.buffered_batches.push(batch);
 }
 
-/// Appends rows of specified indices from columns into active array 
builders.
-fn append_rows(
-&mut self,
-columns: &[ArrayRef],
-

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-04-01 Thread via GitHub


Kontinuation commented on code in PR #1511:
URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022551508


##
native/core/src/execution/shuffle/shuffle_writer.rs:
##
@@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner {
 }
 }
 
-/// The status of appending rows to a partition buffer.
-#[derive(Debug)]
-enum AppendRowStatus {
-/// Rows were appended
-Appended,
-/// Not all rows were appended due to lack of available memory
-StartIndex(usize),
-}
-
-struct PartitionBuffer {
-/// The schema of batches to be partitioned.
-schema: SchemaRef,
-/// The "frozen" Arrow IPC bytes of active data. They are frozen when 
`flush` is called.
-frozen: Vec,
-/// Array builders for appending rows into buffering batches.
-active: Vec>,
-/// The estimation of memory size of active builders in bytes when they 
are filled.
-active_slots_mem_size: usize,
-/// Number of rows in active builders.
-num_active_rows: usize,
-/// The maximum number of rows in a batch. Once `num_active_rows` reaches 
`batch_size`,
-/// the active array builders will be frozen and appended to frozen buffer 
`frozen`.
+/// A partitioner that writes all shuffle data to a single file and a single 
index file
+struct SinglePartitionShufflePartitioner {
+// output_data_file: File,
+output_data_writer: BufBatchWriter,
+output_index_path: String,
+/// Batches that are smaller than the batch size and to be concatenated
+buffered_batches: Vec,
+/// Number of rows in the concatenating batches
+num_buffered_rows: usize,
+/// Metrics for the repartitioner
+metrics: ShuffleRepartitionerMetrics,
+/// The configured batch size
 batch_size: usize,
-/// Memory reservation for this partition buffer.
-reservation: MemoryReservation,
-/// Spill file for intermediate shuffle output for this partition. Each 
spill event
-/// will append to this file and the contents will be copied to the 
shuffle file at
-/// the end of processing.
-spill_file: Option,
-/// Writer that performs encoding and compression
-shuffle_block_writer: ShuffleBlockWriter,
-}
-
-struct SpillFile {
-temp_file: RefCountedTempFile,
-file: File,
 }
 
-impl PartitionBuffer {
+impl SinglePartitionShufflePartitioner {
 fn try_new(
+output_data_path: String,
+output_index_path: String,
 schema: SchemaRef,
+metrics: ShuffleRepartitionerMetrics,
 batch_size: usize,
-reservation: MemoryReservation,
 codec: CompressionCodec,
 enable_fast_encoding: bool,
 ) -> Result {
 let shuffle_block_writer =
-ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, 
codec)?;
-let active_slots_mem_size = schema
-.fields()
-.iter()
-.map(|field| slot_size(batch_size, field.data_type()))
-.sum::();
+ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, 
codec.clone())?;
+
+let output_data_file = OpenOptions::new()
+.write(true)
+.create(true)
+.truncate(true)
+.open(output_data_path)
+.map_err(to_df_err)?;
+
+let output_data_writer = BufBatchWriter::new(shuffle_block_writer, 
output_data_file);
+
 Ok(Self {
-schema,
-frozen: vec![],
-active: vec![],
-active_slots_mem_size,
-num_active_rows: 0,
+output_data_writer,
+output_index_path,
+buffered_batches: vec![],
+num_buffered_rows: 0,
+metrics,
 batch_size,
-reservation,
-spill_file: None,
-shuffle_block_writer,
 })
 }
 
-/// Initializes active builders if necessary.
-/// Returns error if memory reservation fails.
-fn allocate_active_builders(&mut self, metrics: 
&ShuffleRepartitionerMetrics) -> Result<()> {
-if self.active.is_empty() {
-let mut mempool_timer = metrics.mempool_time.timer();
-self.reservation.try_grow(self.active_slots_mem_size)?;
-mempool_timer.stop();
-
-let mut repart_timer = metrics.repart_time.timer();
-self.active = new_array_builders(&self.schema, self.batch_size);
-repart_timer.stop();
-}
-Ok(())
+/// Add a batch to the buffer of the partitioner, these buffered batches 
will be concatenated
+/// and written to the output data file when the number of rows in the 
buffer reaches the batch size.
+fn add_buffered_batch(&mut self, batch: RecordBatch) {
+self.num_buffered_rows += batch.num_rows();
+self.buffered_batches.push(batch);
 }
 
-/// Appends rows of specified indices from columns into active array 
builders.
-fn append_rows(
-&mut self,
-columns: &[ArrayRef],
-

[PR] ArraySort: support structs [datafusion]

2025-04-01 Thread via GitHub


cht42 opened a new pull request, #15527:
URL: https://github.com/apache/datafusion/pull/15527

   ## Which issue does this PR close?
   
   
   
   - Closes #15526
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-01 Thread via GitHub


zhuqi-lucas commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2768994999

   Thank you @jayzhan211 for the guide, i will try this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15504:
URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2769364593

   > @berkaysynnada thank you, actually I did use the datafusion-examples as 
reference.
   > 
   > > Perhaps we can make this in datafusion-examples as well
   > 
   > Yeah that could also be done with maybe a few minor adjustments. If you 
want, I can do that as well once I have the active conversations reviewed and 
resolved.
   
   I think it would be good to avoid having duplicated content in 
datafusion-examples and the normal documentation. I personally think the 
documentation site is more accessable (more chance of someone reading / finding 
it via search) than rust examples in the repository


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-04-01 Thread via GitHub


jayzhan211 commented on issue #15511:
URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769754304

   > > I don't think we need to actually "benchmark" the code for each merge.
   > 
   > The issue [#5504](https://github.com/apache/datafusion/issues/5504) would 
require all benchmarks to run after each merge. I think we could just add 
benchmarks directly for now. What do you think?
   > 
   > I am willing to work on* this.
   
   I don't think we need to run the benchmark on CI, at least it should be 
optional and disable by default.
   
   > We could do that too however not all benchmarks are sql queries.
   
   I agree that maintaining it is challenging. Adding it to the extended test 
suite and running it on every merge isn’t a viable solution, as it’s costly and 
often unnecessary. I don’t think keeping the benchmark functional is 
essentialβ€”it’s more like a script that we can modify as needed, depending on 
what we want to measure each time
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-04-01 Thread via GitHub


alamb commented on code in PR #15504:
URL: https://github.com/apache/datafusion/pull/15504#discussion_r2023080660


##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -97,6 +97,167 @@ impl AggregateExprBuilder {
 /// Constructs an `AggregateFunctionExpr` from the builder
 ///
 /// Note that an [`Self::alias`] must be provided before calling this 
method.
+///
+/// # Example: Create an `AggregateUDF`
+///
+/// In the following example, `AggregateFunctionExpr` will be built using 
`AggregateExprBuilder`

Review Comment:
   If you add `[]` to the names rustdoc will make them links. For example
   
   ```suggestion
   /// In the following example, [`AggregateFunctionExpr`] will be built 
using [`AggregateExprBuilder`]
   ```
   
   You may have to provide the full path with a separate definition like;
   
   ```rust
   /// Link to [MyStruct]
   ///
   /// [MyStruct]: core::my_struct::MyStruct
   
   ```



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -97,6 +97,167 @@ impl AggregateExprBuilder {
 /// Constructs an `AggregateFunctionExpr` from the builder
 ///
 /// Note that an [`Self::alias`] must be provided before calling this 
method.
+///
+/// # Example: Create an `AggregateUDF`
+///
+/// In the following example, `AggregateFunctionExpr` will be built using 
`AggregateExprBuilder`
+/// which provides a build function.
+///
+/// First we will create an `Accumulator` which will be used to further 
implement `AggregateUDFImpl`.
+/// After implementing `AggregateUDFImpl`, it could be used to pass in as 
a parameter to create an `AggregateExprBuilder`.
+/// `AggregateExprBuilder` could the be used to generate 
`AggregateFunctionExpr` after chaining
+/// queries on top of each other.
+///
+/// ```

Review Comment:
   I think this is not the right place to demonstrate creating a user defined 
aggregate so I suggest removing this particular example (notes below how to 
fold it into the other example)



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -97,6 +97,167 @@ impl AggregateExprBuilder {
 /// Constructs an `AggregateFunctionExpr` from the builder
 ///
 /// Note that an [`Self::alias`] must be provided before calling this 
method.
+///
+/// # Example: Create an `AggregateUDF`
+///
+/// In the following example, `AggregateFunctionExpr` will be built using 
`AggregateExprBuilder`
+/// which provides a build function.
+///
+/// First we will create an `Accumulator` which will be used to further 
implement `AggregateUDFImpl`.
+/// After implementing `AggregateUDFImpl`, it could be used to pass in as 
a parameter to create an `AggregateExprBuilder`.
+/// `AggregateExprBuilder` could the be used to generate 
`AggregateFunctionExpr` after chaining
+/// queries on top of each other.
+///
+/// ```
+/// use std::any::Any;
+/// use std::sync::OnceLock;
+/// use std::sync::Arc;
+/// use arrow::datatypes::DataType;
+/// use datafusion_common::{DataFusionError, plan_err, Result, 
ScalarValue};
+/// use datafusion_expr::{col, ColumnarValue, Signature, Volatility, Expr, 
Documentation};
+/// use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, 
function::{AccumulatorArgs, StateFieldsArgs}};
+/// use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE;
+/// use arrow::datatypes::Schema;
+/// use arrow::datatypes::Field;
+/// use arrow::array::Array;
+///
+/// #[derive(Debug)]
+/// struct FirstValueAccumulator {
+/// value: Option,
+/// data_type: DataType,
+/// }
+///
+/// impl Accumulator for FirstValueAccumulator {
+/// fn update_batch(&mut self, values: &[Arc]) -> 
Result<()> {
+/// if self.value.is_none() && !values.is_empty() {
+/// let first_array = &values[0];
+/// for i in 0..first_array.len() {
+/// if !first_array.is_null(i) {
+/// self.value = 
Some(ScalarValue::try_from_array(first_array, i)?);
+/// break;
+/// }
+/// }
+/// }
+/// Ok(())
+/// }
+///
+/// fn merge_batch(&mut self, states: &[Arc]) -> Result<()> 
{
+/// if self.value.is_none() && !states.is_empty() {
+/// let first_array = &states[0];
+/// for i in 0..first_array.len() {
+/// if !first_array.is_null(i) {
+/// self.value = 
Some(ScalarValue::try_from_array(first_array, i)?);
+/// break;
+/// }
+/// }
+/// }
+/// Ok(())
+/// }
+///
+/// fn evaluate(&mut self) -> Result {
+/// match &self.value {
+/// Some(value) => Ok(value.c

Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]

2025-04-01 Thread via GitHub


alamb commented on code in PR #15505:
URL: https://github.com/apache/datafusion/pull/15505#discussion_r202314


##
docs/source/user-guide/cli/datasources.md:
##
@@ -126,6 +125,32 @@ select count(*) from hits;
 1 row in set. Query took 0.344 seconds.
 ```
 
+**Why Wildcards Are Not Supported**
+
+Although wildcards (e.g., _.parquet or \*\*/_.parquet) may work for local 
filesystems in some cases, they are not officially supported by DataFusion. 
This is because wildcards are not universally applicable across all storage 
backends (e.g., S3, GCS). Instead, DataFusion expects the user to specify the 
directory path, and it will automatically read all compatible files within that 
directory.
+
+For example, the following usage is not supported:
+
+```sql
+CREATE EXTERNAL TABLE test (
+message TEXT,
+day DATE
+)
+STORED AS PARQUET
+LOCATION 'gs://bucket/*.parquet';
+```
+
+Instead, you should use:
+
+```sql
+CREATE EXTERNAL TABLE test (
+message TEXT,
+day DATE
+)
+STORED AS PARQUET
+LOCATION 'gs://bucket/';

Review Comment:
   ```suggestion
   LOCATION 'gs://bucket/my_table';
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: update group by columns for merge phase after spill [datafusion]

2025-04-01 Thread via GitHub


rluvaton opened a new pull request, #15531:
URL: https://github.com/apache/datafusion/pull/15531

   ## Which issue does this PR close?
   
   - Closes #15530.
   
   ## Rationale for this change
   
   the PR forgot to update the group by expressions:
   - #13995
   
   ## What changes are included in this PR?
   
   Update group by columns to be based on their index for merging phase 
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] ArraySort: support structs [datafusion]

2025-04-01 Thread via GitHub


alamb commented on code in PR #15527:
URL: https://github.com/apache/datafusion/pull/15527#discussion_r2023103674


##
datafusion/functions-nested/src/sort.rs:
##
@@ -207,9 +208,21 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> 
Result {
 valid.append_null();
 } else {
 let arr_ref = list_array.value(i);
-let arr_ref = arr_ref.as_ref();
 
-let sorted_array = compute::sort(arr_ref, sort_option)?;
+let sorted_array = match arr_ref.data_type() {
+DataType::Struct(_) => {
+let sort_columns: Vec = vec![SortColumn {
+values: Arc::clone(&arr_ref),
+options: sort_option,
+}];
+let indices = compute::lexsort_to_indices(&sort_columns, 
None)?;
+compute::take(arr_ref.as_ref(), &indices, None)?
+}
+_ => {
+let arr_ref = arr_ref.as_ref();
+compute::sort(arr_ref, sort_option)?

Review Comment:
   is the issue that arrow-rs 's sort kernel doesn't support sorting structures 
but `lexsort_to_indices` does? 
   
   Maybe we can offer some comments about why and possibly file an upstream 
ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]

2025-04-01 Thread via GitHub


suibianwanwank commented on PR #15281:
URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769782451

   > The projection required to be in the group expression. I think the query 
of these 2 are equivalent but the subquery one group by `e2.b` and the join 
query group by `e1.b`.
   > 
   > The only problem left is that whether this rewrite is general enough to 
work well on most of the subquery
   > 
   > ```
   > query IT
   > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from 
t2 e2 where e2.b = e1.b + 1) from t1 e1;
   > 
   > 0 a
   > 2 a
   > 
   > query TT
   > explain
   > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from 
t2 e2 where e2.b = e1.b + 1) from t1 e1;
   > 
   > logical_plan
   > 01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN 
Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") 
ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE 
Utf8("b") END
   > 02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS 
Int64)
   > 03)SubqueryAlias: e1
   > 04)--TableScan: t1 projection=[b]
   > 05)SubqueryAlias: __scalar_sq_1
   > 06)--Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") 
END, e2.b, Boolean(true) AS __always_true
   > 07)Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]]
   > 08)--SubqueryAlias: e2
   > 09)TableScan: t2 projection=[a, b]
   > physical_plan
   > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN 
b ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END 
as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END]
   > 02)--CoalesceBatchesExec: target_batch_size=8192
   > 03)HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + 
Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN 
max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4]
   > 04)--CoalesceBatchesExec: target_batch_size=8192
   > 05)RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), 
input_partitions=1
   > 06)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as 
e1.b + Int64(1)]
   > 07)DataSourceExec: partitions=1, partition_sizes=[1]
   > 08)--CoalesceBatchesExec: target_batch_size=8192
   > 09)RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS 
Int64)@3], 4), input_partitions=4
   > 10)--ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE 
b END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 
as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS 
Int64)]
   > 11)AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], 
aggr=[max(e2.a)]
   > 12)--CoalesceBatchesExec: target_batch_size=8192
   > 13)RepartitionExec: partitioning=Hash([b@0], 4), 
input_partitions=4
   > 14)--RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   > 15)AggregateExec: mode=Partial, gby=[b@1 as b], 
aggr=[max(e2.a)]
   > 16)--DataSourceExec: partitions=1, partition_sizes=[1]
   > 
   > query IT
   > SELECT
   > e1.b,
   > CASE 
   > WHEN MAX(e2.a) > 10 THEN 'a' 
   > ELSE 'b' 
   > END AS result
   > FROM t2 e2
   > LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   > GROUP BY e1.b;
   > 
   > 2 a
   > 0 a
   > 
   > query TT
   > explain
   > SELECT
   > e1.b,
   > CASE 
   > WHEN MAX(e2.a) > 10 THEN 'a' 
   > ELSE 'b' 
   > END AS result
   > FROM t2 e2
   > LEFT JOIN t1 e1 ON e2.b = e1.b + 1
   > GROUP BY e1.b;
   > 
   > logical_plan
   > 01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE 
Utf8("b") END AS result
   > 02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]]
   > 03)Projection: e2.a, e1.b
   > 04)--Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1)
   > 05)SubqueryAlias: e2
   > 06)--TableScan: t2 projection=[a, b]
   > 07)SubqueryAlias: e1
   > 08)--TableScan: t1 projection=[b]
   > physical_plan
   > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE 
b END as result]
   > 02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)]
   > 03)CoalesceBatchesExec: target_batch_size=8192
   > 04)--RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4
   > 05)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)]
   > 06)--RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
   > 07)ProjectionExec: expr=[a@1 as a, b@0 as b]
   > 08)--CoalesceBatchesExec: target_batch_size=8192
   > 09)HashJoinExec: mode=Partitioned, join_

[PR] Draft: Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-01 Thread via GitHub


zhuqi-lucas opened a new pull request, #15532:
URL: https://github.com/apache/datafusion/pull/15532

   ## Which issue does this PR close?
   
   
   
   - Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Collecting parquet without any transformations throws an exception [datafusion-comet]

2025-04-01 Thread via GitHub


l0kr commented on issue #1588:
URL: 
https://github.com/apache/datafusion-comet/issues/1588#issuecomment-2769772035

   Ah nice catch @mbutrovich! Yup, looks like a dupe πŸ‘€


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-04-01 Thread via GitHub


2010YOUY01 commented on code in PR #15501:
URL: https://github.com/apache/datafusion/pull/15501#discussion_r2022134277


##
datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs:
##
@@ -0,0 +1,635 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Fuzz Test for various corner cases sorting RecordBatches exceeds available 
memory and should spill
+
+use std::cmp::min;
+use std::sync::Arc;
+
+use arrow::array::RecordBatch;
+use arrow_schema::SchemaRef;
+use datafusion::datasource::MemTable;
+use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion_common::{instant::Instant, Result};
+use datafusion_execution::memory_pool::{
+human_readable_size, MemoryPool, UnboundedMemoryPool,
+};
+use datafusion_expr::display_schema;
+use datafusion_physical_plan::spill::get_record_batch_memory_size;
+use rand::seq::SliceRandom;
+use std::time::Duration;
+
+use datafusion_execution::{
+disk_manager::DiskManagerConfig, memory_pool::FairSpillPool,
+runtime_env::RuntimeEnvBuilder,
+};
+use rand::Rng;
+use rand::{rngs::StdRng, SeedableRng};
+
+use crate::fuzz_cases::aggregation_fuzzer::check_equality_of_batches;
+
+use super::aggregation_fuzzer::ColumnDescr;
+use super::record_batch_generator::{get_supported_types_columns, 
RecordBatchGenerator};
+
+/// Entry point for executing the sort query fuzzer.
+///
+/// Now memory limiting is disabled by default. See TODOs in `SortQueryFuzzer`.
+#[tokio::test(flavor = "multi_thread")]
+async fn sort_query_fuzzer_runner() {
+let random_seed = std::time::SystemTime::now()
+.duration_since(std::time::UNIX_EPOCH)
+.unwrap()
+.as_secs();
+let test_generator = SortFuzzerTestGenerator::new(
+2000,
+3,
+"sort_fuzz_table".to_string(),
+get_supported_types_columns(random_seed),
+false,
+random_seed,
+);
+let mut fuzzer = SortQueryFuzzer::new(random_seed)
+// Configs for how many random query to test
+.with_max_rounds(Some(5))
+.with_queries_per_round(4)
+.with_config_variations_per_query(25)
+// Will stop early if the time limit is reached
+.with_time_limit(Duration::from_secs(20))
+.with_test_generator(test_generator);
+
+fuzzer.run().await.unwrap();
+}
+
+/// SortQueryFuzzer holds the runner configuration for executing sort query 
fuzz tests. The fuzzing details are managed inside `SortFuzzerTestGenerator`.
+///
+/// It defines:
+/// - `max_rounds`: Maximum number of rounds to run (or None to run until 
`time_limit`).
+/// - `queries_per_round`: Number of different queries to run in each round.
+/// - `config_variations_per_query`: Number of different configurations to 
test per query.
+/// - `time_limit`: Time limit for the entire fuzzer execution.
+///
+/// TODO: The following improvements are blocked on 
https://github.com/apache/datafusion/issues/14748:
+/// 1. Support generating queries with arbitrary number of ORDER BY clauses
+///Currently limited to be smaller than number of projected columns
+/// 2. Enable special type columns like utf8_low to be used in ORDER BY clauses
+/// 3. Enable memory limiting functionality in the fuzzer runner
+pub struct SortQueryFuzzer {
+test_gen: SortFuzzerTestGenerator,
+/// Random number generator for the runner, used to generate seeds for 
inner components.
+/// Seeds for each choice (query, config, etc.) are printed out for 
reproducibility.
+runner_rng: StdRng,
+
+// 
+// Runner configurations
+// 
+/// For each round, a new dataset is generated. If `None`, keep running 
until
+/// the time limit is reached
+max_rounds: Option,
+/// How many different queries to run in each round
+queries_per_round: usize,
+/// For each query, how many different configurations to try and make sure 
their
+/// results are consistent
+config_variations_per_query: usize,
+/// The time limit for the entire sort query fuzzer execution.
+time_limit: Option,
+}
+
+impl SortQueryFuzzer {
+pub fn new(seed: u64

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15499:
URL: https://github.com/apache/datafusion/pull/15499#issuecomment-2769639137

   Thanks again @qstommyshu 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-04-01 Thread via GitHub


alamb commented on code in PR #15504:
URL: https://github.com/apache/datafusion/pull/15504#discussion_r2023660049


##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -97,6 +97,165 @@ impl AggregateExprBuilder {
 /// Constructs an `AggregateFunctionExpr` from the builder
 ///
 /// Note that an [`Self::alias`] must be provided before calling this 
method.
+///
+/// # Example: Create an [`AggregateUDF`]
+///
+/// In the following example, [`AggregateFunctionExpr`] will be built 
using [`AggregateExprBuilder`]
+/// which provides a build function. Full example could be accessed from 
the source file.
+///
+/// ```
+/// # use std::any::Any;
+/// # use std::sync::OnceLock;
+/// # use std::sync::Arc;
+/// # use arrow::datatypes::DataType;
+/// # use datafusion_common::{DataFusionError, plan_err, Result, 
ScalarValue};
+/// # use datafusion_expr::{col, ColumnarValue, Signature, Volatility, 
Expr, Documentation};
+/// # use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, 
function::{AccumulatorArgs, StateFieldsArgs}};
+/// # use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE;
+/// # use arrow::datatypes::Schema;
+/// # use arrow::datatypes::Field;
+/// # use arrow::array::Array;
+/// #
+/// # #[derive(Debug)]
+/// # struct FirstValueAccumulator {
+/// # value: Option,
+/// # data_type: DataType,
+/// # }
+/// #
+/// # impl Accumulator for FirstValueAccumulator {
+/// # fn update_batch(&mut self, values: &[Arc]) -> 
Result<()> {

Review Comment:
   I think we can significantly shorten the example by just using 
`unimplemented()!` instead of adding an actual implementation for an 
accumulator, and the methods.



##
datafusion/physical-expr/src/aggregate.rs:
##
@@ -97,6 +97,165 @@ impl AggregateExprBuilder {
 /// Constructs an `AggregateFunctionExpr` from the builder
 ///
 /// Note that an [`Self::alias`] must be provided before calling this 
method.
+///
+/// # Example: Create an [`AggregateUDF`]
+///
+/// In the following example, [`AggregateFunctionExpr`] will be built 
using [`AggregateExprBuilder`]
+/// which provides a build function. Full example could be accessed from 
the source file.
+///
+/// ```
+/// # use std::any::Any;
+/// # use std::sync::OnceLock;
+/// # use std::sync::Arc;
+/// # use arrow::datatypes::DataType;
+/// # use datafusion_common::{DataFusionError, plan_err, Result, 
ScalarValue};
+/// # use datafusion_expr::{col, ColumnarValue, Signature, Volatility, 
Expr, Documentation};
+/// # use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, 
function::{AccumulatorArgs, StateFieldsArgs}};
+/// # use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE;
+/// # use arrow::datatypes::Schema;
+/// # use arrow::datatypes::Field;
+/// # use arrow::array::Array;
+/// #
+/// # #[derive(Debug)]
+/// # struct FirstValueAccumulator {
+/// # value: Option,
+/// # data_type: DataType,
+/// # }
+/// #
+/// # impl Accumulator for FirstValueAccumulator {
+/// # fn update_batch(&mut self, values: &[Arc]) -> 
Result<()> {
+/// # if self.value.is_none() && !values.is_empty() {
+/// # let first_array = &values[0];
+/// # for i in 0..first_array.len() {
+/// # if !first_array.is_null(i) {
+/// # self.value = 
Some(ScalarValue::try_from_array(first_array, i)?);
+/// # break;
+/// # }
+/// # }
+/// # }
+/// # Ok(())
+/// # }
+/// #
+/// # fn merge_batch(&mut self, states: &[Arc]) -> 
Result<()> {
+/// # if self.value.is_none() && !states.is_empty() {
+/// # let first_array = &states[0];
+/// # for i in 0..first_array.len() {
+/// # if !first_array.is_null(i) {
+/// # self.value = 
Some(ScalarValue::try_from_array(first_array, i)?);
+/// # break;
+/// # }
+/// # }
+/// # }
+/// # Ok(())
+/// # }
+/// #
+/// # fn evaluate(&mut self) -> Result {
+/// # match &self.value {
+/// # Some(value) => Ok(value.clone()),
+/// # None => ScalarValue::try_from(&self.data_type),
+/// # }
+/// # }
+/// #
+/// # fn size(&self) -> usize {
+/// # std::mem::size_of_val(self)
+/// # }
+/// #
+/// # fn state(&mut self) -> Result> {
+/// # match &self.value {
+/// # Some(value) => Ok(vec![value.clone()]),
+/// # None => ScalarValue::try_from(&self.data_type).map(|v| 
vec![v]),

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15504:
URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2770660147

   I also merged up from main to fix the CO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] CSV data with double quotes fails [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #439:
URL: https://github.com/apache/datafusion/issues/439#issuecomment-2770467590

   I think this is fixed -- datafusion can read data with double quotes now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: fix struct of arrays [datafusion-comet]

2025-04-01 Thread via GitHub


comphead commented on PR #1592:
URL: 
https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770697619

   @parthchandra @andygrove @kazuyukitanimura 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Physical plan refactor to support optimization rules and more efficient use of threads [datafusion]

2025-04-01 Thread via GitHub


alamb closed issue #92: Physical plan refactor to support optimization rules 
and more efficient use of threads
URL: https://github.com/apache/datafusion/issues/92


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Word Count [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #197:
URL: https://github.com/apache/datafusion/issues/197#issuecomment-2770485282

   Make lineitem SF 10 (3 seconds!) data using tpchgen-cli 
https://github.com/clflushopt/tpchgen-rs
   
   ```shell
   tpchgen-cli -v --tables=lineitem --scale-factor=10 --format=parquet
   ```
   
   Then you can run split to strings
   ```sql
   > select unnest (string_to_array(l_comment, ' ')) as word from 
'lineitem.parquet' limit 10;
   +---+
   | word  |
   +---+
   | egular|
   | courts|
   | above |
   | the   |
   | ly|
   | final |
   | dependencies: |
   | slyly |
   | bold  |
   |   |
   +---+
   10 row(s) fetched.
   Elapsed 0.024 seconds.
   ```
   
   Do the count
   ```sql
   > select count(*) from (select unnest (string_to_array(l_comment, ' ')) as 
word from 'lineitem.parquet');
   +---+
   | count(*)  |
   +---+
   | 271168008 |
   +---+
   1 row(s) fetched.
   Elapsed 0.731 seconds.
   ```
   
   You can even do the distinct word count:
   ```sql
   > select count(distinct word) from (select unnest 
(string_to_array(l_comment, ' ')) as word from 'lineitem.parquet');
   +--+
   | count(DISTINCT word) |
   +--+
   | 4959 |
   +--+
   1 row(s) fetched.
   Elapsed 1.052 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15438:
URL: https://github.com/apache/datafusion/pull/15438#issuecomment-2770491240

   Hi @LiaCastaneda  -- I believe the CI has failed on this PR due to a change 
in the CI actions. Can you please merge the PR up to main which i think will 
address the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15301:
URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2770508515

   FYI I will likely try and review this PR again carefully first thing 
tomorrow morning


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(ci): build fails with strange error [datafusion-ballista]

2025-04-01 Thread via GitHub


milenkovicm opened a new pull request, #1222:
URL: https://github.com/apache/datafusion-ballista/pull/1222

   # Which issue does this PR close?
   
   Closes #.
   
# Rationale for this change
   
   There spurious CI error, 
   
   - https://github.com/apache/datafusion-ballista/actions/runs/14204334938
   -  https://github.com/apache/datafusion-ballista/actions/runs/14196252739
   
   ```
   actions-rs/toolchain@v1 is not allowed to be used in 
apache/datafusion-ballista. A...
   ```
   
   ```
   r-lib/actions/pr-fetch@master and r-lib/actions/pr-push@master are not 
allowed to be used in apache/datafusion-ballista. Actions in this workflow must 
be: within a repository that belongs to your Enterprise account, created by 
GitHub, verified in the GitHub Marketplace, or matching the following: 
1Password/load-secrets-action@*, A ...
   ```
   
   # What changes are included in this PR?
   
   replaced deprecated `actions-rs/toolchain@v1` with 
`dtolnay/rust-toolchain@stable` 
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #15529:
URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2770458720

   This may be some overlap with this work from @adriangb (though I realize you 
are talking about a different optimization)
   - https://github.com/apache/datafusion/issues/15037

   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add test cases for NULL in joins as key values [datafusion]

2025-04-01 Thread via GitHub


alamb closed issue #148:  Add test cases for NULL in joins as key values
URL: https://github.com/apache/datafusion/issues/148


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] CSV data with double quotes fails [datafusion]

2025-04-01 Thread via GitHub


alamb closed issue #439: CSV data with double quotes fails 
URL: https://github.com/apache/datafusion/issues/439


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Physical plan refactor to support optimization rules and more efficient use of threads [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #92:
URL: https://github.com/apache/datafusion/issues/92#issuecomment-2770464823

   I think this is no longer relevant so closing. Let's open a new ticket if 
there is anything actionable remaining


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: clone and debug for FileSinkConfig [datafusion]

2025-04-01 Thread via GitHub


jayzhan211 merged PR #15516:
URL: https://github.com/apache/datafusion/pull/15516


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]

2025-04-01 Thread via GitHub


logan-keede commented on issue #13814:
URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2770108919

   I did some profiling.
   ```sh
   RUSTC_BOOTSTRAP=1 cargo rustc -p datafusion-catalog -- -Z self-profile -Z 
self-profile-events=default,args
   ```
   
![Image](https://github.com/user-attachments/assets/86ff5fe6-61ed-4ebc-beeb-6f9170bb1ba9)
   
   
![Image](https://github.com/user-attachments/assets/4a749967-b8f2-4ba2-87e2-05b05a5c8343)
   
   culprits are some asynchronous functions(Memtable::{scan,  insert_into}, 
StreamTable::{scan, insert_into}  StreamTableFactory::create, 
StreamWrite::write_all, ViewTable::scan ):- Each of them taking 1-2 sec(5-10% 
of total time) in analysis phase only.
   
   there are a couple hundred evaluate_obligation call under each type_check of 
pattern 
   `{...{some_datafusion_object} as std::marker::Sync/Send> ... }` which means 
that compiler is trying to check if `some_datafusion_object` fulfils the trait 
bound of being Send/Sync.
   
   This does not happen with other asynchronous method eg. 
`ListingSchemaProvider::refresh`, so maybe we can try to check the difference 
between them and see what is going wrong with above mentioned functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: update changelog for 45.0.0 [datafusion-ballista]

2025-04-01 Thread via GitHub


milenkovicm commented on PR #1218:
URL: 
https://github.com/apache/datafusion-ballista/pull/1218#issuecomment-2770545131

   just a heads up @andygrove I've added release of scheduler, executor ... 
docker containers when new tag created. I did try it, hopefully it will not 
make problems when new tag is created. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(ci): build fails with strange error [datafusion-ballista]

2025-04-01 Thread via GitHub


milenkovicm commented on PR #1222:
URL: 
https://github.com/apache/datafusion-ballista/pull/1222#issuecomment-2770514586

   does not look as it fixes the issue: 
https://github.com/apache/datafusion-ballista/actions/runs/14204595182


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Migrate datafusion/sql tests to insta, part3 [datafusion]

2025-04-01 Thread via GitHub


qstommyshu opened a new pull request, #15533:
URL: https://github.com/apache/datafusion/pull/15533

   ## Which issue does this PR close?
   
   
   
   - Related #15397, #15497, #15499  this is a part of #15484 breaking down.
   - Checkout things to note of the whole migration in comments section of 
#15484.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   This is the part 3 of #15484 breakdown, as the code changes in #15484 is too 
large. Part1: #15497, Part2:  #15499
   
   ## Are these changes tested?
   
   
   Yes, I manually tested the before/after changes.
   
   ## Are there any user-facing changes?
   
   
   
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: fix struct of arrays [datafusion-comet]

2025-04-01 Thread via GitHub


codecov-commenter commented on PR #1592:
URL: 
https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770532403

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1592?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 56.25%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`51ab989`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/51ab989cb7f51592de97f7785318be7f812ee5b8?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 115 commits behind head on main.
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1592  +/-   ##
   
   + Coverage 56.12%   56.25%   +0.12% 
   + Complexity  976  957  -19 
   
 Files   119  124   +5 
 Lines 1174312436 +693 
 Branches   2251 2322  +71 
   
   + Hits   6591 6996 +405 
   - Misses 4012 4285 +273 
   - Partials   1140 1155  +15 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1592?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Migrate datafusion/sql tests to insta, part3 [datafusion]

2025-04-01 Thread via GitHub


qstommyshu commented on PR #15533:
URL: https://github.com/apache/datafusion/pull/15533#issuecomment-2770233544

   Humm, seems there are some issue with the CI pipeline.
   
   https://github.com/user-attachments/assets/1079a7c8-3e5c-4bca-9473-3c0e9fe69ec7";
 />
   
   > sccache: error: Server startup failed: cache storage failed to read: 
Unexpected (permanent) at read => 
{"$id":"1","innerException":null,"message":"This legacy service is shutting 
down, effective April 15, 2025. Migrate to the new service ASAP. For more 
information: 
https://gh.io/gha-cache-sunset","typeName":"Microsoft.Azure.DevOps.ArtifactCache.WebApi.ArtifactCacheServiceDecommissionedException,
 
Microsoft.Azure.DevOps.ArtifactCache.WebApi","typeKey":"ArtifactCacheServiceDecommissionedException","errorCode":0,"eventId":3000}
   
   Looks like the pipeline is using some legacy service that is shutting down?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add GreptimeDB to the "Users" in README [datafusion-sqlparser-rs]

2025-04-01 Thread via GitHub


iffyio merged PR #1788:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] added functionality to handle output statement [datafusion-sqlparser-rs]

2025-04-01 Thread via GitHub


dilovancelik opened a new pull request, #1790:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790

   Hey I have added a feature to handle OUTPUT Statements in the end of merge 
statements, which is used in MS SQL per the following issue 
[1789](https://github.com/apache/datafusion-sqlparser-rs/issues/1789) that I 
created. 
   
   - I've added the OUTPUT Keyword
   - I've added a struct in for Output in ast/mod.rs
   - I've added an optional Output parameter to the Merge struct in ast/mod.rs
   - added function for parsing the output parameter in parser/mod.rs
   - updated parse merge clause function, to break on output keyword
   - updated parse merge to handling of the output clause
   - added and updated tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] feat: fix struct of arrays [datafusion-comet]

2025-04-01 Thread via GitHub


comphead opened a new pull request, #1592:
URL: https://github.com/apache/datafusion-comet/pull/1592

   ## Which issue does this PR close?
   
   
   
   Closes #1551.
   
   ## Rationale for this change
   
   Fixing STRUCT of ARRAY
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] ArraySort: support structs [datafusion]

2025-04-01 Thread via GitHub


cht42 commented on code in PR #15527:
URL: https://github.com/apache/datafusion/pull/15527#discussion_r2023172721


##
datafusion/functions-nested/src/sort.rs:
##
@@ -207,9 +208,21 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> 
Result {
 valid.append_null();
 } else {
 let arr_ref = list_array.value(i);
-let arr_ref = arr_ref.as_ref();
 
-let sorted_array = compute::sort(arr_ref, sort_option)?;
+let sorted_array = match arr_ref.data_type() {
+DataType::Struct(_) => {
+let sort_columns: Vec = vec![SortColumn {
+values: Arc::clone(&arr_ref),
+options: sort_option,
+}];
+let indices = compute::lexsort_to_indices(&sort_columns, 
None)?;
+compute::take(arr_ref.as_ref(), &indices, None)?
+}
+_ => {
+let arr_ref = arr_ref.as_ref();
+compute::sort(arr_ref, sort_option)?

Review Comment:
   https://github.com/apache/arrow-rs/issues/6911#issuecomment-2562928843



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #14608:
URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2770127700

   @scsmithr of GlareDB integrated the tpchgen library in glaredb as a table 
function
   - https://github.com/GlareDB/glaredb/pull/3549
   
   Which is quite cool
   
   ```shell
   glaredb> .maxrows 10
   glaredb> select * from tpch_gen.lineitem(1);
   
β”Œβ”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”¬β”€β”¬β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”¬β”¬β”¬β”
   β”‚ l_orderkey β”‚ l_partkey β”‚ l_suppkey β”‚ l_linenumber β”‚ l_quantity β”‚ 
l_extendedprice β”‚ … β”‚ l_commitdate β”‚ l_receiptdate β”‚ l_shipinstruct β”‚ 
l_shipmode β”‚ l_comment  β”‚
   β”‚ Int64  β”‚ Int64 β”‚ Int64 β”‚ Int32β”‚ Int64  β”‚ 
Decimal64(15,2) β”‚   β”‚ Date32   β”‚ Date32β”‚ Utf8   β”‚ Utf8  
 β”‚ Utf8   β”‚
   
β”œβ”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”Όβ”Όβ”€β”Όβ”€β”€β”€β”Όβ”€β”€β”Όβ”€β”€β”€β”Όβ”Όβ”Όβ”€
   β”‚  1 β”‚155190 β”‚  7706 β”‚1 β”‚ 17 β”‚
21168.23 β”‚ … β”‚ 1996-11-05   β”‚ 1996-12-14β”‚ DELIVER IN PE… β”‚ TRUCK  β”‚ 
egular co… β”‚
   β”‚  1 β”‚ 67310 β”‚  7311 β”‚2 β”‚ 36 β”‚
45983.16 β”‚ … β”‚ 1996-11-21   β”‚ 1997-01-12β”‚ TAKE BACK RET… β”‚ MAIL   β”‚ ly 
final … β”‚
   β”‚  1 β”‚ 63700 β”‚  3701 β”‚3 β”‚  8 β”‚
13309.60 β”‚ … β”‚ 1996-11-27   β”‚ 1996-10-24β”‚ TAKE BACK RET… β”‚ REG AIRβ”‚ 
riously. … β”‚
   β”‚  1 β”‚  2132 β”‚  4633 β”‚4 β”‚ 28 β”‚
28955.64 β”‚ … β”‚ 1996-12-22   β”‚ 1997-02-07β”‚ NONE   β”‚ AIRβ”‚ 
lites. fl… β”‚
   β”‚  1 β”‚ 24027 β”‚  1534 β”‚5 β”‚ 24 β”‚
22824.48 β”‚ … β”‚ 1996-12-06   β”‚ 1996-12-24β”‚ NONE   β”‚ FOBβ”‚  
pending … β”‚
   β”‚  … β”‚ … β”‚ … │… β”‚  … β”‚   
… β”‚ … β”‚ …│ … β”‚ …  β”‚ …  β”‚ …  
β”‚
   β”‚575 β”‚  7272 β”‚  2273 β”‚1 β”‚ 32 β”‚
37736.64 β”‚ … β”‚ 1994-06-24   β”‚ 1994-07-15β”‚ COLLECT CODβ”‚ REG AIRβ”‚ 
tructions… β”‚
   β”‚575 β”‚  6452 β”‚  1453 β”‚2 β”‚  7 β”‚ 
9509.15 β”‚ … β”‚ 1994-06-17   β”‚ 1994-08-13β”‚ DELIVER IN PE… β”‚ SHIP   β”‚ lar 
pinto… β”‚
   β”‚575 β”‚ 37131 β”‚  2138 β”‚3 β”‚ 18 β”‚
19226.34 β”‚ … β”‚ 1994-05-22   β”‚ 1994-09-01β”‚ DELIVER IN PE… β”‚ FOBβ”‚ , 
quick d… β”‚
   β”‚600 β”‚ 32255 β”‚  2256 β”‚1 β”‚  5 β”‚ 
5936.25 β”‚ … β”‚ 1997-08-13   β”‚ 1997-08-25β”‚ TAKE BACK RET… β”‚ MAIL   β”‚ 
carefully  β”‚
   β”‚600 β”‚ 96127 β”‚  6128 β”‚2 β”‚ 28 β”‚
31447.36 β”‚ … β”‚ 1997-06-25   β”‚ 1997-07-15β”‚ NONE   β”‚ AIRβ”‚ 
ooze furi… β”‚
   
β”œβ”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”΄β”΄β”€β”΄β”€β”€β”€β”΄β”€β”€β”΄β”€β”€β”€β”΄β”΄β”΄β”€
   β”‚ 6001215 rows, 10 shown 

β”‚
   
β””β”˜
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: update group by columns for merge phase after spill [datafusion]

2025-04-01 Thread via GitHub


rluvaton commented on PR #15531:
URL: https://github.com/apache/datafusion/pull/15531#issuecomment-2769949735

   The CI failures are infra related...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-04-01 Thread via GitHub


tomershaniii commented on code in PR #1747:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2023395604


##
src/parser/mod.rs:
##
@@ -7081,18 +7029,243 @@ impl<'a> Parser<'a> {
 
 if let Token::Word(word) = self.peek_token().token {
 if word.keyword == Keyword::OPTIONS {
-options = Some(self.parse_options(Keyword::OPTIONS)?);
+table_options =
+
CreateTableOptions::Options(self.parse_options(Keyword::OPTIONS)?)
 }
 };
 }
 
+if !dialect_of!(self is HiveDialect) && table_options == 
CreateTableOptions::None {
+let plain_options = self.parse_plain_options()?;
+if !plain_options.is_empty() {
+table_options = CreateTableOptions::Plain(plain_options)
+}
+};
+
 Ok(CreateTableConfiguration {
 partition_by,
 cluster_by,
-options,
+table_options,
 })
 }
 
+fn parse_plain_option(&mut self) -> Result, ParserError> 
{
+// Single parameter option
+if self.parse_keywords(&[Keyword::START, Keyword::TRANSACTION]) {
+return Ok(Some(SqlOption::Ident(Ident::new("START TRANSACTION";
+}
+
+// Custom option
+if self.parse_keywords(&[Keyword::COMMENT]) {
+let has_eq = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let comment = match (has_eq, value.token) {
+(true, Token::SingleQuotedString(s)) => {
+Ok(Some(SqlOption::Comment(CommentDef::WithEq(s
+}
+(false, Token::SingleQuotedString(s)) => {
+Ok(Some(SqlOption::Comment(CommentDef::WithoutEq(s
+}
+(_, token) => {
+self.expected("Token::SingleQuotedString", 
TokenWithSpan::wrap(token))
+}
+};
+return comment;
+}
+
+if self.parse_keywords(&[Keyword::ENGINE]) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let engine = match value.token {
+Token::Word(w) => {
+let parameters = if self.peek_token() == Token::LParen {
+Some(self.parse_parenthesized_identifiers()?)
+} else {
+None
+};
+
+Ok(Some(SqlOption::TableEngine(TableEngine {
+name: w.value,
+parameters,
+})))
+}
+_ => {
+return self.expected("Token::Word", value)?;
+}
+};
+
+return engine;
+}
+
+if self.parse_keywords(&[Keyword::TABLESPACE]) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let tablespace = match value.token {
+// TABLESPACE tablespace_name [STORAGE DISK] | [TABLESPACE 
tablespace_name] STORAGE MEMORY
+Token::Word(Word { value: name, .. }) | 
Token::SingleQuotedString(name) => {
+let storage = match self.parse_keyword(Keyword::STORAGE) {
+true => {
+let _ = self.consume_token(&Token::Eq);
+let storage_token = self.next_token();
+match &storage_token.token {
+Token::Word(w) => match 
w.value.to_uppercase().as_str() {
+"DISK" => Some(StorageType::Disk),
+"MEMORY" => Some(StorageType::Memory),
+_ => self
+.expected("Storage type (DISK or 
MEMORY)", storage_token)?,
+},
+_ => self.expected("Token::Word", 
storage_token)?,
+}
+}
+false => None,
+};
+
+Ok(Some(SqlOption::TableSpace(TablespaceOption {
+name,
+storage,
+})))
+}
+_ => {
+return self.expected("Token::Word", value)?;
+}
+};
+
+return tablespace;
+}
+
+if self.parse_keyword(Keyword::UNION) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+match value.token {
+// UNION [=] (tbl_name[,tbl_name]...)
+Token::LParen => {
+let tables: Vec =
+self.parse_comma_separated0(Parser::parse

Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]

2025-04-01 Thread via GitHub


tomershaniii commented on code in PR #1747:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2023395604


##
src/parser/mod.rs:
##
@@ -7081,18 +7029,243 @@ impl<'a> Parser<'a> {
 
 if let Token::Word(word) = self.peek_token().token {
 if word.keyword == Keyword::OPTIONS {
-options = Some(self.parse_options(Keyword::OPTIONS)?);
+table_options =
+
CreateTableOptions::Options(self.parse_options(Keyword::OPTIONS)?)
 }
 };
 }
 
+if !dialect_of!(self is HiveDialect) && table_options == 
CreateTableOptions::None {
+let plain_options = self.parse_plain_options()?;
+if !plain_options.is_empty() {
+table_options = CreateTableOptions::Plain(plain_options)
+}
+};
+
 Ok(CreateTableConfiguration {
 partition_by,
 cluster_by,
-options,
+table_options,
 })
 }
 
+fn parse_plain_option(&mut self) -> Result, ParserError> 
{
+// Single parameter option
+if self.parse_keywords(&[Keyword::START, Keyword::TRANSACTION]) {
+return Ok(Some(SqlOption::Ident(Ident::new("START TRANSACTION";
+}
+
+// Custom option
+if self.parse_keywords(&[Keyword::COMMENT]) {
+let has_eq = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let comment = match (has_eq, value.token) {
+(true, Token::SingleQuotedString(s)) => {
+Ok(Some(SqlOption::Comment(CommentDef::WithEq(s
+}
+(false, Token::SingleQuotedString(s)) => {
+Ok(Some(SqlOption::Comment(CommentDef::WithoutEq(s
+}
+(_, token) => {
+self.expected("Token::SingleQuotedString", 
TokenWithSpan::wrap(token))
+}
+};
+return comment;
+}
+
+if self.parse_keywords(&[Keyword::ENGINE]) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let engine = match value.token {
+Token::Word(w) => {
+let parameters = if self.peek_token() == Token::LParen {
+Some(self.parse_parenthesized_identifiers()?)
+} else {
+None
+};
+
+Ok(Some(SqlOption::TableEngine(TableEngine {
+name: w.value,
+parameters,
+})))
+}
+_ => {
+return self.expected("Token::Word", value)?;
+}
+};
+
+return engine;
+}
+
+if self.parse_keywords(&[Keyword::TABLESPACE]) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+let tablespace = match value.token {
+// TABLESPACE tablespace_name [STORAGE DISK] | [TABLESPACE 
tablespace_name] STORAGE MEMORY
+Token::Word(Word { value: name, .. }) | 
Token::SingleQuotedString(name) => {
+let storage = match self.parse_keyword(Keyword::STORAGE) {
+true => {
+let _ = self.consume_token(&Token::Eq);
+let storage_token = self.next_token();
+match &storage_token.token {
+Token::Word(w) => match 
w.value.to_uppercase().as_str() {
+"DISK" => Some(StorageType::Disk),
+"MEMORY" => Some(StorageType::Memory),
+_ => self
+.expected("Storage type (DISK or 
MEMORY)", storage_token)?,
+},
+_ => self.expected("Token::Word", 
storage_token)?,
+}
+}
+false => None,
+};
+
+Ok(Some(SqlOption::TableSpace(TablespaceOption {
+name,
+storage,
+})))
+}
+_ => {
+return self.expected("Token::Word", value)?;
+}
+};
+
+return tablespace;
+}
+
+if self.parse_keyword(Keyword::UNION) {
+let _ = self.consume_token(&Token::Eq);
+let value = self.next_token();
+
+match value.token {
+// UNION [=] (tbl_name[,tbl_name]...)
+Token::LParen => {
+let tables: Vec =
+self.parse_comma_separated0(Parser::parse

Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #13814:
URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2770840025

   It seems like 1.86 won't be released for 3 more days: 
https://releases.rs/docs/1.86.0/
   
   It would be cool to update and try it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve spill performance: Disable re-validation of spilled files [datafusion]

2025-04-01 Thread via GitHub


alamb commented on PR #15454:
URL: https://github.com/apache/datafusion/pull/15454#issuecomment-2770842980

   Merged up to get latest changes and rerun CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-04-01 Thread via GitHub


mkgada commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770843418

   Thank you so much for looking into this though, @andygrove 
   
   I will take this up with GCP folks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] datafusion-cli: document reading partitioned parquet [datafusion]

2025-04-01 Thread via GitHub


alamb closed issue #15309: datafusion-cli: document reading partitioned parquet
URL: https://github.com/apache/datafusion/issues/15309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] feat: fix schema issues for `native reader - read STRUCT of ARRAY fields` [datafusion-comet]

2025-04-01 Thread via GitHub


comphead closed issue #1551: feat: fix schema issues for `native reader - read 
STRUCT of ARRAY fields`
URL: https://github.com/apache/datafusion-comet/issues/1551


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Fix struct of arrays schema issue [datafusion-comet]

2025-04-01 Thread via GitHub


comphead commented on PR #1592:
URL: 
https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770808027

   Thanks @andygrove for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-01 Thread via GitHub


alamb commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2770823621

   I found a duckdb implementation of a seemingling similar optimization: 
https://github.com/duckdb/duckdb/blob/7912713493b38b1eda162f29b7759d5024989a5f/src/optimizer/sum_rewriter.cpp#L25


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-04-01 Thread via GitHub


mkgada commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770838744

   I am using a Spark image supplied by GCP Dataproc, thank you for checking 
this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-01 Thread via GitHub


Rachelint commented on issue #15383:
URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2770996279

   > I'm considering another approach. Maybe I shouldn't use 
filter_record_batch πŸ€”. It filters the all column iteratly. I should filter the 
row when the accumulator merge_batch πŸ€”
   
   I think also need to filter rows in `GroupValues::intern`, too.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]

2025-04-01 Thread via GitHub


wForget commented on code in PR #1587:
URL: https://github.com/apache/datafusion-comet/pull/1587#discussion_r2023897072


##
native/core/src/parquet/mod.rs:
##
@@ -650,21 +651,29 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_parquet_Native_initRecordBat
 required_schema: jbyteArray,
 data_schema: jbyteArray,
 session_timezone: jstring,
+batch_size: jint,
+worker_threads: jint,
+blocking_threads: jint,
 ) -> jlong {
 try_unwrap_or_throw(&e, |mut env| unsafe {
-let task_ctx = TaskContext::default();
+let session_config = SessionConfig::new().with_batch_size(batch_size 
as usize);
+let planer =

Review Comment:
   This planner is used for native filter conversion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-04-01 Thread via GitHub


andygrove commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770840853

   I am assuming that the GCP version of Spark has some differences in these 
internal APIs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-01 Thread via GitHub


zhuqi-lucas commented on issue #15524:
URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2770953658

   Thank you @Dandandan @alamb for double check and confirm! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: fix spark/sql test failures in native_iceberg_compat [datafusion-comet]

2025-04-01 Thread via GitHub


parthchandra opened a new pull request, #1593:
URL: https://github.com/apache/datafusion-comet/pull/1593

   ## Which issue does this PR close?
   
   Part of #1542 
   
   ## Rationale for this change
   
   A bug in the logic of `NativeBatchReader` caused NPE and array index out of 
bounds errors in `native_iceberg_compat` mode. Summary is that the old version 
used `requestedSchema.getColumns` to get the columns to read. However, this 
returns only the leaf (primitive) columns and does not contain any group 
fields.  So if the query was trying to read a group field (i.e. reading an 
entire struct instead of just one of the fields of the struct), we would use 
incorrect column metadata and sometimes even an incorrect number of fields.
   
   ## What changes are included in this PR?
   The PR changes the logic to use `requestedSchema.getFields` which returns 
both group and primitive type fields.
   The PR also adds additional handling in the `schema_adapter` to allow fields 
in the `to_type` schema that may not exist in the `from_type` schema 
   
   
   ## How are these changes tested?
   
   New unit tests (based on the tests that were failing in Spark).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-04-01 Thread via GitHub


mkgada commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770817710

   Update: spun up another cluster on Spark 3.5.3 and used the same prebuilt 
Comet JAR 0.7.0
   
   Was able to get past the initial error documented here but now running into 
   
   
   `Caused by: java.lang.NoSuchMethodError: 'void 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int,
 long, long[], long[], java.io.File)'
   `
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-04-01 Thread via GitHub


andygrove commented on issue #1576:
URL: 
https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770828974

   > Update: spun up another cluster on Spark 3.5.3 and used the same prebuilt 
Comet JAR 0.7.0
   > 
   > Was able to get past the initial error documented here but now running into
   > 
   > `Caused by: java.lang.NoSuchMethodError: 'void 
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int,
 long, long[], long[], java.io.File)' `
   
   I checked out Spark v3.5.3 and the method is defined as:
   
   ```
 def writeMetadataFileAndCommit(
 shuffleId: Int,
 mapId: Long,
 lengths: Array[Long],
 checksums: Array[Long],
 dataTmp: File): Unit = {
   ```
   
   It also looks like this method signature has not changed recently.
   
   Can you confirm that you are using the open-source Apache Spark v3.5.3 and 
not a GCP Dataproc version of Spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Fix struct of arrays schema issue [datafusion-comet]

2025-04-01 Thread via GitHub


comphead merged PR #1592:
URL: https://github.com/apache/datafusion-comet/pull/1592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]

2025-04-01 Thread via GitHub


parthchandra commented on code in PR #1587:
URL: https://github.com/apache/datafusion-comet/pull/1587#discussion_r2023778763


##
native/core/src/parquet/mod.rs:
##
@@ -650,21 +651,29 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_parquet_Native_initRecordBat
 required_schema: jbyteArray,
 data_schema: jbyteArray,
 session_timezone: jstring,
+batch_size: jint,
+worker_threads: jint,
+blocking_threads: jint,
 ) -> jlong {
 try_unwrap_or_throw(&e, |mut env| unsafe {
-let task_ctx = TaskContext::default();
+let session_config = SessionConfig::new().with_batch_size(batch_size 
as usize);
+let planer =

Review Comment:
   Do we need to create a new planner here? We really only need a 
session_config (even though it is really not valid for the entire session). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]

2025-04-01 Thread via GitHub


alamb merged PR #15505:
URL: https://github.com/apache/datafusion/pull/15505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Disable sccache action to fix gh cache issue [datafusion]

2025-04-01 Thread via GitHub


alamb merged PR #15536:
URL: https://github.com/apache/datafusion/pull/15536


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-04-01 Thread via GitHub


CrystalZhou0529 commented on code in PR #14775:
URL: https://github.com/apache/datafusion/pull/14775#discussion_r2023878333


##
datafusion/ffi/tests/ffi_integration.rs:
##
@@ -179,4 +181,103 @@ mod tests {
 
 Ok(())
 }
+
+#[tokio::test]
+async fn test_ffi_udaf() -> Result<()> {
+let module = get_module()?;
+
+let ffi_avg_func =

Review Comment:
   Fixed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] feat: Support read array type using native reader [datafusion-comet]

2025-04-01 Thread via GitHub


comphead closed issue #1454: feat: Support read array type using native reader
URL: https://github.com/apache/datafusion-comet/issues/1454


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] feat: adding more struct/arrays tests [datafusion-comet]

2025-04-01 Thread via GitHub


comphead opened a new pull request, #1594:
URL: https://github.com/apache/datafusion-comet/pull/1594

   ## Which issue does this PR close?
   
   
   
   Related #1550 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-04-01 Thread via GitHub


acking-you commented on code in PR #15462:
URL: https://github.com/apache/datafusion/pull/15462#discussion_r2023116930


##
datafusion/physical-expr/src/expressions/binary.rs:
##
@@ -805,6 +811,47 @@ impl BinaryExpr {
 }
 }
 
+/// Check if it meets the short-circuit condition
+/// 1. For the `AND` operator, if the `lhs` result all are `false`
+/// 2. For the `OR` operator, if the `lhs` result all are `true`
+/// 3. Otherwise, it does not meet the short-circuit condition
+fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool {
+let data_type = arg.data_type();
+match (data_type, op) {
+(DataType::Boolean, Operator::And) => {
+match arg {
+ColumnarValue::Array(array) => {
+if let Ok(array) = as_boolean_array(&array) {
+return array.false_count() == array.len();

Review Comment:
   > Might be overkill, but one _could_ try a sampling approach: Run the loop 
with the early exit for the first few chunks, and then switch over to the 
unconditional loop.
   
   Thank you for your suggestion, but if we're only applying conditional checks 
to the first few blocks, then I feel this optimization might not be meaningful. 
If nearly all blocks can be filtered out by the preceding filter, the 
optimization will no longer be effective.
   
   >If we find that this slows down some other performance we could also add 
some sort of heuristic check to calling false_count / true_count -- like for 
example if the rhs arg is "complex" (not a Column for example)
   
   I tend to agree with @alamb's point that if the overhead of verification is 
somewhat unacceptable, adopting some heuristic approaches would be better.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Spark SQL test failures in native_iceberg_compat mode [datafusion-comet]

2025-04-01 Thread via GitHub


mbutrovich commented on issue #1542:
URL: 
https://github.com/apache/datafusion-comet/issues/1542#issuecomment-2769901472

   ```
   catalyst: Passed: Total 6925, Failed 0, Errors 0, Passed 6925, Ignored 5, 
Canceled 1
   core 1: Failed: Total 8686, Failed 47, Errors 0, Passed 8639, Ignored 277, 
Canceled 3
   core 2: Failed: Total 2045, Failed 106, Errors 0, Passed 1939, Ignored 360
   core 3: Failed: Total 1394, Failed 24, Errors 0, Passed 1370, Ignored 119, 
Canceled 15
   hive 1: Failed: Total 2144, Failed 9, Errors 0, Passed 2135, Ignored 38, 
Canceled 4
   hive 2: Error: Total 19, Failed 0, Errors 1, Passed 18, Ignored 1, Canceled 4
   hive 3: Passed: Total 1044, Failed 0, Errors 0, Passed 1044, Ignored 13, 
Canceled 4
   ```
   Counts from https://github.com/apache/datafusion-comet/pull/1541 today. Will 
track here as we keep updating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >