Re: [I] Run all benchmarks on merge to main branch [datafusion]
Shreyaskr1409 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769119055 > I don't think we need to actually "benchmark" the code for each merge How about we set a tag/label for performance related PRs and run benchmark tests for those specific PRs only? I could look into it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] AQE Unable to Rewrite Joins as Broadcast Hash Joins Due to Existing CometBroadcastHashJoin Operator [datafusion-comet]
Kontinuation opened a new issue, #1589: URL: https://github.com/apache/datafusion-comet/issues/1589 ### Describe the bug AQE could transform SortMergeJoin or ShuffledHashJoin to BroadcastHashJoin dynamically after discovering that one of the Exchange operator only shuffle writes small amount of data. However, this optimization does not always happen when using Comet. TPC-H Q7 has an equi-join between `supplier` and `lineitem`. Spark could discover that `supplier` is small enough to be broadcasted after running the `Exchange` operator, and dynamically change the sort-merge-join to a broadcast hash join (see `BroadcastHashJoin Inner BuildLeft (15)`): ``` == Physical Plan == AdaptiveSparkPlan (99) +- == Final Plan == * Sort (62) +- AQEShuffleRead (61) +- ShuffleQueryStage (60), Statistics(sizeInBytes=288.0 B, rowCount=4) +- Exchange (59) +- * HashAggregate (58) +- AQEShuffleRead (57) +- ShuffleQueryStage (56), Statistics(sizeInBytes=2.8 KiB, rowCount=36) +- Exchange (55) +- * HashAggregate (54) +- * Project (53) +- * BroadcastHashJoin Inner BuildRight (52) :- * Project (49) : +- * BroadcastHashJoin Inner BuildRight (48) : :- * Project (42) : : +- * SortMergeJoin Inner (41) : : :- * Sort (33) : : : +- AQEShuffleRead (32) : : : +- ShuffleQueryStage (31), Statistics(sizeInBytes=667.5 MiB, rowCount=1.46E+7) : : :+- Exchange (30) : : : +- * Project (29) : : : +- * SortMergeJoin Inner (28) : : : :- * Sort (20) : : : : +- AQEShuffleRead (19) : : : : +- ShuffleQueryStage (18), Statistics(sizeInBytes=667.5 MiB, rowCount=1.46E+7) : : : :+- Exchange (17) : : : : +- * Project (16) : : : : +- * BroadcastHashJoin Inner BuildLeft (15) <-- Transformed from SortMergeJoin by AQE : : : : :- BroadcastQueryStage (8), Statistics(sizeInBytes=8.0 MiB, rowCount=8.00E+4) : : : : : +- BroadcastExchange (7) : : : : : +- AQEShuffleRead (6) : : : : :+- ShuffleQueryStage (5), Statistics(sizeInBytes=1874.1 KiB, rowCount=8.00E+4) : : : : : +- Exchange (4) : : : : : +- * Filter (3) : : : : : +- * ColumnarToRow (2) : : : : :+- Scan parquet (1) : : : : +- AQEShuffleRead (14) : : : : +- ShuffleQueryStage (13), Statistics(sizeInBytes=8.1 GiB, rowCount=1.82E+8) : : : : +- Exchange (12) : : : : +- * Filter (11) : : : : +- * ColumnarToRow (10) : : : : +- Scan parquet (9) : : : +- * Sort (27) : : :+- AQEShuffleRead (26) : : : +- ShuffleQueryStage (25), Statistics(sizeInBytes=3.4 GiB, rowCount=1.50E+8) : : : +- Exchange (24)
Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]
suibianwanwank commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769563811 @jayzhan211 From this field, it seems there's no issue, but how to get the e.b field if we perform aggregation on the Join. Since this query requires all rows from the left table to be included in the results. This might also involve the equivalence of Aggregate operations under outer joins. ```SQL select e.b ,(select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e.b+1 ) from t1 e; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Internal error: PhysicalExpr Column references bound error, Failure in spilling for `AggregateMode::Single` [datafusion]
rluvaton opened a new issue, #15530: URL: https://github.com/apache/datafusion/issues/15530 ### Describe the bug when using aggregate exec with single mode, and spilling and the group by expressions are not the first expressions from the previous plan there will be schema mismatch ### To Reproduce ```rust #[cfg(test)] mod tests { use std::fmt::{Display, Formatter}; use arrow::array::{ArrayRef, Int32Array, Int64Array, RecordBatch, StringArray}; use arrow::datatypes::{DataType, Field, Schema, SchemaRef}; use datafusion::common::Result; use datafusion::execution::memory_pool::FairSpillPool; use datafusion::execution::runtime_env::RuntimeEnvBuilder; use datafusion::execution::TaskContext; use datafusion::functions_aggregate::sum::sum_udaf; use datafusion::physical_expr::aggregate::AggregateExprBuilder; use datafusion::physical_expr::expressions::{lit, Column}; use datafusion::physical_plan::aggregates::{PhysicalGroupBy, AggregateExec, AggregateMode}; use datafusion::physical_plan::common::collect; use datafusion::physical_plan::ExecutionPlan; use rand::{random, thread_rng, Rng}; use std::sync::Arc; use datafusion_physical_plan::memory::{LazyBatchGenerator, LazyMemoryExec}; use parking_lot::RwLock; #[tokio::test] async fn test_debug() -> Result<()> { let scan_schema = Arc::new(Schema::new(vec![ Field::new("col_0", DataType::Int64, true), Field::new("col_1", DataType::Utf8, true), Field::new("col_2", DataType::Utf8, true), Field::new("col_3", DataType::Utf8, true), Field::new("col_4", DataType::Utf8, true), Field::new("col_5", DataType::Int32, true), Field::new("col_6", DataType::Utf8, true), Field::new("col_7", DataType::Utf8, true), Field::new("col_8", DataType::Utf8, true), ])); let group_by = PhysicalGroupBy::new_single(vec![ (Arc::new(Column::new("col_1", 1)), "col_1".to_string()), (Arc::new(Column::new("col_7", 7)), "col_7".to_string()), (Arc::new(Column::new("col_0", 0)), "col_0".to_string()), (Arc::new(Column::new("col_8", 8)), "col_8".to_string()), ]); fn generate_int64_array() -> ArrayRef { Arc::new(Int64Array::from_iter_values( (0..8192).map(|_| random::()), )) } fn generate_int32_array() -> ArrayRef { Arc::new(Int32Array::from_iter_values( (0..8192).map(|_| random::()), )) } fn generate_string_array() -> ArrayRef { Arc::new(StringArray::from( (0..8192) .map(|_| -> String { thread_rng() .sample_iter::(rand::distributions::Standard) .take(10) .collect() }) .collect::>(), )) } fn generate_record_batch(schema: &SchemaRef) -> Result { RecordBatch::try_new( Arc::clone(&schema), vec![ generate_int64_array(), generate_string_array(), generate_string_array(), generate_string_array(), generate_string_array(), generate_int32_array(), generate_string_array(), generate_string_array(), generate_string_array(), ], ) .map_err(|err| err.into()) } let aggregate_expressions = vec![Arc::new( AggregateExprBuilder::new(sum_udaf(), vec![lit(1i64)]) .schema(Arc::clone(&scan_schema)) .alias("SUM(1i64)") .build()?, )]; #[derive(Debug)] struct Generator { index: usize, count: usize, schema: SchemaRef, } impl Display for Generator { fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { write!(f, "Generator") } } impl LazyBatchGenerator for Generator { fn generate_next_batch(&mut self) -> Result> { if self.index > self.count { return Ok(None); } let batch = generate_record_batch(&self.schema)?; self.index += 1; Ok(Some(batch)) } } let gene
Re: [PR] docs: change OSX/OS X to macOS [datafusion-comet]
andygrove merged PR #1584: URL: https://github.com/apache/datafusion-comet/pull/1584 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add documentation example for `AggregateExprBuilder` [datafusion]
Shreyaskr1409 commented on issue #15369: URL: https://github.com/apache/datafusion/issues/15369#issuecomment-2769196668 @alamb should I add this example to datafusion-examples as well? as per https://github.com/apache/datafusion/pull/15504#issuecomment-2767270135 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]
Kontinuation commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022987423 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -852,16 +1079,64 @@ impl PartitionBuffer { file: spill_data, }); } -self.spill_file -.as_mut() -.unwrap() -.file -.write_all(&output_batches)?; +Ok(()) +} +} + +/// Write batches to writer while using a buffer to avoid frequent system calls. +/// The record batches were first written by ShuffleBlockWriter into an internal buffer. +/// Once the buffer exceeds the max size, the buffer will be flushed to the writer. +struct BufBatchWriter, W: Write> { +shuffle_block_writer: S, +writer: W, +buffer: Vec, +buffer_max_size: usize, +} + +impl, W: Write> BufBatchWriter { +fn new(shuffle_block_writer: S, writer: W) -> Self { +// 1MB should be good enough to avoid frequent system calls, +// and also won't cause too much memory usage +let buffer_max_size = 1024 * 1024; +Self { +shuffle_block_writer, +writer, +buffer: vec![], +buffer_max_size, +} +} + +fn write( +&mut self, +batch: &RecordBatch, +encode_time: &Time, +write_time: &Time, +) -> Result { +let mut cursor = Cursor::new(&mut self.buffer); +cursor.seek(SeekFrom::End(0))?; +let mut write_timer = write_time.timer(); +let bytes_written = +self.shuffle_block_writer +.borrow() +.write_batch(batch, &mut cursor, encode_time)?; +let pos = cursor.position(); +if pos >= self.buffer_max_size as u64 { +self.writer.write_all(&self.buffer)?; +self.buffer.clear(); +} write_timer.stop(); Review Comment: Just found that I have removed the timing for std::io::copy and BufWriter.flush, I have added them back in the latest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]
geoffreyclaude commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2769593513 I ran some quick [experiments on my fork](https://github.com/geoffreyclaude/datafusion/pull/3) by checking for early termination after each batch processed in the "topK" on the example TPCH query above: - Elapsed dropped from `16s` to `800ms`: 20x speedup - The Parquet DataSource `output_rows` metric dropped from `17135217` to `81920` (81920 because it read 1 batch of 8192 rows in parallel on 10 partitions): 200x reduction - The Parquet DataSource `bytes_scanned` metric dropped from `130MB` to `23MB`: 5x reduction (which doesn't align at all with the `output_rows` reduction for some reason...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] March 17, 2025: This week(s) in DataFusion [datafusion]
alamb commented on issue #15269: URL: https://github.com/apache/datafusion/issues/15269#issuecomment-2769431998 And we now have explain plans on by default π - https://github.com/apache/datafusion/pull/15427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]
alamb merged PR #15499: URL: https://github.com/apache/datafusion/pull/15499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improve html table rendering formatting [datafusion-python]
AgalyaS1757 commented on issue #1078: URL: https://github.com/apache/datafusion-python/issues/1078#issuecomment-2769659726 Solution Are; 1. Modify the Data Size Limit Identify the part of the code where the 2MB limit is enforced. Introduce a user-configurable parameter (e.g., max_data_size), allowing users to set their preferred limit. Ensure that the size approximation logic remains efficient and does not slow down performance. 2. Add an Option to Disable Styling Locate the existing HTML rendering logic where styles are applied. Introduce a flag (e.g., disable_styling) that allows users to enable or disable CSS styling in the output. Implement a conditional check to apply styling only if the flag is not set. 3. Enable Custom Formatting Modify the ArrayFormatter or introduce a parameter (e.g., custom_formatter) that allows users to pass their own formatting function. Ensure that the custom function is validated and safely applied without breaking existing functionality. 4. Refactor HTML Generation Code (Nice-to-have for maintainability) Break down the existing monolithic HTML rendering function into smaller, reusable helper functions. Consider moving the HTML-related functions into a separate module/file if the changes become too extensive. 5. Testing & Documentation Write test cases to ensure the new functionalities work as expected. Update the documentation with clear examples on how to: Set a custom data size limit. Disable styling. Use a custom formatter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Use global tokio runtime per executor process [datafusion-comet]
andygrove commented on issue #1590: URL: https://github.com/apache/datafusion-comet/issues/1590#issuecomment-2769667177 See https://github.com/apache/datafusion-comet/pull/1104 for a previous attempt at implementing this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix sequential metadata fetching in ListingTable causing high latency [datafusion]
alamb commented on PR #14918: URL: https://github.com/apache/datafusion/pull/14918#issuecomment-2769716046 This was referenced by @sergiimk in https://discord.com/channels/885562378132000778/1290751484807352412/1356393367566553240 (they hit the same problem and was pleased to find it fixed!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]
jayzhan211 commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769722147 The projection required to be in the group expression. I think the query of these 2 are equivalent but the subquery one group by `e2.b` and the join query group by `e1.b`. Not sure if this rewrite could be general enough π€ ``` query IT select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; 0 a 2 a query TT explain select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; logical_plan 01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END 02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS Int64) 03)SubqueryAlias: e1 04)--TableScan: t1 projection=[b] 05)SubqueryAlias: __scalar_sq_1 06)--Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b, Boolean(true) AS __always_true 07)Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]] 08)--SubqueryAlias: e2 09)TableScan: t2 projection=[a, b] physical_plan 01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN b ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END] 02)--CoalesceBatchesExec: target_batch_size=8192 03)HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4] 04)--CoalesceBatchesExec: target_batch_size=8192 05)RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), input_partitions=1 06)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b + Int64(1)] 07)DataSourceExec: partitions=1, partition_sizes=[1] 08)--CoalesceBatchesExec: target_batch_size=8192 09)RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS Int64)@3], 4), input_partitions=4 10)--ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS Int64)] 11)AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] 12)--CoalesceBatchesExec: target_batch_size=8192 13)RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 14)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 15)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] 16)--DataSourceExec: partitions=1, partition_sizes=[1] query IT SELECT e1.b, CASE WHEN MAX(e2.a) > 10 THEN 'a' ELSE 'b' END AS result FROM t2 e2 LEFT JOIN t1 e1 ON e2.b = e1.b + 1 GROUP BY e1.b; 2 a 0 a query TT explain SELECT e1.b, CASE WHEN MAX(e2.a) > 10 THEN 'a' ELSE 'b' END AS result FROM t2 e2 LEFT JOIN t1 e1 ON e2.b = e1.b + 1 GROUP BY e1.b; logical_plan 01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS result 02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]] 03)Projection: e2.a, e1.b 04)--Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1) 05)SubqueryAlias: e2 06)--TableScan: t2 projection=[a, b] 07)SubqueryAlias: e1 08)--TableScan: t1 projection=[b] physical_plan 01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as result] 02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] 03)CoalesceBatchesExec: target_batch_size=8192 04)--RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 05)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] 06)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 07)ProjectionExec: expr=[a@1 as a, b@0 as b] 08)--CoalesceBatchesExec: target_batch_size=8192 09)HashJoinExec: mode=Partitioned, join_type=Right, on=[(e1.b + Int64(1)@1, CAST(e2.b AS Int64)@2)], projection=[b@0, a@2] 10)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b + Int64(1)] 11)DataSourceExec: partition
Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]
LiaCastaneda commented on code in PR #15438: URL: https://github.com/apache/datafusion/pull/15438#discussion_r2022571200 ## datafusion/expr/src/logical_plan/builder.rs: ## @@ -1470,17 +1470,27 @@ impl ValuesFields { pub fn change_redundant_column(fields: &Fields) -> Vec { Review Comment: yep that's a good idea π -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]
ctsk commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2022894121 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition +/// 1. For the `AND` operator, if the `lhs` result all are `false` +/// 2. For the `OR` operator, if the `lhs` result all are `true` +/// 3. Otherwise, it does not meet the short-circuit condition +fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { +let data_type = arg.data_type(); +match (data_type, op) { +(DataType::Boolean, Operator::And) => { +match arg { +ColumnarValue::Array(array) => { +if let Ok(array) = as_boolean_array(&array) { +return array.false_count() == array.len(); Review Comment: Might be overkill, but one *could* try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop. Almost seems like something the compiler could automagically do... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] use state machine to refactor the `get_files_with_limit` method [datafusion]
xudong963 commented on PR #15521: URL: https://github.com/apache/datafusion/pull/15521#issuecomment-2769429971 Thanks @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Follow up #15432 [datafusion]
xudong963 closed issue #15519: Follow up #15432 URL: https://github.com/apache/datafusion/issues/15519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Weekly Plan (Andrew Lamb) March 24, 2025 [datafusion]
alamb closed issue #15393: Weekly Plan (Andrew Lamb) March 24, 2025 URL: https://github.com/apache/datafusion/issues/15393 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Weekly Plan (Andrew Lamb) March 31, 2025 [datafusion]
alamb opened a new issue, #15528: URL: https://github.com/apache/datafusion/issues/15528 This is an attempt to organize myself and make what I plan to work on more visible ## Weekly High Level Goals - [ ] Work on integrating tpch data generator with @clflushopt : https://github.com/apache/datafusion/issues/14608 - [ ] Get https://github.com/apache/datafusion/issues/3463 ready for merge with @XiangpengHao - [ ] TopK Pushdown #15037 with @adriangb - [ ] https://github.com/apache/arrow-rs/issues/7084 # Other projects I plan to review - [ ] Bug fixes - [ ] Other performance improvements - [ ] Complete insta test migration https://github.com/apache/datafusion/issues/15178 with @blaginin @shruti2522 @qstommyshu and others - [ ] Hardening external sorts: https://github.com/apache/datafusion/issues/14692 with @2010YOUY01 - [ ] Set up Spark function library pattern: https://github.com/apache/datafusion/pull/15168 with @shehabgamin and @andygrove - [ ] Use UTF8 view by default https://github.com/apache/datafusion/issues/15096 with @zhuqi-lucas ## Background I am putting this list on github because: 1. I like how github renders checklists w/ PR titles so it is easy to track (I currently have a local text file...) 2. I thought others might be interested from seeing what I am doing / planning to do 3. It makes me feel better that I don't have time to review all the PRs π The way I am trying to prioritize PRs is in the following order 1. Bug fixes 2. Documentation / UX / API improvements (things that make DataFusion easier/better to work with) 3. Performance improvements 4. New features with wide appeal 5. New functions Note new features and functions are deliberately at the bottom -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Consolidate statistics aggregation [datafusion]
alamb commented on issue #8229: URL: https://github.com/apache/datafusion/issues/8229#issuecomment-2769446857 @xudong963 do you think we have completed this issue now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Extend TopK early termination to partially sorted inputs [datafusion]
geoffreyclaude opened a new issue, #15529: URL: https://github.com/apache/datafusion/issues/15529 ### Is your feature request related to a problem or challenge? DataFusion currently has a "TopK early termination" optimization, which speeds up queries that involve `ORDER BY` and `LIMIT` if the input data is already sorted by the full ordering requested in the query. However, many real-world scenarios involve datasets that are only partially sorted. For example, consider a time-series dataset that's pre-sorted by `day` but not sorted within each day. Queries requesting data sorted by `day, timestamp` should still benefit significantly from optimization because once DataFusion has collected the required number of rows from the most recent day(s), it could safely ignore data from earlier days. Today, DataFusion does not take advantage of such partial ordering, resulting in unnecessary scans and sorts. Example query affected by this: ```sql SELECT day, sensor_id, reading, timestamp FROM sensor_readings WHERE sensor_id = 1002 ORDER BY day DESC, timestamp DESC LIMIT 10; ``` If the data source providing `sensor_readings` can guarantee a `day DESC` ordering, this query should quickly finish after scanning enough rows from the most recent days, but currently DataFusion will continue scanning unnecessarily the full `sensor_readings`. ### Describe the solution you'd like I propose extending DataFusion's existing "TopK early termination" optimization to handle cases where the input data is partially sorted by a prefix of the requested ordering. Specifically, DataFusion should detect: - When the input ordering has a non-empty common prefix with the query's requested ordering. - When the top-K buffer is full. - If all still pending rows are guaranteed to be strictly worse than the top-K's max value, comparing only on the common prefix. Under these conditions, DataFusion can safely terminate scanning early, significantly improving query performance and reducing resource consumption. ### Describe alternatives you've considered _No response_ ### Additional context I wasn't able to find benchmarks on already sorted data. However, a simple reproducer from the TPCH dataset could be: ```sql CREATE EXTERNAL TABLE lineitem_ship ( l_shipdate DATE, l_commitdate DATE, l_shipmode VARCHAR, l_quantity INT ) STORED AS PARQUET LOCATION 'scratch/topk' WITH ORDER (l_shipdate); INSERT INTO lineitem_ship SELECT l_shipdate, l_commitdate, l_shipmode, l_quantity FROM lineitem ORDER BY l_shipdate; SELECT l_shipdate, l_commitdate, l_quantity FROM lineitem_ship WHERE l_shipmode IN ('MAIL', 'AIR') ORDER BY l_shipdate, l_commitdate, l_quantity LIMIT 10; ``` This query today scans the full `lineitem_ship` table. I'd expect it to be orders of magnitude faster with the sort prefix enhancement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]
jayzhan211 commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769195062 ``` [2025-04-01T12:19:41Z DEBUG datafusion_optimizer::utils] scalar_subquery_to_join: Projection: e.b, __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END Left Join: Filter: CAST(__scalar_sq_1.b AS Int64) = CAST(e.b AS Int64) + Int64(1) SubqueryAlias: e TableScan: t SubqueryAlias: __scalar_sq_1 Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]] SubqueryAlias: e2 TableScan: t2 ``` In `scalar_subquery_to_join` optimization, we have left join on `__scalar_sq_1` and `e`. I wonder could we join the plan first and then call aggregation on top of joined plan. The plan I expected is something like this ``` [2025-04-01T12:19:41Z DEBUG datafusion_optimizer::utils] scalar_subquery_to_join: Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]] Left Join: Filter: CAST(e2.b AS Int64) = CAST(e.b AS Int64) + Int64(1) SubqueryAlias: e TableScan: t SubqueryAlias: e2 TableScan: t2 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] AQE Unable to Rewrite Joins as Broadcast Hash Joins Due to Existing CometBroadcastHashJoin Operator [datafusion-comet]
mbutrovich commented on issue #1589: URL: https://github.com/apache/datafusion-comet/issues/1589#issuecomment-2769269205 Good catch, @Kontinuation! https://github.com/apache/datafusion-comet/pull/1578 has me looking at AQE wondering if there are other places where Comet isn't working with AQE where we should be. This is motivating to keep investigating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] GroupsAccumulator for Duration (#15322) [datafusion]
emilk closed pull request #15522: GroupsAccumulator for Duration (#15322) URL: https://github.com/apache/datafusion/pull/15522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] ClickBench Improvements (Vanity Benchmark) [datafusion]
zhuqi-lucas commented on issue #14586: URL: https://github.com/apache/datafusion/issues/14586#issuecomment-2768790003 Make Clickbench Q29 5X faster: https://github.com/apache/datafusion/issues/15524 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2022286476 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -,4 +2315,163 @@ mod tests { assert_eq!(new_config.constraints, Constraints::default()); assert!(new_config.new_lines_in_values); } + +#[test] +fn test_split_groups_by_statistics_with_target_partitions() -> Result<()> { Review Comment: Add tests for new method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
ctsk commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022551332 ## datafusion/physical-plan/src/sorts/sort_filters.rs: ## @@ -0,0 +1,236 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::{Arc, RwLock}; + +use arrow_schema::SortOptions; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::Operator; +use datafusion_physical_expr::{ +expressions::{is_not_null, is_null, lit, BinaryExpr}, +LexOrdering, PhysicalExpr, +}; + +use crate::dynamic_filters::{DynamicFilterPhysicalExpr, DynamicFilterSource}; + +/// Holds threshold value and sort order information for a column +#[derive(Debug, Clone)] +struct ColumnThreshold { +/// The current threshold value +pub value: Arc>>, +/// The column expression +pub expr: Arc, +/// Sort options +pub sort_options: SortOptions, +} + +/// Pushdown of dynamic fitlers from sort + limit operators (aka `TopK`) is used to speed up queries +/// such as `SELECT * FROM table ORDER BY col DESC LIMIT 10` by pushing down the +/// threshold values for the sort columns to the data source. +/// That is, the TopK operator will keep track of the top 10 values for the sort +/// and before a new file is opened it's statitics will be checked against the +/// threshold values to determine if the file can be skipped and predicate pushdown +/// will use these to skip rows during the scan. +/// +/// For example, imagine this data gets created if multiple sources with clock skews, +/// network delays, etc. are writing data and you don't do anything fancy to guarantee +/// perfect sorting by `timestamp` (i.e. you naively write out the data to Parquet, maybe do some compaction, etc.). +/// The point is that 99% of yesterday's files have a `timestamp` smaller than 99% of today's files +/// but there may be a couple seconds of overlap between files. +/// To be concrete, let's say this is our data: +// +// | file | min | max | +// |--|-|-| +// | 1| 1 | 10 | +// | 2| 9 | 19 | +// | 3| 20 | 31 | +// | 4| 30 | 35 | +// +// Ideally a [`TableProvider`] is able to use file level stats or other methods to roughly order the files +// within each partition / file group such that we start with the newest / largest `timestamp`s. +// If this is not possible the optimization still works but is less efficient and harder to visualize, +// so for this example let's assume that we process 1 file at a time and we started with file 4. +// After processing file 4 let's say we have 10 values in our TopK heap, the smallest of which is 30. +// The TopK operator will then push down the filter `timestamp < 30` down the tree of [`ExecutionPlan`]s +// and if the data source supports dynamic filter pushdown it will accept a reference to this [`DynamicPhysicalExprSource`] +// and when it goes to open file 3 it will ask the [`DynamicPhysicalExprSource`] for the current filters. +// Since file 3 may contain values larger than 30 we cannot skip it entirely, +// but scanning it may still be more efficient due to page pruning and other optimizations. +// Once we get to file 2 however we can skip it entirely because we know that all values in file 2 are smaller than 30. +// The same goes for file 1. +// So this optimization just saved us 50% of the work of scanning the data. +#[derive(Debug, Clone)] +pub struct SortDynamicFilterSource { +thresholds: Vec, +} + +impl SortDynamicFilterSource { +pub fn new(ordering: &LexOrdering) -> Self { +let thresholds = ordering +.iter() +.map(|sort_expr| ColumnThreshold { +value: Arc::new(RwLock::new(None)), +expr: Arc::clone(&sort_expr.expr), +sort_options: sort_expr.options, +}) +.collect(); + +Self { thresholds } +} + +pub fn update_values(&self, new_values: &[ScalarValue]) { +if new_values.len() != self.thresholds.len() { +panic!("New values length does not match the number of thresholds"); +} +for (i, new_value) in new_values.iter().enumerate() { +let threshold = &self.thresholds[i]; +
Re: [I] TPCH unit tests failure [datafusion-ballista]
milenkovicm commented on issue #1194: URL: https://github.com/apache/datafusion-ballista/issues/1194#issuecomment-2769278754 @vmingchen is this issue closed with #1195? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Run all benchmarks on merge to main branch [datafusion]
Omega359 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769284420 > > There has been a number of issues where benchmarks stopped working and no one noticed until someone happened to try and run them > > Instead of running the benchmark, how about adding those benchmark query to tests, I don't think we need to actually "benchmark" the code for each merge. We could do that too however not all benchmarks are sql queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
ctsk commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022274853 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -186,6 +235,90 @@ impl TopK { Ok(()) } +fn calculate_dynamic_filters( +thresholds: Vec, +) -> Result>> { +// Create filter expressions for each threshold +let mut filters: Vec> = Review Comment: Been thinking about this too :D. One could add another term to the generated expression: ``` passive OR ``` `passive` itself is a dynamic literal that starts out as `true` and gets set to `false` by the TopK Heap once the heap is full. A downside of the DynamicLiteral approach is that it requires one mutex per field (vs 1 mutex for the whole expression previously). For native types, this *feels* like something that could be backed by an atomic, but I don't see an easy way of achieving that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
ctsk commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022327718 ## datafusion/physical-plan/src/sorts/sort_filters.rs: ## @@ -0,0 +1,236 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::{Arc, RwLock}; + +use arrow_schema::SortOptions; +use datafusion_common::{Result, ScalarValue}; +use datafusion_expr::Operator; +use datafusion_physical_expr::{ +expressions::{is_not_null, is_null, lit, BinaryExpr}, +LexOrdering, PhysicalExpr, +}; + +use crate::dynamic_filters::{DynamicFilterPhysicalExpr, DynamicFilterSource}; + +/// Holds threshold value and sort order information for a column +#[derive(Debug, Clone)] +struct ColumnThreshold { +/// The current threshold value +pub value: Arc>>, +/// The column expression +pub expr: Arc, +/// Sort options +pub sort_options: SortOptions, +} + +/// Pushdown of dynamic fitlers from sort + limit operators (aka `TopK`) is used to speed up queries +/// such as `SELECT * FROM table ORDER BY col DESC LIMIT 10` by pushing down the +/// threshold values for the sort columns to the data source. +/// That is, the TopK operator will keep track of the top 10 values for the sort +/// and before a new file is opened it's statitics will be checked against the +/// threshold values to determine if the file can be skipped and predicate pushdown +/// will use these to skip rows during the scan. +/// +/// For example, imagine this data gets created if multiple sources with clock skews, +/// network delays, etc. are writing data and you don't do anything fancy to guarantee +/// perfect sorting by `timestamp` (i.e. you naively write out the data to Parquet, maybe do some compaction, etc.). +/// The point is that 99% of yesterday's files have a `timestamp` smaller than 99% of today's files +/// but there may be a couple seconds of overlap between files. +/// To be concrete, let's say this is our data: +// +// | file | min | max | +// |--|-|-| +// | 1| 1 | 10 | +// | 2| 9 | 19 | +// | 3| 20 | 31 | +// | 4| 30 | 35 | +// +// Ideally a [`TableProvider`] is able to use file level stats or other methods to roughly order the files +// within each partition / file group such that we start with the newest / largest `timestamp`s. +// If this is not possible the optimization still works but is less efficient and harder to visualize, +// so for this example let's assume that we process 1 file at a time and we started with file 4. +// After processing file 4 let's say we have 10 values in our TopK heap, the smallest of which is 30. +// The TopK operator will then push down the filter `timestamp < 30` down the tree of [`ExecutionPlan`]s +// and if the data source supports dynamic filter pushdown it will accept a reference to this [`DynamicPhysicalExprSource`] +// and when it goes to open file 3 it will ask the [`DynamicPhysicalExprSource`] for the current filters. +// Since file 3 may contain values larger than 30 we cannot skip it entirely, +// but scanning it may still be more efficient due to page pruning and other optimizations. +// Once we get to file 2 however we can skip it entirely because we know that all values in file 2 are smaller than 30. +// The same goes for file 1. +// So this optimization just saved us 50% of the work of scanning the data. +#[derive(Debug, Clone)] +pub struct SortDynamicFilterSource { +thresholds: Vec, +} + +impl SortDynamicFilterSource { +pub fn new(ordering: &LexOrdering) -> Self { +let thresholds = ordering +.iter() +.map(|sort_expr| ColumnThreshold { +value: Arc::new(RwLock::new(None)), +expr: Arc::clone(&sort_expr.expr), +sort_options: sort_expr.options, +}) +.collect(); + +Self { thresholds } +} + +pub fn update_values(&self, new_values: &[ScalarValue]) { +if new_values.len() != self.thresholds.len() { +panic!("New values length does not match the number of thresholds"); +} +for (i, new_value) in new_values.iter().enumerate() { +let threshold = &self.thresholds[i]; +
[I] Make Clickbench Q29 5x faster for datafusion [datafusion]
zhuqi-lucas opened a new issue, #15524: URL: https://github.com/apache/datafusion/issues/15524 ### Is your feature request related to a problem or challenge? https://github.com/user-attachments/assets/c82b798f-7c14-42e9-b6d9-b67a6b038c9d"; /> Our datafusion is 5x slower than duckdb for q29, it's easy for us to optimize to 5x faster, here is the try: Before rewrite: ```rust cargo run --profile release-nonlto --target aarch64-apple-darwin --bin dfbench -- clickbench -p benchmarks/data/hits_partitioned -q 29 Finished `release-nonlto` profile [optimized] target(s) in 0.26s Running `target/aarch64-apple-darwin/release-nonlto/dfbench clickbench -p benchmarks/data/hits_partitioned -q 29` Running benchmarks with the following options: RunOpt { query: Some(29), common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "benchmarks/data/hits_partitioned", queries_path: "benchmarks/queries/clickbench/queries.sql", output_path: None } Q29: SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth" + 1), SUM("ResolutionWidth" + 2), SUM("ResolutionWidth" + 3), SUM("ResolutionWidth" + 4), SUM("ResolutionWidth" + 5), SUM("ResolutionWidth" + 6), SUM("ResolutionWidth" + 7), SUM("ResolutionWidth" + 8), SUM("ResolutionWidth" + 9), SUM("ResolutionWidth" + 10), SUM("ResolutionWidth" + 11), SUM("ResolutionWidth" + 12), SUM("ResolutionWidth" + 13), SUM("ResolutionWidth" + 14), SUM("ResolutionWidth" + 15), SUM("ResolutionWidth" + 16), SUM("ResolutionWidth" + 17), SUM("ResolutionWidth" + 18), SUM("ResolutionWidth" + 19), SUM("ResolutionWidth" + 20), SUM("ResolutionWidth" + 21), SUM("ResolutionWidth" + 22), SUM("ResolutionWidth" + 23), SUM("ResolutionWidth" + 24), SUM("ResolutionWidth" + 25), SUM("ResolutionWidth" + 26), SUM("ResolutionWidth" + 27), SUM("ResolutionWidth" + 28), SUM("ResolutionWidth" + 29), SUM("ResolutionWidth" + 30), SUM("ResolutionWidth" + 31), SUM("ResolutionWidth" + 32), SUM("ResolutionWidth" + 33), SUM("Resolu tionWidth" + 34), SUM("ResolutionWidth" + 35), SUM("ResolutionWidth" + 36), SUM("ResolutionWidth" + 37), SUM("ResolutionWidth" + 38), SUM("ResolutionWidth" + 39), SUM("ResolutionWidth" + 40), SUM("ResolutionWidth" + 41), SUM("ResolutionWidth" + 42), SUM("ResolutionWidth" + 43), SUM("ResolutionWidth" + 44), SUM("ResolutionWidth" + 45), SUM("ResolutionWidth" + 46), SUM("ResolutionWidth" + 47), SUM("ResolutionWidth" + 48), SUM("ResolutionWidth" + 49), SUM("ResolutionWidth" + 50), SUM("ResolutionWidth" + 51), SUM("ResolutionWidth" + 52), SUM("ResolutionWidth" + 53), SUM("ResolutionWidth" + 54), SUM("ResolutionWidth" + 55), SUM("ResolutionWidth" + 56), SUM("ResolutionWidth" + 57), SUM("ResolutionWidth" + 58), SUM("ResolutionWidth" + 59), SUM("ResolutionWidth" + 60), SUM("ResolutionWidth" + 61), SUM("ResolutionWidth" + 62), SUM("ResolutionWidth" + 63), SUM("ResolutionWidth" + 64), SUM("ResolutionWidth" + 65), SUM("ResolutionWidth" + 66), SUM("ResolutionWidth" + 67), SUM("ResolutionWidth" + 68), SUM("ResolutionWidth" + 69), SUM("ResolutionWidth" + 70), SUM("ResolutionWidth" + 71), SUM("ResolutionWidth" + 72), SUM("ResolutionWidth" + 73), SUM("ResolutionWidth" + 74), SUM("ResolutionWidth" + 75), SUM("ResolutionWidth" + 76), SUM("ResolutionWidth" + 77), SUM("ResolutionWidth" + 78), SUM("ResolutionWidth" + 79), SUM("ResolutionWidth" + 80), SUM("ResolutionWidth" + 81), SUM("ResolutionWidth" + 82), SUM("ResolutionWidth" + 83), SUM("ResolutionWidth" + 84), SUM("ResolutionWidth" + 85), SUM("ResolutionWidth" + 86), SUM("ResolutionWidth" + 87), SUM("ResolutionWidth" + 88), SUM("ResolutionWidth" + 89) FROM hits; Query 29 iteration 0 took 341.5 ms and returned 1 rows Query 29 iteration 1 took 320.7 ms and returned 1 rows Query 29 iteration 2 took 303.0 ms and returned 1 rows Query 29 avg time: 321.73 ms ``` After rewrite: ```rust cargo run --profile release-nonlto --target aarch64-apple-darwin --bin dfbench -- clickbench -p benchmarks/data/hits_partitioned -q 29 Finished `release-nonlto` profile [optimized] target(s) in 0.26s Running `target/aarch64-apple-darwin/release-nonlto/dfbench clickbench -p benchmarks/data/hits_partitioned -q 29` Running benchmarks with the following options: RunOpt { query: Some(29), common: CommonOpt { iterations: 3, partitions: None, batch_size: 8192, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "benchmarks/data/hits_partitioned", queries_path: "benchmarks/queries/clickbench/queries.sql", output_path: None } Q29: SELECT SUM("ResolutionWidth"), SUM("ResolutionWidth") + 1 * COUNT(*), SUM("ResolutionWidth") + 2 * COUNT(*), SUM("ResolutionWidth") + 3 * COUNT(*), SUM("ResolutionWidth") + 4 * COUNT(*), SUM("ResolutionWidth") + 5 * COUNT(*), SUM("ResolutionWi
Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]
Kontinuation commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022553633 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appending rows to a partition buffer. -#[derive(Debug)] -enum AppendRowStatus { -/// Rows were appended -Appended, -/// Not all rows were appended due to lack of available memory -StartIndex(usize), -} - -struct PartitionBuffer { -/// The schema of batches to be partitioned. -schema: SchemaRef, -/// The "frozen" Arrow IPC bytes of active data. They are frozen when `flush` is called. -frozen: Vec, -/// Array builders for appending rows into buffering batches. -active: Vec>, -/// The estimation of memory size of active builders in bytes when they are filled. -active_slots_mem_size: usize, -/// Number of rows in active builders. -num_active_rows: usize, -/// The maximum number of rows in a batch. Once `num_active_rows` reaches `batch_size`, -/// the active array builders will be frozen and appended to frozen buffer `frozen`. +/// A partitioner that writes all shuffle data to a single file and a single index file +struct SinglePartitionShufflePartitioner { +// output_data_file: File, +output_data_writer: BufBatchWriter, +output_index_path: String, +/// Batches that are smaller than the batch size and to be concatenated +buffered_batches: Vec, +/// Number of rows in the concatenating batches +num_buffered_rows: usize, +/// Metrics for the repartitioner +metrics: ShuffleRepartitionerMetrics, +/// The configured batch size batch_size: usize, -/// Memory reservation for this partition buffer. -reservation: MemoryReservation, -/// Spill file for intermediate shuffle output for this partition. Each spill event -/// will append to this file and the contents will be copied to the shuffle file at -/// the end of processing. -spill_file: Option, -/// Writer that performs encoding and compression -shuffle_block_writer: ShuffleBlockWriter, -} - -struct SpillFile { -temp_file: RefCountedTempFile, -file: File, } -impl PartitionBuffer { +impl SinglePartitionShufflePartitioner { fn try_new( +output_data_path: String, +output_index_path: String, schema: SchemaRef, +metrics: ShuffleRepartitionerMetrics, batch_size: usize, -reservation: MemoryReservation, codec: CompressionCodec, enable_fast_encoding: bool, ) -> Result { let shuffle_block_writer = -ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, codec)?; -let active_slots_mem_size = schema -.fields() -.iter() -.map(|field| slot_size(batch_size, field.data_type())) -.sum::(); +ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, codec.clone())?; + +let output_data_file = OpenOptions::new() +.write(true) +.create(true) +.truncate(true) +.open(output_data_path) +.map_err(to_df_err)?; + +let output_data_writer = BufBatchWriter::new(shuffle_block_writer, output_data_file); + Ok(Self { -schema, -frozen: vec![], -active: vec![], -active_slots_mem_size, -num_active_rows: 0, +output_data_writer, +output_index_path, +buffered_batches: vec![], +num_buffered_rows: 0, +metrics, batch_size, -reservation, -spill_file: None, -shuffle_block_writer, }) } -/// Initializes active builders if necessary. -/// Returns error if memory reservation fails. -fn allocate_active_builders(&mut self, metrics: &ShuffleRepartitionerMetrics) -> Result<()> { -if self.active.is_empty() { -let mut mempool_timer = metrics.mempool_time.timer(); -self.reservation.try_grow(self.active_slots_mem_size)?; -mempool_timer.stop(); - -let mut repart_timer = metrics.repart_time.timer(); -self.active = new_array_builders(&self.schema, self.batch_size); -repart_timer.stop(); -} -Ok(()) +/// Add a batch to the buffer of the partitioner, these buffered batches will be concatenated +/// and written to the output data file when the number of rows in the buffer reaches the batch size. +fn add_buffered_batch(&mut self, batch: RecordBatch) { +self.num_buffered_rows += batch.num_rows(); +self.buffered_batches.push(batch); } -/// Appends rows of specified indices from columns into active array builders. -fn append_rows( -&mut self, -columns: &[ArrayRef], -
Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]
Kontinuation commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2022551508 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appending rows to a partition buffer. -#[derive(Debug)] -enum AppendRowStatus { -/// Rows were appended -Appended, -/// Not all rows were appended due to lack of available memory -StartIndex(usize), -} - -struct PartitionBuffer { -/// The schema of batches to be partitioned. -schema: SchemaRef, -/// The "frozen" Arrow IPC bytes of active data. They are frozen when `flush` is called. -frozen: Vec, -/// Array builders for appending rows into buffering batches. -active: Vec>, -/// The estimation of memory size of active builders in bytes when they are filled. -active_slots_mem_size: usize, -/// Number of rows in active builders. -num_active_rows: usize, -/// The maximum number of rows in a batch. Once `num_active_rows` reaches `batch_size`, -/// the active array builders will be frozen and appended to frozen buffer `frozen`. +/// A partitioner that writes all shuffle data to a single file and a single index file +struct SinglePartitionShufflePartitioner { +// output_data_file: File, +output_data_writer: BufBatchWriter, +output_index_path: String, +/// Batches that are smaller than the batch size and to be concatenated +buffered_batches: Vec, +/// Number of rows in the concatenating batches +num_buffered_rows: usize, +/// Metrics for the repartitioner +metrics: ShuffleRepartitionerMetrics, +/// The configured batch size batch_size: usize, -/// Memory reservation for this partition buffer. -reservation: MemoryReservation, -/// Spill file for intermediate shuffle output for this partition. Each spill event -/// will append to this file and the contents will be copied to the shuffle file at -/// the end of processing. -spill_file: Option, -/// Writer that performs encoding and compression -shuffle_block_writer: ShuffleBlockWriter, -} - -struct SpillFile { -temp_file: RefCountedTempFile, -file: File, } -impl PartitionBuffer { +impl SinglePartitionShufflePartitioner { fn try_new( +output_data_path: String, +output_index_path: String, schema: SchemaRef, +metrics: ShuffleRepartitionerMetrics, batch_size: usize, -reservation: MemoryReservation, codec: CompressionCodec, enable_fast_encoding: bool, ) -> Result { let shuffle_block_writer = -ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, codec)?; -let active_slots_mem_size = schema -.fields() -.iter() -.map(|field| slot_size(batch_size, field.data_type())) -.sum::(); +ShuffleBlockWriter::try_new(schema.as_ref(), enable_fast_encoding, codec.clone())?; + +let output_data_file = OpenOptions::new() +.write(true) +.create(true) +.truncate(true) +.open(output_data_path) +.map_err(to_df_err)?; + +let output_data_writer = BufBatchWriter::new(shuffle_block_writer, output_data_file); + Ok(Self { -schema, -frozen: vec![], -active: vec![], -active_slots_mem_size, -num_active_rows: 0, +output_data_writer, +output_index_path, +buffered_batches: vec![], +num_buffered_rows: 0, +metrics, batch_size, -reservation, -spill_file: None, -shuffle_block_writer, }) } -/// Initializes active builders if necessary. -/// Returns error if memory reservation fails. -fn allocate_active_builders(&mut self, metrics: &ShuffleRepartitionerMetrics) -> Result<()> { -if self.active.is_empty() { -let mut mempool_timer = metrics.mempool_time.timer(); -self.reservation.try_grow(self.active_slots_mem_size)?; -mempool_timer.stop(); - -let mut repart_timer = metrics.repart_time.timer(); -self.active = new_array_builders(&self.schema, self.batch_size); -repart_timer.stop(); -} -Ok(()) +/// Add a batch to the buffer of the partitioner, these buffered batches will be concatenated +/// and written to the output data file when the number of rows in the buffer reaches the batch size. +fn add_buffered_batch(&mut self, batch: RecordBatch) { +self.num_buffered_rows += batch.num_rows(); +self.buffered_batches.push(batch); } -/// Appends rows of specified indices from columns into active array builders. -fn append_rows( -&mut self, -columns: &[ArrayRef], -
[PR] ArraySort: support structs [datafusion]
cht42 opened a new pull request, #15527: URL: https://github.com/apache/datafusion/pull/15527 ## Which issue does this PR close? - Closes #15526 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]
zhuqi-lucas commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2768994999 Thank you @jayzhan211 for the guide, i will try this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]
alamb commented on PR #15504: URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2769364593 > @berkaysynnada thank you, actually I did use the datafusion-examples as reference. > > > Perhaps we can make this in datafusion-examples as well > > Yeah that could also be done with maybe a few minor adjustments. If you want, I can do that as well once I have the active conversations reviewed and resolved. I think it would be good to avoid having duplicated content in datafusion-examples and the normal documentation. I personally think the documentation site is more accessable (more chance of someone reading / finding it via search) than rust examples in the repository -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Run all benchmarks on merge to main branch [datafusion]
jayzhan211 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2769754304 > > I don't think we need to actually "benchmark" the code for each merge. > > The issue [#5504](https://github.com/apache/datafusion/issues/5504) would require all benchmarks to run after each merge. I think we could just add benchmarks directly for now. What do you think? > > I am willing to work on* this. I don't think we need to run the benchmark on CI, at least it should be optional and disable by default. > We could do that too however not all benchmarks are sql queries. I agree that maintaining it is challenging. Adding it to the extended test suite and running it on every merge isnβt a viable solution, as itβs costly and often unnecessary. I donβt think keeping the benchmark functional is essentialβitβs more like a script that we can modify as needed, depending on what we want to measure each time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]
alamb commented on code in PR #15504: URL: https://github.com/apache/datafusion/pull/15504#discussion_r2023080660 ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,167 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the builder /// /// Note that an [`Self::alias`] must be provided before calling this method. +/// +/// # Example: Create an `AggregateUDF` +/// +/// In the following example, `AggregateFunctionExpr` will be built using `AggregateExprBuilder` Review Comment: If you add `[]` to the names rustdoc will make them links. For example ```suggestion /// In the following example, [`AggregateFunctionExpr`] will be built using [`AggregateExprBuilder`] ``` You may have to provide the full path with a separate definition like; ```rust /// Link to [MyStruct] /// /// [MyStruct]: core::my_struct::MyStruct ``` ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,167 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the builder /// /// Note that an [`Self::alias`] must be provided before calling this method. +/// +/// # Example: Create an `AggregateUDF` +/// +/// In the following example, `AggregateFunctionExpr` will be built using `AggregateExprBuilder` +/// which provides a build function. +/// +/// First we will create an `Accumulator` which will be used to further implement `AggregateUDFImpl`. +/// After implementing `AggregateUDFImpl`, it could be used to pass in as a parameter to create an `AggregateExprBuilder`. +/// `AggregateExprBuilder` could the be used to generate `AggregateFunctionExpr` after chaining +/// queries on top of each other. +/// +/// ``` Review Comment: I think this is not the right place to demonstrate creating a user defined aggregate so I suggest removing this particular example (notes below how to fold it into the other example) ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,167 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the builder /// /// Note that an [`Self::alias`] must be provided before calling this method. +/// +/// # Example: Create an `AggregateUDF` +/// +/// In the following example, `AggregateFunctionExpr` will be built using `AggregateExprBuilder` +/// which provides a build function. +/// +/// First we will create an `Accumulator` which will be used to further implement `AggregateUDFImpl`. +/// After implementing `AggregateUDFImpl`, it could be used to pass in as a parameter to create an `AggregateExprBuilder`. +/// `AggregateExprBuilder` could the be used to generate `AggregateFunctionExpr` after chaining +/// queries on top of each other. +/// +/// ``` +/// use std::any::Any; +/// use std::sync::OnceLock; +/// use std::sync::Arc; +/// use arrow::datatypes::DataType; +/// use datafusion_common::{DataFusionError, plan_err, Result, ScalarValue}; +/// use datafusion_expr::{col, ColumnarValue, Signature, Volatility, Expr, Documentation}; +/// use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, function::{AccumulatorArgs, StateFieldsArgs}}; +/// use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE; +/// use arrow::datatypes::Schema; +/// use arrow::datatypes::Field; +/// use arrow::array::Array; +/// +/// #[derive(Debug)] +/// struct FirstValueAccumulator { +/// value: Option, +/// data_type: DataType, +/// } +/// +/// impl Accumulator for FirstValueAccumulator { +/// fn update_batch(&mut self, values: &[Arc]) -> Result<()> { +/// if self.value.is_none() && !values.is_empty() { +/// let first_array = &values[0]; +/// for i in 0..first_array.len() { +/// if !first_array.is_null(i) { +/// self.value = Some(ScalarValue::try_from_array(first_array, i)?); +/// break; +/// } +/// } +/// } +/// Ok(()) +/// } +/// +/// fn merge_batch(&mut self, states: &[Arc]) -> Result<()> { +/// if self.value.is_none() && !states.is_empty() { +/// let first_array = &states[0]; +/// for i in 0..first_array.len() { +/// if !first_array.is_null(i) { +/// self.value = Some(ScalarValue::try_from_array(first_array, i)?); +/// break; +/// } +/// } +/// } +/// Ok(()) +/// } +/// +/// fn evaluate(&mut self) -> Result { +/// match &self.value { +/// Some(value) => Ok(value.c
Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]
alamb commented on code in PR #15505: URL: https://github.com/apache/datafusion/pull/15505#discussion_r202314 ## docs/source/user-guide/cli/datasources.md: ## @@ -126,6 +125,32 @@ select count(*) from hits; 1 row in set. Query took 0.344 seconds. ``` +**Why Wildcards Are Not Supported** + +Although wildcards (e.g., _.parquet or \*\*/_.parquet) may work for local filesystems in some cases, they are not officially supported by DataFusion. This is because wildcards are not universally applicable across all storage backends (e.g., S3, GCS). Instead, DataFusion expects the user to specify the directory path, and it will automatically read all compatible files within that directory. + +For example, the following usage is not supported: + +```sql +CREATE EXTERNAL TABLE test ( +message TEXT, +day DATE +) +STORED AS PARQUET +LOCATION 'gs://bucket/*.parquet'; +``` + +Instead, you should use: + +```sql +CREATE EXTERNAL TABLE test ( +message TEXT, +day DATE +) +STORED AS PARQUET +LOCATION 'gs://bucket/'; Review Comment: ```suggestion LOCATION 'gs://bucket/my_table'; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: update group by columns for merge phase after spill [datafusion]
rluvaton opened a new pull request, #15531: URL: https://github.com/apache/datafusion/pull/15531 ## Which issue does this PR close? - Closes #15530. ## Rationale for this change the PR forgot to update the group by expressions: - #13995 ## What changes are included in this PR? Update group by columns to be based on their index for merging phase ## Are these changes tested? Yes ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] ArraySort: support structs [datafusion]
alamb commented on code in PR #15527: URL: https://github.com/apache/datafusion/pull/15527#discussion_r2023103674 ## datafusion/functions-nested/src/sort.rs: ## @@ -207,9 +208,21 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> Result { valid.append_null(); } else { let arr_ref = list_array.value(i); -let arr_ref = arr_ref.as_ref(); -let sorted_array = compute::sort(arr_ref, sort_option)?; +let sorted_array = match arr_ref.data_type() { +DataType::Struct(_) => { +let sort_columns: Vec = vec![SortColumn { +values: Arc::clone(&arr_ref), +options: sort_option, +}]; +let indices = compute::lexsort_to_indices(&sort_columns, None)?; +compute::take(arr_ref.as_ref(), &indices, None)? +} +_ => { +let arr_ref = arr_ref.as_ref(); +compute::sort(arr_ref, sort_option)? Review Comment: is the issue that arrow-rs 's sort kernel doesn't support sorting structures but `lexsort_to_indices` does? Maybe we can offer some comments about why and possibly file an upstream ticket -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Queries similar to `count-bug` produce incorrect results [datafusion]
suibianwanwank commented on PR #15281: URL: https://github.com/apache/datafusion/pull/15281#issuecomment-2769782451 > The projection required to be in the group expression. I think the query of these 2 are equivalent but the subquery one group by `e2.b` and the join query group by `e1.b`. > > The only problem left is that whether this rewrite is general enough to work well on most of the subquery > > ``` > query IT > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; > > 0 a > 2 a > > query TT > explain > select e1.b, (select case when max(e2.a) > 10 then 'a' else 'b' end from t2 e2 where e2.b = e1.b + 1) from t1 e1; > > logical_plan > 01)Projection: e1.b, CASE WHEN __scalar_sq_1.__always_true IS NULL THEN Utf8("b") ELSE __scalar_sq_1.CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END > 02)--Left Join: CAST(e1.b AS Int64) + Int64(1) = CAST(__scalar_sq_1.b AS Int64) > 03)SubqueryAlias: e1 > 04)--TableScan: t1 projection=[b] > 05)SubqueryAlias: __scalar_sq_1 > 06)--Projection: CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, e2.b, Boolean(true) AS __always_true > 07)Aggregate: groupBy=[[e2.b]], aggr=[[max(e2.a)]] > 08)--SubqueryAlias: e2 > 09)TableScan: t2 projection=[a, b] > physical_plan > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN __always_true@2 IS NULL THEN b ELSE CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@1 END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END] > 02)--CoalesceBatchesExec: target_batch_size=8192 > 03)HashJoinExec: mode=Partitioned, join_type=Left, on=[(e1.b + Int64(1)@1, CAST(__scalar_sq_1.b AS Int64)@3)], projection=[b@0, CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END@2, __always_true@4] > 04)--CoalesceBatchesExec: target_batch_size=8192 > 05)RepartitionExec: partitioning=Hash([e1.b + Int64(1)@1], 4), input_partitions=1 > 06)--ProjectionExec: expr=[b@0 as b, CAST(b@0 AS Int64) + 1 as e1.b + Int64(1)] > 07)DataSourceExec: partitions=1, partition_sizes=[1] > 08)--CoalesceBatchesExec: target_batch_size=8192 > 09)RepartitionExec: partitioning=Hash([CAST(__scalar_sq_1.b AS Int64)@3], 4), input_partitions=4 > 10)--ProjectionExec: expr=[CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END, b@0 as b, true as __always_true, CAST(b@0 AS Int64) as CAST(__scalar_sq_1.b AS Int64)] > 11)AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] > 12)--CoalesceBatchesExec: target_batch_size=8192 > 13)RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 > 14)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 > 15)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] > 16)--DataSourceExec: partitions=1, partition_sizes=[1] > > query IT > SELECT > e1.b, > CASE > WHEN MAX(e2.a) > 10 THEN 'a' > ELSE 'b' > END AS result > FROM t2 e2 > LEFT JOIN t1 e1 ON e2.b = e1.b + 1 > GROUP BY e1.b; > > 2 a > 0 a > > query TT > explain > SELECT > e1.b, > CASE > WHEN MAX(e2.a) > 10 THEN 'a' > ELSE 'b' > END AS result > FROM t2 e2 > LEFT JOIN t1 e1 ON e2.b = e1.b + 1 > GROUP BY e1.b; > > logical_plan > 01)Projection: e1.b, CASE WHEN max(e2.a) > Int32(10) THEN Utf8("a") ELSE Utf8("b") END AS result > 02)--Aggregate: groupBy=[[e1.b]], aggr=[[max(e2.a)]] > 03)Projection: e2.a, e1.b > 04)--Left Join: CAST(e2.b AS Int64) = CAST(e1.b AS Int64) + Int64(1) > 05)SubqueryAlias: e2 > 06)--TableScan: t2 projection=[a, b] > 07)SubqueryAlias: e1 > 08)--TableScan: t1 projection=[b] > physical_plan > 01)ProjectionExec: expr=[b@0 as b, CASE WHEN max(e2.a)@1 > 10 THEN a ELSE b END as result] > 02)--AggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[max(e2.a)] > 03)CoalesceBatchesExec: target_batch_size=8192 > 04)--RepartitionExec: partitioning=Hash([b@0], 4), input_partitions=4 > 05)AggregateExec: mode=Partial, gby=[b@1 as b], aggr=[max(e2.a)] > 06)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 > 07)ProjectionExec: expr=[a@1 as a, b@0 as b] > 08)--CoalesceBatchesExec: target_batch_size=8192 > 09)HashJoinExec: mode=Partitioned, join_
[PR] Draft: Make Clickbench Q29 5x faster for datafusion [datafusion]
zhuqi-lucas opened a new pull request, #15532: URL: https://github.com/apache/datafusion/pull/15532 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Collecting parquet without any transformations throws an exception [datafusion-comet]
l0kr commented on issue #1588: URL: https://github.com/apache/datafusion-comet/issues/1588#issuecomment-2769772035 Ah nice catch @mbutrovich! Yup, looks like a dupe π -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]
2010YOUY01 commented on code in PR #15501: URL: https://github.com/apache/datafusion/pull/15501#discussion_r2022134277 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -0,0 +1,635 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Fuzz Test for various corner cases sorting RecordBatches exceeds available memory and should spill + +use std::cmp::min; +use std::sync::Arc; + +use arrow::array::RecordBatch; +use arrow_schema::SchemaRef; +use datafusion::datasource::MemTable; +use datafusion::prelude::{SessionConfig, SessionContext}; +use datafusion_common::{instant::Instant, Result}; +use datafusion_execution::memory_pool::{ +human_readable_size, MemoryPool, UnboundedMemoryPool, +}; +use datafusion_expr::display_schema; +use datafusion_physical_plan::spill::get_record_batch_memory_size; +use rand::seq::SliceRandom; +use std::time::Duration; + +use datafusion_execution::{ +disk_manager::DiskManagerConfig, memory_pool::FairSpillPool, +runtime_env::RuntimeEnvBuilder, +}; +use rand::Rng; +use rand::{rngs::StdRng, SeedableRng}; + +use crate::fuzz_cases::aggregation_fuzzer::check_equality_of_batches; + +use super::aggregation_fuzzer::ColumnDescr; +use super::record_batch_generator::{get_supported_types_columns, RecordBatchGenerator}; + +/// Entry point for executing the sort query fuzzer. +/// +/// Now memory limiting is disabled by default. See TODOs in `SortQueryFuzzer`. +#[tokio::test(flavor = "multi_thread")] +async fn sort_query_fuzzer_runner() { +let random_seed = std::time::SystemTime::now() +.duration_since(std::time::UNIX_EPOCH) +.unwrap() +.as_secs(); +let test_generator = SortFuzzerTestGenerator::new( +2000, +3, +"sort_fuzz_table".to_string(), +get_supported_types_columns(random_seed), +false, +random_seed, +); +let mut fuzzer = SortQueryFuzzer::new(random_seed) +// Configs for how many random query to test +.with_max_rounds(Some(5)) +.with_queries_per_round(4) +.with_config_variations_per_query(25) +// Will stop early if the time limit is reached +.with_time_limit(Duration::from_secs(20)) +.with_test_generator(test_generator); + +fuzzer.run().await.unwrap(); +} + +/// SortQueryFuzzer holds the runner configuration for executing sort query fuzz tests. The fuzzing details are managed inside `SortFuzzerTestGenerator`. +/// +/// It defines: +/// - `max_rounds`: Maximum number of rounds to run (or None to run until `time_limit`). +/// - `queries_per_round`: Number of different queries to run in each round. +/// - `config_variations_per_query`: Number of different configurations to test per query. +/// - `time_limit`: Time limit for the entire fuzzer execution. +/// +/// TODO: The following improvements are blocked on https://github.com/apache/datafusion/issues/14748: +/// 1. Support generating queries with arbitrary number of ORDER BY clauses +///Currently limited to be smaller than number of projected columns +/// 2. Enable special type columns like utf8_low to be used in ORDER BY clauses +/// 3. Enable memory limiting functionality in the fuzzer runner +pub struct SortQueryFuzzer { +test_gen: SortFuzzerTestGenerator, +/// Random number generator for the runner, used to generate seeds for inner components. +/// Seeds for each choice (query, config, etc.) are printed out for reproducibility. +runner_rng: StdRng, + +// +// Runner configurations +// +/// For each round, a new dataset is generated. If `None`, keep running until +/// the time limit is reached +max_rounds: Option, +/// How many different queries to run in each round +queries_per_round: usize, +/// For each query, how many different configurations to try and make sure their +/// results are consistent +config_variations_per_query: usize, +/// The time limit for the entire sort query fuzzer execution. +time_limit: Option, +} + +impl SortQueryFuzzer { +pub fn new(seed: u64
Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]
alamb commented on PR #15499: URL: https://github.com/apache/datafusion/pull/15499#issuecomment-2769639137 Thanks again @qstommyshu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]
alamb commented on code in PR #15504: URL: https://github.com/apache/datafusion/pull/15504#discussion_r2023660049 ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,165 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the builder /// /// Note that an [`Self::alias`] must be provided before calling this method. +/// +/// # Example: Create an [`AggregateUDF`] +/// +/// In the following example, [`AggregateFunctionExpr`] will be built using [`AggregateExprBuilder`] +/// which provides a build function. Full example could be accessed from the source file. +/// +/// ``` +/// # use std::any::Any; +/// # use std::sync::OnceLock; +/// # use std::sync::Arc; +/// # use arrow::datatypes::DataType; +/// # use datafusion_common::{DataFusionError, plan_err, Result, ScalarValue}; +/// # use datafusion_expr::{col, ColumnarValue, Signature, Volatility, Expr, Documentation}; +/// # use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, function::{AccumulatorArgs, StateFieldsArgs}}; +/// # use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE; +/// # use arrow::datatypes::Schema; +/// # use arrow::datatypes::Field; +/// # use arrow::array::Array; +/// # +/// # #[derive(Debug)] +/// # struct FirstValueAccumulator { +/// # value: Option, +/// # data_type: DataType, +/// # } +/// # +/// # impl Accumulator for FirstValueAccumulator { +/// # fn update_batch(&mut self, values: &[Arc]) -> Result<()> { Review Comment: I think we can significantly shorten the example by just using `unimplemented()!` instead of adding an actual implementation for an accumulator, and the methods. ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,165 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the builder /// /// Note that an [`Self::alias`] must be provided before calling this method. +/// +/// # Example: Create an [`AggregateUDF`] +/// +/// In the following example, [`AggregateFunctionExpr`] will be built using [`AggregateExprBuilder`] +/// which provides a build function. Full example could be accessed from the source file. +/// +/// ``` +/// # use std::any::Any; +/// # use std::sync::OnceLock; +/// # use std::sync::Arc; +/// # use arrow::datatypes::DataType; +/// # use datafusion_common::{DataFusionError, plan_err, Result, ScalarValue}; +/// # use datafusion_expr::{col, ColumnarValue, Signature, Volatility, Expr, Documentation}; +/// # use datafusion_expr::{AggregateUDFImpl, AggregateUDF, Accumulator, function::{AccumulatorArgs, StateFieldsArgs}}; +/// # use datafusion_expr::window_doc_sections::DOC_SECTION_AGGREGATE; +/// # use arrow::datatypes::Schema; +/// # use arrow::datatypes::Field; +/// # use arrow::array::Array; +/// # +/// # #[derive(Debug)] +/// # struct FirstValueAccumulator { +/// # value: Option, +/// # data_type: DataType, +/// # } +/// # +/// # impl Accumulator for FirstValueAccumulator { +/// # fn update_batch(&mut self, values: &[Arc]) -> Result<()> { +/// # if self.value.is_none() && !values.is_empty() { +/// # let first_array = &values[0]; +/// # for i in 0..first_array.len() { +/// # if !first_array.is_null(i) { +/// # self.value = Some(ScalarValue::try_from_array(first_array, i)?); +/// # break; +/// # } +/// # } +/// # } +/// # Ok(()) +/// # } +/// # +/// # fn merge_batch(&mut self, states: &[Arc]) -> Result<()> { +/// # if self.value.is_none() && !states.is_empty() { +/// # let first_array = &states[0]; +/// # for i in 0..first_array.len() { +/// # if !first_array.is_null(i) { +/// # self.value = Some(ScalarValue::try_from_array(first_array, i)?); +/// # break; +/// # } +/// # } +/// # } +/// # Ok(()) +/// # } +/// # +/// # fn evaluate(&mut self) -> Result { +/// # match &self.value { +/// # Some(value) => Ok(value.clone()), +/// # None => ScalarValue::try_from(&self.data_type), +/// # } +/// # } +/// # +/// # fn size(&self) -> usize { +/// # std::mem::size_of_val(self) +/// # } +/// # +/// # fn state(&mut self) -> Result> { +/// # match &self.value { +/// # Some(value) => Ok(vec![value.clone()]), +/// # None => ScalarValue::try_from(&self.data_type).map(|v| vec![v]),
Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]
alamb commented on PR #15504: URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2770660147 I also merged up from main to fix the CO -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] CSV data with double quotes fails [datafusion]
alamb commented on issue #439: URL: https://github.com/apache/datafusion/issues/439#issuecomment-2770467590 I think this is fixed -- datafusion can read data with double quotes now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: fix struct of arrays [datafusion-comet]
comphead commented on PR #1592: URL: https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770697619 @parthchandra @andygrove @kazuyukitanimura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Physical plan refactor to support optimization rules and more efficient use of threads [datafusion]
alamb closed issue #92: Physical plan refactor to support optimization rules and more efficient use of threads URL: https://github.com/apache/datafusion/issues/92 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Word Count [datafusion]
alamb commented on issue #197: URL: https://github.com/apache/datafusion/issues/197#issuecomment-2770485282 Make lineitem SF 10 (3 seconds!) data using tpchgen-cli https://github.com/clflushopt/tpchgen-rs ```shell tpchgen-cli -v --tables=lineitem --scale-factor=10 --format=parquet ``` Then you can run split to strings ```sql > select unnest (string_to_array(l_comment, ' ')) as word from 'lineitem.parquet' limit 10; +---+ | word | +---+ | egular| | courts| | above | | the | | ly| | final | | dependencies: | | slyly | | bold | | | +---+ 10 row(s) fetched. Elapsed 0.024 seconds. ``` Do the count ```sql > select count(*) from (select unnest (string_to_array(l_comment, ' ')) as word from 'lineitem.parquet'); +---+ | count(*) | +---+ | 271168008 | +---+ 1 row(s) fetched. Elapsed 0.731 seconds. ``` You can even do the distinct word count: ```sql > select count(distinct word) from (select unnest (string_to_array(l_comment, ' ')) as word from 'lineitem.parquet'); +--+ | count(DISTINCT word) | +--+ | 4959 | +--+ 1 row(s) fetched. Elapsed 1.052 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix duplicate unqualified Field name (schema error) on join queries [datafusion]
alamb commented on PR #15438: URL: https://github.com/apache/datafusion/pull/15438#issuecomment-2770491240 Hi @LiaCastaneda -- I believe the CI has failed on this PR due to a change in the CI actions. Can you please merge the PR up to main which i think will address the issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add dynamic pruning filters from TopK state [datafusion]
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2770508515 FYI I will likely try and review this PR again carefully first thing tomorrow morning -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(ci): build fails with strange error [datafusion-ballista]
milenkovicm opened a new pull request, #1222: URL: https://github.com/apache/datafusion-ballista/pull/1222 # Which issue does this PR close? Closes #. # Rationale for this change There spurious CI error, - https://github.com/apache/datafusion-ballista/actions/runs/14204334938 - https://github.com/apache/datafusion-ballista/actions/runs/14196252739 ``` actions-rs/toolchain@v1 is not allowed to be used in apache/datafusion-ballista. A... ``` ``` r-lib/actions/pr-fetch@master and r-lib/actions/pr-push@master are not allowed to be used in apache/datafusion-ballista. Actions in this workflow must be: within a repository that belongs to your Enterprise account, created by GitHub, verified in the GitHub Marketplace, or matching the following: 1Password/load-secrets-action@*, A ... ``` # What changes are included in this PR? replaced deprecated `actions-rs/toolchain@v1` with `dtolnay/rust-toolchain@stable` # Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]
alamb commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2770458720 This may be some overlap with this work from @adriangb (though I realize you are talking about a different optimization) - https://github.com/apache/datafusion/issues/15037 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add test cases for NULL in joins as key values [datafusion]
alamb closed issue #148: Add test cases for NULL in joins as key values URL: https://github.com/apache/datafusion/issues/148 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] CSV data with double quotes fails [datafusion]
alamb closed issue #439: CSV data with double quotes fails URL: https://github.com/apache/datafusion/issues/439 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Physical plan refactor to support optimization rules and more efficient use of threads [datafusion]
alamb commented on issue #92: URL: https://github.com/apache/datafusion/issues/92#issuecomment-2770464823 I think this is no longer relevant so closing. Let's open a new ticket if there is anything actionable remaining -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: clone and debug for FileSinkConfig [datafusion]
jayzhan211 merged PR #15516: URL: https://github.com/apache/datafusion/pull/15516 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]
logan-keede commented on issue #13814: URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2770108919 I did some profiling. ```sh RUSTC_BOOTSTRAP=1 cargo rustc -p datafusion-catalog -- -Z self-profile -Z self-profile-events=default,args ```   culprits are some asynchronous functions(Memtable::{scan, insert_into}, StreamTable::{scan, insert_into} StreamTableFactory::create, StreamWrite::write_all, ViewTable::scan ):- Each of them taking 1-2 sec(5-10% of total time) in analysis phase only. there are a couple hundred evaluate_obligation call under each type_check of pattern `{...{some_datafusion_object} as std::marker::Sync/Send> ... }` which means that compiler is trying to check if `some_datafusion_object` fulfils the trait bound of being Send/Sync. This does not happen with other asynchronous method eg. `ListingSchemaProvider::refresh`, so maybe we can try to check the difference between them and see what is going wrong with above mentioned functions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: update changelog for 45.0.0 [datafusion-ballista]
milenkovicm commented on PR #1218: URL: https://github.com/apache/datafusion-ballista/pull/1218#issuecomment-2770545131 just a heads up @andygrove I've added release of scheduler, executor ... docker containers when new tag created. I did try it, hopefully it will not make problems when new tag is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(ci): build fails with strange error [datafusion-ballista]
milenkovicm commented on PR #1222: URL: https://github.com/apache/datafusion-ballista/pull/1222#issuecomment-2770514586 does not look as it fixes the issue: https://github.com/apache/datafusion-ballista/actions/runs/14204595182 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Migrate datafusion/sql tests to insta, part3 [datafusion]
qstommyshu opened a new pull request, #15533: URL: https://github.com/apache/datafusion/pull/15533 ## Which issue does this PR close? - Related #15397, #15497, #15499 this is a part of #15484 breaking down. - Checkout things to note of the whole migration in comments section of #15484. ## Rationale for this change ## What changes are included in this PR? This is the part 3 of #15484 breakdown, as the code changes in #15484 is too large. Part1: #15497, Part2: #15499 ## Are these changes tested? Yes, I manually tested the before/after changes. ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: fix struct of arrays [datafusion-comet]
codecov-commenter commented on PR #1592: URL: https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770532403 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1592?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 56.25%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`51ab989`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/51ab989cb7f51592de97f7785318be7f812ee5b8?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 115 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1592 +/- ## + Coverage 56.12% 56.25% +0.12% + Complexity 976 957 -19 Files 119 124 +5 Lines 1174312436 +693 Branches 2251 2322 +71 + Hits 6591 6996 +405 - Misses 4012 4285 +273 - Partials 1140 1155 +15 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1592?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Migrate datafusion/sql tests to insta, part3 [datafusion]
qstommyshu commented on PR #15533: URL: https://github.com/apache/datafusion/pull/15533#issuecomment-2770233544 Humm, seems there are some issue with the CI pipeline. https://github.com/user-attachments/assets/1079a7c8-3e5c-4bca-9473-3c0e9fe69ec7"; /> > sccache: error: Server startup failed: cache storage failed to read: Unexpected (permanent) at read => {"$id":"1","innerException":null,"message":"This legacy service is shutting down, effective April 15, 2025. Migrate to the new service ASAP. For more information: https://gh.io/gha-cache-sunset","typeName":"Microsoft.Azure.DevOps.ArtifactCache.WebApi.ArtifactCacheServiceDecommissionedException, Microsoft.Azure.DevOps.ArtifactCache.WebApi","typeKey":"ArtifactCacheServiceDecommissionedException","errorCode":0,"eventId":3000} Looks like the pipeline is using some legacy service that is shutting down? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add GreptimeDB to the "Users" in README [datafusion-sqlparser-rs]
iffyio merged PR #1788: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1788 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] added functionality to handle output statement [datafusion-sqlparser-rs]
dilovancelik opened a new pull request, #1790: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1790 Hey I have added a feature to handle OUTPUT Statements in the end of merge statements, which is used in MS SQL per the following issue [1789](https://github.com/apache/datafusion-sqlparser-rs/issues/1789) that I created. - I've added the OUTPUT Keyword - I've added a struct in for Output in ast/mod.rs - I've added an optional Output parameter to the Merge struct in ast/mod.rs - added function for parsing the output parameter in parser/mod.rs - updated parse merge clause function, to break on output keyword - updated parse merge to handling of the output clause - added and updated tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: fix struct of arrays [datafusion-comet]
comphead opened a new pull request, #1592: URL: https://github.com/apache/datafusion-comet/pull/1592 ## Which issue does this PR close? Closes #1551. ## Rationale for this change Fixing STRUCT of ARRAY ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] ArraySort: support structs [datafusion]
cht42 commented on code in PR #15527: URL: https://github.com/apache/datafusion/pull/15527#discussion_r2023172721 ## datafusion/functions-nested/src/sort.rs: ## @@ -207,9 +208,21 @@ pub fn array_sort_inner(args: &[ArrayRef]) -> Result { valid.append_null(); } else { let arr_ref = list_array.value(i); -let arr_ref = arr_ref.as_ref(); -let sorted_array = compute::sort(arr_ref, sort_option)?; +let sorted_array = match arr_ref.data_type() { +DataType::Struct(_) => { +let sort_columns: Vec = vec![SortColumn { +values: Arc::clone(&arr_ref), +options: sort_option, +}]; +let indices = compute::lexsort_to_indices(&sort_columns, None)?; +compute::take(arr_ref.as_ref(), &indices, None)? +} +_ => { +let arr_ref = arr_ref.as_ref(); +compute::sort(arr_ref, sort_option)? Review Comment: https://github.com/apache/arrow-rs/issues/6911#issuecomment-2562928843 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]
alamb commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2770127700 @scsmithr of GlareDB integrated the tpchgen library in glaredb as a table function - https://github.com/GlareDB/glaredb/pull/3549 Which is quite cool ```shell glaredb> .maxrows 10 glaredb> select * from tpch_gen.lineitem(1); ββ¬ββββ¬ββββ¬βββ¬β¬ββ¬ββββ¬βββ¬ββββ¬β¬β¬β β l_orderkey β l_partkey β l_suppkey β l_linenumber β l_quantity β l_extendedprice β β¦ β l_commitdate β l_receiptdate β l_shipinstruct β l_shipmode β l_comment β β Int64 β Int64 β Int64 β Int32β Int64 β Decimal64(15,2) β β Date32 β Date32β Utf8 β Utf8 β Utf8 β ββΌββββΌββββΌβββΌβΌββΌββββΌβββΌββββΌβΌβΌβ€ β 1 β155190 β 7706 β1 β 17 β 21168.23 β β¦ β 1996-11-05 β 1996-12-14β DELIVER IN PEβ¦ β TRUCK β egular coβ¦ β β 1 β 67310 β 7311 β2 β 36 β 45983.16 β β¦ β 1996-11-21 β 1997-01-12β TAKE BACK RETβ¦ β MAIL β ly final β¦ β β 1 β 63700 β 3701 β3 β 8 β 13309.60 β β¦ β 1996-11-27 β 1996-10-24β TAKE BACK RETβ¦ β REG AIRβ riously. β¦ β β 1 β 2132 β 4633 β4 β 28 β 28955.64 β β¦ β 1996-12-22 β 1997-02-07β NONE β AIRβ lites. flβ¦ β β 1 β 24027 β 1534 β5 β 24 β 22824.48 β β¦ β 1996-12-06 β 1996-12-24β NONE β FOBβ pending β¦ β β β¦ β β¦ β β¦ ββ¦ β β¦ β β¦ β β¦ β β¦β β¦ β β¦ β β¦ β β¦ β β575 β 7272 β 2273 β1 β 32 β 37736.64 β β¦ β 1994-06-24 β 1994-07-15β COLLECT CODβ REG AIRβ tructionsβ¦ β β575 β 6452 β 1453 β2 β 7 β 9509.15 β β¦ β 1994-06-17 β 1994-08-13β DELIVER IN PEβ¦ β SHIP β lar pintoβ¦ β β575 β 37131 β 2138 β3 β 18 β 19226.34 β β¦ β 1994-05-22 β 1994-09-01β DELIVER IN PEβ¦ β FOBβ , quick dβ¦ β β600 β 32255 β 2256 β1 β 5 β 5936.25 β β¦ β 1997-08-13 β 1997-08-25β TAKE BACK RETβ¦ β MAIL β carefully β β600 β 96127 β 6128 β2 β 28 β 31447.36 β β¦ β 1997-06-25 β 1997-07-15β NONE β AIRβ ooze furiβ¦ β ββ΄ββββ΄ββββ΄βββ΄β΄ββ΄ββββ΄βββ΄ββββ΄β΄β΄β€ β 6001215 rows, 10 shown β ββ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: update group by columns for merge phase after spill [datafusion]
rluvaton commented on PR #15531: URL: https://github.com/apache/datafusion/pull/15531#issuecomment-2769949735 The CI failures are infra related... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]
tomershaniii commented on code in PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2023395604 ## src/parser/mod.rs: ## @@ -7081,18 +7029,243 @@ impl<'a> Parser<'a> { if let Token::Word(word) = self.peek_token().token { if word.keyword == Keyword::OPTIONS { -options = Some(self.parse_options(Keyword::OPTIONS)?); +table_options = + CreateTableOptions::Options(self.parse_options(Keyword::OPTIONS)?) } }; } +if !dialect_of!(self is HiveDialect) && table_options == CreateTableOptions::None { +let plain_options = self.parse_plain_options()?; +if !plain_options.is_empty() { +table_options = CreateTableOptions::Plain(plain_options) +} +}; + Ok(CreateTableConfiguration { partition_by, cluster_by, -options, +table_options, }) } +fn parse_plain_option(&mut self) -> Result, ParserError> { +// Single parameter option +if self.parse_keywords(&[Keyword::START, Keyword::TRANSACTION]) { +return Ok(Some(SqlOption::Ident(Ident::new("START TRANSACTION"; +} + +// Custom option +if self.parse_keywords(&[Keyword::COMMENT]) { +let has_eq = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let comment = match (has_eq, value.token) { +(true, Token::SingleQuotedString(s)) => { +Ok(Some(SqlOption::Comment(CommentDef::WithEq(s +} +(false, Token::SingleQuotedString(s)) => { +Ok(Some(SqlOption::Comment(CommentDef::WithoutEq(s +} +(_, token) => { +self.expected("Token::SingleQuotedString", TokenWithSpan::wrap(token)) +} +}; +return comment; +} + +if self.parse_keywords(&[Keyword::ENGINE]) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let engine = match value.token { +Token::Word(w) => { +let parameters = if self.peek_token() == Token::LParen { +Some(self.parse_parenthesized_identifiers()?) +} else { +None +}; + +Ok(Some(SqlOption::TableEngine(TableEngine { +name: w.value, +parameters, +}))) +} +_ => { +return self.expected("Token::Word", value)?; +} +}; + +return engine; +} + +if self.parse_keywords(&[Keyword::TABLESPACE]) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let tablespace = match value.token { +// TABLESPACE tablespace_name [STORAGE DISK] | [TABLESPACE tablespace_name] STORAGE MEMORY +Token::Word(Word { value: name, .. }) | Token::SingleQuotedString(name) => { +let storage = match self.parse_keyword(Keyword::STORAGE) { +true => { +let _ = self.consume_token(&Token::Eq); +let storage_token = self.next_token(); +match &storage_token.token { +Token::Word(w) => match w.value.to_uppercase().as_str() { +"DISK" => Some(StorageType::Disk), +"MEMORY" => Some(StorageType::Memory), +_ => self +.expected("Storage type (DISK or MEMORY)", storage_token)?, +}, +_ => self.expected("Token::Word", storage_token)?, +} +} +false => None, +}; + +Ok(Some(SqlOption::TableSpace(TablespaceOption { +name, +storage, +}))) +} +_ => { +return self.expected("Token::Word", value)?; +} +}; + +return tablespace; +} + +if self.parse_keyword(Keyword::UNION) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +match value.token { +// UNION [=] (tbl_name[,tbl_name]...) +Token::LParen => { +let tables: Vec = +self.parse_comma_separated0(Parser::parse
Re: [PR] Add all missing table options to be handled in any order [datafusion-sqlparser-rs]
tomershaniii commented on code in PR #1747: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1747#discussion_r2023395604 ## src/parser/mod.rs: ## @@ -7081,18 +7029,243 @@ impl<'a> Parser<'a> { if let Token::Word(word) = self.peek_token().token { if word.keyword == Keyword::OPTIONS { -options = Some(self.parse_options(Keyword::OPTIONS)?); +table_options = + CreateTableOptions::Options(self.parse_options(Keyword::OPTIONS)?) } }; } +if !dialect_of!(self is HiveDialect) && table_options == CreateTableOptions::None { +let plain_options = self.parse_plain_options()?; +if !plain_options.is_empty() { +table_options = CreateTableOptions::Plain(plain_options) +} +}; + Ok(CreateTableConfiguration { partition_by, cluster_by, -options, +table_options, }) } +fn parse_plain_option(&mut self) -> Result, ParserError> { +// Single parameter option +if self.parse_keywords(&[Keyword::START, Keyword::TRANSACTION]) { +return Ok(Some(SqlOption::Ident(Ident::new("START TRANSACTION"; +} + +// Custom option +if self.parse_keywords(&[Keyword::COMMENT]) { +let has_eq = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let comment = match (has_eq, value.token) { +(true, Token::SingleQuotedString(s)) => { +Ok(Some(SqlOption::Comment(CommentDef::WithEq(s +} +(false, Token::SingleQuotedString(s)) => { +Ok(Some(SqlOption::Comment(CommentDef::WithoutEq(s +} +(_, token) => { +self.expected("Token::SingleQuotedString", TokenWithSpan::wrap(token)) +} +}; +return comment; +} + +if self.parse_keywords(&[Keyword::ENGINE]) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let engine = match value.token { +Token::Word(w) => { +let parameters = if self.peek_token() == Token::LParen { +Some(self.parse_parenthesized_identifiers()?) +} else { +None +}; + +Ok(Some(SqlOption::TableEngine(TableEngine { +name: w.value, +parameters, +}))) +} +_ => { +return self.expected("Token::Word", value)?; +} +}; + +return engine; +} + +if self.parse_keywords(&[Keyword::TABLESPACE]) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +let tablespace = match value.token { +// TABLESPACE tablespace_name [STORAGE DISK] | [TABLESPACE tablespace_name] STORAGE MEMORY +Token::Word(Word { value: name, .. }) | Token::SingleQuotedString(name) => { +let storage = match self.parse_keyword(Keyword::STORAGE) { +true => { +let _ = self.consume_token(&Token::Eq); +let storage_token = self.next_token(); +match &storage_token.token { +Token::Word(w) => match w.value.to_uppercase().as_str() { +"DISK" => Some(StorageType::Disk), +"MEMORY" => Some(StorageType::Memory), +_ => self +.expected("Storage type (DISK or MEMORY)", storage_token)?, +}, +_ => self.expected("Token::Word", storage_token)?, +} +} +false => None, +}; + +Ok(Some(SqlOption::TableSpace(TablespaceOption { +name, +storage, +}))) +} +_ => { +return self.expected("Token::Word", value)?; +} +}; + +return tablespace; +} + +if self.parse_keyword(Keyword::UNION) { +let _ = self.consume_token(&Token::Eq); +let value = self.next_token(); + +match value.token { +// UNION [=] (tbl_name[,tbl_name]...) +Token::LParen => { +let tables: Vec = +self.parse_comma_separated0(Parser::parse
Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]
alamb commented on issue #13814: URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2770840025 It seems like 1.86 won't be released for 3 more days: https://releases.rs/docs/1.86.0/ It would be cool to update and try it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve spill performance: Disable re-validation of spilled files [datafusion]
alamb commented on PR #15454: URL: https://github.com/apache/datafusion/pull/15454#issuecomment-2770842980 Merged up to get latest changes and rerun CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770843418 Thank you so much for looking into this though, @andygrove I will take this up with GCP folks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] datafusion-cli: document reading partitioned parquet [datafusion]
alamb closed issue #15309: datafusion-cli: document reading partitioned parquet URL: https://github.com/apache/datafusion/issues/15309 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] feat: fix schema issues for `native reader - read STRUCT of ARRAY fields` [datafusion-comet]
comphead closed issue #1551: feat: fix schema issues for `native reader - read STRUCT of ARRAY fields` URL: https://github.com/apache/datafusion-comet/issues/1551 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Fix struct of arrays schema issue [datafusion-comet]
comphead commented on PR #1592: URL: https://github.com/apache/datafusion-comet/pull/1592#issuecomment-2770808027 Thanks @andygrove for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]
alamb commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2770823621 I found a duckdb implementation of a seemingling similar optimization: https://github.com/duckdb/duckdb/blob/7912713493b38b1eda162f29b7759d5024989a5f/src/optimizer/sum_rewriter.cpp#L25 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770838744 I am using a Spark image supplied by GCP Dataproc, thank you for checking this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]
Rachelint commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2770996279 > I'm considering another approach. Maybe I shouldn't use filter_record_batch π€. It filters the all column iteratly. I should filter the row when the accumulator merge_batch π€ I think also need to filter rows in `GroupValues::intern`, too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]
wForget commented on code in PR #1587: URL: https://github.com/apache/datafusion-comet/pull/1587#discussion_r2023897072 ## native/core/src/parquet/mod.rs: ## @@ -650,21 +651,29 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat required_schema: jbyteArray, data_schema: jbyteArray, session_timezone: jstring, +batch_size: jint, +worker_threads: jint, +blocking_threads: jint, ) -> jlong { try_unwrap_or_throw(&e, |mut env| unsafe { -let task_ctx = TaskContext::default(); +let session_config = SessionConfig::new().with_batch_size(batch_size as usize); +let planer = Review Comment: This planner is used for native filter conversion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
andygrove commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770840853 I am assuming that the GCP version of Spark has some differences in these internal APIs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]
zhuqi-lucas commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2770953658 Thank you @Dandandan @alamb for double check and confirm! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: fix spark/sql test failures in native_iceberg_compat [datafusion-comet]
parthchandra opened a new pull request, #1593: URL: https://github.com/apache/datafusion-comet/pull/1593 ## Which issue does this PR close? Part of #1542 ## Rationale for this change A bug in the logic of `NativeBatchReader` caused NPE and array index out of bounds errors in `native_iceberg_compat` mode. Summary is that the old version used `requestedSchema.getColumns` to get the columns to read. However, this returns only the leaf (primitive) columns and does not contain any group fields. So if the query was trying to read a group field (i.e. reading an entire struct instead of just one of the fields of the struct), we would use incorrect column metadata and sometimes even an incorrect number of fields. ## What changes are included in this PR? The PR changes the logic to use `requestedSchema.getFields` which returns both group and primitive type fields. The PR also adds additional handling in the `schema_adapter` to allow fields in the `to_type` schema that may not exist in the `from_type` schema ## How are these changes tested? New unit tests (based on the tests that were failing in Spark). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770817710 Update: spun up another cluster on Spark 3.5.3 and used the same prebuilt Comet JAR 0.7.0 Was able to get past the initial error documented here but now running into `Caused by: java.lang.NoSuchMethodError: 'void org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int, long, long[], long[], java.io.File)' ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]
andygrove commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2770828974 > Update: spun up another cluster on Spark 3.5.3 and used the same prebuilt Comet JAR 0.7.0 > > Was able to get past the initial error documented here but now running into > > `Caused by: java.lang.NoSuchMethodError: 'void org.apache.spark.shuffle.IndexShuffleBlockResolver.writeMetadataFileAndCommit(int, long, long[], long[], java.io.File)' ` I checked out Spark v3.5.3 and the method is defined as: ``` def writeMetadataFileAndCommit( shuffleId: Int, mapId: Long, lengths: Array[Long], checksums: Array[Long], dataTmp: File): Unit = { ``` It also looks like this method signature has not changed recently. Can you confirm that you are using the open-source Apache Spark v3.5.3 and not a GCP Dataproc version of Spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Fix struct of arrays schema issue [datafusion-comet]
comphead merged PR #1592: URL: https://github.com/apache/datafusion-comet/pull/1592 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]
parthchandra commented on code in PR #1587: URL: https://github.com/apache/datafusion-comet/pull/1587#discussion_r2023778763 ## native/core/src/parquet/mod.rs: ## @@ -650,21 +651,29 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat required_schema: jbyteArray, data_schema: jbyteArray, session_timezone: jstring, +batch_size: jint, +worker_threads: jint, +blocking_threads: jint, ) -> jlong { try_unwrap_or_throw(&e, |mut env| unsafe { -let task_ctx = TaskContext::default(); +let session_config = SessionConfig::new().with_batch_size(batch_size as usize); +let planer = Review Comment: Do we need to create a new planner here? We really only need a session_config (even though it is really not valid for the entire session). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]
alamb merged PR #15505: URL: https://github.com/apache/datafusion/pull/15505 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Disable sccache action to fix gh cache issue [datafusion]
alamb merged PR #15536: URL: https://github.com/apache/datafusion/pull/15536 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]
CrystalZhou0529 commented on code in PR #14775: URL: https://github.com/apache/datafusion/pull/14775#discussion_r2023878333 ## datafusion/ffi/tests/ffi_integration.rs: ## @@ -179,4 +181,103 @@ mod tests { Ok(()) } + +#[tokio::test] +async fn test_ffi_udaf() -> Result<()> { +let module = get_module()?; + +let ffi_avg_func = Review Comment: Fixed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] feat: Support read array type using native reader [datafusion-comet]
comphead closed issue #1454: feat: Support read array type using native reader URL: https://github.com/apache/datafusion-comet/issues/1454 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: adding more struct/arrays tests [datafusion-comet]
comphead opened a new pull request, #1594: URL: https://github.com/apache/datafusion-comet/pull/1594 ## Which issue does this PR close? Related #1550 . ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2023116930 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition +/// 1. For the `AND` operator, if the `lhs` result all are `false` +/// 2. For the `OR` operator, if the `lhs` result all are `true` +/// 3. Otherwise, it does not meet the short-circuit condition +fn check_short_circuit(arg: &ColumnarValue, op: &Operator) -> bool { +let data_type = arg.data_type(); +match (data_type, op) { +(DataType::Boolean, Operator::And) => { +match arg { +ColumnarValue::Array(array) => { +if let Ok(array) = as_boolean_array(&array) { +return array.false_count() == array.len(); Review Comment: > Might be overkill, but one _could_ try a sampling approach: Run the loop with the early exit for the first few chunks, and then switch over to the unconditional loop. Thank you for your suggestion, but if we're only applying conditional checks to the first few blocks, then I feel this optimization might not be meaningful. If nearly all blocks can be filtered out by the preceding filter, the optimization will no longer be effective. >If we find that this slows down some other performance we could also add some sort of heuristic check to calling false_count / true_count -- like for example if the rhs arg is "complex" (not a Column for example) I tend to agree with @alamb's point that if the overhead of verification is somewhat unacceptable, adopting some heuristic approaches would be better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Spark SQL test failures in native_iceberg_compat mode [datafusion-comet]
mbutrovich commented on issue #1542: URL: https://github.com/apache/datafusion-comet/issues/1542#issuecomment-2769901472 ``` catalyst: Passed: Total 6925, Failed 0, Errors 0, Passed 6925, Ignored 5, Canceled 1 core 1: Failed: Total 8686, Failed 47, Errors 0, Passed 8639, Ignored 277, Canceled 3 core 2: Failed: Total 2045, Failed 106, Errors 0, Passed 1939, Ignored 360 core 3: Failed: Total 1394, Failed 24, Errors 0, Passed 1370, Ignored 119, Canceled 15 hive 1: Failed: Total 2144, Failed 9, Errors 0, Passed 2135, Ignored 38, Canceled 4 hive 2: Error: Total 19, Failed 0, Errors 1, Passed 18, Ignored 1, Canceled 4 hive 3: Passed: Total 1044, Failed 0, Errors 0, Passed 1044, Ignored 13, Canceled 4 ``` Counts from https://github.com/apache/datafusion-comet/pull/1541 today. Will track here as we keep updating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org