Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011599235 ## datafusion/core/tests/memory_limit/mod.rs: ## @@ -455,7 +455,9 @@ async fn test_stringview_external_sort() { .with_memory_pool(Arc::new(FairSpillPool::n

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011688314 ## datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs: ## @@ -520,7 +520,9 @@ async fn group_by_string_test( let expected = compute_counts(&input, column_name)

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011691796 ## datafusion/datasource/src/memory.rs: ## @@ -718,6 +750,181 @@ impl MemorySourceConfig { pub fn original_schema(&self) -> SchemaRef { Arc::clone(&se

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011691796 ## datafusion/datasource/src/memory.rs: ## @@ -718,6 +750,181 @@ impl MemorySourceConfig { pub fn original_schema(&self) -> SchemaRef { Arc::clone(&se

Re: [PR] chore(deps): bump log from 0.4.26 to 0.4.27 [datafusion]

2025-03-25 Thread via GitHub
xudong963 merged PR #15410: URL: https://github.com/apache/datafusion/pull/15410 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

[PR] chore(deps): bump log from 0.4.26 to 0.4.27 [datafusion]

2025-03-25 Thread via GitHub
dependabot[bot] opened a new pull request, #15410: URL: https://github.com/apache/datafusion/pull/15410 Bumps [log](https://github.com/rust-lang/log) from 0.4.26 to 0.4.27. Release notes Sourced from https://github.com/rust-lang/log/releases";>log's releases. 0.4.27 What's

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011766884 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -3461,3 +3467,102 @@ fn optimize_away_unnecessary_repartition2() -> Result<()> {

Re: [PR] Add `downcast_to_source` method for `DataSourceExec` [datafusion]

2025-03-25 Thread via GitHub
xudong963 commented on code in PR #15416: URL: https://github.com/apache/datafusion/pull/15416#discussion_r2011873162 ## datafusion/datasource/src/source.rs: ## @@ -230,4 +231,17 @@ impl DataSourceExec { Boundedness::Bounded, ) } + +/// Downcast th

Re: [PR] Refactor: add `FileGroup` structure for `Vec` [datafusion]

2025-03-25 Thread via GitHub
xudong963 commented on PR #15379: URL: https://github.com/apache/datafusion/pull/15379#issuecomment-2750924900 Thanks for your review. I'll merge it after @alamb has another look! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Add `deserialize` to `BatchSerializer` [datafusion]

2025-03-25 Thread via GitHub
xudong963 commented on code in PR #15411: URL: https://github.com/apache/datafusion/pull/15411#discussion_r2011915258 ## datafusion/datasource/src/write/mod.rs: ## @@ -67,12 +67,97 @@ impl Write for SharedBuffer { } } -/// A trait that defines the methods required for a

[PR] Support Avg distinct for `float64` type [datafusion]

2025-03-25 Thread via GitHub
qazxcdswe123 opened a new pull request, #15413: URL: https://github.com/apache/datafusion/pull/15413 ## Which issue does this PR close? related: #2408 #15356 ## Rationale for this change The original PR is splited into 2 parts, one for float64 type and one for float64+d

Re: [PR] Move `DataSink` to `datasource` and add session crate [datafusion]

2025-03-25 Thread via GitHub
berkaysynnada commented on PR #15371: URL: https://github.com/apache/datafusion/pull/15371#issuecomment-2750410384 BTW, we can easily make the graph `datasource ->(imports) catalog -> physical-plan`, but I'm not sure about conditioning `catalog` as it never requires `datasource` ? wdyt @ala

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-03-25 Thread via GitHub
Dandandan commented on code in PR #15324: URL: https://github.com/apache/datafusion/pull/15324#discussion_r2011540550 ## datafusion/functions-aggregate/src/count.rs: ## @@ -227,137 +226,37 @@ impl AggregateUDFImpl for Count { return not_impl_err!("COUNT DISTINCT wit

[PR] minor: fix typo [datafusion-comet]

2025-03-25 Thread via GitHub
wForget opened a new pull request, #1570: URL: https://github.com/apache/datafusion-comet/pull/1570 ## Which issue does this PR close? minor fix ## Rationale for this change fix typo ## What changes are included in this PR? ## How are these changes t

Re: [PR] Support Avg distinct for `decimal` and `float` type [datafusion]

2025-03-25 Thread via GitHub
qazxcdswe123 commented on code in PR #15414: URL: https://github.com/apache/datafusion/pull/15414#discussion_r2011713584 ## datafusion/functions-aggregate-common/src/aggregate/avg_distinct/decimal.rs: ## @@ -0,0 +1,230 @@ +// Licensed to the Apache Software Foundation (ASF) unde

[PR] Add `downcast_to_source` method for `DataSourceExec` [datafusion]

2025-03-25 Thread via GitHub
xudong963 opened a new pull request, #15416: URL: https://github.com/apache/datafusion/pull/15416 ## Rationale for this change When I upgraded the df46, It was annoying for me to do the following thing(a series of `downcast_ref`) and it's also easy to call the wrong method, su

Re: [PR] Support Avg distinct for `decimal` and `float` type [datafusion]

2025-03-25 Thread via GitHub
qazxcdswe123 commented on code in PR #15414: URL: https://github.com/apache/datafusion/pull/15414#discussion_r2011716786 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -6686,3 +6688,35 @@ SELECT a, median(b), arrow_typeof(median(b)) FROM group_median_all_nulls GROUP

Re: [PR] Add `downcast_to_source` method for `DataSourceExec` [datafusion]

2025-03-25 Thread via GitHub
xudong963 commented on code in PR #15416: URL: https://github.com/apache/datafusion/pull/15416#discussion_r2011719252 ## datafusion/datasource/src/source.rs: ## @@ -230,4 +231,17 @@ impl DataSourceExec { Boundedness::Bounded, ) } + +/// Downcast th

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011720446 ## datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs: ## @@ -520,7 +520,9 @@ async fn group_by_string_test( let expected = compute_counts(&input, column_name)

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011688314 ## datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs: ## @@ -520,7 +520,9 @@ async fn group_by_string_test( let expected = compute_counts(&input, column_name)

Re: [PR] Use `any` instead of `for_each` [datafusion]

2025-03-25 Thread via GitHub
xudong963 commented on code in PR #15289: URL: https://github.com/apache/datafusion/pull/15289#discussion_r2007191299 ## datafusion/datasource-parquet/src/file_format.rs: ## @@ -839,9 +839,10 @@ pub fn statistics_from_parquet_meta_calc( total_byte_size += row_group_meta

[PR] Support Avg distinct for `decimal` and `float` type [datafusion]

2025-03-25 Thread via GitHub
qazxcdswe123 opened a new pull request, #15414: URL: https://github.com/apache/datafusion/pull/15414 ## Which issue does this PR close? - related: https://github.com/apache/datafusion/pull/15356 https://github.com/apache/datafusion/pull/15413 - Closes #2408 ## Rationale for

[PR] Lia/fix duplicate unqualified [datafusion]

2025-03-25 Thread via GitHub
LiaCastaneda opened a new pull request, #15415: URL: https://github.com/apache/datafusion/pull/15415 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes te

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011672602 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -598,146 +656,370 @@ impl OrderSensitiveArrayAggAccumulator { #[cfg(test)] mod tests { use super

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011495377 ## datafusion/datasource/src/memory.rs: ## @@ -902,4 +1108,319 @@ mod tests { Ok(()) } + +fn batch(row_size: usize) -> RecordBatch { +le

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011495377 ## datafusion/datasource/src/memory.rs: ## @@ -902,4 +1108,319 @@ mod tests { Ok(()) } + +fn batch(row_size: usize) -> RecordBatch { +le

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011496396 ## datafusion/datasource/src/memory.rs: ## @@ -902,4 +1108,319 @@ mod tests { Ok(()) } + +fn batch(row_size: usize) -> RecordBatch { +le

Re: [D] More thorough contribution guideline [datafusion]

2025-03-25 Thread via GitHub
GitHub user logan-keede added a comment to the discussion: More thorough contribution guideline Filed https://github.com/apache/datafusion/issues/15408. GitHub link: https://github.com/apache/datafusion/discussions/15365#discussioncomment-12610889 This is an automatically sent email for

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-25 Thread via GitHub
wForget commented on code in PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2011796258 ## native/core/src/parquet/mod.rs: ## @@ -641,6 +640,8 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat session_timezo

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-25 Thread via GitHub
jayzhan211 commented on code in PR #15413: URL: https://github.com/apache/datafusion/pull/15413#discussion_r2011786304 ## datafusion/functions-aggregate-common/src/aggregate/avg_distinct/numeric.rs: ## @@ -0,0 +1,78 @@ +// Licensed to the Apache Software Foundation (ASF) under o

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011760705 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { let data_type = acc_args.exprs[0].data_

[I] RecordBatchReader respect `COMET_BATCH_SIZE` [datafusion-comet]

2025-03-25 Thread via GitHub
wForget opened a new issue, #1571: URL: https://github.com/apache/datafusion-comet/issues/1571 ### What is the problem the feature request solves? RecordBatchReader respect `COMET_BATCH_SIZE` ### Describe the potential solution _No response_ ### Additional context

Re: [I] Regression: eager evaluation of expressions inside CASE conditional expression [datafusion]

2025-03-25 Thread via GitHub
findepi closed issue #15384: Regression: eager evaluation of expressions inside CASE conditional expression URL: https://github.com/apache/datafusion/issues/15384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-25 Thread via GitHub
mertak-synnada commented on code in PR #15352: URL: https://github.com/apache/datafusion/pull/15352#discussion_r2011755617 ## datafusion-examples/examples/advanced_parquet_index.rs: ## @@ -498,13 +499,15 @@ impl TableProvider for IndexTableProvider { // provide

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011460176 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -598,146 +656,370 @@ impl OrderSensitiveArrayAggAccumulator { #[cfg(test)] mod tests { use super

Re: [PR] Lia/fix duplicate unqualified [datafusion]

2025-03-25 Thread via GitHub
LiaCastaneda closed pull request #15415: Lia/fix duplicate unqualified URL: https://github.com/apache/datafusion/pull/15415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011760705 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { let data_type = acc_args.exprs[0].data_

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2011766884 ## datafusion/core/tests/physical_optimizer/enforce_distribution.rs: ## @@ -3461,3 +3467,102 @@ fn optimize_away_unnecessary_repartition2() -> Result<()> {

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011672602 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -598,146 +656,370 @@ impl OrderSensitiveArrayAggAccumulator { #[cfg(test)] mod tests { use super

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
gabotechs commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2011760705 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { let data_type = acc_args.exprs[0].data_

Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

2025-03-25 Thread via GitHub
kosiew commented on PR #15295: URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2750776919 hi @TheBuilderJR , I looked at #15259 and wanted to try incorporating its `struct_coercion`. But the amended `struct_coercion` caused `sqllogictests` to fail. So I did not

[PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-25 Thread via GitHub
ctsk opened a new pull request, #15418: URL: https://github.com/apache/datafusion/pull/15418 ## Rationale for this change + What changes are included in this PR? The logic of HashJoin's `execute` implements differs from the other operators - The recursive call to the `execute` step of

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on code in PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569#discussion_r2012813465 ## spark/src/test/scala/org/apache/comet/CometSparkSessionExtensionsSuite.scala: ## @@ -27,24 +27,24 @@ class CometSparkSessionExtensionsSuite extends

Re: [I] Add selection vector repartitioning [datafusion]

2025-03-25 Thread via GitHub
Dandandan commented on issue #15420: URL: https://github.com/apache/datafusion/issues/15420#issuecomment-2752228227 Oh you're right @goldmedal , the formula is `hash % total_partition == current_partition` -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [PR] Support Avg distinct for `decimal` and `float` type [datafusion]

2025-03-25 Thread via GitHub
qazxcdswe123 commented on code in PR #15414: URL: https://github.com/apache/datafusion/pull/15414#discussion_r2011714588 ## datafusion/functions-aggregate-common/src/aggregate/avg_distinct/decimal.rs: ## @@ -0,0 +1,230 @@ +// Licensed to the Apache Software Foundation (ASF) unde

Re: [PR] Add DataFusion 44.0.0 blog post [datafusion-site]

2025-03-25 Thread via GitHub
alamb closed pull request #47: Add DataFusion 44.0.0 blog post URL: https://github.com/apache/datafusion-site/pull/47 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] fix: Making shuffle files generated in native shuffle mode reclaimable [datafusion-comet]

2025-03-25 Thread via GitHub
mbutrovich commented on PR #1568: URL: https://github.com/apache/datafusion-comet/pull/1568#issuecomment-2752360918 Thank you for the good fix! I'll take a pass through this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Add regexp_extract func [datafusion]

2025-03-25 Thread via GitHub
SKY-ALIN commented on code in PR #14282: URL: https://github.com/apache/datafusion/pull/14282#discussion_r2012956967 ## datafusion/functions/src/regex/regexpextract.rs: ## @@ -0,0 +1,322 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor li

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-25 Thread via GitHub
comphead commented on code in PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569#discussion_r2013066694 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -976,7 +976,7 @@ class CometSparkSessionExtensions } // We s

Re: [I] Remote HDFS tests with Minikube [datafusion-comet]

2025-03-25 Thread via GitHub
comphead commented on issue #1367: URL: https://github.com/apache/datafusion-comet/issues/1367#issuecomment-2752783627 Closed by #1515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [I] Remote HDFS tests with Minikube [datafusion-comet]

2025-03-25 Thread via GitHub
comphead closed issue #1367: Remote HDFS tests with Minikube URL: https://github.com/apache/datafusion-comet/issues/1367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [PR] perf: unwrap cast for comparing ints =/!= strings [datafusion]

2025-03-25 Thread via GitHub
alan910127 commented on PR #15110: URL: https://github.com/apache/datafusion/pull/15110#issuecomment-2753032125 > Perhaps you can revert the changes to type_coercsion.rs in this PR so we can get it merged and then make a follow on PR to address #15161 ? OK! -- This is an automated

Re: [PR] Add `deserialize` to `BatchSerializer` [datafusion]

2025-03-25 Thread via GitHub
jayzhan-synnada commented on PR #15411: URL: https://github.com/apache/datafusion/pull/15411#issuecomment-2753048100 `BatchDeserializer` can do what I want so we don't need this. Maybe we can fix the code structure, it would be nice that they are close together and clear that we can d

Re: [PR] Draft: Use take-in kernel in repartitioning [datafusion]

2025-03-25 Thread via GitHub
alamb commented on PR #15392: URL: https://github.com/apache/datafusion/pull/15392#issuecomment-2752606235 Hi @ctsk -- is this PR ready for running some benchmarks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[PR] Alamb/default explain plan [datafusion]

2025-03-25 Thread via GitHub
alamb opened a new pull request, #15427: URL: https://github.com/apache/datafusion/pull/15427 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15343 ## Rationale for this change We have nice explain plans, let's default DataF

Re: [PR] [Minor] Remove/reorder logical plan rules [datafusion]

2025-03-25 Thread via GitHub
jayzhan211 merged PR #15421: URL: https://github.com/apache/datafusion/pull/15421 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Refactor: add `FileGroup` structure for `Vec` [datafusion]

2025-03-25 Thread via GitHub
jayzhan211 commented on PR #15379: URL: https://github.com/apache/datafusion/pull/15379#issuecomment-2752868339 Thanks @xudong963 @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura merged PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569#issuecomment-2752978026 Thanks merged @parthchandra @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] [Minor] Remove/reorder logical plan rules [datafusion]

2025-03-25 Thread via GitHub
jayzhan211 commented on PR #15421: URL: https://github.com/apache/datafusion/pull/15421#issuecomment-2752939441 Thanks @Dandandan @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] fix: isCometEnabled name conflict [datafusion-comet]

2025-03-25 Thread via GitHub
parthchandra commented on code in PR #1569: URL: https://github.com/apache/datafusion-comet/pull/1569#discussion_r2013108125 ## spark/src/test/scala/org/apache/comet/CometSparkSessionExtensionsSuite.scala: ## @@ -27,24 +27,24 @@ class CometSparkSessionExtensionsSuite extends Co

Re: [I] Sort query won't get round-robin repartitioned if input is `MemTable` [datafusion]

2025-03-25 Thread via GitHub
wiedld commented on issue #15088: URL: https://github.com/apache/datafusion/issues/15088#issuecomment-2752857144 I believe this is ready for review: https://github.com/apache/datafusion/pull/15409 -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Refactor: add `FileGroup` structure for `Vec` [datafusion]

2025-03-25 Thread via GitHub
jayzhan211 merged PR #15379: URL: https://github.com/apache/datafusion/pull/15379 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Enforce JOIN plan to require condition [datafusion]

2025-03-25 Thread via GitHub
goldmedal commented on PR #15334: URL: https://github.com/apache/datafusion/pull/15334#issuecomment-2752894014 Thanks @comphead and @alamb 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] feat: introduce hadoop mini cluster to test native scan on hdfs [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on PR #1556: URL: https://github.com/apache/datafusion-comet/pull/1556#issuecomment-2752660990 Merged thanks @wForget -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Enable hdfs test(s) in ci [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura closed issue #1515: Enable hdfs test(s) in ci URL: https://github.com/apache/datafusion-comet/issues/1515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [PR] refactor(hash_join): Move JoinHashMap to separate mod [datafusion]

2025-03-25 Thread via GitHub
alamb commented on PR #15419: URL: https://github.com/apache/datafusion/pull/15419#issuecomment-2752505446 Looks good to me too -- thank you @ctsk and @Weijun-H FYI @comphead and @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-25 Thread via GitHub
adriangb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2752376477 We already have Statistics on PartitionedFile so we could potentially use Dynamic filters to prune based on those before opening the file -- This is an automated message from

[PR] Docs: Added extra resources & fixed formatting to Concepts, Readings, Events section [datafusion]

2025-03-25 Thread via GitHub
2SpaceMasterRace opened a new pull request, #15424: URL: https://github.com/apache/datafusion/pull/15424 ## Which issue does this PR close? NIL. ## Rationale for this change This PR improves the overall quality and accessibility of the documentation by: - Enhancing re

Re: [I] Migrate subtrait tests to `insta` [datafusion]

2025-03-25 Thread via GitHub
alamb commented on issue #15398: URL: https://github.com/apache/datafusion/issues/15398#issuecomment-2752419868 I think this is a good first issue as it is well explained and doesn't need deep DataFusion knowledge, so marking as such -- This is an automated message from the Apache Git Ser

Re: [I] Migrate optimizer tests to `insta` [datafusion]

2025-03-25 Thread via GitHub
alamb commented on issue #15396: URL: https://github.com/apache/datafusion/issues/15396#issuecomment-2752419528 I think this is a good first issue as it is well explained and doesn't need deep DataFusion knowledge, so marking as such -- This is an automated message from the Apache Git Ser

Re: [I] Migrate `datafusion/sql` tests to `insta` [datafusion]

2025-03-25 Thread via GitHub
alamb commented on issue #15397: URL: https://github.com/apache/datafusion/issues/15397#issuecomment-2752420168 I think this is a good first issue as it is well explained and doesn't need deep DataFusion knowledge, so marking as such -- This is an automated message from the Apache Git Ser

Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

2025-03-25 Thread via GitHub
kevinjqliu commented on code in PR #64: URL: https://github.com/apache/datafusion-site/pull/64#discussion_r2012878189 ## content/blog/2025-03-24-datafusion-46.0.0.md: ## @@ -0,0 +1,96 @@ +--- +layout: post +title: Apache DataFusion 46.0.0 Released +date: 2025-03-24 +author: Oznu

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-25 Thread via GitHub
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2752424919 the sqlite tests need updating prior to this issue being pushed to main. Confirmed test failures, here are a few examples: ``` External error: query is expected to fail wit

[I] Avoid evaluating filters when they can be discarded purely from statistics [datafusion]

2025-03-25 Thread via GitHub
adriangb opened a new issue, #15425: URL: https://github.com/apache/datafusion/issues/15425 ### Is your feature request related to a problem or challenge? Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container: 1. This container

Re: [PR] chore(deps): bump the arrow-parquet group with 2 updates [datafusion]

2025-03-25 Thread via GitHub
alamb commented on code in PR #15376: URL: https://github.com/apache/datafusion/pull/15376#discussion_r2012880021 ## Cargo.toml: ## @@ -90,15 +90,15 @@ arrow = { version = "54.2.1", features = [ "prettyprint", "chrono-tz", ] } -arrow-buffer = { version = "54.1.0", def

Re: [PR] Support Avg distinct for `decimal` and `float` type [datafusion]

2025-03-25 Thread via GitHub
Omega359 commented on PR #15414: URL: https://github.com/apache/datafusion/pull/15414#issuecomment-2752381318 FYI, the sqlite tests have a number of avg(distinct) queries that will need to be updated. See the [README](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/RE

Re: [I] Add CI check for `cargo-semver-checks` [datafusion]

2025-03-25 Thread via GitHub
logan-keede commented on issue #15408: URL: https://github.com/apache/datafusion/issues/15408#issuecomment-2752381916 We can customise [this github action](https://github.com/obi1kenobi/cargo-semver-checks-action) to suit our needs. I am already giving it a shot. It will save me(or

Re: [PR] FIX : some benchmarks are failing [datafusion]

2025-03-25 Thread via GitHub
alamb commented on code in PR #15367: URL: https://github.com/apache/datafusion/pull/15367#discussion_r2012933341 ## datafusion/core/benches/distinct_query_sql.rs: ## @@ -144,59 +141,50 @@ pub async fn create_context_sampled_data( } fn criterion_benchmark_limited_distinct_sa

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-25 Thread via GitHub
AdamGS commented on PR #15352: URL: https://github.com/apache/datafusion/pull/15352#issuecomment-2752200316 my 2c - this looks great, I would love to also rename `FileScanConfig` to something else but I get the backwards compatibility concerns. -- This is an automated message from the Apa

Re: [I] Scalars are too verbose in column name output [datafusion]

2025-03-25 Thread via GitHub
alamb commented on issue #15395: URL: https://github.com/apache/datafusion/issues/15395#issuecomment-2752511719 Here is a related ticket: - #2027 > Created an issue to ask what others think - the change is quite simple, but maybe there were previous discussions or consequence

Re: [I] Feature is not implemeneted: Unsupported cast with list of structs [datafusion]

2025-03-25 Thread via GitHub
alamb commented on issue #15338: URL: https://github.com/apache/datafusion/issues/15338#issuecomment-2752525764 - Related discussion: https://github.com/apache/arrow-rs/issues/7176 I think @kosiew may have a PR up that is related - https://github.com/apache/datafusion/pull/15295 -

[PR] minor: Add new crates to labeler [datafusion]

2025-03-25 Thread via GitHub
logan-keede opened a new pull request, #15426: URL: https://github.com/apache/datafusion/pull/15426 ## Which issue does this PR close? - None. just a small update to labeler. ## Rationale for this change labeler.yml does not include new `datasource-*` crates

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-25 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2752608813 I think you may want to test `this PR, dynamic_filter` as well (no `pushdown_filters`). Since it also does stats pushdown it may still help. Btw I do not pretend for this to b

Re: [PR] minor: Add new crates to labeler [datafusion]

2025-03-25 Thread via GitHub
logan-keede commented on PR #15426: URL: https://github.com/apache/datafusion/pull/15426#issuecomment-2752621296 It needs to be merged first if https://github.com/apache/datafusion/pull/8431 this proves anything. 🤦‍♂️ let's wait for https://github.com/apache/datafusion/pull/15371. We are

Re: [I] Avoid evaluating filters when they can be discarded purely from statistics [datafusion]

2025-03-25 Thread via GitHub
adriangb commented on issue #15425: URL: https://github.com/apache/datafusion/issues/15425#issuecomment-2752626326 I think that has more to do with the `datafusion.execution.collect_statistics` setting and use of those statistics -- This is an automated message from the Apache Git Service

Re: [PR] feat: enable iceberg compat tests, more tests for complex types [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on code in PR #1550: URL: https://github.com/apache/datafusion-comet/pull/1550#discussion_r2012997384 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -2722,7 +2721,11 @@ object QueryPlanSerde extends Logging with ShimQueryPl

Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

2025-03-25 Thread via GitHub
comphead commented on code in PR #64: URL: https://github.com/apache/datafusion-site/pull/64#discussion_r2013064081 ## content/blog/2025-03-24-datafusion-46.0.0.md: ## @@ -0,0 +1,96 @@ +--- +layout: post +title: Apache DataFusion 46.0.0 Released +date: 2025-03-24 +author: Oznur

[I] select distinct on map types yields ArrowError NotYetImplemented [datafusion]

2025-03-25 Thread via GitHub
gz opened a new issue, #15428: URL: https://github.com/apache/datafusion/issues/15428 ### Describe the bug querying a table/parquet file with the following schema and query fails: ``` create table t ( m MAP NOT NULL ); ``` ``` let df = ctx.sql("SELECT d

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-25 Thread via GitHub
mbutrovich commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2752361360 Thank you for taking a look at this! I'll take a pass through this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] Chore: simplify array related functions impl [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura merged PR #1490: URL: https://github.com/apache/datafusion-comet/pull/1490 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Chore: simplify array related functions impl [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on PR #1490: URL: https://github.com/apache/datafusion-comet/pull/1490#issuecomment-2752337649 Merged thanks @kazantsev-maksim -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Triggering extended tests through PR comment [datafusion]

2025-03-25 Thread via GitHub
alamb commented on PR #15101: URL: https://github.com/apache/datafusion/pull/15101#issuecomment-2752311863 Thanks @danila-b -- you are totally right. I started a new test and we'll see how that goes - https://github.com/alamb/datafusion/pull/33 -- This is an automated message from the

Re: [I] Incorrect null handling in fast shuffle encoder [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura closed issue #1520: Incorrect null handling in fast shuffle encoder URL: https://github.com/apache/datafusion-comet/issues/1520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [I] Incorrect null handling in fast shuffle encoder [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura closed issue #1520: Incorrect null handling in fast shuffle encoder URL: https://github.com/apache/datafusion-comet/issues/1520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura merged PR #1522: URL: https://github.com/apache/datafusion-comet/pull/1522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format [datafusion-comet]

2025-03-25 Thread via GitHub
kazuyukitanimura commented on PR #1522: URL: https://github.com/apache/datafusion-comet/pull/1522#issuecomment-2752339224 Thanks @Kontinuation merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-25 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2752649012 > Btw I do not pretend for this to be in a "finished" state, I think it needs more tests, etc. I just want to get a general 👍🏻 or 👎🏻 on the approach and some guidance around testing, b

Re: [I] Support zero copy hash repartitioning for Hash Join [datafusion]

2025-03-25 Thread via GitHub
zebsme commented on issue #15382: URL: https://github.com/apache/datafusion/issues/15382#issuecomment-2752247061 > > > * Add a mode that outputs selection vectors (for now let's use dense boolean arrays so it can be added to `RecordBatch`) in `RepartitionExec`. The array outputs `true`

Re: [PR] Add support for DISTINCT + ORDER BY in `ARRAY_AGG` [datafusion]

2025-03-25 Thread via GitHub
alamb commented on code in PR #14413: URL: https://github.com/apache/datafusion/pull/14413#discussion_r2012769580 ## datafusion/functions-aggregate/src/array_agg.rs: ## @@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { let data_type = acc_args.exprs[0].data_type

  1   2   >