[PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub
dependabot[bot] opened a new pull request, #15470: URL: https://github.com/apache/datafusion/pull/15470 Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.0 to 1.6.1. Commits See full diff in https://github.com/smithy-lang/smithy-rs/commits";>compare view

[I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas opened a new issue, #15471: URL: https://github.com/apache/datafusion/issues/15471 ### Describe the bug The average time compute for clickbench query should not inside the query iterator. I was mistakenly added inside the iterator. ### To Reproduce _N

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2760506147 Fyi, I'm working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15472: URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760512944 cc @xudong963 @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on issue #15471: URL: https://github.com/apache/datafusion/issues/15471#issuecomment-2760505849 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 opened a new pull request, #15473: URL: https://github.com/apache/datafusion/pull/15473 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825 ## Rationale for this change As @surema

Re: [PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018151341 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

[PR] Partial fix for #1078 — [Add Dataframe display config] [datafusion-python]

2025-03-28 Thread via GitHub
kosiew opened a new pull request, #1086: URL: https://github.com/apache/datafusion-python/pull/1086 ## Which issue does this PR close? Partial fix for #1078 ## Rationale for this change This PR adds configurable display settings for `DataFrame` representations in the Pyt

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2760574284 I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` -- This is an automated me

[PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas opened a new pull request, #15472: URL: https://github.com/apache/datafusion/pull/15472 ## Which issue does this PR close? - Closes [#15471](https://github.com/apache/datafusion/issues/15471) ## Rationale for this change the average time for clickbench query c

Re: [PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub
xudong963 merged PR #15470: URL: https://github.com/apache/datafusion/pull/15470 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 merged PR #15472: URL: https://github.com/apache/datafusion/pull/15472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [I] The average time compute for clickbench query is wrong [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 closed issue #15471: The average time compute for clickbench query is wrong URL: https://github.com/apache/datafusion/issues/15471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

[PR] Partial fix for 1078: [refactor: Simplify HTML generation in PyDataFrame by extracting helper functions] [datafusion-python]

2025-03-28 Thread via GitHub
kosiew opened a new pull request, #1087: URL: https://github.com/apache/datafusion-python/pull/1087 # Which issue does this PR close? Partial fix for #1078 # Rationale for this change > Split up some of the html generation into a set of helper functions. The render

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760761409 The line number in the error message is the row index of a certain record batch, not the line number in the csv file. I have filed an issue to arrow-rs for making this error me

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760766577 > why there are two head rows I didn't find this. You might find the cause. -- This is an automated message from the Apache Git Service. To respond to the message, plea

[PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
acking-you opened a new pull request, #15475: URL: https://github.com/apache/datafusion/pull/15475 ## Which issue does this PR close? - Closes #15465 . ## Rationale for this change ## What changes are included in this PR? ## Are these change

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761137454 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` Thank you very

Re: [I] DeltaLake integration not working (Python) (FFI Table providers not working) [datafusion-python]

2025-03-28 Thread via GitHub
alamb commented on issue #1077: URL: https://github.com/apache/datafusion-python/issues/1077#issuecomment-2742876701 FYI @timsaucer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub
comphead commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762263935 > Note that this does break for users of HashJoinExec that > > * Use the CollectLeft mode, with >1 partition on the build side AND > * Construct their physical plan without

Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub
codecov-commenter commented on PR #1577: URL: https://github.com/apache/datafusion-comet/pull/1577#issuecomment-2762138134 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1577?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2762302514 Before this PR, if someone hand-wired a CollectLeft HashJoin where the left child has more than one output partition, the HashJoin would automatically add a CoalescePartitions exec. Thi

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub
friendlymatthew commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019270254 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = forma

Re: [PR] feat: enable iceberg compat tests, more tests for complex types [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on code in PR #1550: URL: https://github.com/apache/datafusion-comet/pull/1550#discussion_r2019242229 ## spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala: ## @@ -490,8 +490,7 @@ object CometScanExec extends DataTypeSupport { // TO

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990 @adriangb and I had a discussion about https://github.com/apache/datafusion/pull/15301 here are some notes: ## Usecases: - TopK dynamic filter pushdown - Prune f

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub
mbutrovich commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2762276704 > Looks good, do we still need to wait for arrow-rs based on [#1566 (comment)](https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890) ? We can m

Re: [PR] minor: Allow to run TPCH bench for a specific query [datafusion]

2025-03-28 Thread via GitHub
comphead commented on PR #15467: URL: https://github.com/apache/datafusion/pull/15467#issuecomment-276152 Thanks @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-28 Thread via GitHub
alamb merged PR #15352: URL: https://github.com/apache/datafusion/pull/15352 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura commented on PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2761992952 Looks good, do we still need to wait for arrow-rs based on https://github.com/apache/datafusion-comet/pull/1566#issuecomment-2748737890 ? -- This is an automated mess

Re: [I] Organize fields inside `SortMergeJoinStream` [datafusion]

2025-03-28 Thread via GitHub
suibianwanwank commented on issue #15406: URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162 Hi, @2010YOUY01. I have read and tried to understand both SortMergeJoinStream and GroupedHashAggregateStream (though I still have some uncertainties). I have some initial

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub
jinwenjie123 commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762497216 Hi @andygrove May I ask why we decide not support RangePartitioning ? and will it be supported in the near future ? Thanks -- This is an automated message from

Re: [I] Feature is not implemeneted: Unsupported cast with list of structs [datafusion]

2025-03-28 Thread via GitHub
ion-elgreco commented on issue #15338: URL: https://github.com/apache/datafusion/issues/15338#issuecomment-2762411530 > - Related discussion: https://github.com/apache/arrow-rs/issues/7176 > > I think @kosiew may have a PR up that is related > - https://github.com/apache/datafusion/

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019195514 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.va

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019198451 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.va

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2762301099 From discussion with Andrew here are a couple notes: - The most granular update frequency of the filters is when the TopK itself updates, so we should switch from polling to pushi

Re: [PR] refactor: Move `Memtable` to catalog [datafusion]

2025-03-28 Thread via GitHub
logan-keede commented on code in PR #15459: URL: https://github.com/apache/datafusion/pull/15459#discussion_r2019291249 ## datafusion/catalog/src/memory/table.rs: ## @@ -0,0 +1,377 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761888661 > Looks good to me. Since we're only ordering by this it shouldn't matter that we order by an integer instead of a proper timestamp, ordering is equivalent. Thank you very

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
alamb closed issue #15456: [Bug] datafusion-cli may fail to read csv files URL: https://github.com/apache/datafusion/issues/15456 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub
parthchandra commented on PR #15481: URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761890159 There is also some profiling information in https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html I'm presuming this will replace that? How does

[PR] feat: Improve fetch partition performance, support skip validation arrow ipc files [datafusion-ballista]

2025-03-28 Thread via GitHub
westhide opened a new pull request, #1216: URL: https://github.com/apache/datafusion-ballista/pull/1216 # Which issue does this PR close? Closes #1189. # Rationale for this change # What changes are included in this PR? # Are there any user-facing

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2761891586 Note that this does break for users of HashJoinExec that - Use the CollectLeft mode, with >1 partition on the build side AND - Construct their physical plan without running EnforceD

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15479: URL: https://github.com/apache/datafusion/pull/15479#discussion_r2019008151 ## datafusion/physical-optimizer/src/coalesce_batches.rs: ## @@ -92,3 +92,73 @@ impl PhysicalOptimizerRule for CoalesceBatches { true } } + +/// R

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2761889602 That makes a lot of sense! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]

2025-03-28 Thread via GitHub
ivankelly commented on issue #15028: URL: https://github.com/apache/datafusion/issues/15028#issuecomment-276185 Excellent analysis folks. Parquet row groups size makes a lot of sense since the rows are large. We can tune that way down since our use case isn't columnar. How do I get the

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761888196 Turns out this is a bug in the generator -- https://github.com/clflushopt/tpchgen-rs/issues/73#issuecomment-2761885245 -- This is an automated message from the Apache Git Servic

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761917991 > > Looks good to me. Since we're only ordering by this it shouldn't matter that we order by an integer instead of a proper timestamp, ordering is equivalent. > > Thank you v

Re: [PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub
comphead commented on PR #15481: URL: https://github.com/apache/datafusion/pull/15481#issuecomment-2761920912 > There is also some profiling information in https://datafusion.apache.org/comet/contributor-guide/profiling_native_code.html I'm presuming this will replace that? How does samply

Re: [PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub
comphead commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2019237566 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -416,21 +409,23 @@ impl ExternalSorter { Some(self.spill_manager.create_in_progress_file("So

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
alan910127 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019310584 ## datafusion/core/tests/expr_api/mod.rs: ## @@ -330,12 +330,12 @@ async fn test_create_physical_expr_coercion() { create_expr_test(lit(1i32).eq(col("id"))

[PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
alan910127 opened a new pull request, #15482: URL: https://github.com/apache/datafusion/pull/15482 ## Which issue does this PR close? - Closes #15161. ## Rationale for this change Currently, DataFusion handles comparisons between numbers and string litera

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
alamb commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762075206 > @andygrove @alamb would y'all be able to help here? I saw that the prebuilt JARs were tested for 3.5.4 and upwards. Are they backwards compatible? Sorry I don't know

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762447424 Run extended tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[I] [DISCUSS] Data quality framework using DataFusion [datafusion]

2025-03-28 Thread via GitHub
jsai28 opened a new issue, #15483: URL: https://github.com/apache/datafusion/issues/15483 Would there be any interest in building a data quality framework like [Great Expectations](https://github.com/great-expectations/great_expectationshttps://github.com/great-expectations/great_expectation

Re: [PR] chore(deps): update rand requirement from 0.8 to 0.9 [datafusion]

2025-03-28 Thread via GitHub
comphead commented on PR #14333: URL: https://github.com/apache/datafusion/pull/14333#issuecomment-2762462451 depends on https://github.com/apache/datafusion/pull/14967 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]

2025-03-28 Thread via GitHub
westhide commented on issue #1189: URL: https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761307615 References: [Benchmarks for Arrow IPC reader](https://github.com/apache/arrow-rs/pull/7091) -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761294822 @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As datafusion hasn't implemented it yet, you didn't have difficulty rewriting the `human_display` according to

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-28 Thread via GitHub
goldmedal commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2761416899 > You should be able to get the test back by also setting `datafusion.optimizer.hash_join_single_partition_threshold` to `0` / a low value. Thanks. It works. I also added th

Re: [I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub
LiaCastaneda commented on issue #15477: URL: https://github.com/apache/datafusion/issues/15477#issuecomment-2761931268 I think @ologlogn will take it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] External sort failing with modest memory limit when writing parquet files [datafusion]

2025-03-28 Thread via GitHub
Kontinuation commented on issue #15028: URL: https://github.com/apache/datafusion/issues/15028#issuecomment-2761940822 > Excellent analysis folks. Parquet row groups size makes a lot of sense since the rows are large. We can tune that way down since our use case isn't columnar. How do I get

[PR] doc: fix quick-start executor command [datafusion-ballista]

2025-03-28 Thread via GitHub
westhide opened a new pull request, #1217: URL: https://github.com/apache/datafusion-ballista/pull/1217 # Which issue does this PR close? Closes N/A. # Rationale for this change # What changes are included in this PR? # Are there any user-facing ch

Re: [I] Consider using `with_skip_validation` for shuffle file reading [datafusion-ballista]

2025-03-28 Thread via GitHub
westhide commented on issue #1189: URL: https://github.com/apache/datafusion-ballista/issues/1189#issuecomment-2761293050 /take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Simplify display format of `AggregateFunctionExpr`, add `Expr::sql_name` [datafusion]

2025-03-28 Thread via GitHub
irenjj commented on PR #15253: URL: https://github.com/apache/datafusion/pull/15253#issuecomment-2761472370 > @irenjj `AggregateFunctionExpr` has `with_new_expressions()` API. As datafusion hasn't implemented it yet, you didn't have difficulty rewriting the `human_display` according to the

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-28 Thread via GitHub
mbutrovich commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2018762152 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -422,27 +432,29 @@ impl ShuffleRepartitioner { .collect::>>()?;

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761610618 > Should we update the same query in the clickbench repo as well? Yes, and we might rerun the queries as well (as `to_timestamp_seconds` takes some time itself as well). --

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
blaginin commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762056261 can you merge main into this branch please? to remove the diff -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura commented on code in PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555#discussion_r2019103734 ## native/core/src/parquet/mod.rs: ## @@ -641,6 +640,8 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat sessi

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762061935 I'll do a last commit to resolve those comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] Add `FileScanConfigBuilder` [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15352: URL: https://github.com/apache/datafusion/pull/15352#issuecomment-2762070065 Thanks again @blaginin @mertak-synnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Add documentation for `Run extended tests` command [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15463: URL: https://github.com/apache/datafusion/pull/15463#issuecomment-2762078550 Thanks @comphead and @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Add documentation for `Run extended tests` command [datafusion]

2025-03-28 Thread via GitHub
alamb merged PR #15463: URL: https://github.com/apache/datafusion/pull/15463 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Docs: Formatting and Added Extra resources [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15450: URL: https://github.com/apache/datafusion/pull/15450#issuecomment-2762082875 > > BTW @oznur-synnada I wonder if you have time to update the page with other recent blog content 🤔 > > You mean this? #15440 Yes! As well as https://datafusion.ap

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2762080799 🌶️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2006767851 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -1655,4 +1656,46 @@ mod tests { assert_eq!(calls.len(), 2); assert_eq!(calls,

[PR] bench: Introduce cross platform Samply profiler [datafusion]

2025-03-28 Thread via GitHub
comphead opened a new pull request, #15481: URL: https://github.com/apache/datafusion/pull/15481 ## Which issue does this PR close? - Closes #. ## Rationale for this change Introduce cross platform Samply profiler for DataFusion and benchmarks ## W

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761813612 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` After taking a

Re: [I] Idea: Avoid planning CoalesceBatches in front of blocking operators. [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on issue #15478: URL: https://github.com/apache/datafusion/issues/15478#issuecomment-2762997345 > One downside: Increased memory usage. > > The hash join build side stores the RecordBatches in a vector before building the hash table. This vector will grow larger. I

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762816764 Looks like it failed? https://github.com/apache/datafusion/actions/runs/14139465370/job/39618247236 -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Update ClickBench queries to avoid `to_timestamp_seconds` [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on issue #15465: URL: https://github.com/apache/datafusion/issues/15465#issuecomment-2762989314 Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that a

Re: [I] Support for user defined FFI functions [datafusion-ballista]

2025-03-28 Thread via GitHub
unknown-no commented on issue #1215: URL: https://github.com/apache/datafusion-ballista/issues/1215#issuecomment-2763011354 Related [WASM UDFs](https://github.com/apache/datafusion/issues/9326) -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762922710 This particular API is not a public API and we use it to so we can verify the metrics in tests. Maybe we can disable its use in non test environments? -- This is an

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2763043386 @alamb I've achieved 2/3 goals: - I added wrapping of a `DynamicFilterSource` in a `PhysicalExpr` such that it can dynamically update itself to prune rows using filter pushdown _e

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019668440 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2763074215 > For your planning purposes I will be away the week of April 21 -- so perhaps we can start testing a week earlier (week of April 7 so we have time to complete / fix issues pr

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15432: URL: https://github.com/apache/datafusion/pull/15432#discussion_r2019681642 ## datafusion/core/src/datasource/statistics.rs: ## @@ -145,7 +147,142 @@ pub async fn get_statistics_with_limit( Ok((result_files, statistics)) } -fn ad

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub
shehabgamin commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2762517221 Happy to test whenever! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] feat: Add regexp_substr function [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] commented on PR #14323: URL: https://github.com/apache/datafusion/pull/14323#issuecomment-2763013966 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] fix: aggregation corner case [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 commented on PR #15457: URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2763109194 > count(*) actually doesnt depend on any column on input logically count(*) need to know the row number of the column -- This is an automated message from the Apache Git S

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
alamb commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741 ## benchmarks/queries/clickbench/README.md: ## @@ -120,13 +122,42 @@ LIMIT 10; ``` Results look like - +``` +-+-+---+--+

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
Dandandan merged PR #15475: URL: https://github.com/apache/datafusion/pull/15475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub
andygrove commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762962087 I discussed this feature with @mbutrovich recently and he may have additional thoughts on this topic. -- This is an automated message from the Apache Git Service. To resp

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub
viirya commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2763002946 The implementation issue or difference for RangePartitioning other than other partitioning like HashPartitioning, is that it involves some sampling operations that perform wit

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on code in PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#discussion_r2019634486 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1460,6 +1460,59 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2761412282 > You mean coalesce_partitions_if_needed() call is redundant in datafusion? I don't think that's the case, but if it is so, why don't we remove that line? I wanted to keep the PR

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761675489 Nice find @chenkovsky -- so looks like there is some bug in the data generator after all. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2761689234 @jinwenjie123 appreciate your response! I am using one of the pre-built JARs, I will not be able to switch to Spark 3.4.x since our cluster was recently upgraded to 3.5.x an

[PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
qstommyshu opened a new pull request, #15480: URL: https://github.com/apache/datafusion/pull/15480 ## Which issue does this PR close? - Closes #15398. Related #15444 ## Rationale for this change ## What changes are included in this PR? Migr

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2761715713 Hi @alamb and @blaginin Part2 of the substrait tests migration is done as well. Please take a look when you have time :) The only tests that cannot be changed to `in

[PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub
l0kr opened a new pull request, #1577: URL: https://github.com/apache/datafusion-comet/pull/1577 ## Which issue does this PR close? Closes #936. ## Rationale for this change Previous PR went stale so I wanted to move it forward. ## What changes are included

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2004247883 ## datafusion/common/src/config.rs: ## @@ -590,6 +590,9 @@ config_namespace! { /// during aggregations, if possible pub enable_topk_aggregation:

  1   2   >