Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

[PR] Set HashJoin seed [datafusion]

2025-04-20 Thread via GitHub
ctsk opened a new pull request, #15783: URL: https://github.com/apache/datafusion/pull/15783 ## Which issue does this PR close? - Closes #15620. ## What changes are included in this PR? The hash join seed is hard-coded to a different value that the RepartitionExec seed.

Re: [PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub
codecov-commenter commented on PR #1663: URL: https://github.com/apache/datafusion-comet/pull/1663#issuecomment-2817252576 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1663?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

Re: [PR] Add try_new for LogicalPlan::Join [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on PR #15757: URL: https://github.com/apache/datafusion/pull/15757#issuecomment-2817276675 The implementation looks good to me. Two points: 1. The testing seems quite verbose - I briefly browsed around and think `DFSchema::matches_arrow_schema` can help you shorten it. 2

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on PR #15768: URL: https://github.com/apache/datafusion/pull/15768#issuecomment-2817240670 Does the `batches_until_yield` threshold need to be adjusted? https://github.com/apache/datafusion/blob/91870b626b184a901c713fada3a1adb0eda2de7c/datafusion/physical-plan/src/repar

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub
Rachelint commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817566333 > However, I found the performance won't be better for Clickbench queries 4 and 7. I think it may be possible that the test queries can't reflect the improvement well.

Re: [I] Remove nulls in joins during hash table lookup [datafusion]

2025-04-20 Thread via GitHub
2010YOUY01 commented on issue #15784: URL: https://github.com/apache/datafusion/issues/15784#issuecomment-2817567103 It's a good idea. However, I think if the build side has a large HT and the probe side has many `NULL`s, the existing implementation is likely to be faster. Some statistics m

Re: [PR] Feature/benchmark config from env [datafusion]

2025-04-20 Thread via GitHub
2010YOUY01 commented on PR #15782: URL: https://github.com/apache/datafusion/pull/15782#issuecomment-2817587293 Thanks, I think it's good to go after adding some simple docs for such configurations, in: https://github.com/apache/datafusion/tree/main/benchmarks and also the output of `.

Re: [PR] fix: clickbench type err [datafusion]

2025-04-20 Thread via GitHub
chenkovsky commented on code in PR #15773: URL: https://github.com/apache/datafusion/pull/15773#discussion_r2051868053 ## benchmarks/queries/clickbench/README.md: ## @@ -155,7 +155,7 @@ WHERE THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT EL

Re: [I] Support non UTF-8 in CSV files [datafusion]

2025-04-20 Thread via GitHub
guojidan commented on issue #15756: URL: https://github.com/apache/datafusion/issues/15756#issuecomment-2817457411 ``` use std::sync::Arc; use bytes::Bytes; use datafusion::prelude::{CsvReadOptions, SessionContext}; use object_store::{memory::InMemory, path::Path, ObjectStore}

[PR] Feature/benchmark config from env [datafusion]

2025-04-20 Thread via GitHub
ctsk opened a new pull request, #15782: URL: https://github.com/apache/datafusion/pull/15782 ## Which issue does this PR close? - Closes #15684 ## Rationale for this change When benchmarking, it is convenient to modify some datafusion settings between runs and compare t

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
Dandandan commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051763714 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter()

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

[I] Remove nulls in joins during hash table lookup [datafusion]

2025-04-20 Thread via GitHub
ctsk opened a new issue, #15784: URL: https://github.com/apache/datafusion/issues/15784 ### Is your feature request related to a problem or challenge? The join implementation currently removes nulls during equality checking - I believe it would be faster / more efficient if we did so

[I] nit: Hash Partitioning Shuffle Writes Last Batches First [datafusion-comet]

2025-04-20 Thread via GitHub
EmilyMatt opened a new issue, #1659: URL: https://github.com/apache/datafusion-comet/issues/1659 This is a nitpick, but I've noticed that in the new interleaved shuffle, when copying the data into the output data file, first the in-memory data is written to the file, and only then is the co

[PR] fix: Shuffle should maintain insertion order [datafusion-comet]

2025-04-20 Thread via GitHub
EmilyMatt opened a new pull request, #1660: URL: https://github.com/apache/datafusion-comet/pull/1660 This is very much a nitpick and not a necessary change, but I believe that shuffle batches should be FIFO. ## Which issue does this PR close? Closes #1659 . ## R

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

2025-04-20 Thread via GitHub
chenkovsky commented on issue #15755: URL: https://github.com/apache/datafusion/issues/15755#issuecomment-2817103271 @jayzhan211 could you please give me an example sql? I tried to construct test sql. but https://github.com/apache/datafusion/blob/8a193c22f6c3f771bff16cb8f9608d764d2

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

2025-04-20 Thread via GitHub
jayzhan211 commented on issue #15755: URL: https://github.com/apache/datafusion/issues/15755#issuecomment-2817104010 > @jayzhan211 could you please give me an example sql? I tried to construct test sql. but > > https://github.com/apache/datafusion/blob/8a193c22f6c3f771bff16cb8f9608

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

2025-04-20 Thread via GitHub
ion-elgreco closed pull request #908: refactor: rename to `with_column_renamed` to `rename` URL: https://github.com/apache/datafusion-python/pull/908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-04-20 Thread via GitHub
goldmedal commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2051721176 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -320,6 +336,68 @@ impl BatchPartitioner { Ok((partition, batch))

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2817057508 Sorry for the silence. I've been extensively benchmarking this PR and the results have been fairly mixed. I've also tried different thresholds for coalescing. I plan on generating table

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051665380 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

2025-04-20 Thread via GitHub
rluvaton commented on code in PR #15022: URL: https://github.com/apache/datafusion/pull/15022#discussion_r2051696202 ## datafusion/expr-common/src/groups_accumulator.rs: ## @@ -106,6 +107,44 @@ impl EmitTo { /// [`Accumulator`]: crate::accumulator::Accumulator /// [Aggregating

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

2025-04-20 Thread via GitHub
rluvaton commented on code in PR #15022: URL: https://github.com/apache/datafusion/pull/15022#discussion_r2051696277 ## datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs: ## @@ -299,6 +420,12 @@ impl GroupsAccumulatorAdapter { } impl GroupsAccumulator

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817168799 I have another implementation for this issue https://github.com/goldmedal/datafusion/pull/4 The concept is that getting the row according to indices in the selection vecto

[PR] chore: Refactor Memory Pools [datafusion-comet]

2025-04-20 Thread via GitHub
EmilyMatt opened a new pull request, #1662: URL: https://github.com/apache/datafusion-comet/pull/1662 ## Which issue does this PR close? Closes #1661 . ## Rationale for this change As described in the issue, I just think the pools logic is becoming complex

Re: [PR] fix: Shuffle should maintain insertion order [datafusion-comet]

2025-04-20 Thread via GitHub
codecov-commenter commented on PR #1660: URL: https://github.com/apache/datafusion-comet/pull/1660#issuecomment-2817176336 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1660?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Show current SQL recursion limit in RecursionLimitExceeded error message [datafusion]

2025-04-20 Thread via GitHub
kumarlokesh commented on PR #15644: URL: https://github.com/apache/datafusion/pull/15644#issuecomment-2817176100 > > thanks @kumarlokesh The code now looks much more aligned. > > perhaps we can factor out > > ``` > > self.parser > >

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub
Dandandan commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051731884 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter()

[I] Suggestion: Refactor Memory Pools [datafusion-comet]

2025-04-20 Thread via GitHub
EmilyMatt opened a new issue, #1661: URL: https://github.com/apache/datafusion-comet/issues/1661 ### What is the problem the feature request solves? As there is now more than one memory Comet pool, and it stands to reason there will be more in the future, as well as with the increasin

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-04-20 Thread via GitHub
Dandandan commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2051728746 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -320,6 +336,68 @@ impl BatchPartitioner { Ok((partition, batch))

[I] Documentation for publishing crates is missing some crates [datafusion]

2025-04-20 Thread via GitHub
andygrove opened a new issue, #15781: URL: https://github.com/apache/datafusion/issues/15781 ### Describe the bug `datasource` needs to be published before `catalog`: ``` Packaging datafusion-catalog v47.0.0 (/Users/andy/git/apache/datafusion/datafusion/catalog) Up

Re: [PR] chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub
YanivKunda commented on PR #1563: URL: https://github.com/apache/datafusion-comet/pull/1563#issuecomment-2817207648 > I'd like to merge this one and then switch to the official release next week @andygrove here's the switch: https://github.com/apache/datafusion-comet/pull/1663 -

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-04-20 Thread via GitHub
ctsk commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2817209539 [benchmarks/hashagg-results.md](https://github.com/apache/datafusion/blob/05db68ca2e5efba9b76759178f50a0595ffd9939/benchmarks/hashagg-results.md) [benchmarks/join-results.md](https:/

[PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub
YanivKunda opened a new pull request, #1663: URL: https://github.com/apache/datafusion-comet/pull/1663 ## Which issue does this PR close? Closes #1634 ## Rationale for this change Upgraded from 47.0.0-rc1 post https://github.com/apache/datafusion-comet/pull/1563 #