date:20250420

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

[PR] Set HashJoin seed [datafusion]

2025-04-20 Thread via GitHub

ctsk opened a new pull request, #15783: URL: https://github.com/apache/datafusion/pull/15783 ## Which issue does this PR close? - Closes #15620. ## What changes are included in this PR? The hash join seed is hard-coded to a different value that the RepartitionExec seed.

Re: [PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub

codecov-commenter commented on PR #1663: URL: https://github.com/apache/datafusion-comet/pull/1663#issuecomment-2817252576 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1663?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

Re: [PR] Add try_new for LogicalPlan::Join [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on PR #15757: URL: https://github.com/apache/datafusion/pull/15757#issuecomment-2817276675 The implementation looks good to me. Two points: 1. The testing seems quite verbose - I briefly browsed around and think `DFSchema::matches_arrow_schema` can help you shorten it. 2

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on PR #15768: URL: https://github.com/apache/datafusion/pull/15768#issuecomment-2817240670 Does the `batches_until_yield` threshold need to be adjusted? https://github.com/apache/datafusion/blob/91870b626b184a901c713fada3a1adb0eda2de7c/datafusion/physical-plan/src/repar

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub

Rachelint commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817566333 > However, I found the performance won't be better for Clickbench queries 4 and 7. I think it may be possible that the test queries can't reflect the improvement well.

Re: [I] Remove nulls in joins during hash table lookup [datafusion]

2025-04-20 Thread via GitHub

2010YOUY01 commented on issue #15784: URL: https://github.com/apache/datafusion/issues/15784#issuecomment-2817567103 It's a good idea. However, I think if the build side has a large HT and the probe side has many `NULL`s, the existing implementation is likely to be faster. Some statistics m

Re: [PR] Feature/benchmark config from env [datafusion]

2025-04-20 Thread via GitHub

2010YOUY01 commented on PR #15782: URL: https://github.com/apache/datafusion/pull/15782#issuecomment-2817587293 Thanks, I think it's good to go after adding some simple docs for such configurations, in: https://github.com/apache/datafusion/tree/main/benchmarks and also the output of `.

Re: [PR] fix: clickbench type err [datafusion]

2025-04-20 Thread via GitHub

chenkovsky commented on code in PR #15773: URL: https://github.com/apache/datafusion/pull/15773#discussion_r2051868053 ## benchmarks/queries/clickbench/README.md: ## @@ -155,7 +155,7 @@ WHERE THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT EL

Re: [I] Support non UTF-8 in CSV files [datafusion]

2025-04-20 Thread via GitHub

guojidan commented on issue #15756: URL: https://github.com/apache/datafusion/issues/15756#issuecomment-2817457411 ``` use std::sync::Arc; use bytes::Bytes; use datafusion::prelude::{CsvReadOptions, SessionContext}; use object_store::{memory::InMemory, path::Path, ObjectStore}

[PR] Feature/benchmark config from env [datafusion]

2025-04-20 Thread via GitHub

ctsk opened a new pull request, #15782: URL: https://github.com/apache/datafusion/pull/15782 ## Which issue does this PR close? - Closes #15684 ## Rationale for this change When benchmarking, it is convenient to modify some datafusion settings between runs and compare t

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

Dandandan commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051763714 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter()

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051758338 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

[I] Remove nulls in joins during hash table lookup [datafusion]

2025-04-20 Thread via GitHub

ctsk opened a new issue, #15784: URL: https://github.com/apache/datafusion/issues/15784 ### Is your feature request related to a problem or challenge? The join implementation currently removes nulls during equality checking - I believe it would be faster / more efficient if we did so

[I] nit: Hash Partitioning Shuffle Writes Last Batches First [datafusion-comet]

2025-04-20 Thread via GitHub

EmilyMatt opened a new issue, #1659: URL: https://github.com/apache/datafusion-comet/issues/1659 This is a nitpick, but I've noticed that in the new interleaved shuffle, when copying the data into the output data file, first the in-memory data is written to the file, and only then is the co

[PR] fix: Shuffle should maintain insertion order [datafusion-comet]

2025-04-20 Thread via GitHub

EmilyMatt opened a new pull request, #1660: URL: https://github.com/apache/datafusion-comet/pull/1660 This is very much a nitpick and not a necessary change, but I believe that shuffle batches should be FIFO. ## Which issue does this PR close? Closes #1659 . ## R

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

2025-04-20 Thread via GitHub

chenkovsky commented on issue #15755: URL: https://github.com/apache/datafusion/issues/15755#issuecomment-2817103271 @jayzhan211 could you please give me an example sql? I tried to construct test sql. but https://github.com/apache/datafusion/blob/8a193c22f6c3f771bff16cb8f9608d764d2

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

2025-04-20 Thread via GitHub

jayzhan211 commented on issue #15755: URL: https://github.com/apache/datafusion/issues/15755#issuecomment-2817104010 > @jayzhan211 could you please give me an example sql? I tried to construct test sql. but > > https://github.com/apache/datafusion/blob/8a193c22f6c3f771bff16cb8f9608

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

2025-04-20 Thread via GitHub

ion-elgreco closed pull request #908: refactor: rename to `with_column_renamed` to `rename` URL: https://github.com/apache/datafusion-python/pull/908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-04-20 Thread via GitHub

goldmedal commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2051721176 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -320,6 +336,68 @@ impl BatchPartitioner { Ok((partition, batch))

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2817057508 Sorry for the silence. I've been extensively benchmarking this PR and the results have been fairly mixed. I've also tried different thresholds for coalescing. I plan on generating table

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051665380 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter() .e

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

2025-04-20 Thread via GitHub

rluvaton commented on code in PR #15022: URL: https://github.com/apache/datafusion/pull/15022#discussion_r2051696202 ## datafusion/expr-common/src/groups_accumulator.rs: ## @@ -106,6 +107,44 @@ impl EmitTo { /// [`Accumulator`]: crate::accumulator::Accumulator /// [Aggregating

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

2025-04-20 Thread via GitHub

rluvaton commented on code in PR #15022: URL: https://github.com/apache/datafusion/pull/15022#discussion_r2051696277 ## datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs: ## @@ -299,6 +420,12 @@ impl GroupsAccumulatorAdapter { } impl GroupsAccumulator

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-04-20 Thread via GitHub

goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2817168799 I have another implementation for this issue https://github.com/goldmedal/datafusion/pull/4 The concept is that getting the row according to indices in the selection vecto

[PR] chore: Refactor Memory Pools [datafusion-comet]

2025-04-20 Thread via GitHub

EmilyMatt opened a new pull request, #1662: URL: https://github.com/apache/datafusion-comet/pull/1662 ## Which issue does this PR close? Closes #1661 . ## Rationale for this change As described in the issue, I just think the pools logic is becoming complex

Re: [PR] fix: Shuffle should maintain insertion order [datafusion-comet]

2025-04-20 Thread via GitHub

codecov-commenter commented on PR #1660: URL: https://github.com/apache/datafusion-comet/pull/1660#issuecomment-2817176336 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1660?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Show current SQL recursion limit in RecursionLimitExceeded error message [datafusion]

2025-04-20 Thread via GitHub

kumarlokesh commented on PR #15644: URL: https://github.com/apache/datafusion/pull/15644#issuecomment-2817176100 > > thanks @kumarlokesh The code now looks much more aligned. > > perhaps we can factor out > > ``` > > self.parser > >

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

2025-04-20 Thread via GitHub

Dandandan commented on code in PR #15768: URL: https://github.com/apache/datafusion/pull/15768#discussion_r2051731884 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -298,25 +299,15 @@ impl BatchPartitioner { .into_iter()

[I] Suggestion: Refactor Memory Pools [datafusion-comet]

2025-04-20 Thread via GitHub

EmilyMatt opened a new issue, #1661: URL: https://github.com/apache/datafusion-comet/issues/1661 ### What is the problem the feature request solves? As there is now more than one memory Comet pool, and it stands to reason there will be more in the future, as well as with the increasin

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-04-20 Thread via GitHub

Dandandan commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2051728746 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -320,6 +336,68 @@ impl BatchPartitioner { Ok((partition, batch))

[I] Documentation for publishing crates is missing some crates [datafusion]

2025-04-20 Thread via GitHub

andygrove opened a new issue, #15781: URL: https://github.com/apache/datafusion/issues/15781 ### Describe the bug `datasource` needs to be published before `catalog`: ``` Packaging datafusion-catalog v47.0.0 (/Users/andy/git/apache/datafusion/datafusion/catalog) Up

Re: [PR] chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub

YanivKunda commented on PR #1563: URL: https://github.com/apache/datafusion-comet/pull/1563#issuecomment-2817207648 > I'd like to merge this one and then switch to the official release next week @andygrove here's the switch: https://github.com/apache/datafusion-comet/pull/1663 -

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-04-20 Thread via GitHub

ctsk commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2817209539 [benchmarks/hashagg-results.md](https://github.com/apache/datafusion/blob/05db68ca2e5efba9b76759178f50a0595ffd9939/benchmarks/hashagg-results.md) [benchmarks/join-results.md](https:/

[PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

2025-04-20 Thread via GitHub

YanivKunda opened a new pull request, #1663: URL: https://github.com/apache/datafusion-comet/pull/1663 ## Which issue does this PR close? Closes #1634 ## Rationale for this change Upgraded from 47.0.0-rc1 post https://github.com/apache/datafusion-comet/pull/1563 #

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

[PR] Set HashJoin seed [datafusion]

Re: [PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

Re: [PR] Add try_new for LogicalPlan::Join [datafusion]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

Re: [I] Remove nulls in joins during hash table lookup [datafusion]

Re: [PR] Feature/benchmark config from env [datafusion]

Re: [PR] fix: clickbench type err [datafusion]

Re: [I] Support non UTF-8 in CSV files [datafusion]

[PR] Feature/benchmark config from env [datafusion]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

[I] Remove nulls in joins during hash table lookup [datafusion]

[I] nit: Hash Partitioning Shuffle Writes Last Batches First [datafusion-comet]

[PR] fix: Shuffle should maintain insertion order [datafusion-comet]

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

Re: [I] Support coercsion from `FixedSizeBinary` to `BinaryView` [datafusion]

Re: [PR] refactor: rename to `with_column_renamed` to `rename` [datafusion-python]

Re: [PR] Introduce selection vector repartitioning [datafusion]

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

[PR] chore: Refactor Memory Pools [datafusion-comet]

Re: [PR] fix: Shuffle should maintain insertion order [datafusion-comet]

Re: [PR] Show current SQL recursion limit in RecursionLimitExceeded error message [datafusion]

Re: [PR] Use `interleave` in hash repartitioning [datafusion]

[I] Suggestion: Refactor Memory Pools [datafusion-comet]

Re: [PR] Introduce selection vector repartitioning [datafusion]

[I] Documentation for publishing crates is missing some crates [datafusion]

Re: [PR] chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 [datafusion-comet]

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

[PR] chore: Upgrade to datafusion 47.0.0 [datafusion-comet]

36 matches

Site Navigation

Mail list logo

Footer information