Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-07 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3044919809 As a random aside, this is the kind of micro optimization that I would have been terrified of making in C/C++ land without extreme care to avoid concurrent access. With Rust I

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-07 Thread via GitHub
alamb commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2187217486 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -76,8 +76,40 @@ impl FusedStreams { } } +/// A pair of `Arc` that can be reused +#[derive(Debug)] +str

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-04 Thread via GitHub
Dandandan merged PR #16647: URL: https://github.com/apache/datafusion/pull/16647 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
comphead commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2183175335 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -105,26 +110,57 @@ impl RowCursorStream { }) .collect::>>()?; -let stre

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
zhuqi-lucas commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032768991 > > πŸ€–: Benchmark completed > > Details > > ``` > > Comparing HEAD and reuse_rows > > > > Benchmark sort_tpch10.json > >

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032747385 > πŸ€–: Benchmark completed > > Details > > ``` > Comparing HEAD and reuse_rows > > Benchmark sort_tpch10.json > >

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032716133 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark sort_tpch1.json ┏━━┳

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032712753 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032712630 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark sort_tpch10.json ┏━━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032659676 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032659548 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark clickbench_extended.json ┏━━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032511251 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032383269 > @alamb could you maybe run it on `sort_tpch10`. Perhaps the difference is different when using more cores (16 vs 10 I believe?). I have queued this up and it should run in a fe

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3032327248 ``` Benchmark sort_tpch10.json ┏━━┳━┳━┳━━━┓ ┃ Query┃main ┃ r

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3031974821 > Sometimes, i found sort_tpch10 will get the more accurate or good result when we optimize the merge part, because our in_mem sort buffer is 1MB, so the sort_tpch will have less c

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-03 Thread via GitHub
zhuqi-lucas commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3031932686 Sometimes, i found sort_tpch10 will get the more accurate or good result when we optimize the merge part, because our in_mem sort buffer is 1MB, so the sort_tpch will have less c

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2180819618 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -105,26 +110,53 @@ impl RowCursorStream { }) .collect::>>()?; -let str

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2180844635 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -105,26 +110,53 @@ impl RowCursorStream { }) .collect::>>()?; -let str

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3029069816 > This is quite cool @Dandandan > > My only real concern is that this code will be tricky to maintain and could easily get reverted / regressed as part of a follow on change

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2180789490 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -88,6 +90,9 @@ pub struct RowCursorStream { streams: FusedStreams, /// Tracks the memory used by `co

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028914931 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028969141 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark sort_tpch.json ┏━━┳━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028965756 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028965671 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark clickbench_extended.json ┏━━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028828620 > πŸ€–: Benchmark completed > > Details > > ``` > Comparing HEAD and reuse_rows > > Benchmark sort_tpch.json > > ┏

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028800817 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark sort_tpch.json ┏━━┳━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028797204 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028797121 πŸ€–: Benchmark completed Details ``` Comparing HEAD and reuse_rows Benchmark clickbench_extended.json ┏━━

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028641060 > @alamb may I request some benchmark run? I really need to figure out how to script this automatically. I will see if I can get claude to do something for me -- This is an au

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
alamb commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028639856 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028611122 @alamb may I request some benchmark run? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3028609059 > I believe we can increase the in-place memory for sorting benchmark here, here the default is 1MB. > > The result will largely affected by the in place sort memory buffer f

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027535311 Ah ok I see you changed `MemoryStream` to allow for this πŸ€” -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027527429 > I made the following changes to allow the test to support returning 8192 rows each time: ![Clipboard_Screenshot_1751455219](https://private-user-images.githubusercontent.com/

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
zhuqi-lucas commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027537392 I believe we can increase the in-place memory for sorting benchmark here, here the default is 1MB. The result will largely affected by the in place sort memory buffer from

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
acking-you commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027510420 I made the following changes to allow the test to support returning 8192 rows each time: ![Clipboard_Screenshot_1751455219](https://github.com/user-attachments/assets/85e2f088-

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027365462 > > I had a look at this benchmark. It seems it only is testing a single 1M batch per partition/column? > > You mean streaming back RecordBatch (e.g., batch of 8192 rows) ins

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
acking-you commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027354373 > I had a look at this benchmark. It seems it only is testing a single 1M batch per partition/column? You mean streaming back RecordBatch (e.g., batch of 8192 rows) instead

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027311548 > ```shell > bench_merge_sorted_preserving > ``` I had a look at this benchmark. It seems it only is testing a single 1M batch per partition/column? If so, it wouldn

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
acking-you commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027160766 This is the benchmark scenario where the test data has not been modified by default(multi large string): ```sh Benchmarking bench_merge_sorted_preserving/multiple_large_stri

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
acking-you commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3027041977 LGTM,I will try running the benchmark for `datafusion/physical-plan/benches/sort_preserving_merge.rs` to see if there is any improvement -- This is an automated message from th

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-02 Thread via GitHub
zhuqi-lucas commented on code in PR #16647: URL: https://github.com/apache/datafusion/pull/16647#discussion_r2179323995 ## datafusion/physical-plan/src/sorts/stream.rs: ## @@ -105,26 +110,53 @@ impl RowCursorStream { }) .collect::>>()?; -let s

Re: [PR] Reuse Rows allocation in RowCursorStream [datafusion]

2025-07-01 Thread via GitHub
Dandandan commented on PR #16647: URL: https://github.com/apache/datafusion/pull/16647#issuecomment-3026648547 FYI @zhuqi-lucas @acking-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp