Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2030077255 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -858,6 +858,96 @@ impl FileScanConfig { }) } +/// Splits file groups into new gro

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-05 Thread via GitHub
suremarc commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2021230783 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups b

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-05 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020557114 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-05 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020499604 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-05 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024732164 ## datafusion/datasource/src/mod.rs: ## @@ -313,6 +314,78 @@ async fn find_first_newline( Ok(index) } +/// Generates test files with min-max statistics i

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-04 Thread via GitHub
alamb merged PR #15473: URL: https://github.com/apache/datafusion/pull/15473 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2779544090 Thanks @xudong963 @2010YOUY01 and @suremarc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-04 Thread via GitHub
alamb commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2779542553 > I'm curious why the PR triggers the `security audit` CI - It is not related to this PR: https://github.com/apache/datafusion/issues/15571 -- This is an automated message

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-03 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2776255267 I'm curious why the PR triggers the `security audit` CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-02 Thread via GitHub
2010YOUY01 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024515270 ## datafusion/datasource/src/mod.rs: ## @@ -313,6 +314,78 @@ async fn find_first_newline( Ok(index) } +/// Generates test files with min-max statistics

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-02 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024400775 ## datafusion/datasource/src/mod.rs: ## @@ -313,6 +314,78 @@ async fn find_first_newline( Ok(index) } +/// Generates test files with min-max statistics i

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-01 Thread via GitHub
suremarc commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020398544 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more co

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-01 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2022286476 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -,4 +2315,163 @@ mod tests { assert_eq!(new_config.constraints, Constraints::default());

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2767850478 Thank you, @leoyvens! I plan to add tests for the PR in the next two days, and then we can continue to move it forward. Thanks for all your review! -- This is an automated

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766577490 In terms of default behaviour, I see there are planning time concerns relative to this PR. For cases like mine, where files are lexicographically sorted, just changing the way the d

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766568954 I took some time to play with this, so I can provide an anecdotal report. **Conclusion** In my setup, this PR is a clear win to execution times. **Configurations**

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766451717 I took some time to play with this, so I can provide an anecdotal report. I compared three setups: - This branch with `split_file_groups_by_statistics = true` - main with `spli

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020973260 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020805433 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020883570 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2765942628 > I suggest to reuse the benchmark utilities also for testing, random file group generation and the later sort order check is a great property test Yes, that's what in my min

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020523576 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020803934 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-30 Thread via GitHub
suremarc commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020364348 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups b

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-30 Thread via GitHub
suremarc commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020364348 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups b

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-30 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2765053470 Ci failure is expected because the file groups changed due to the new method. I'll update the failed tests and add tests for the new method after we make a consistence, espec

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-30 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020323603 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-30 Thread via GitHub
leoyvens commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020240967 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more co

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018831008 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Copilot commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018163460 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups ba

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018226169 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018207072 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c