berkaysynnada commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2030077255
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -858,6 +858,96 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new gro
suremarc commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2021230783
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups b
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020557114
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020499604
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024732164
##
datafusion/datasource/src/mod.rs:
##
@@ -313,6 +314,78 @@ async fn find_first_newline(
Ok(index)
}
+/// Generates test files with min-max statistics i
alamb merged PR #15473:
URL: https://github.com/apache/datafusion/pull/15473
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscr...@datafusi
alamb commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2779544090
Thanks @xudong963 @2010YOUY01 and @suremarc
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to g
alamb commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2779542553
> I'm curious why the PR triggers the `security audit` CI
- It is not related to this PR:
https://github.com/apache/datafusion/issues/15571
--
This is an automated message
xudong963 commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2776255267
I'm curious why the PR triggers the `security audit` CI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
2010YOUY01 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024515270
##
datafusion/datasource/src/mod.rs:
##
@@ -313,6 +314,78 @@ async fn find_first_newline(
Ok(index)
}
+/// Generates test files with min-max statistics
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2024400775
##
datafusion/datasource/src/mod.rs:
##
@@ -313,6 +314,78 @@ async fn find_first_newline(
Ok(index)
}
+/// Generates test files with min-max statistics i
suremarc commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020398544
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more co
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2022286476
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -,4 +2315,163 @@ mod tests {
assert_eq!(new_config.constraints, Constraints::default());
xudong963 commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2767850478
Thank you, @leoyvens!
I plan to add tests for the PR in the next two days, and then we can
continue to move it forward. Thanks for all your review!
--
This is an automated
leoyvens commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766577490
In terms of default behaviour, I see there are planning time concerns
relative to this PR. For cases like mine, where files are lexicographically
sorted, just changing the way the d
leoyvens commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766568954
I took some time to play with this, so I can provide an anecdotal report.
**Conclusion**
In my setup, this PR is a clear win to execution times.
**Configurations**
2010YOUY01 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
leoyvens commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766451717
I took some time to play with this, so I can provide an anecdotal report. I
compared three setups:
- This branch with `split_file_groups_by_statistics = true`
- main with `spli
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020973260
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
Dandandan commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020805433
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020883570
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
xudong963 commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2765942628
> I suggest to reuse the benchmark utilities also for testing, random file
group generation and the later sort order check is a great property test
Yes, that's what in my min
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020523576
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
Dandandan commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020803934
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
2010YOUY01 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups
suremarc commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020364348
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups b
suremarc commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020364348
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups b
xudong963 commented on PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2765053470
Ci failure is expected because the file groups changed due to the new method.
I'll update the failed tests and add tests for the new method after we make
a consistence, espec
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020323603
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
leoyvens commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020240967
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more co
Dandandan commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018831008
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
Copilot commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018163460
##
datafusion/datasource/src/file_scan_config.rs:
##
@@ -575,6 +575,95 @@ impl FileScanConfig {
})
}
+/// Splits file groups into new groups ba
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018226169
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
xudong963 commented on code in PR #15473:
URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018207072
##
datafusion/datasource/benches/split_groups_by_statistics.rs:
##
@@ -0,0 +1,178 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more c
35 matches
Mail list logo