Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]
DerGut commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040700866
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
Ok(())
}
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) ->
Result<()> {
+let size = get_reserved_byte_for_record_batch(input);
+
+let result = self.reservation.try_grow(size);
+if result.is_err() {
+if self.in_mem_batches.is_empty() {
+return result;
Review Comment:
@2010YOUY01 since you commented on the issue
https://github.com/apache/datafusion/issues/15675#issuecomment-2796177594 with
suggestions:
> I think the fix (for a more user-friendly error message) would be
>
> 1. When sort_spill_reservation_byte reservation failed, the error message
should include considering set the memory limit larger than this reservation
memory size
> 2. For the code path triggered this internal error, return an
ExecutionError instead with a similar error message
This code now emits a
[`DataFusionError::ResourcesExhausted`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L126)
instead of the suggested
[`DataFusionError::Execution`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L117).
I think this makes sense, given the descriptions of the errors.
E.g. Execution error doesn't seem to apply:
> This error is returned when an error happens during execution due to a
malformed input.
but ResourcesExhausted does:
> This error is thrown when a consumer cannot acquire additional memory
---
I still like the idea of suggesting to bump the memory limit in the error
message.
However, since the code simply bubbles up the error emitted by the memory
pool's
[`insufficient_capacity_err`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/execution/src/memory_pool/pool.rs#L249)
macro, we would either need to map it, or change the macro's message.
I don't think mapping makes sense here because the error is generally used
for reservation-related issues - I don't think this is a special case.
Changing the error message would probably require a separate discussion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
timsaucer commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798792530 I spoke too soon - I'm getting one error in our unit tests on `last_value`. I'm trying to investigate this morning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]
2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798787828
Benchmark results:
(I think there is no significant regression for an extra round of re-spill,
if it's running on a machine with fast SSDs)
### Environment
MacBook Pro with m4-pro chip (disk bandwidth is around 8000MB/s)
### Sorting 'thin' table
1. Run datafusion-cli with `cargo run --profile release-nonlto --
--mem-pool-type fair -m 100M`
2. Execute `explain analyze select * from generate_series(1, 10) as
t1(v1) order by v1;`
Main: 37s (merge ~170 spill files at once)
PR (with `sort_max_spill_merge_degree = 16`, and there is one round of
re-spill): 43s
PR (with `sort_max_spill_merge_degree = 10`, two rounds of re-spill): 49s
### Sorting 'fat' table
Run `sort_tpch` benchmark q7
```
// Q7: 3 sort keys {(INTEGER, 7), (BIGINT, 10k), (BIGINT, 1.5M)} +
12 all other columns
r#"
SELECT l_linenumber, l_suppkey, l_orderkey,
l_partkey, l_quantity, l_extendedprice, l_discount, l_tax,
l_returnflag, l_linestatus, l_shipdate, l_commitdate,
l_receiptdate, l_shipinstruct, l_shipmode
FROM lineitem
ORDER BY l_linenumber, l_suppkey, l_orderkey
"#,
```
Benchmark command
```sh
cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p
/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 7 --memory-limit
1.2G
```
Notes:
- `target_partitions` config set to 14, and later configurations and results
depend on this setting.
- For PR's benchmark runs, `sort_max_spill_merge_degree` is manually changed
to 6, as a result:
- under 1.2G memory limit, 1 round of re-spill will be triggered
- under 500M memory limit, 2 rounds of re-spill happens
Result
Main (1.2G):
```
Q7 iteration 0 took 9374.7 ms and returned 59986052 rows
Q7 iteration 1 took 8117.6 ms and returned 59986052 rows
Q7 iteration 2 took 8549.1 ms and returned 59986052 rows
Q7 avg time: 8680.47 ms
```
Main (500M):
```
Fail with OOM
```
PR (1.2G):
```
ata/tpch_sf10 -q 7 --memory-limit 1G`
Q7 iteration 0 took 10723.6 ms and returned 59986052 rows
Q7 iteration 1 took 12962.8 ms and returned 59986052 rows
Q7 iteration 2 took 11739.7 ms and returned 59986052 rows
Q7 avg time: 11808.71 ms
```
PR (500M):
```
Q7 iteration 0 took 16233.1 ms and returned 59986052 rows
Q7 iteration 1 took 18568.4 ms and returned 59986052 rows
Q7 iteration 2 took 19173.4 ms and returned 59986052 rows
Q7 avg time: 17991.67 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]
2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798789296 I have made the following updates: - Address review comments - Introduced a new configuration option for max merge degree - Add tests It's ready for another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: normalize window ident [datafusion]
alamb merged PR #15639: URL: https://github.com/apache/datafusion/pull/15639 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Support window definitions with aliases [datafusion]
alamb closed issue #15605: Support window definitions with aliases URL: https://github.com/apache/datafusion/issues/15605 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify `x=x` --> `x IS NOT NULL OR NULL` [datafusion]
alamb commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2798807271 Thanks again @ding-young π -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Trivial WHERE filter not eliminated when combined with CTE [datafusion]
alamb closed issue #15387: Trivial WHERE filter not eliminated when combined with CTE URL: https://github.com/apache/datafusion/issues/15387 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Enhance: simplify `x=x` --> `x IS NOT NULL OR NULL` [datafusion]
alamb merged PR #15589: URL: https://github.com/apache/datafusion/pull/15589 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]
alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec spill URL: https://github.com/apache/datafusion/issues/15323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]
alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec spill URL: https://github.com/apache/datafusion/issues/15323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: normalize window ident [datafusion]
alamb commented on PR #15639: URL: https://github.com/apache/datafusion/pull/15639#issuecomment-2798807532 Thanks @chenkovsky and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Allow parsing byte literals as FixedSizeBinary [datafusion]
alamb commented on issue #15686: URL: https://github.com/apache/datafusion/issues/15686#issuecomment-2798808874 > Coercing the LHS to Binary is less performant. Unconditionally parsing literals as FixedSizeBytes would be a breaking change. What about coercing the RHS to FixedSizeBinary? So like ```sql where byte_column = x'deadbeef' ``` Becomes ```sql where byte_column = arrow_cast(x'deadbeef', FixedSizeBinary) ``` This would have the benefit of: 1. Easier to use (no config flag) 2. Could handle non exactly sizes things like casting `0xdead` to `0xdead` for comparisoon with 4 byte integers for example -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] add config parse_hex_as_fixed_size_binary [datafusion]
alamb commented on PR #15687: URL: https://github.com/apache/datafusion/pull/15687#issuecomment-2798809026 I have an alternate suggestion here: https://github.com/apache/datafusion/issues/15686#issuecomment-2798808874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
timsaucer commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798806510 I did a little investigating, but I don't have time for a couple of days to dive in deeper. This appears to be related to https://github.com/apache/datafusion/pull/15542 @UBarney do you know why the sql unit tests needed to be changed to pass? It seems like we have a regression in `datafusion-python` related to this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add Table Functions to FFI Crate [datafusion]
timsaucer merged PR #15581: URL: https://github.com/apache/datafusion/pull/15581 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]
alamb merged PR #15654: URL: https://github.com/apache/datafusion/pull/15654 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798810667 > I did a little investigating, but I don't have time for a couple of days to dive in deeper. This appears to be related to [#15542](https://github.com/apache/datafusion/pull/15542) [@UBarney](https://github.com/UBarney) do you know why the sql unit tests needed to be changed to pass? It seems like we have a regression in `datafusion-python` related to this change. I think @andygrove filed a ticket for this one - https://github.com/apache/datafusion/issues/15676 I didn't fully follow the discussion -- but it seems like that issue has been closed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add Table Functions to FFI Crate [datafusion]
alamb commented on PR #15581: URL: https://github.com/apache/datafusion/pull/15581#issuecomment-2798809598 Very exciting -- thanks @timsaucer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]
alamb commented on PR #15654: URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2798807598 π -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore(deps): bump sysinfo from 0.33.1 to 0.34.2 [datafusion]
alamb merged PR #15682: URL: https://github.com/apache/datafusion/pull/15682 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[I] first_value and last_value should have identical signatures [datafusion]
timsaucer opened a new issue, #12376: URL: https://github.com/apache/datafusion/issues/12376 ### Describe the bug Less of a bug per se, but it would be nice to have identical function signatures between first_value and last_value ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798742956 > @acking-you the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently). I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
timsaucer commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798821873 > > I did a little investigating, but I don't have time for a couple of days to dive in deeper. This appears to be related to [#15542](https://github.com/apache/datafusion/pull/15542) [@UBarney](https://github.com/UBarney) do you know why the sql unit tests needed to be changed to pass? It seems like we have a regression in `datafusion-python` related to this change. > > I think [@andygrove](https://github.com/andygrove) filed a ticket for this one > > * [Regression in `last_value` functionality #15676](https://github.com/apache/datafusion/issues/15676) > > > I didn't fully follow the discussion -- but it seems like that issue has been closed I've read the discussion now and I think I'm in agreement that it's not an actual regression since the aggregation has no deterministic outcome without ordering assigned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] executor can't read s3 config in push-staged mode. [datafusion-ballista]
milenkovicm closed issue #1235: executor can't read s3 config in push-staged mode. URL: https://github.com/apache/datafusion-ballista/issues/1235 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: executor can't read s3 config in push-staged mode [datafusion-ballista]
milenkovicm merged PR #1236: URL: https://github.com/apache/datafusion-ballista/pull/1236 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]
LucaCappelletti94 commented on code in PR #1806:
URL:
https://github.com/apache/datafusion-sqlparser-rs/pull/1806#discussion_r2040654851
##
src/parser/mod.rs:
##
@@ -7070,13 +7071,22 @@ impl<'a> Parser<'a> {
}
}
-/// Parse configuration like partitioning, clustering information during
the table creation.
+/// Parse configuration like inheritance, partitioning, clustering
information during the table creation.
///
///
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#syntax_2)
-///
[PostgreSQL](https://www.postgresql.org/docs/current/ddl-partitioning.html)
+/// [PostgreSQL
Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html)
+/// [PostgreSQL
Inheritance](https://www.postgresql.org/docs/current/ddl-inherit.html)
fn parse_optional_create_table_config(
&mut self,
) -> Result {
+let inherits = if dialect_of!(self is BigQueryDialect |
PostgreSqlDialect | GenericDialect)
Review Comment:
The dialect thing must have been a copy paste from the line below and then
my brain decided to drop that pointer of information - fixed as described!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]
LucaCappelletti94 commented on code in PR #1806:
URL:
https://github.com/apache/datafusion-sqlparser-rs/pull/1806#discussion_r2040655023
##
tests/sqlparser_postgres.rs:
##
@@ -2733,6 +2733,39 @@ fn parse_create_brin() {
}
}
+#[test]
+fn parse_create_table_with_inherits() {
+let single_inheritance_sql =
+"CREATE TABLE child_table (child_column INT) INHERITS
(public.parent_table)";
+match pg().verified_stmt(single_inheritance_sql) {
+Statement::CreateTable(CreateTable {
+inherits: Some(inherits),
+..
+}) => {
+assert_eq_vec(&["public", "parent_table"], &inherits[0].0);
+}
+_ => unreachable!(),
+}
+
+let double_inheritance_sql = "CREATE TABLE child_table (child_column INT)
INHERITS (public.parent_table, pg_catalog.pg_settings)";
+match pg().verified_stmt(double_inheritance_sql) {
+Statement::CreateTable(CreateTable {
+inherits: Some(inherits),
+..
+}) => {
+assert_eq_vec(&["public", "parent_table"], &inherits[0].0);
+assert_eq_vec(&["pg_catalog", "pg_settings"], &inherits[1].0);
+}
+_ => unreachable!(),
+}
+}
+
+#[test]
+#[should_panic]
+fn parse_create_table_with_empty_inherits_fails() {
+pg().verified_stmt("CREATE TABLE child_table (child_column INT) INHERITS
()");
Review Comment:
Refactored test as described!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[PR] chore: Prepare for datafusion 47.0.0 + arrow-rs 55.0.0 [datafusion-comet]
andygrove opened a new pull request, #1642: URL: https://github.com/apache/datafusion-comet/pull/1642 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Set DataFusion runtime configurations through SQL interface [datafusion]
kumarlokesh commented on code in PR #15594:
URL: https://github.com/apache/datafusion/pull/15594#discussion_r2040707194
##
datafusion/core/tests/sql/runtime_config.rs:
##
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Tests for runtime configuration SQL interface
+
+use datafusion::prelude::*;
+use datafusion_common::Result;
+
+#[tokio::test]
+async fn test_set_memory_limit() -> Result<()> {
+let ctx = SessionContext::new();
+
+// Set memory limit to 100MB using SQL - note the quotes around the value
+ctx.sql("SET datafusion.runtime.memory_limit = '100M'")
Review Comment:
> Running this test can't ensure this config is set, because it can run
without setting the memory limit and use the default `UnboundedMemoryPool`,
instead I think it should assert the spill count for a query that has spilled
some intermediate data.
>
> ```
> SET datafusion.runtime.memory_limit = '1M'
> set datafusion.execution.sort_spill_reservation_bytes = 0;
>
> select * from generate_series(1, 10) as t1(v1) order by v1;
> -- And assert spill-count from the query is > 0
> ```
>
> You can check this
>
>
https://github.com/apache/datafusion/blob/4ae0edea8eb7d5ad5df07e22a8fb39aa9fd1e2df/datafusion/core/tests/memory_limit/mod.rs#L525
>
> for how to assert the spill file count.
> Later after all configurations are added, I think we should make this test
case stronger by setting more runtime configs, and do some property test to
ensure all of them are properly set.
>
> BTW, I tried the above test locally, and the `generate_series` UDTF seems
not registered in the PR branch, but it works in the main branch.
@2010YOUY01 @berkaysynnada addressed above in
https://github.com/apache/datafusion/pull/15594/commits/6501f218e2c3f7c97316794b2137e7b3672b77a9.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
Dandandan commented on issue #15375: URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798960792 @2010YOUY01 that sound like a very promising future direction. I might try something experimenting on this soon if none beats me to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[PR] build(deps): bump mimalloc from 0.1.43 to 0.1.46 [datafusion-python]
dependabot[bot] opened a new pull request, #1105: URL: https://github.com/apache/datafusion-python/pull/1105 Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 0.1.43 to 0.1.46. Release notes Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's releases. Version 0.1.46 Changes Fixed musl builds. Version 0.1.45 Changes Mimalloc v2.2.3 Version 0.1.44 Changes Mimalloc v2.2.2 Commits https://github.com/purpleprotocol/mimalloc_rust/commit/59d7ee3ba7aa030b46a7cf721720f63a579b7fdb";>59d7ee3 v0.1.46 https://github.com/purpleprotocol/mimalloc_rust/commit/585ef6acf1c97f44e18b06cc445bd689e42ef739";>585ef6a Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/134";>#134 from purpleprotocol/patch_alpine https://github.com/purpleprotocol/mimalloc_rust/commit/65cd7894e7ae1ce035584d669f58b9b4a423782e";>65cd789 Patch Alpine https://github.com/purpleprotocol/mimalloc_rust/commit/b6ce368277c1862cdae040bebbe37840de4a0aea";>b6ce368 v0.1.45 https://github.com/purpleprotocol/mimalloc_rust/commit/60e1a4bc6bd2895fff76063b2594546e40cf9b48";>60e1a4b Update to mimalloc v2.2.3 https://github.com/purpleprotocol/mimalloc_rust/commit/bbf61305ad2d9d1c4b417bd277e36caec31d21b7";>bbf6130 v0.1.44 https://github.com/purpleprotocol/mimalloc_rust/commit/3e7e3214ea3397ffc922baaa42933e5091482073";>3e7e321 Patch windows https://github.com/purpleprotocol/mimalloc_rust/commit/f37c56ad7c3ed13904afd272e59f3ca80f7b02b8";>f37c56a Clippy https://github.com/purpleprotocol/mimalloc_rust/commit/c00538aa0b553e7eecab40b114fedb56f414ddc7";>c00538a Update to mimalloc v2.2.2 https://github.com/purpleprotocol/mimalloc_rust/commit/cc1c72a62bf3edefe5abf842b60092f088740860";>cc1c72a Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/126";>#126 from richerfu/master Additional commits viewable in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.43...v0.1.46";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] build(deps): bump mimalloc from 0.1.43 to 0.1.45 [datafusion-python]
dependabot[bot] closed pull request #1095: build(deps): bump mimalloc from 0.1.43 to 0.1.45 URL: https://github.com/apache/datafusion-python/pull/1095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] build(deps): bump mimalloc from 0.1.43 to 0.1.45 [datafusion-python]
dependabot[bot] commented on PR #1095: URL: https://github.com/apache/datafusion-python/pull/1095#issuecomment-2798993546 Superseded by #1105. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340 The error in `cargo test` is caused by an incorrect calculation of the pre-selection. The correct steps for calculating the pre-selection are as follows: 1. Compute the boolean array on the left-hand side. 2. Filter to obtain a new record batch based on the left-hand boolean array. 3. Compute the boolean array on the right-hand side using the new record batch (note that this boolean array will have incorrect positions since it corresponds to the new record batch). 4. Combine the left-hand and right-hand boolean arrays to produce the correct boolean array (modify the positions in the left-hand array marked as `true` based on the values from the right-hand array). To illustrate this, Iβve drawn a diagram (itβs quite rough since I used a mouse, so I hope you donβt mind π):  The current code only handles up to the second step, which is why the error occurs. If you use `evaluate_selection`, the issue persists because its internal call to `scatter` (which corresponds to completing the fourth step) fills in the missing parts of the right-hand boolean array with null values, resulting in an incorrect outcome. Iβm currently working on implementing the fourth step to fix the issue, but it may take some time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Allow parsing byte literals as FixedSizeBinary [datafusion]
leoyvens commented on issue #15686: URL: https://github.com/apache/datafusion/issues/15686#issuecomment-2799013192 @alamb thank you for looking at this. Avoiding a config flag would be nice. But I'm skeptical of the proposed coercion. If we coerce `binary` to `fixed(N)` when encountering any expression of type: `fixed(N) = binary`, this would be a fallible coercion. While the infallible coercion of `fixed(N)` to `binary` would allow the expression to be evaluated for all values. DataFusion has already chosen this more lenient coercion for lists. The following: ``` select arrow_cast([1], 'FixedSizeList(1, Int64)') = [1, 2]; ``` Evaluates today to `false` instead of giving a coercion error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]
iffyio merged PR #1806: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]
DerGut commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040698187
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -529,6 +523,12 @@ impl ExternalSorter {
/// Sorts the in-memory batches and merges them into a single sorted run,
then writes
/// the result to spill files.
async fn sort_and_spill_in_mem_batches(&mut self) -> Result<()> {
+if self.in_mem_batches.is_empty() {
Review Comment:
This was my first version to fix the internal error. Later I decided to move
the logic to the caller (`ExternalSorter::insert_batch`) because the name
`ExternalSorter::sort_and_spill_in_mem_batches` kind of implies that it only
makes sense to call when there are in-mem batches.
Now we handle this case in two places. I can remove the handling here unless
a more defensive approach is desired.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Decorrelate scalar subqueries with more complex filter expressions [datafusion]
duongcongtoai commented on issue #14554: URL: https://github.com/apache/datafusion/issues/14554#issuecomment-2798943345 I think we can break down this story into multiple step: 1. unify the optimizor for correlated query, regardless the query type (exists query, scalar query etc) 2. support flexible decorrelation scheme (simple vs general approach), we can achieve this by following the algorithm mentioned in the [2nd paper](https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf). To achieve this, there is a prerequisite to introduce an index algebra during the rewrite. This index requires a pre-traversing over the whole query to detect all non-trivial subquery, and answer the question whether simple unnesting is sufficient, or should the framework continue with the general approach 3. Implement general purpose + recursive aware subquery decorrelation for the most major operators (projection, filter, group by) using the top-down algorithm mentioned in the 2nd paper 4. Gradually support more complex expression (group by, order, limit, window function) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
Dandandan commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798955035 Thanks for sharing the results @zhuqi-lucas this is really interesting! I think it mainly shows that we probably should try and use more efficient in memory sorting (e.g. an arrow kernel that sorts multiple batches) here rather than use `SortPreservingMergeStream` which is intended to be used on data streams. The arrow kernel would avoid the regressions of `concat`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]
iffyio commented on code in PR #1808:
URL:
https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2040586609
##
src/ast/mod.rs:
##
@@ -8368,6 +8387,22 @@ pub enum CreateFunctionBody {
///
/// [BigQuery]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#syntax_11
AsAfterOptions(Expr),
+/// Function body with statements before the `RETURN` keyword.
+///
+/// Example:
+/// ```sql
+/// CREATE FUNCTION my_scalar_udf(a INT, b INT)
+/// RETURNS INT
+/// AS
+/// BEGIN
+/// DECLARE c INT;
+/// SET c = a + b;
+/// RETURN c;
+/// END;
+/// ```
+///
+/// [SQL Server]:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql
+MultiStatement(Vec),
Review Comment:
Ah yeah I think if possible it would be good to reuse the same
representation as `ConditionalStatements`.Maybe something like?
```rust
CreateFunctionBody::AsBeginEnd(BeginEndStatements)
ConditionalStatements::BeginEnd(BeginEndStatements)
```
Where we pull out the current representation into its own type, it'll also
get to implement display once as a result too
##
src/parser/mod.rs:
##
@@ -580,10 +580,12 @@ impl<'a> Parser<'a> {
// `BEGIN` is a nonstandard but common alias for the
// standard `START TRANSACTION` statement. It is supported
// by at least PostgreSQL and MySQL.
+// `BEGIN` is also used for multi-statement blocks in SQL
Server
Keyword::BEGIN => self.parse_begin(),
// `END` is a nonstandard but common alias for the
// standard `COMMIT TRANSACTION` statement. It is supported
// by PostgreSQL.
+// `END` is also used for multi-statement blocks in SQL Server
Review Comment:
```suggestion
Keyword::BEGIN => self.parse_begin(),
```
Thinking we can skip the comment entirely since they don't add much, it will
avoid the dilemma of being inconsistent in the comment by not mentioning all
supported dialects/use cases vs having to mention all of them
##
src/parser/mod.rs:
##
@@ -617,6 +619,9 @@ impl<'a> Parser<'a> {
}
// `COMMENT` is snowflake specific
https://docs.snowflake.com/en/sql-reference/sql/comment
Keyword::COMMENT if self.dialect.supports_comment_on() =>
self.parse_comment(),
+Keyword::RETURN if dialect_of!(self is MsSqlDialect) => {
Review Comment:
```suggestion
Keyword::RETURN => {
```
I think its reasonable to have the parser always parse the statement if it
shows up. e.g. snowflake and bigquery also support the statement so we'd get
support for them automatically as a result
##
src/parser/mod.rs:
##
@@ -5135,6 +5146,63 @@ impl<'a> Parser<'a> {
}))
}
+/// Parse `CREATE FUNCTION` for [SQL Server]
+///
+/// [SQL Server]:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql
+fn parse_mssql_create_function(
+&mut self,
+or_alter: bool,
+or_replace: bool,
+temporary: bool,
+) -> Result {
+let name = self.parse_object_name(false)?;
+
+let parse_function_param =
+|parser: &mut Parser| -> Result {
+let name = parser.parse_identifier()?;
+let data_type = parser.parse_data_type()?;
+Ok(OperateFunctionArg {
+mode: None,
+name: Some(name),
+data_type,
+default_expr: None,
+})
+};
+self.expect_token(&Token::LParen)?;
+let args = self.parse_comma_separated0(parse_function_param,
Token::RParen)?;
+self.expect_token(&Token::RParen)?;
+
+let return_type = if self.parse_keyword(Keyword::RETURNS) {
+Some(self.parse_data_type()?)
+} else {
+None
+};
+
+self.expect_keyword_is(Keyword::AS)?;
+self.expect_keyword_is(Keyword::BEGIN)?;
+let function_body =
Some(CreateFunctionBody::MultiStatement(self.parse_statements()?));
Review Comment:
Ah if the desired behavior is to have optional semicolon delimited
statements for mssql, what we could do would be to make the behavior
configurable when [parsing the statements
list](https://github.com/apache/datafusion-sqlparser-rs/blob/2457d079daeb564782f8d408c19846fe45080caf/src/parser/mod.rs#L461).
We could do something like this?
```rust
pub fn parse_statements(&mut self) -> Result, ParserError> {
self.parse_statements_inner(true)
}
fn parse_statements_inner(&mut self, require_semicolon_delimiter) ->
Result, ParserError> {
// ... same as the body of `parse_statements()
Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]
2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798782989
I didn't implement the parallel merge optimization for now, my major concern
is: this optimization requires one extra configuration, and users have to learn
and correctly set 2 configs for each individual query, to enable the most
efficient cascaded spill merge execution (see the below `intended optimization`
section for what's those 2 configs), which is not ideal.
So I'd like to defer the implementation for a bit, to think about if there
are any simpler approaches (or maybe by collecting stats internally and
auto-tune those related configs)
Also, I think the current implementation is good enough to cover common
cases (I did a rough estimation, sorting TPCH-SF1000 lineitem table with 16GB
of memory only requires one round of re-spill)
Here is the optimization that I originally thought about, I'll put them into
a separate issue if it makes sense.
### Example scenario
For one partition's `SortExec`, 100 runs are spilled, and we set
`spill_max_spill_merge_degree` to 4
### Current Implementation

Each time it merges 4 existing spills into one combined spill file, until
there are <= 4 spills total, the final result can be produced.
For each entry, the number of re-spill will be $floor(\log_4 100)$ = 3
### Intended optimization
If the memory pool is enough to hold more buffers at a time (while
`spill_max_spill_merge_degree` is still limited to 4, in case the merge-degree
is too large and hurt performance in some cases)
One additional config will be introduced `sort_buffer_batch_capacity`, and
set to `16` in the above example, the execution will look like:

Then, inside each merge step, 16 spill files will be combined and re-spill.
Each entry only need to be re-spilled for $floor(\log_{16} 100)$ = 1 time.
With this approach, we can achieve an optimal re-spill count, and also
enable parallel merge.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Cascaded spill merge and re-spill [datafusion]
qstommyshu commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2040791232
##
datafusion/core/tests/memory_limit/mod.rs:
##
@@ -615,6 +616,104 @@ async fn test_disk_spill_limit_not_reached() ->
Result<()> {
Ok(())
}
+// Test configuration `sort_max_spill_merge_degree` in external sorting
+// ---
+
+// Ensure invalid config value of `sort_max_spill_merge_degree` returns error
+#[rstest]
+#[case(0)]
+#[case(1)]
+#[tokio::test]
+async fn test_invalid_sort_max_spill_merge_degree(
+#[case] sort_max_spill_merge_degree: usize,
Review Comment:
this #[case] syntax looks so elegant for writing repetitive tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Cascaded spill merge and re-spill [datafusion]
qstommyshu commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2799447585 > ### Intended optimization > If the memory pool is enough to hold more batches at a time (while `spill_max_spill_merge_degree` is still limited to 4, in case the merge-degree is too large and hurt performance in some cases) One additional config `sort_buffer_batch_capacity` is introduced, and set to `16` in the above example, the execution will look like: ... Thanks for the clear explanation, that's a lot of great works, and looks really cool! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Cascaded spill merge and re-spill [datafusion]
qstommyshu commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2040846556 ## docs/source/user-guide/configs.md: ## @@ -84,6 +84,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus | datafusion.execution.skip_physical_aggregate_schema_check | false | When set to true, skips verifying that the schema produced by planning the input of `LogicalPlan::Aggregate` exactly matches the schema of the input plan. When set to false, if the schema does not match exactly (including nullability and metadata), a planning error will be raised. This is used to workaround bugs in the planner that are now caught by the new schema verification step. | | datafusion.execution.sort_spill_reservation_bytes | 10485760 | Specifies the reserved memory for each spillable sort operation to facilitate an in-memory merge. When a sort operation spills to disk, the in-memory data must be sorted and merged before being written to a file. This setting reserves a specific amount of memory for that in-memory sort/merge process. Note: This setting is irrelevant if the sort operation cannot spill (i.e., if there's no `DiskManager` configured). | | datafusion.execution.sort_in_place_threshold_bytes | 1048576 | When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. | +| datafusion.execution.sort_max_spill_merge_degree| 16 | When doing external sorting, the maximum number of spilled files to read back at once. Those read files in the same merge step will be sort- preserving-merged and re-spilled, and the step will be repeated to reduce the number of spilled files in multiple passes, until a final sorted run can be produced. | Review Comment: Great attention to detail, for updating the user guide! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Use pager and allow configuration via `\pset` [datafusion]
djellemah commented on PR #15597: URL: https://github.com/apache/datafusion/pull/15597#issuecomment-2799539091 Seems to me there's a confluence of several factors here: - testing this kind of functionality is not simple. - it's user facing, so if it breaks somebody will notice and say something. - it's not core functionality so the probability of erroneous behaviors causing severe consequences is fairly low. - there is limited maintainer bandwidth. - there is limited contributor bandwidth. In my opinion, the decision matrix here is not quite the same as one would apply to core functionality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[PR] Minor: add order by arg for last value [datafusion]
jayzhan211 opened a new pull request, #15695: URL: https://github.com/apache/datafusion/pull/15695 ## Which issue does this PR close? - Closes #12376. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
jayzhan211 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2799545774 > My only remaining question is if we want to upgrade arrow in this release as well +1 for upgrading all the dependencies -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] PoC (Perf: Support automatically concat_batches for sort which will improve performance) [datafusion]
Dandandan commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798712702 I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Add support for `GO` batch delimiter in SQL Server [datafusion-sqlparser-rs]
iffyio commented on code in PR #1809:
URL:
https://github.com/apache/datafusion-sqlparser-rs/pull/1809#discussion_r2040595052
##
src/parser/mod.rs:
##
@@ -617,6 +623,9 @@ impl<'a> Parser<'a> {
}
// `COMMENT` is snowflake specific
https://docs.snowflake.com/en/sql-reference/sql/comment
Keyword::COMMENT if self.dialect.supports_comment_on() =>
self.parse_comment(),
+Keyword::GO if dialect_of!(self is MsSqlDialect) => {
Review Comment:
```suggestion
Keyword::GO => {
```
I think it should be fine to let the parser always accept the statement if
it shows up. Technically if we wanted to gate it behind a feature then we could
use a `self.dialect.supports` flag since the `dialect_of` macro is deprecated,
but I think it makes more sense to let the parser accept unconditionally
##
src/parser/mod.rs:
##
@@ -475,6 +475,12 @@ impl<'a> Parser<'a> {
if expecting_statement_delimiter && word.keyword ==
Keyword::END {
break;
}
+// Treat batch delimiter as an end of statement
+if expecting_statement_delimiter && dialect_of!(self is
MsSqlDialect) {
+if let Some(Statement::Go { count: _ }) = stmts.last()
{
+expecting_statement_delimiter = false;
+}
+}
Review Comment:
not sure I followed this part, what was the intention?
##
src/parser/mod.rs:
##
@@ -15017,6 +15026,48 @@ impl<'a> Parser<'a> {
}
}
+fn parse_go(&mut self) -> Result {
Review Comment:
One thing we could do, in the `parse_stmt()` function could we call
`self.prev_token()` to rewind the `Go` keyword so that this function becomes
self contained in being able to parse a full `Go` statement?
##
src/parser/mod.rs:
##
@@ -15017,6 +15026,48 @@ impl<'a> Parser<'a> {
}
}
+fn parse_go(&mut self) -> Result {
+// previous token should be a newline (skipping non-newline whitespace)
Review Comment:
hmm I didn't look too closely but not sure I followed the function body on a
high level, from the docs and the representation in the parser the Go statement
only contains an optional int, so that I imagined all the function needs to do
would be to `maybe_parse(|parser| parser.parse_number_value())` or similar? The
function is currently managing whitespace and semicolon but not super clear to
me why that is required
##
src/ast/mod.rs:
##
@@ -4050,6 +4050,14 @@ pub enum Statement {
arguments: Vec,
options: Vec,
},
+/// Go (SQL Server)
+///
+/// GO is not a Transact-SQL statement; it is a command recognized by
various tools as a batch delimiter
+///
+/// See:
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/sql-server-utilities-statements-go
+Go {
+count: Option,
+},
Review Comment:
Repr wise can we wrap the statement body in explicit structs? we're[
transitioning
away](https://github.com/apache/datafusion-sqlparser-rs/issues/1204) from
anonymous structs in order to make the API more ergonomic. Something like
```rust
struct GoStatement {
count: Option
}
Statement::Go(GoStatement)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
Dandandan commented on issue #15375: URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798726588 Really nice observation! I think we should drive this further. Some further observations I saw when looking at the current implementation on master for the in memory merging part: * the in memory batches are not cleared after batches are sorted * it is needlessly using `spawn_buffered` in the code for streaming merge () sorting can be done on the spot here and batches dropped as soon as possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[PR] Fix internal error in sort when hitting memory limit [datafusion]
DerGut opened a new pull request, #15692: URL: https://github.com/apache/datafusion/pull/15692 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/15675 ## Rationale for this change I noticed an internal error in the sort implementation. This PR is exposing it as a proper user-facing error instead. ## What changes are included in this PR? ## Are these changes tested? Yes. ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] PoC (Perf: Support automatically concat_batches for sort which will improve performance) [datafusion]
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798857034 > I think this is already looking quite nice. What do you need to finalize this @zhuqi-lucas Thank you @Dandandan for review, i think we just need to add the benchmark result for this PR for next step. And it's mergable for the first version, later we can improve it according to comments: https://github.com/apache/datafusion/issues/15375#issuecomment-2747654525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[PR] fix: unparse join without projection [datafusion]
chenkovsky opened a new pull request, #15693: URL: https://github.com/apache/datafusion/pull/15693 ## Which issue does this PR close? - Closes #15688. ## Rationale for this change There are two issues. 1. projection in scan table is dropped in try_transform_to_simple_table_scan_with_filters 2. projection in selectbuilder can only be setted once. it's ok if there's a projection plan outside, but for dataframe api or optimized logical plan. it's not always true. for join expression, we need to add selected expressions from left plan, and selected expressions from right plan. ## What changes are included in this PR? 1. add projection in try_transform_to_simple_table_scan_with_filters 2. enable setting projection in selectbuilder twice for join logical plan. ## Are these changes tested? UT ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
zhuqi-lucas commented on code in PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#discussion_r2040673419
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -645,7 +645,36 @@ impl ExternalSorter {
return self.sort_batch_stream(batch, metrics, reservation);
}
-let streams = std::mem::take(&mut self.in_mem_batches)
+let mut merged_batches = Vec::new();
+let mut current_batches = Vec::new();
+let mut current_size = 0;
+
+for batch in std::mem::take(&mut self.in_mem_batches) {
Review Comment:
> I think it would be nice to use `pop` (`while let Some(batch) = v.pop`)
here to remove the batch from the vec once sorted to reduce memory usage. Now
the batch is AFAIK retained until after the loop.
Thank you @Dandandan for review and good suggestion, addressed your
suggestion!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
zhuqi-lucas commented on issue #15375: URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798866385 Thank you @Dandandan , addressed your comments. And we can make it as the first version. And in future we may can improve it as described by @2010YOUY01 : https://github.com/apache/datafusion/issues/15375#issuecomment-2747654525 And i think current implementation is also reasonable because the sort_in_place_threshold_bytes is a already used config, we can first reuse it to concat batch and it's safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798868399 @alamb Do we have the CI benchmark running now? If no, i need your help to run... Thanks a lot! And also for the sort-tpch itself, i was running for the improvement result, but not for other benchmark running. Previous sort-tpch: ```rust ββββ³β³ββ³ββββ β Queryβ main β concat_batches_for_sort βChange β β‘ββββββββββ© β Q1 β 2241.04ms β 1816.69ms β +1.23x faster β β Q2 β 1841.01ms β 1496.73ms β +1.23x faster β β Q3 β 12755.85ms β 12770.18ms β no change β β Q4 β 4433.49ms β 3278.70ms β +1.35x faster β β Q5 β 4414.15ms β 4409.04ms β no change β β Q6 β 4543.09ms β 4597.32ms β no change β β Q7 β 8012.85ms β 9026.30ms β 1.13x slower β β Q8 β 6572.37ms β 6049.51ms β +1.09x faster β β Q9 β 6734.63ms β 6345.69ms β +1.06x faster β β Q10 β 9896.16ms β 9564.17ms β no change β ββββ΄β΄ββ΄ββββ ββ³β β Benchmark Summary ββ β‘ββ© β Total Time (main) β 61444.64ms β β Total Time (concat_batches_for_sort) β 59354.33ms β β Average Time (main)β 6144.46ms β β Average Time (concat_batches_for_sort) β 5935.43ms β β Queries Faster β 5 β β Queries Slower β 1 β β Queries with No Change β 4 β ββ΄β ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: Modify Spark SQL core 2 tests for `native_datafusion` reader, change 3.5.5 diff hash length to 11 [datafusion-comet]
andygrove merged PR #1641: URL: https://github.com/apache/datafusion-comet/pull/1641 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Standardize Spark diff hash length [datafusion-comet]
andygrove closed issue #1640: Standardize Spark diff hash length URL: https://github.com/apache/datafusion-comet/issues/1640 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]
acking-you opened a new pull request, #15694: URL: https://github.com/apache/datafusion/pull/15694 ## Which issue does this PR close? - Closes #15636 ## Rationale for this change Many thanks to @kosiew for doing a tremendous amount of work. Based on his PR https://github.com/apache/datafusion/pull/15648, I have made the following changes: 1. Performance test cases have been added for "all false" in AND operations and "all true" in OR operations. 2. Reduce some unnecessary function calls: for instance, `false_count` and `true_count` can be unified as `values().count_set_bits()` in non-null cases, which would eliminate two calls to `nulls_count` and one call to `true_count`. 3. Add heuristic pre-selection under the `and` operation. For more details on the principle, you can refer to: https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437. ## What changes are included in this PR? The work done by kosiew:https://github.com/apache/datafusion/pull/15648 ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]
acking-you commented on issue #15631: URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798871646 > > [@acking-you](https://github.com/acking-you) the code needs to be extended to support nulls (you can take a look at the true_count implementation in arrow-rs to do this efficiently). > > I have an idea for an early exit losing performance, and I'm trying it out. If it works, I'll post the code @Dandandan You're absolutely right. At this point, we should no longer focus on optimizing the early return for the `true_count` calculation. Instead, we should consider extending support to new optimization scenarios, such as handling `nulls` and other cases. Hereβs why: ## My Attempt I experimented with optimizing the `true_count` calculation using SIMD instructions, performing 256-bit comparisons and computations while incorporating early returns. This significantly reduced the overhead of checking boolean arrays in non-short-circuit scenarios (with a performance boost of up to 6x in extreme cases). However, in short-circuit scenarios like `all false` or `all true`, it still performed about 20% slower than the existing `false_count/true_count` implementation. When I applied this optimization to the execution of `BinaryExpr` and benchmarked the entire `BinaryExpr`, I observed a performance regression. The reasons are as follows: 1. The computation of `true_count` for boolean arrays is only on the order of 60ns, and even with early exit optimizations, it remains at the 10ns level (in extreme cases). 2. When `BinaryExpr` does not trigger short-circuiting, its execution time is on the order of 400us (depending on the complexity of the expression). Optimizing the `true_count` calculation for boolean arrays becomes negligible in this context. 3. When short-circuiting occurs, the time complexity of `BinaryExpr` is on par with the computation of the boolean array. In this case, calling `true_count` remains the optimal solution. ## Other Optimization Scenarios - Extending short-circuit optimizations to handle `nulls`. - The early execution of selection that I previously mentioned: [https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437](https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437). I realized that my earlier conclusions about the benefits of early selection execution were incorrect. After testing with the latest code changes (on a machine with an AMD 9950x CPU), the results are as follows: all scenarios that benefited from this optimization saw a reduction in execution time from `450us` to `170us`. This is a significant improvement, and it did not slow down any other cases! ```sh short_circuit/and/one_true_first time: [188.36 Β΅s 188.96 Β΅s 189.55 Β΅s] change: [-58.870% -58.652% -58.420%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_last time: [170.10 Β΅s 171.04 Β΅s 171.97 Β΅s] change: [-62.788% -62.520% -62.258%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle time: [165.38 Β΅s 165.97 Β΅s 166.57 Β΅s] change: [-64.176% -63.968% -63.784%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_left time: [165.02 Β΅s 165.48 Β΅s 165.96 Β΅s] change: [-63.671% -63.458% -63.204%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_right time: [169.09 Β΅s 169.61 Β΅s 170.13 Β΅s] change: [-63.298% -63.089% -62.868%] (p = 0.00 < 0.05) Performance has improved. short_circuit/and/one_true_middle_right #2 time: [166.43 Β΅s 167.12 Β΅s 167.85 Β΅s] change: [-64.092% -63.916% -63.751%] (p = 0.00 < 0.05) Performance has improved. ``` I further optimized the work done by @kosiew. Relevant PR: https://github.com/apache/datafusion/pull/15694 (currently, there are still some issues with `cargo test`, and I'm working on fixing them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Flatten array in a single step instead of recursive [datafusion]
delamarch3 commented on PR #15160: URL: https://github.com/apache/datafusion/pull/15160#issuecomment-2798766441 Thanks for the reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Make Clickbench Q29 5x faster for datafusion [datafusion]
ctsk commented on PR #15532: URL: https://github.com/apache/datafusion/pull/15532#issuecomment-2798754505 Heads up: `SUM(DISTINCT (x + 5))` is **not** equivalent to `SUM(DISTINCT x) + 5 * COUNT(DISTINCT x)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798878082 > > Do we want to hold DF 47 release for the arrow upgrade too? > > I think it is possible (arrow will hopefully be released at the end of this week -- and we could make the DF release candidate next week...) > > This would be great for the Comet project. My only remaining question is if we want to upgrade arrow in this release as well The upgrade PR is here: - https://github.com/apache/datafusion/pull/15466 Since it also upgrades object_store and pyo3 it is somewhat more disruptive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798879716 Latest result based current latest code: ```rust Benchmark sort_tpch1.json ββββ³βββ³ββ³ββββ β Queryβ main β concat_batches_for_sort βChange β β‘ββββββββββββ© β Q1 β 153.49ms β137.57ms β +1.12x faster β β Q2 β 131.29ms β120.93ms β +1.09x faster β β Q3 β 980.57ms β982.22ms β no change β β Q4 β 252.25ms β245.09ms β no change β β Q5 β 464.81ms β449.27ms β no change β β Q6 β 481.44ms β455.45ms β +1.06x faster β β Q7 β 810.73ms β709.74ms β +1.14x faster β β Q8 β 498.10ms β491.12ms β no change β β Q9 β 503.80ms β510.20ms β no change β β Q10 β 789.02ms β706.45ms β +1.12x faster β β Q11 β 417.39ms β411.50ms β no change β ββββ΄βββ΄ββ΄ββββ ββ³ββββ β Benchmark Summary β β β‘βββββ© β Total Time (main) β 5482.89ms β β Total Time (concat_batches_for_sort) β 5219.53ms β β Average Time (main)β 498.44ms β β Average Time (concat_batches_for_sort) β 474.50ms β β Queries Faster β 5 β β Queries Slower β 0 β β Queries with No Change β 6 β ββ΄ββββ Benchmark sort_tpch10.json ββββ³β³ββ³ββββ β Queryβ main β concat_batches_for_sort βChange β β‘ββββββββββ© β Q1 β 2243.52ms β 1825.64ms β +1.23x faster β β Q2 β 1842.11ms β 1639.00ms β +1.12x faster β β Q3 β 12446.31ms β 11981.63ms β no change β β Q4 β 4047.55ms β 3715.96ms β +1.09x faster β β Q5 β 4364.46ms β 4503.51ms β no change β β Q6 β 4561.01ms β 4688.31ms β no change β β Q7 β 8158.01ms β 7915.54ms β no change β β Q8 β 6077.40ms β 5524.08ms β +1.10x faster β β Q9 β 6347.21ms β 5853.44ms β +1.08x faster β β Q10 β 11561.03ms β 14235.69ms β 1.23x slower β β Q11 β 6069.42ms β 5666.77ms β +1.07x faster β ββββ΄β΄ββ΄ββββ ββ³β β Benchmark Summary ββ β‘ββ© β Total Time (main) β 67718.04ms β β Total Time (concat_batches_for_sort) β 67549.58ms β β Average Time (main)β 6156.19ms β β Average Time (concat_batches_for_sort) β 6140.87ms β β Queries Faster β 6 β β Queries Slower β 1 β β Queries with No Change β 4 β ββ΄β ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Document `CREATE EXTERNAL TABLE ... OPTIONS` [datafusion]
alamb commented on issue #10451:
URL: https://github.com/apache/datafusion/issues/10451#issuecomment-2798883874
Hi @marvelshan -- I also took a look at the docs. It actually looks like
the format options are largely documented, but the documentation could be
improved
I suggest:
1. Rename
[`write_options.md`](https://datafusion.apache.org/user-guide/sql/write_options.html)
to `format_options.md` and make it clear they apply both to reading and
writing . Also, link `write_options` clearly in the `CREATE EXTERNAL TABLE`
section
2. Then for each sub heading for each format (`JSON format specific
options`, etc) add an example, such as
```sql
CREATE EXTERNAL TABLE t
STORED AS JSON
LOCATION '/tmp/foo.dat'
OPTIONS('COMPRESSION', 'bzip')
```
I also suggest increasing the level of Json, csv, etc options so they appear
in the table of contents

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]
2010YOUY01 commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040878250
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -529,6 +523,12 @@ impl ExternalSorter {
/// Sorts the in-memory batches and merges them into a single sorted run,
then writes
/// the result to spill files.
async fn sort_and_spill_in_mem_batches(&mut self) -> Result<()> {
+if self.in_mem_batches.is_empty() {
Review Comment:
> Now we handle this case in two places. I can remove the handling here
unless a more defensive approach is desired.
I think it's better to be more defensive, I'd prefer to throw an internal
error saying that we must assume the buffer is not empty when calling this
function.
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -1552,6 +1571,62 @@ mod tests {
Ok(())
}
+#[tokio::test]
+async fn test_batch_reservation_error() -> Result<()> {
Review Comment:
ππΌ
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
Ok(())
}
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) ->
Result<()> {
+let size = get_reserved_byte_for_record_batch(input);
+
+let result = self.reservation.try_grow(size);
+if result.is_err() {
+if self.in_mem_batches.is_empty() {
+return result;
Review Comment:
It makes sense that we should preserve the raw resource error. In this case,
`ContextError` comes in handy:
https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L132
It can add extra notes to the original error message, to make it more
understandable.
Here, the extra note can be like "When the memory limit is reached and there
are no buffered batches to spill, external sort cannot continue. Consider
setting a higher memory limit."
##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
Ok(())
}
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) ->
Result<()> {
Review Comment:
nit: I think a more obvious name for this function would be
`reserve_memory_for_batch_or_spill`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
[PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]
changsun20 opened a new pull request, #15696: URL: https://github.com/apache/datafusion/pull/15696 ## Which issue does this PR close? Closes #14434 ## Rationale for this change This PR addresses a common SQL anti-pattern where users accidentally use `= NULL` instead of `IS NULL`. While syntactically valid, this comparison always returns NULL in SQL and often indicates a developer mistake. The changes help users identify this pitfall through rich warnings while maintaining full query execution capabilities. ## What changes are included in this PR? 1. Added warning detection for `= NULL` comparisons in predicate contexts 2. Implemented span-based diagnostics highlighting the problematic expression 3. Enhanced SQL parser integration with upgraded sqlparser dependency (0.55+) 4. Warning collection plumbing using non-intrusive RefCell storage 5. Added help messages suggesting `IS NULL` alternative 6. Test coverage for single and multiple occurrences ## Are these changes tested? Yes, this PR includes: - Unit tests for single `= NULL` detection - Validation of multiple warnings in complex expressions - Span position verification - Help message content checks ## Are there any user-facing changes? Yes, but non-breaking. No API changes or behavior modifications - existing queries will still execute normally but may produce additional warnings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]
changsun20 commented on PR #15696: URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2799781696 Hi @eliaperantoni, Thank you for your patience and guidance throughout this issue. I've implemented the core functionality per our discussions, but would like to confirm a few implementation details: 1. **Predicate Context Validation** The warning detection is integrated during `BinaryExpr` processing, which should naturally limit it to predicate contexts. Statements like `UPDATE users SET password = NULL` won't trigger false warnings by default. Could you confirm this approach is acceptable? 2. **Span Handling Strategy** In usual cases, the left operand is an `Identifier`. The current implementation combines the identifier's left span with NULL's right span for precise highlighting. For rare non-identifier cases (e.g., some complex expressions that I can't immediately come up with one right now), we fall back to using just the NULL span. This balances precision with robustness. 3. **Test Coverage Request** While I've added tests for basic and multiple `= NULL` cases, could you suggest any edge cases or additional scenarios that should be validated? I appreciate your expertise in reviewing these implementation choices. Please let me know if any adjustments are needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]
andygrove commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2799561705 I am also +1 for upgrading the dependencies (for selfish reasons; we are waiting on an arrow feature to help with INT96 timestamps in Parquet) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
