Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-12 Thread via GitHub


DerGut commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040700866


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
 
 Ok(())
 }
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk 
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) -> 
Result<()> {
+let size = get_reserved_byte_for_record_batch(input);
+
+let result = self.reservation.try_grow(size);
+if result.is_err() {
+if self.in_mem_batches.is_empty() {
+return result;

Review Comment:
   @2010YOUY01 since you commented on the issue 
https://github.com/apache/datafusion/issues/15675#issuecomment-2796177594 with 
suggestions:
   
   > I think the fix (for a more user-friendly error message) would be
   >
   > 1. When sort_spill_reservation_byte reservation failed, the error message 
should include considering set the memory limit larger than this reservation 
memory size
   > 2. For the code path triggered this internal error, return an 
ExecutionError instead with a similar error message
   
   This code now emits a 
[`DataFusionError::ResourcesExhausted`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L126)
 instead of the suggested 
[`DataFusionError::Execution`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L117).
 I think this makes sense, given the descriptions of the errors.
   
   E.g. Execution error doesn't seem to apply:
   > This error is returned when an error happens during execution due to a 
malformed input.
   
   but ResourcesExhausted does:
   > This error is thrown when a consumer cannot acquire additional memory
   
   ---
   
   I still like the idea of suggesting to bump the memory limit in the error 
message.
   
   However, since the code simply bubbles up the error emitted by the memory 
pool's 
[`insufficient_capacity_err`](https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/execution/src/memory_pool/pool.rs#L249)
 macro, we would either need to map it, or change the macro's message.
   
   I don't think mapping makes sense here because the error is generally used 
for reservation-related issues - I don't think this is a special case.
   Changing the error message would probably require a separate discussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


timsaucer commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798792530

   I spoke too soon - I'm getting one error in our unit tests on `last_value`. 
I'm trying to investigate this morning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798787828

   Benchmark results:
   (I think there is no significant regression for an extra round of re-spill, 
if it's running on a machine with fast SSDs)
   
   ### Environment
   MacBook Pro with m4-pro chip (disk bandwidth is around 8000MB/s)
   ### Sorting 'thin' table
   1. Run datafusion-cli with `cargo run --profile release-nonlto -- 
--mem-pool-type fair -m 100M`
   2. Execute `explain analyze select * from generate_series(1, 10) as 
t1(v1) order by v1;`
   
   Main: 37s (merge ~170 spill files at once)
   PR (with `sort_max_spill_merge_degree = 16`, and there is one round of 
re-spill): 43s
   PR (with `sort_max_spill_merge_degree = 10`, two rounds of re-spill): 49s
   ### Sorting 'fat' table
   Run `sort_tpch` benchmark q7
   ```
   // Q7: 3 sort keys {(INTEGER, 7), (BIGINT, 10k), (BIGINT, 1.5M)} + 
12 all other columns
   r#"
   SELECT l_linenumber, l_suppkey, l_orderkey, 
  l_partkey, l_quantity, l_extendedprice, l_discount, l_tax,
  l_returnflag, l_linestatus, l_shipdate, l_commitdate,
  l_receiptdate, l_shipinstruct, l_shipmode
   FROM lineitem
   ORDER BY l_linenumber, l_suppkey, l_orderkey
   "#,
   ```
   Benchmark command
   ```sh
cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p 
/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 7 --memory-limit 
1.2G
   ```
   Notes:
   - `target_partitions` config set to 14, and later configurations and results 
depend on this setting.
   - For PR's benchmark runs, `sort_max_spill_merge_degree` is manually changed 
to 6, as a result: 
   - under 1.2G memory limit, 1 round of re-spill will be triggered
   - under 500M memory limit, 2 rounds of re-spill happens
   
    Result
   Main (1.2G): 
   ```
   Q7 iteration 0 took 9374.7 ms and returned 59986052 rows
   Q7 iteration 1 took 8117.6 ms and returned 59986052 rows
   Q7 iteration 2 took 8549.1 ms and returned 59986052 rows
   Q7 avg time: 8680.47 ms
   ```
   
   Main (500M):
   ```
   Fail with OOM
   ```
   
   PR (1.2G):
   ```
   ata/tpch_sf10 -q 7 --memory-limit 1G`
   Q7 iteration 0 took 10723.6 ms and returned 59986052 rows
   Q7 iteration 1 took 12962.8 ms and returned 59986052 rows
   Q7 iteration 2 took 11739.7 ms and returned 59986052 rows
   Q7 avg time: 11808.71 ms
   ```
   
   PR (500M):
   ```
   Q7 iteration 0 took 16233.1 ms and returned 59986052 rows
   Q7 iteration 1 took 18568.4 ms and returned 59986052 rows
   Q7 iteration 2 took 19173.4 ms and returned 59986052 rows
   Q7 avg time: 17991.67 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798789296

   I have made the following updates:
   - Address review comments
   - Introduced a new configuration option for max merge degree
   - Add tests
   
   It's ready for another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: normalize window ident [datafusion]

2025-04-12 Thread via GitHub


alamb merged PR #15639:
URL: https://github.com/apache/datafusion/pull/15639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Support window definitions with aliases [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15605: Support window definitions with aliases 
URL: https://github.com/apache/datafusion/issues/15605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify `x=x` --> `x IS NOT NULL OR NULL` [datafusion]

2025-04-12 Thread via GitHub


alamb commented on PR #15589:
URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2798807271

   Thanks again @ding-young πŸ™ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Trivial WHERE filter not eliminated when combined with CTE [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15387: Trivial WHERE filter not eliminated when combined 
with CTE
URL: https://github.com/apache/datafusion/issues/15387


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Enhance: simplify `x=x` --> `x IS NOT NULL OR NULL` [datafusion]

2025-04-12 Thread via GitHub


alamb merged PR #15589:
URL: https://github.com/apache/datafusion/pull/15589


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec 
spill
URL: https://github.com/apache/datafusion/issues/15323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-12 Thread via GitHub


alamb closed issue #15323: Reduce number of tokio blocking threads in SortExec 
spill
URL: https://github.com/apache/datafusion/issues/15323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: normalize window ident [datafusion]

2025-04-12 Thread via GitHub


alamb commented on PR #15639:
URL: https://github.com/apache/datafusion/pull/15639#issuecomment-2798807532

   Thanks @chenkovsky and @comphead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Allow parsing byte literals as FixedSizeBinary [datafusion]

2025-04-12 Thread via GitHub


alamb commented on issue #15686:
URL: https://github.com/apache/datafusion/issues/15686#issuecomment-2798808874

   > Coercing the LHS to Binary is less performant. Unconditionally parsing 
literals as FixedSizeBytes would be a breaking change.
   
   What about coercing the RHS to FixedSizeBinary?
   
   So like
   
   ```sql
   where byte_column = x'deadbeef'
   ```
   
   Becomes
   
   ```sql
   where byte_column = arrow_cast(x'deadbeef', FixedSizeBinary)
   ```
   
   This would have the benefit of:
   1. Easier to use (no config flag)
   2. Could handle non exactly sizes things like casting `0xdead` to 
`0xdead` for comparisoon with 4 byte integers for example


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] add config parse_hex_as_fixed_size_binary [datafusion]

2025-04-12 Thread via GitHub


alamb commented on PR #15687:
URL: https://github.com/apache/datafusion/pull/15687#issuecomment-2798809026

   I have an alternate suggestion here: 
https://github.com/apache/datafusion/issues/15686#issuecomment-2798808874


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


timsaucer commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798806510

   I did a little investigating, but I don't have time for a couple of days to 
dive in deeper. This appears to be related to 
https://github.com/apache/datafusion/pull/15542 @UBarney do you know why the 
sql unit tests needed to be changed to pass? It seems like we have a regression 
in `datafusion-python` related to this change. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add Table Functions to FFI Crate [datafusion]

2025-04-12 Thread via GitHub


timsaucer merged PR #15581:
URL: https://github.com/apache/datafusion/pull/15581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-12 Thread via GitHub


alamb merged PR #15654:
URL: https://github.com/apache/datafusion/pull/15654


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


alamb commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798810667

   > I did a little investigating, but I don't have time for a couple of days 
to dive in deeper. This appears to be related to 
[#15542](https://github.com/apache/datafusion/pull/15542) 
[@UBarney](https://github.com/UBarney) do you know why the sql unit tests 
needed to be changed to pass? It seems like we have a regression in 
`datafusion-python` related to this change.
   
   I think @andygrove filed a ticket for this one
   - https://github.com/apache/datafusion/issues/15676
   
   I didn't fully follow the discussion -- but it seems like that issue has 
been closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add Table Functions to FFI Crate [datafusion]

2025-04-12 Thread via GitHub


alamb commented on PR #15581:
URL: https://github.com/apache/datafusion/pull/15581#issuecomment-2798809598

   Very exciting -- thanks @timsaucer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Remove waits from blocking threads reading spill files. [datafusion]

2025-04-12 Thread via GitHub


alamb commented on PR #15654:
URL: https://github.com/apache/datafusion/pull/15654#issuecomment-2798807598

   πŸš€ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore(deps): bump sysinfo from 0.33.1 to 0.34.2 [datafusion]

2025-04-12 Thread via GitHub


alamb merged PR #15682:
URL: https://github.com/apache/datafusion/pull/15682


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[I] first_value and last_value should have identical signatures [datafusion]

2025-04-12 Thread via GitHub


timsaucer opened a new issue, #12376:
URL: https://github.com/apache/datafusion/issues/12376

   ### Describe the bug
   
   Less of a bug per se, but it would be nice to have identical function 
signatures between first_value and last_value
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798742956

   > @acking-you the code needs to be extended to support nulls (you can take a 
look at the true_count implementation in arrow-rs to do this efficiently).
   
   I have an idea for an early exit losing performance, and I'm trying it out. 
If it works, I'll post the code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


timsaucer commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798821873

   > > I did a little investigating, but I don't have time for a couple of days 
to dive in deeper. This appears to be related to 
[#15542](https://github.com/apache/datafusion/pull/15542) 
[@UBarney](https://github.com/UBarney) do you know why the sql unit tests 
needed to be changed to pass? It seems like we have a regression in 
`datafusion-python` related to this change.
   > 
   > I think [@andygrove](https://github.com/andygrove) filed a ticket for this 
one
   > 
   > * [Regression in `last_value` functionality 
#15676](https://github.com/apache/datafusion/issues/15676)
   > 
   > 
   > I didn't fully follow the discussion -- but it seems like that issue has 
been closed
   
   I've read the discussion now and I think I'm in agreement that it's not an 
actual regression since the aggregation has no deterministic outcome without 
ordering assigned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] executor can't read s3 config in push-staged mode. [datafusion-ballista]

2025-04-12 Thread via GitHub


milenkovicm closed issue #1235: executor can't read s3 config in push-staged 
mode.
URL: https://github.com/apache/datafusion-ballista/issues/1235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: executor can't read s3 config in push-staged mode [datafusion-ballista]

2025-04-12 Thread via GitHub


milenkovicm merged PR #1236:
URL: https://github.com/apache/datafusion-ballista/pull/1236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]

2025-04-12 Thread via GitHub


LucaCappelletti94 commented on code in PR #1806:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1806#discussion_r2040654851


##
src/parser/mod.rs:
##
@@ -7070,13 +7071,22 @@ impl<'a> Parser<'a> {
 }
 }
 
-/// Parse configuration like partitioning, clustering information during 
the table creation.
+/// Parse configuration like inheritance, partitioning, clustering 
information during the table creation.
 ///
 /// 
[BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#syntax_2)
-/// 
[PostgreSQL](https://www.postgresql.org/docs/current/ddl-partitioning.html)
+/// [PostgreSQL 
Partitioning](https://www.postgresql.org/docs/current/ddl-partitioning.html)
+/// [PostgreSQL 
Inheritance](https://www.postgresql.org/docs/current/ddl-inherit.html)
 fn parse_optional_create_table_config(
 &mut self,
 ) -> Result {
+let inherits = if dialect_of!(self is BigQueryDialect | 
PostgreSqlDialect | GenericDialect)

Review Comment:
   The dialect thing must have been a copy paste from the line below and then 
my brain decided to drop that pointer of information - fixed as described!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]

2025-04-12 Thread via GitHub


LucaCappelletti94 commented on code in PR #1806:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1806#discussion_r2040655023


##
tests/sqlparser_postgres.rs:
##
@@ -2733,6 +2733,39 @@ fn parse_create_brin() {
 }
 }
 
+#[test]
+fn parse_create_table_with_inherits() {
+let single_inheritance_sql =
+"CREATE TABLE child_table (child_column INT) INHERITS 
(public.parent_table)";
+match pg().verified_stmt(single_inheritance_sql) {
+Statement::CreateTable(CreateTable {
+inherits: Some(inherits),
+..
+}) => {
+assert_eq_vec(&["public", "parent_table"], &inherits[0].0);
+}
+_ => unreachable!(),
+}
+
+let double_inheritance_sql = "CREATE TABLE child_table (child_column INT) 
INHERITS (public.parent_table, pg_catalog.pg_settings)";
+match pg().verified_stmt(double_inheritance_sql) {
+Statement::CreateTable(CreateTable {
+inherits: Some(inherits),
+..
+}) => {
+assert_eq_vec(&["public", "parent_table"], &inherits[0].0);
+assert_eq_vec(&["pg_catalog", "pg_settings"], &inherits[1].0);
+}
+_ => unreachable!(),
+}
+}
+
+#[test]
+#[should_panic]
+fn parse_create_table_with_empty_inherits_fails() {
+pg().verified_stmt("CREATE TABLE child_table (child_column INT) INHERITS 
()");

Review Comment:
   Refactored test as described!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] chore: Prepare for datafusion 47.0.0 + arrow-rs 55.0.0 [datafusion-comet]

2025-04-12 Thread via GitHub


andygrove opened a new pull request, #1642:
URL: https://github.com/apache/datafusion-comet/pull/1642

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Set DataFusion runtime configurations through SQL interface [datafusion]

2025-04-12 Thread via GitHub


kumarlokesh commented on code in PR #15594:
URL: https://github.com/apache/datafusion/pull/15594#discussion_r2040707194


##
datafusion/core/tests/sql/runtime_config.rs:
##
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Tests for runtime configuration SQL interface
+
+use datafusion::prelude::*;
+use datafusion_common::Result;
+
+#[tokio::test]
+async fn test_set_memory_limit() -> Result<()> {
+let ctx = SessionContext::new();
+
+// Set memory limit to 100MB using SQL - note the quotes around the value
+ctx.sql("SET datafusion.runtime.memory_limit = '100M'")

Review Comment:
   > Running this test can't ensure this config is set, because it can run 
without setting the memory limit and use the default `UnboundedMemoryPool`, 
instead I think it should assert the spill count for a query that has spilled 
some intermediate data.
   > 
   > ```
   > SET datafusion.runtime.memory_limit = '1M'
   > set datafusion.execution.sort_spill_reservation_bytes = 0;
   > 
   > select * from generate_series(1, 10) as t1(v1) order by v1;
   > -- And assert spill-count from the query is > 0
   > ```
   > 
   > You can check this
   > 
   > 
https://github.com/apache/datafusion/blob/4ae0edea8eb7d5ad5df07e22a8fb39aa9fd1e2df/datafusion/core/tests/memory_limit/mod.rs#L525
   > 
   > for how to assert the spill file count.
   > Later after all configurations are added, I think we should make this test 
case stronger by setting more runtime configs, and do some property test to 
ensure all of them are properly set.
   > 
   > BTW, I tried the above test locally, and the `generate_series` UDTF seems 
not registered in the PR branch, but it works in the main branch.
   
   @2010YOUY01 @berkaysynnada addressed above in 
https://github.com/apache/datafusion/pull/15594/commits/6501f218e2c3f7c97316794b2137e7b3672b77a9.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


Dandandan commented on issue #15375:
URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798960792

   @2010YOUY01 that sound like a very promising future direction. I might try 
something experimenting on this soon if none beats me to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] build(deps): bump mimalloc from 0.1.43 to 0.1.46 [datafusion-python]

2025-04-12 Thread via GitHub


dependabot[bot] opened a new pull request, #1105:
URL: https://github.com/apache/datafusion-python/pull/1105

   Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 
0.1.43 to 0.1.46.
   
   Release notes
   Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's 
releases.
   
   Version 0.1.46
   Changes
   
   Fixed musl builds.
   
   Version 0.1.45
   Changes
   
   Mimalloc v2.2.3
   
   Version 0.1.44
   Changes
   
   Mimalloc v2.2.2
   
   
   
   
   Commits
   
   https://github.com/purpleprotocol/mimalloc_rust/commit/59d7ee3ba7aa030b46a7cf721720f63a579b7fdb";>59d7ee3
 v0.1.46
   https://github.com/purpleprotocol/mimalloc_rust/commit/585ef6acf1c97f44e18b06cc445bd689e42ef739";>585ef6a
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/134";>#134
 from purpleprotocol/patch_alpine
   https://github.com/purpleprotocol/mimalloc_rust/commit/65cd7894e7ae1ce035584d669f58b9b4a423782e";>65cd789
 Patch Alpine
   https://github.com/purpleprotocol/mimalloc_rust/commit/b6ce368277c1862cdae040bebbe37840de4a0aea";>b6ce368
 v0.1.45
   https://github.com/purpleprotocol/mimalloc_rust/commit/60e1a4bc6bd2895fff76063b2594546e40cf9b48";>60e1a4b
 Update to mimalloc v2.2.3
   https://github.com/purpleprotocol/mimalloc_rust/commit/bbf61305ad2d9d1c4b417bd277e36caec31d21b7";>bbf6130
 v0.1.44
   https://github.com/purpleprotocol/mimalloc_rust/commit/3e7e3214ea3397ffc922baaa42933e5091482073";>3e7e321
 Patch windows
   https://github.com/purpleprotocol/mimalloc_rust/commit/f37c56ad7c3ed13904afd272e59f3ca80f7b02b8";>f37c56a
 Clippy
   https://github.com/purpleprotocol/mimalloc_rust/commit/c00538aa0b553e7eecab40b114fedb56f414ddc7";>c00538a
 Update to mimalloc v2.2.2
   https://github.com/purpleprotocol/mimalloc_rust/commit/cc1c72a62bf3edefe5abf842b60092f088740860";>cc1c72a
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/126";>#126
 from richerfu/master
   Additional commits viewable in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.43...v0.1.46";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mimalloc&package-manager=cargo&previous-version=0.1.43&new-version=0.1.46)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] build(deps): bump mimalloc from 0.1.43 to 0.1.45 [datafusion-python]

2025-04-12 Thread via GitHub


dependabot[bot] closed pull request #1095: build(deps): bump mimalloc from 
0.1.43 to 0.1.45
URL: https://github.com/apache/datafusion-python/pull/1095


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] build(deps): bump mimalloc from 0.1.43 to 0.1.45 [datafusion-python]

2025-04-12 Thread via GitHub


dependabot[bot] commented on PR #1095:
URL: 
https://github.com/apache/datafusion-python/pull/1095#issuecomment-2798993546

   Superseded by #1105.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on PR #15694:
URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2799010340

   The error in `cargo test` is caused by an incorrect calculation of the 
pre-selection. The correct steps for calculating the pre-selection are as 
follows:
   
   1. Compute the boolean array on the left-hand side.
   2. Filter to obtain a new record batch based on the left-hand boolean array.
   3. Compute the boolean array on the right-hand side using the new record 
batch (note that this boolean array will have incorrect positions since it 
corresponds to the new record batch).
   4. Combine the left-hand and right-hand boolean arrays to produce the 
correct boolean array (modify the positions in the left-hand array marked as 
`true` based on the values from the right-hand array).
   
   To illustrate this, I’ve drawn a diagram (it’s quite rough since I used a 
mouse, so I hope you don’t mind πŸ˜‚):
   
   
![image](https://github.com/user-attachments/assets/83197db1-fed1-47da-90e4-07acb4d7de69)
   
   
   The current code only handles up to the second step, which is why the error 
occurs. If you use `evaluate_selection`, the issue persists because its 
internal call to `scatter` (which corresponds to completing the fourth step) 
fills in the missing parts of the right-hand boolean array with null values, 
resulting in an incorrect outcome.
   
   I’m currently working on implementing the fourth step to fix the issue, but 
it may take some time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Allow parsing byte literals as FixedSizeBinary [datafusion]

2025-04-12 Thread via GitHub


leoyvens commented on issue #15686:
URL: https://github.com/apache/datafusion/issues/15686#issuecomment-2799013192

   @alamb thank you for looking at this. Avoiding a config flag would be nice. 
But I'm skeptical of the proposed coercion.
   
   If we coerce `binary` to `fixed(N)` when encountering any expression of 
type: `fixed(N) = binary`, this would be a fallible coercion. While the 
infallible coercion of `fixed(N)` to `binary` would allow the expression to be 
evaluated for all values. DataFusion has already chosen this more lenient 
coercion for lists. The following:
   ```
   select arrow_cast([1], 'FixedSizeList(1, Int64)') = [1, 2];
   ```
   Evaluates today to `false` instead of giving a coercion error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for `INHERITS` option in `CREATE TABLE` statement [datafusion-sqlparser-rs]

2025-04-12 Thread via GitHub


iffyio merged PR #1806:
URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-12 Thread via GitHub


DerGut commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040698187


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -529,6 +523,12 @@ impl ExternalSorter {
 /// Sorts the in-memory batches and merges them into a single sorted run, 
then writes
 /// the result to spill files.
 async fn sort_and_spill_in_mem_batches(&mut self) -> Result<()> {
+if self.in_mem_batches.is_empty() {

Review Comment:
   This was my first version to fix the internal error. Later I decided to move 
the logic to the caller (`ExternalSorter::insert_batch`) because the name 
`ExternalSorter::sort_and_spill_in_mem_batches` kind of implies that it only 
makes sense to call when there are in-mem batches.
   
   Now we handle this case in two places. I can remove the handling here unless 
a more defensive approach is desired.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Decorrelate scalar subqueries with more complex filter expressions [datafusion]

2025-04-12 Thread via GitHub


duongcongtoai commented on issue #14554:
URL: https://github.com/apache/datafusion/issues/14554#issuecomment-2798943345

   I think we can break down this story into multiple step:
   1. unify the optimizor for correlated query, regardless the query type 
(exists query, scalar query etc)
   2. support flexible decorrelation scheme (simple vs general approach), we 
can achieve this by following the algorithm mentioned in the [2nd 
paper](https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neumann-btw2025.pdf).
 To achieve this, there is a prerequisite to introduce an index algebra during 
the rewrite. This index requires a pre-traversing over the whole query to 
detect all non-trivial subquery, and answer the question whether simple 
unnesting is sufficient, or should the framework continue with the general 
approach
   3. Implement general purpose + recursive aware subquery decorrelation for 
the most major operators (projection, filter, group by) using the top-down 
algorithm mentioned in the 2nd paper
   4. Gradually support more complex expression (group by, order, limit, window 
function)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


Dandandan commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798955035

   Thanks for sharing the results @zhuqi-lucas this is really interesting!
   
   I think it mainly shows that we probably should try and use more efficient 
in memory sorting (e.g. an arrow kernel that sorts multiple batches) here 
rather than use `SortPreservingMergeStream` which is intended to be used on 
data streams.
   The arrow kernel would avoid the regressions of `concat`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-12 Thread via GitHub


iffyio commented on code in PR #1808:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2040586609


##
src/ast/mod.rs:
##
@@ -8368,6 +8387,22 @@ pub enum CreateFunctionBody {
 ///
 /// [BigQuery]: 
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#syntax_11
 AsAfterOptions(Expr),
+/// Function body with statements before the `RETURN` keyword.
+///
+/// Example:
+/// ```sql
+/// CREATE FUNCTION my_scalar_udf(a INT, b INT)
+/// RETURNS INT
+/// AS
+/// BEGIN
+/// DECLARE c INT;
+/// SET c = a + b;
+/// RETURN c;
+/// END;
+/// ```
+///
+/// [SQL Server]: 
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql
+MultiStatement(Vec),

Review Comment:
   Ah yeah I think if possible it would be good to reuse the same 
representation as `ConditionalStatements`.Maybe something like?
   ```rust
   CreateFunctionBody::AsBeginEnd(BeginEndStatements)
   ConditionalStatements::BeginEnd(BeginEndStatements)
   
   ```
   Where we pull out the current representation into its own type, it'll also 
get to implement display once as a result too



##
src/parser/mod.rs:
##
@@ -580,10 +580,12 @@ impl<'a> Parser<'a> {
 // `BEGIN` is a nonstandard but common alias for the
 // standard `START TRANSACTION` statement. It is supported
 // by at least PostgreSQL and MySQL.
+// `BEGIN` is also used for multi-statement blocks in SQL 
Server
 Keyword::BEGIN => self.parse_begin(),
 // `END` is a nonstandard but common alias for the
 // standard `COMMIT TRANSACTION` statement. It is supported
 // by PostgreSQL.
+// `END` is also used for multi-statement blocks in SQL Server

Review Comment:
   ```suggestion
   Keyword::BEGIN => self.parse_begin(),
   ```
   Thinking we can skip the comment entirely since they don't add much, it will 
avoid the dilemma of being inconsistent in the comment by not mentioning all 
supported dialects/use cases vs having to mention all of them



##
src/parser/mod.rs:
##
@@ -617,6 +619,9 @@ impl<'a> Parser<'a> {
 }
 // `COMMENT` is snowflake specific 
https://docs.snowflake.com/en/sql-reference/sql/comment
 Keyword::COMMENT if self.dialect.supports_comment_on() => 
self.parse_comment(),
+Keyword::RETURN if dialect_of!(self is MsSqlDialect) => {

Review Comment:
   ```suggestion
   Keyword::RETURN => {
   ```
   I think its reasonable to have the parser always parse the statement if it 
shows up. e.g. snowflake and bigquery also support the statement so we'd get 
support for them automatically as a result



##
src/parser/mod.rs:
##
@@ -5135,6 +5146,63 @@ impl<'a> Parser<'a> {
 }))
 }
 
+/// Parse `CREATE FUNCTION` for [SQL Server]
+///
+/// [SQL Server]: 
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql
+fn parse_mssql_create_function(
+&mut self,
+or_alter: bool,
+or_replace: bool,
+temporary: bool,
+) -> Result {
+let name = self.parse_object_name(false)?;
+
+let parse_function_param =
+|parser: &mut Parser| -> Result {
+let name = parser.parse_identifier()?;
+let data_type = parser.parse_data_type()?;
+Ok(OperateFunctionArg {
+mode: None,
+name: Some(name),
+data_type,
+default_expr: None,
+})
+};
+self.expect_token(&Token::LParen)?;
+let args = self.parse_comma_separated0(parse_function_param, 
Token::RParen)?;
+self.expect_token(&Token::RParen)?;
+
+let return_type = if self.parse_keyword(Keyword::RETURNS) {
+Some(self.parse_data_type()?)
+} else {
+None
+};
+
+self.expect_keyword_is(Keyword::AS)?;
+self.expect_keyword_is(Keyword::BEGIN)?;
+let function_body = 
Some(CreateFunctionBody::MultiStatement(self.parse_statements()?));

Review Comment:
   Ah if the desired behavior is to have optional semicolon delimited 
statements for mssql, what we could do would be to make the behavior 
configurable when [parsing the statements 
list](https://github.com/apache/datafusion-sqlparser-rs/blob/2457d079daeb564782f8d408c19846fe45080caf/src/parser/mod.rs#L461).
 We could do something like this?
   
   ```rust
   pub fn parse_statements(&mut self) -> Result, ParserError> {
   self.parse_statements_inner(true)
   }
   
   fn parse_statements_inner(&mut self, require_semicolon_delimiter) -> 
Result, ParserError> {
   // ... same as the body of `parse_statements()

Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2798782989

   I didn't implement the parallel merge optimization for now, my major concern 
is: this optimization requires one extra configuration, and users have to learn 
and correctly set 2 configs for each individual query, to enable the most 
efficient cascaded spill merge execution (see the below `intended optimization` 
section for what's those 2 configs), which is not ideal.
   So I'd like to defer the implementation for a bit, to think about if there 
are any simpler approaches (or maybe by collecting stats internally and 
auto-tune those related configs)
   Also, I think the current implementation is good enough to cover common 
cases (I did a rough estimation, sorting TPCH-SF1000 lineitem table with 16GB 
of memory only requires one round of re-spill)
   
   Here is the optimization that I originally thought about, I'll put them into 
a separate issue if it makes sense.
   ### Example scenario
   For one partition's `SortExec`, 100 runs are spilled, and we set 
`spill_max_spill_merge_degree` to 4
   ### Current Implementation
   
![image](https://github.com/user-attachments/assets/2d4bba36-2b9c-4fc1-8eac-780900dc1fb8)
   Each time it merges 4 existing spills into one combined spill file, until 
there are <= 4 spills total, the final result can be produced.
   For each entry, the number of re-spill will be $floor(\log_4 100)$ = 3
   
   ### Intended optimization
   If the memory pool is enough to hold more buffers at a time (while 
`spill_max_spill_merge_degree` is still limited to 4, in case the merge-degree 
is too large and hurt performance in some cases)
   One additional config will be introduced `sort_buffer_batch_capacity`, and 
set to `16` in the above example, the execution will look like:
   
![image](https://github.com/user-attachments/assets/5ed8e884-4f45-4abd-b27c-bea6fce8f46d)
   Then, inside each merge step, 16 spill files will be combined and re-spill. 
Each entry only need to be re-spilled for $floor(\log_{16} 100)$ = 1 time.
   With this approach, we can achieve an optimal re-spill count, and also 
enable parallel merge.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


qstommyshu commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2040791232


##
datafusion/core/tests/memory_limit/mod.rs:
##
@@ -615,6 +616,104 @@ async fn test_disk_spill_limit_not_reached() -> 
Result<()> {
 Ok(())
 }
 
+// Test configuration `sort_max_spill_merge_degree` in external sorting
+// ---
+
+// Ensure invalid config value of `sort_max_spill_merge_degree` returns error
+#[rstest]
+#[case(0)]
+#[case(1)]
+#[tokio::test]
+async fn test_invalid_sort_max_spill_merge_degree(
+#[case] sort_max_spill_merge_degree: usize,

Review Comment:
   this #[case] syntax looks so elegant for writing repetitive tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


qstommyshu commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2799447585

   > ### Intended optimization
   > If the memory pool is enough to hold more batches at a time (while 
`spill_max_spill_merge_degree` is still limited to 4, in case the merge-degree 
is too large and hurt performance in some cases) One additional config 
`sort_buffer_batch_capacity` is introduced, and set to `16` in the above 
example, the execution will look like: ...
   
   Thanks for the clear explanation, that's a lot of great works, and looks 
really cool!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-12 Thread via GitHub


qstommyshu commented on code in PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#discussion_r2040846556


##
docs/source/user-guide/configs.md:
##
@@ -84,6 +84,7 @@ Environment variables are read during `SessionConfig` 
initialisation so they mus
 | datafusion.execution.skip_physical_aggregate_schema_check   | 
false | When set to true, skips verifying that the schema 
produced by planning the input of `LogicalPlan::Aggregate` exactly matches the 
schema of the input plan. When set to false, if the schema does not match 
exactly (including nullability and metadata), a planning error will be raised. 
This is used to workaround bugs in the planner that are now caught by the new 
schema verification step.   

 |
 | datafusion.execution.sort_spill_reservation_bytes   | 
10485760  | Specifies the reserved memory for each spillable 
sort operation to facilitate an in-memory merge. When a sort operation spills 
to disk, the in-memory data must be sorted and merged before being written to a 
file. This setting reserves a specific amount of memory for that in-memory 
sort/merge process. Note: This setting is irrelevant if the sort operation 
cannot spill (i.e., if there's no `DiskManager` configured).

|
 | datafusion.execution.sort_in_place_threshold_bytes  | 
1048576   | When sorting, below what size should data be 
concatenated and sorted in a single RecordBatch rather than sorted in batches 
and merged. 




  |
+| datafusion.execution.sort_max_spill_merge_degree| 16 
   | When doing external sorting, the maximum number of 
spilled files to read back at once. Those read files in the same merge step 
will be sort- preserving-merged and re-spilled, and the step will be repeated 
to reduce the number of spilled files in multiple passes, until a final sorted 
run can be produced.


 |

Review Comment:
   Great attention to detail, for updating the user guide!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Use pager and allow configuration via `\pset` [datafusion]

2025-04-12 Thread via GitHub


djellemah commented on PR #15597:
URL: https://github.com/apache/datafusion/pull/15597#issuecomment-2799539091

   Seems to me there's a confluence of several factors here:
   
   - testing this kind of functionality is not simple.
   
   - it's user facing, so if it breaks somebody will notice and say something.
   
   - it's not core functionality so the probability of erroneous behaviors 
causing severe consequences is fairly low.
   
   - there is limited maintainer bandwidth.
   
   - there is limited contributor bandwidth.
   
   In my opinion, the decision matrix here is not quite the same as one would 
apply to core functionality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] Minor: add order by arg for last value [datafusion]

2025-04-12 Thread via GitHub


jayzhan211 opened a new pull request, #15695:
URL: https://github.com/apache/datafusion/pull/15695

   ## Which issue does this PR close?
   
   
   
   - Closes #12376.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


jayzhan211 commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2799545774

   > My only remaining question is if we want to upgrade arrow in this release 
as well
   
   +1 for upgrading all the dependencies


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PoC (Perf: Support automatically concat_batches for sort which will improve performance) [datafusion]

2025-04-12 Thread via GitHub


Dandandan commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798712702

   I think this is already looking quite nice. What do you need to finalize 
this @zhuqi-lucas 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Add support for `GO` batch delimiter in SQL Server [datafusion-sqlparser-rs]

2025-04-12 Thread via GitHub


iffyio commented on code in PR #1809:
URL: 
https://github.com/apache/datafusion-sqlparser-rs/pull/1809#discussion_r2040595052


##
src/parser/mod.rs:
##
@@ -617,6 +623,9 @@ impl<'a> Parser<'a> {
 }
 // `COMMENT` is snowflake specific 
https://docs.snowflake.com/en/sql-reference/sql/comment
 Keyword::COMMENT if self.dialect.supports_comment_on() => 
self.parse_comment(),
+Keyword::GO if dialect_of!(self is MsSqlDialect) => {

Review Comment:
   ```suggestion
   Keyword::GO => {
   ```
   I think it should be fine to let the parser always accept the statement if 
it shows up. Technically if we wanted to gate it behind a feature then we could 
use a `self.dialect.supports` flag since the `dialect_of` macro is deprecated, 
but I think it makes more sense to let the parser accept unconditionally



##
src/parser/mod.rs:
##
@@ -475,6 +475,12 @@ impl<'a> Parser<'a> {
 if expecting_statement_delimiter && word.keyword == 
Keyword::END {
 break;
 }
+// Treat batch delimiter as an end of statement
+if expecting_statement_delimiter && dialect_of!(self is 
MsSqlDialect) {
+if let Some(Statement::Go { count: _ }) = stmts.last() 
{
+expecting_statement_delimiter = false;
+}
+}

Review Comment:
   not sure I followed this part, what was the intention?



##
src/parser/mod.rs:
##
@@ -15017,6 +15026,48 @@ impl<'a> Parser<'a> {
 }
 }
 
+fn parse_go(&mut self) -> Result {

Review Comment:
   One thing we could do, in the `parse_stmt()` function could we call 
`self.prev_token()` to rewind the `Go` keyword so that this function becomes 
self contained in being able to parse a full `Go` statement?



##
src/parser/mod.rs:
##
@@ -15017,6 +15026,48 @@ impl<'a> Parser<'a> {
 }
 }
 
+fn parse_go(&mut self) -> Result {
+// previous token should be a newline (skipping non-newline whitespace)

Review Comment:
   hmm I didn't look too closely but not sure I followed the function body on a 
high level, from the docs and the representation in the parser the Go statement 
only contains an optional int, so that I imagined all the function needs to do 
would be to `maybe_parse(|parser| parser.parse_number_value())` or similar? The 
function is currently managing whitespace and semicolon but not super clear to 
me why that is required



##
src/ast/mod.rs:
##
@@ -4050,6 +4050,14 @@ pub enum Statement {
 arguments: Vec,
 options: Vec,
 },
+/// Go (SQL Server)
+///
+/// GO is not a Transact-SQL statement; it is a command recognized by 
various tools as a batch delimiter
+///
+/// See: 
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/sql-server-utilities-statements-go
+Go {
+count: Option,
+},

Review Comment:
   Repr wise can we wrap the statement body in explicit structs? we're[ 
transitioning 
away](https://github.com/apache/datafusion-sqlparser-rs/issues/1204) from 
anonymous structs in order to make the API more ergonomic. Something like
   ```rust
   struct GoStatement {
   count: Option
   }
   Statement::Go(GoStatement)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


Dandandan commented on issue #15375:
URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798726588

   Really nice observation! I think we should drive this further.
   
   Some further observations I saw when looking at the current implementation 
on master for the in memory merging part:
   
   * the in memory batches are not cleared after batches are sorted
   * it is needlessly using `spawn_buffered` in the code for streaming merge () 
sorting can be done on the spot here and batches dropped as soon as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-12 Thread via GitHub


DerGut opened a new pull request, #15692:
URL: https://github.com/apache/datafusion/pull/15692

   ## Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion/issues/15675
   
   ## Rationale for this change
   
   I noticed an internal error in the sort implementation. This PR is exposing 
it as a proper user-facing error instead.
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   Yes.
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PoC (Perf: Support automatically concat_batches for sort which will improve performance) [datafusion]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798857034

   > I think this is already looking quite nice. What do you need to finalize 
this @zhuqi-lucas
   
   Thank you @Dandandan for review, i think we just need to add the benchmark 
result for this PR for next step.
   
   And it's mergable for the first version, later we can improve it according 
to comments:
   
   https://github.com/apache/datafusion/issues/15375#issuecomment-2747654525
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] fix: unparse join without projection [datafusion]

2025-04-12 Thread via GitHub


chenkovsky opened a new pull request, #15693:
URL: https://github.com/apache/datafusion/pull/15693

   ## Which issue does this PR close?
   
   - Closes #15688.
   
   ## Rationale for this change
   
   There are two issues.
   
   1. projection in scan table is dropped in 
try_transform_to_simple_table_scan_with_filters
   2. projection in selectbuilder can only be setted once. it's ok if there's a 
projection plan outside, but for dataframe api or optimized logical plan. it's 
not always true. for join expression, we need to add selected expressions from 
left plan, and  selected expressions from right plan.
   
   ## What changes are included in this PR?
   
   1. add projection in try_transform_to_simple_table_scan_with_filters
   2. enable setting projection in selectbuilder twice for join logical plan.
   
   ## Are these changes tested?
   
   UT
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on code in PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#discussion_r2040673419


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -645,7 +645,36 @@ impl ExternalSorter {
 return self.sort_batch_stream(batch, metrics, reservation);
 }
 
-let streams = std::mem::take(&mut self.in_mem_batches)
+let mut merged_batches = Vec::new();
+let mut current_batches = Vec::new();
+let mut current_size = 0;
+
+for batch in std::mem::take(&mut self.in_mem_batches) {

Review Comment:
   > I think it would be nice to use `pop` (`while let Some(batch) = v.pop`) 
here to remove the batch from the vec once sorted to reduce memory usage. Now 
the batch is AFAIK retained until after the loop.
   
   Thank you @Dandandan for review and good suggestion, addressed your 
suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on issue #15375:
URL: https://github.com/apache/datafusion/issues/15375#issuecomment-2798866385

   Thank you @Dandandan , addressed your comments. And we can make it as the 
first version. And in future we may can improve it as described by @2010YOUY01 :
   https://github.com/apache/datafusion/issues/15375#issuecomment-2747654525
   
   
   And i think current implementation is also reasonable because the 
sort_in_place_threshold_bytes is a already used config, we can first reuse it 
to concat batch and it's safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798868399

   @alamb Do we have the CI benchmark running now? If no, i need your help to 
run... Thanks a lot!
   
   And also for the sort-tpch itself, i was running for the improvement result, 
but not for other benchmark running.
   
   Previous sort-tpch:
   ```rust
   ┏━━┳┳━┳━━━┓
   ┃ Query┃   main ┃ concat_batches_for_sort ┃Change ┃
   ┑━━╇╇━╇━━━┩
   β”‚ Q1   β”‚  2241.04ms β”‚   1816.69ms β”‚ +1.23x faster β”‚
   β”‚ Q2   β”‚  1841.01ms β”‚   1496.73ms β”‚ +1.23x faster β”‚
   β”‚ Q3   β”‚ 12755.85ms β”‚  12770.18ms β”‚ no change β”‚
   β”‚ Q4   β”‚  4433.49ms β”‚   3278.70ms β”‚ +1.35x faster β”‚
   β”‚ Q5   β”‚  4414.15ms β”‚   4409.04ms β”‚ no change β”‚
   β”‚ Q6   β”‚  4543.09ms β”‚   4597.32ms β”‚ no change β”‚
   β”‚ Q7   β”‚  8012.85ms β”‚   9026.30ms β”‚  1.13x slower β”‚
   β”‚ Q8   β”‚  6572.37ms β”‚   6049.51ms β”‚ +1.09x faster β”‚
   β”‚ Q9   β”‚  6734.63ms β”‚   6345.69ms β”‚ +1.06x faster β”‚
   β”‚ Q10  β”‚  9896.16ms β”‚   9564.17ms β”‚ no change β”‚
   β””β”€β”€β”΄β”΄β”€β”΄β”€β”€β”€β”˜
   ┏┳┓
   ┃ Benchmark Summary  ┃┃
   ┑╇┩
   β”‚ Total Time (main)  β”‚ 61444.64ms β”‚
   β”‚ Total Time (concat_batches_for_sort)   β”‚ 59354.33ms β”‚
   β”‚ Average Time (main)β”‚  6144.46ms β”‚
   β”‚ Average Time (concat_batches_for_sort) β”‚  5935.43ms β”‚
   β”‚ Queries Faster β”‚  5 β”‚
   β”‚ Queries Slower β”‚  1 β”‚
   β”‚ Queries with No Change β”‚  4 β”‚
   β””β”΄β”˜
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: Modify Spark SQL core 2 tests for `native_datafusion` reader, change 3.5.5 diff hash length to 11 [datafusion-comet]

2025-04-12 Thread via GitHub


andygrove merged PR #1641:
URL: https://github.com/apache/datafusion-comet/pull/1641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Standardize Spark diff hash length [datafusion-comet]

2025-04-12 Thread via GitHub


andygrove closed issue #1640: Standardize Spark diff hash length
URL: https://github.com/apache/datafusion-comet/issues/1640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-12 Thread via GitHub


acking-you opened a new pull request, #15694:
URL: https://github.com/apache/datafusion/pull/15694

   ## Which issue does this PR close?
   
   
   
   - Closes #15636
   
   ## Rationale for this change
   
   
   
   Many thanks to @kosiew for doing a tremendous amount of work. Based on his 
PR https://github.com/apache/datafusion/pull/15648, I have made the following 
changes:
   1. Performance test cases have been added for "all false" in AND operations 
and "all true" in OR operations.
   2. Reduce some unnecessary function calls: for instance, `false_count` and 
`true_count` can be unified as `values().count_set_bits()` in non-null cases, 
which would eliminate two calls to `nulls_count` and one call to `true_count`.
   3. Add heuristic pre-selection under the `and` operation. For more details 
on the principle, you can refer to: 
https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437.
   
   ## What changes are included in this PR?
   
   
   
   
   
   The work done by kosiew:https://github.com/apache/datafusion/pull/15648
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Improve the performance of early exit evaluation in binary_expr [datafusion]

2025-04-12 Thread via GitHub


acking-you commented on issue #15631:
URL: https://github.com/apache/datafusion/issues/15631#issuecomment-2798871646

   > > [@acking-you](https://github.com/acking-you) the code needs to be 
extended to support nulls (you can take a look at the true_count implementation 
in arrow-rs to do this efficiently).
   > 
   > I have an idea for an early exit losing performance, and I'm trying it 
out. If it works, I'll post the code
   
   @Dandandan You're absolutely right. At this point, we should no longer focus 
on optimizing the early return for the `true_count` calculation. Instead, we 
should consider extending support to new optimization scenarios, such as 
handling `nulls` and other cases. Here’s why:
   
   ## My Attempt
   I experimented with optimizing the `true_count` calculation using SIMD 
instructions, performing 256-bit comparisons and computations while 
incorporating early returns. This significantly reduced the overhead of 
checking boolean arrays in non-short-circuit scenarios (with a performance 
boost of up to 6x in extreme cases). However, in short-circuit scenarios like 
`all false` or `all true`, it still performed about 20% slower than the 
existing `false_count/true_count` implementation.
   
   When I applied this optimization to the execution of `BinaryExpr` and 
benchmarked the entire `BinaryExpr`, I observed a performance regression. The 
reasons are as follows:
   1. The computation of `true_count` for boolean arrays is only on the order 
of 60ns, and even with early exit optimizations, it remains at the 10ns level 
(in extreme cases).
   2. When `BinaryExpr` does not trigger short-circuiting, its execution time 
is on the order of 400us (depending on the complexity of the expression). 
Optimizing the `true_count` calculation for boolean arrays becomes negligible 
in this context.
   3. When short-circuiting occurs, the time complexity of `BinaryExpr` is on 
par with the computation of the boolean array. In this case, calling 
`true_count` remains the optimal solution.
   
   ## Other Optimization Scenarios
   - Extending short-circuit optimizations to handle `nulls`.
   - The early execution of selection that I previously mentioned: 
[https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437](https://github.com/apache/datafusion/issues/15631#issuecomment-2788923437).
   
   I realized that my earlier conclusions about the benefits of early selection 
execution were incorrect. After testing with the latest code changes (on a 
machine with an AMD 9950x CPU), the results are as follows: all scenarios that 
benefited from this optimization saw a reduction in execution time from `450us` 
to `170us`. This is a significant improvement, and it did not slow down any 
other cases!
   
   ```sh
   short_circuit/and/one_true_first
   time:   [188.36 Β΅s 188.96 Β΅s 189.55 Β΅s]
   change: [-58.870% -58.652% -58.420%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_last
   time:   [170.10 Β΅s 171.04 Β΅s 171.97 Β΅s]
   change: [-62.788% -62.520% -62.258%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle
   time:   [165.38 Β΅s 165.97 Β΅s 166.57 Β΅s]
   change: [-64.176% -63.968% -63.784%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_left
   time:   [165.02 Β΅s 165.48 Β΅s 165.96 Β΅s]
   change: [-63.671% -63.458% -63.204%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_right
   time:   [169.09 Β΅s 169.61 Β΅s 170.13 Β΅s]
   change: [-63.298% -63.089% -62.868%] (p = 0.00 < 
0.05)
   Performance has improved.
   
   short_circuit/and/one_true_middle_right #2
   time:   [166.43 Β΅s 167.12 Β΅s 167.85 Β΅s]
   change: [-64.092% -63.916% -63.751%] (p = 0.00 < 
0.05)
   Performance has improved.
   ```
   
   I further optimized the work done by @kosiew. Relevant PR: 
https://github.com/apache/datafusion/pull/15694 (currently, there are still 
some issues with `cargo test`, and I'm working on fixing them).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Flatten array in a single step instead of recursive [datafusion]

2025-04-12 Thread via GitHub


delamarch3 commented on PR #15160:
URL: https://github.com/apache/datafusion/pull/15160#issuecomment-2798766441

   Thanks for the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-12 Thread via GitHub


ctsk commented on PR #15532:
URL: https://github.com/apache/datafusion/pull/15532#issuecomment-2798754505

   Heads up: `SUM(DISTINCT (x + 5))` is **not** equivalent to `SUM(DISTINCT x) 
+ 5 * COUNT(DISTINCT x)` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


alamb commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2798878082

   > > Do we want to hold DF 47 release for the arrow upgrade too?
   > > I think it is possible (arrow will hopefully be released at the end of 
this week -- and we could make the DF release candidate next week...)
   > 
   > This would be great for the Comet project.
   
   My only remaining question is if we want to upgrade arrow in this release as 
well
   
   The upgrade PR is here: 
   - https://github.com/apache/datafusion/pull/15466
   
   Since it also upgrades object_store and pyo3 it is somewhat more disruptive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-12 Thread via GitHub


zhuqi-lucas commented on PR #15380:
URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2798879716

   Latest result based current latest code:
   
   ```rust
   
   Benchmark sort_tpch1.json
   
   ┏━━┳━━┳━┳━━━┓
   ┃ Query┃ main ┃ concat_batches_for_sort ┃Change ┃
   ┑━━╇━━╇━╇━━━┩
   β”‚ Q1   β”‚ 153.49ms β”‚137.57ms β”‚ +1.12x faster β”‚
   β”‚ Q2   β”‚ 131.29ms β”‚120.93ms β”‚ +1.09x faster β”‚
   β”‚ Q3   β”‚ 980.57ms β”‚982.22ms β”‚ no change β”‚
   β”‚ Q4   β”‚ 252.25ms β”‚245.09ms β”‚ no change β”‚
   β”‚ Q5   β”‚ 464.81ms β”‚449.27ms β”‚ no change β”‚
   β”‚ Q6   β”‚ 481.44ms β”‚455.45ms β”‚ +1.06x faster β”‚
   β”‚ Q7   β”‚ 810.73ms β”‚709.74ms β”‚ +1.14x faster β”‚
   β”‚ Q8   β”‚ 498.10ms β”‚491.12ms β”‚ no change β”‚
   β”‚ Q9   β”‚ 503.80ms β”‚510.20ms β”‚ no change β”‚
   β”‚ Q10  β”‚ 789.02ms β”‚706.45ms β”‚ +1.12x faster β”‚
   β”‚ Q11  β”‚ 417.39ms β”‚411.50ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏┳━━━┓
   ┃ Benchmark Summary  ┃   ┃
   ┑╇━━━┩
   β”‚ Total Time (main)  β”‚ 5482.89ms β”‚
   β”‚ Total Time (concat_batches_for_sort)   β”‚ 5219.53ms β”‚
   β”‚ Average Time (main)β”‚  498.44ms β”‚
   β”‚ Average Time (concat_batches_for_sort) β”‚  474.50ms β”‚
   β”‚ Queries Faster β”‚ 5 β”‚
   β”‚ Queries Slower β”‚ 0 β”‚
   β”‚ Queries with No Change β”‚ 6 β”‚
   β””β”΄β”€β”€β”€β”˜
   
   Benchmark sort_tpch10.json
   
   ┏━━┳┳━┳━━━┓
   ┃ Query┃   main ┃ concat_batches_for_sort ┃Change ┃
   ┑━━╇╇━╇━━━┩
   β”‚ Q1   β”‚  2243.52ms β”‚   1825.64ms β”‚ +1.23x faster β”‚
   β”‚ Q2   β”‚  1842.11ms β”‚   1639.00ms β”‚ +1.12x faster β”‚
   β”‚ Q3   β”‚ 12446.31ms β”‚  11981.63ms β”‚ no change β”‚
   β”‚ Q4   β”‚  4047.55ms β”‚   3715.96ms β”‚ +1.09x faster β”‚
   β”‚ Q5   β”‚  4364.46ms β”‚   4503.51ms β”‚ no change β”‚
   β”‚ Q6   β”‚  4561.01ms β”‚   4688.31ms β”‚ no change β”‚
   β”‚ Q7   β”‚  8158.01ms β”‚   7915.54ms β”‚ no change β”‚
   β”‚ Q8   β”‚  6077.40ms β”‚   5524.08ms β”‚ +1.10x faster β”‚
   β”‚ Q9   β”‚  6347.21ms β”‚   5853.44ms β”‚ +1.08x faster β”‚
   β”‚ Q10  β”‚ 11561.03ms β”‚  14235.69ms β”‚  1.23x slower β”‚
   β”‚ Q11  β”‚  6069.42ms β”‚   5666.77ms β”‚ +1.07x faster β”‚
   β””β”€β”€β”΄β”΄β”€β”΄β”€β”€β”€β”˜
   ┏┳┓
   ┃ Benchmark Summary  ┃┃
   ┑╇┩
   β”‚ Total Time (main)  β”‚ 67718.04ms β”‚
   β”‚ Total Time (concat_batches_for_sort)   β”‚ 67549.58ms β”‚
   β”‚ Average Time (main)β”‚  6156.19ms β”‚
   β”‚ Average Time (concat_batches_for_sort) β”‚  6140.87ms β”‚
   β”‚ Queries Faster β”‚  6 β”‚
   β”‚ Queries Slower β”‚  1 β”‚
   β”‚ Queries with No Change β”‚  4 β”‚
   β””β”΄β”˜
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Document `CREATE EXTERNAL TABLE ... OPTIONS` [datafusion]

2025-04-12 Thread via GitHub


alamb commented on issue #10451:
URL: https://github.com/apache/datafusion/issues/10451#issuecomment-2798883874

   Hi @marvelshan  -- I also took a look at the docs. It actually looks like 
the format options are largely documented, but the documentation could be 
improved
   
   I suggest:
   
   1. Rename 
[`write_options.md`](https://datafusion.apache.org/user-guide/sql/write_options.html)
 to `format_options.md` and make it clear they apply both to reading and 
writing . Also, link `write_options` clearly in the `CREATE EXTERNAL TABLE` 
section
   
   2. Then for each sub heading for each format (`JSON format specific 
options`, etc) add an example, such as 
   
   ```sql
   CREATE EXTERNAL TABLE t 
   STORED AS JSON
   LOCATION '/tmp/foo.dat'
   OPTIONS('COMPRESSION', 'bzip')
   ```
   
   I also suggest increasing the level of Json, csv, etc options so they appear 
in the table of contents
   
   
![Image](https://github.com/user-attachments/assets/2c3f3418-9c03-4079-a679-b516ef492b2d)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-12 Thread via GitHub


2010YOUY01 commented on code in PR #15692:
URL: https://github.com/apache/datafusion/pull/15692#discussion_r2040878250


##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -529,6 +523,12 @@ impl ExternalSorter {
 /// Sorts the in-memory batches and merges them into a single sorted run, 
then writes
 /// the result to spill files.
 async fn sort_and_spill_in_mem_batches(&mut self) -> Result<()> {
+if self.in_mem_batches.is_empty() {

Review Comment:
   > Now we handle this case in two places. I can remove the handling here 
unless a more defensive approach is desired.
   
   I think it's better to be more defensive, I'd prefer to throw an internal 
error saying that we must assume the buffer is not empty when calling this 
function.



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -1552,6 +1571,62 @@ mod tests {
 Ok(())
 }
 
+#[tokio::test]
+async fn test_batch_reservation_error() -> Result<()> {

Review Comment:
   πŸ‘πŸΌ 



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
 
 Ok(())
 }
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk 
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) -> 
Result<()> {
+let size = get_reserved_byte_for_record_batch(input);
+
+let result = self.reservation.try_grow(size);
+if result.is_err() {
+if self.in_mem_batches.is_empty() {
+return result;

Review Comment:
   It makes sense that we should preserve the raw resource error. In this case, 
 `ContextError` comes in handy:
   
https://github.com/apache/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/common/src/error.rs#L132
   It can add extra notes to the original error message, to make it more 
understandable.
   
   Here, the extra note can be like "When the memory limit is reached and there 
are no buffered batches to spill, external sort cannot continue. Consider 
setting a higher memory limit."



##
datafusion/physical-plan/src/sorts/sort.rs:
##
@@ -765,6 +765,25 @@ impl ExternalSorter {
 
 Ok(())
 }
+
+/// Reserves memory to be able to accommodate the given batch.
+/// If memory is scarce, tries to spill current in-memory batches to disk 
first.
+async fn reserve_memory_for_batch(&mut self, input: &RecordBatch) -> 
Result<()> {

Review Comment:
   nit: I think a more obvious name for this function would be 
`reserve_memory_for_batch_or_spill`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



[PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]

2025-04-12 Thread via GitHub


changsun20 opened a new pull request, #15696:
URL: https://github.com/apache/datafusion/pull/15696

   ## Which issue does this PR close?
   
   Closes #14434
   
   ## Rationale for this change
   
   This PR addresses a common SQL anti-pattern where users accidentally use `= 
NULL` instead of `IS NULL`. While syntactically valid, this comparison always 
returns NULL in SQL and often indicates a developer mistake. The changes help 
users identify this pitfall through rich warnings while maintaining full query 
execution capabilities.
   
   ## What changes are included in this PR?
   
   1. Added warning detection for `= NULL` comparisons in predicate contexts
   2. Implemented span-based diagnostics highlighting the problematic expression
   3. Enhanced SQL parser integration with upgraded sqlparser dependency (0.55+)
   4. Warning collection plumbing using non-intrusive RefCell storage
   5. Added help messages suggesting `IS NULL` alternative
   6. Test coverage for single and multiple occurrences
   
   
   ## Are these changes tested?
   
   Yes, this PR includes:
   
   - Unit tests for single `= NULL` detection
   - Validation of multiple warnings in complex expressions
   - Span position verification
   - Help message content checks
   
   ## Are there any user-facing changes?
   
   Yes, but non-breaking. No API changes or behavior modifications - existing 
queries will still execute normally but may produce additional warnings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]

2025-04-12 Thread via GitHub


changsun20 commented on PR #15696:
URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2799781696

   Hi @eliaperantoni,
   
   Thank you for your patience and guidance throughout this issue. I've 
implemented the core functionality per our discussions, but would like to 
confirm a few implementation details:
   
   1. **Predicate Context Validation**  
   The warning detection is integrated during `BinaryExpr` processing, 
which should naturally limit it to predicate contexts. Statements like `UPDATE 
users SET password = NULL` won't trigger false warnings by default. Could you 
confirm this approach is acceptable?
   2. **Span Handling Strategy**  
   In usual cases, the left operand is an `Identifier`. The current 
implementation combines the identifier's left span with NULL's right span for 
precise highlighting. For rare non-identifier cases (e.g., some complex 
expressions that I can't immediately come up with one right now), we fall back 
to using just the NULL span. This balances precision with robustness.
   3. **Test Coverage Request**  
   While I've added tests for basic and multiple `= NULL` cases, could you 
suggest any edge cases or additional scenarios that should be validated?
   
   I appreciate your expertise in reviewing these implementation choices. 
Please let me know if any adjustments are needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-12 Thread via GitHub


andygrove commented on issue #15072:
URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2799561705

   I am also +1 for upgrading the dependencies (for selfish reasons; we are 
waiting on an arrow feature to help with INT96 timestamps in Parquet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]