Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

via GitHub Tue, 25 Mar 2025 09:17:02 -0700


alamb commented on code in PR #64:
URL: https://github.com/apache/datafusion-site/pull/64#discussion_r2012457516



##########
content/blog/2025-03-24-datafusion-46.0.0.md:
##########
@@ -0,0 +1,92 @@
+---
+layout: post
+title: Apache DataFusion 46.0.0 Released
+date: 2025-03-24
+author: Oznur Hanci and Berkay Sahin on behalf of the PMC
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We’re excited to announce the release of **Apache DataFusion 46.0.0**! This 
new version represents a significant milestone for the project, packing in a 
wide range of improvements and fixes. You can find the complete details in the 
full 
[changelog](https://github.com/apache/datafusion/blob/branch-46/dev/changelog/46.0.0.md).
 We’ll highlight the most important changes below and guide you through 
upgrading.
+
+## Breaking Changes
+
+DataFusion 46.0.0 brings a few **breaking changes** that may require 
adjustments to your code:
+
+- [Unified `DataSourceExec` Execution 
Plan](https://github.com/apache/datafusion/pull/14224#)**:** DataFusion 46.0.0 
introduces a major refactor of scan operators. The separate 
file-format-specific execution plan nodes (`ParquetExec`, `CsvExec`, 
`JsonExec`, `AvroExec`, etc.) have been **deprecated and merged into a single 
`DataSourceExec` plan**. Format-specific logic is now encapsulated in new 
`DataSource` and `FileSource` traits. This change simplifies the execution 
model, but if you have code that directly references the old plan nodes, you’ll 
need to update it to use `DataSourceExec` (see the [Upgrade 
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
examples of the new API).
+- [**Error Handling 
Improvements](https://github.com/apache/arrow-datafusion/issues/7360#:~:text=2) 
(`DataFusionError::Collection`):** We began overhauling DataFusion’s approach 
to error handling. In this release, a new error variant 
`DataFusionError::Collection` (and related mechanisms) has been introduced to 
aggregate multiple errors into one. This is part of a broader effort to provide 
richer error context and reduce internal panics. As a result, some error types 
or messages have changed. Downstream code that matches on specific 
`DataFusionError` variants might need adjustment.
+
+## Highlighted New Features
+
+### Improved Diagnostics
+
+DataFusion 46.0.0 introduces a new [**SQL Diagnostics 
framework**](https://github.com/apache/datafusion/issues/14429) to make error 
messages more understandable. This comes in the form of new `Diagnostic` and 
`DiagnosticEntry` types, which allow the system to attach rich context (like 
source query text spans) to error messages. In practical terms, certain planner 
errors will now point to the exact location in your SQL query that caused the 
issue. 
+
+For example, if you reference an unknown table or miss a column in `GROUP BY` 
the error message will include the query snippet causing the error. These 
diagnostics are meant for end-users of applications built on DataFusion, 
providing clearer messages instead of generic errors. Currently, diagnostics 
cover unresolved table/column references, missing `GROUP BY`columns, ambiguous 
references, wrong number of UNION columns, type mismatches, and a few others. 
Future releases will extend this to more error types. This feature should 
greatly ease debugging of complex SQL by pinpointing errors directly in the 
query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his 
contributions in this project.
+
+### Unified `DataSourceExec` for Table Providers
+
+As mentioned, DataFusion now uses a unified `DataSourceExec` for reading 
tables, which is both a breaking change and a feature. *Why is this important?* 
The new approach simplifies how custom table providers are integrated and 
optimized. Namely, the optimizer can treat file scans uniformly and push down 
filters/limits more consistently when there is one execution plan that handles 
all data sources. The new `DataSourceExec` is paired with a `DataSource` trait 
that encapsulates format-specific behaviors (Parquet, CSV, JSON, Avro, etc.) in 
a pluggable way.
+
+All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been 
migrated to this framework. This unification makes the codebase cleaner and 
sets the stage for future enhancements (like consistent metadata handling and 
limit pushdown across all formats). Check out PR 
[#14224](https://github.com/apache/datafusion/pull/14224) for design details. 
We thank [@mertak-synnada](https://github.com/mertak-synnada) and 
[@ozankabak](https://github.com/ozankabak) for their contributions.
+
+### FFI Support for Scalar UDFs
+
+DataFusion’s Foreign Function Interface (FFI) has been extended to support 
[**user-defined scalar 
functions**](https://github.com/apache/datafusion/pull/14579) defined in 
external languages. In 46.0.0, you can now expose a custom scalar UDF through 
the FFI layer and use it in DataFusion as if it were built-in. This is 
particularly exciting for the **Python bindings** and other language 
integrations – it means you could define a function in Python (or C, etc.) and 
register it with DataFusion’s Rust core via the FFI crate. Thanks, 
[@timsaucer](https://github.com/timsaucer)!
+
+### New Statistics/Distribution Framework
+
+This release, thanks mainly to [@Fly-Style](https://github.com/Fly-Style) with 
contributions from [@ozankabak](https://github.com/ozankabak) and 
[@berkaysynnada](https://github.com/berkaysynnada), includes the initial pieces 
of a [**redesigned statistics 
framework](https://github.com/apache/datafusion/pull/14699).** DataFusion’s 
optimizer can now represent column data distributions using a new 
`Distribution` enum, instead of the old precision or range estimations. The 
supported distribution types currently include **Uniform, Gaussian (normal), 
Exponential, Bernoulli**, and a **Generic** catch-all.
+
+For example, if a filter expression is applied to a column with a known 
uniform distribution range, the optimizer can propagate that to estimate result 
selectivity more accurately. Similarly, comparisons (`=`, `>`, etc.) on columns 
yield Bernoulli distributions (with true/false probabilities) in this model.
+
+This is a foundational change with many follow-on PRs underway. Even though 
the immediate user-visible effect is limited (the optimizer didn't magically 
improve by an order of magnitude overnight), but it lays groundwork for more 
advanced query planning in the future. Over time, as statistics information 
encapsulated in `Distribution`s get integrated, DataFusion will be able to make 
smarter decisions like more aggressive parquet pruning, better join orderings, 
and so on based on data distribution information. The core framework is now in 
place and is being hooked up to column and table level statistics.
+
+### Aggregate Monotonicity and Window Ordering
+
+DataFusion 46.0.0 adds a new concept of 
[set](https://github.com/apache/datafusion/pull/14271#)[-monotonicity](https://github.com/apache/datafusion/blob/5210a2bac32e43dc7bf6e7e6000cdeaf2833c06e/datafusion/expr/src/udaf.rs#L1090)
 for certain transformations, which helps avoid unnecessary sort operations. In 
particular, the planner now understands when a **window function introduces new 
orderings of data**. For example, DataFusion now recognizes that a 
window-aggregate like `MAX` on a column can have an ordering even if the column 
itself doesn't have an ordering (for certain window frames). PR 
[#14271](https://github.com/apache/datafusion/pull/14271) introduced a 
“set-monotonicity” property for window functions, and a follow-up PR 
[#14813](https://github.com/apache/datafusion/pull/14813) refined the handling 
of sort order in window frames. Huge thanks to 
[@berkaysynnada](https://github.com/berkaysynnada) and 
[@mertak-synnada](https://github.com/mertak-synnada) for this feature.
+
+## Performance Improvements
+
+DataFusion 46.0.0 comes with a slew of performance enhancements across the 
board. Here are some of the noteworthy optimizations in this release:
+
+- **Faster `median()` (no grouping):** The `median()` aggregate function got a 
special fast path when used without a `GROUP BY`. By optimizing its 
accumulator, median calculation is about **2× faster** in the single-group 
case. If you use `MEDIAN()` on large datasets (especially as a single value), 
you should notice reduced query times (PR 
[#14399](https://github.com/apache/datafusion/pull/14399) by 
[@2010YOUY01](https://github.com/2010YOUY01)).
+- **Optimized `FIRST_VALUE`/`LAST_VALUE`:** The `FIRST_VALUE` and `LAST_VALUE` 
window functions have been improved by avoiding an internal sort of rows. 
Instead of sorting each partition, the implementation now uses a direct 
approach to pick the first/last element. This yields **10–100% performance 
improvement** for these functions, depending on the scenario. Queries using 
`FIRST_VALUE(...) OVER (PARTITION BY ... ORDER BY ...)` will run faster, 
especially when partitions are large (PR 
[#14402](https://github.com/apache/datafusion/pull/14402) by 
[@blaginin](https://github.com/blaginin)).
+- **`repeat()` String Function Boost:** Repeating strings is now more 
efficient – the `repeat(text, n)` function was optimized by about **50%**. This 
was achieved by reducing allocations and using a more efficient concatenation 
strategy. If you generate large repeated strings in queries, this can cut the 
time nearly in half (PR 
[#14697](https://github.com/apache/datafusion/pull/14697) by 
[@zjregee](https://github.com/zjregee)).
+- **Ultra-fast `uuid()` UDF:** The `uuid()` function (which generates random 
UUID strings) received a major speed-up. It’s now roughly **40× faster** than 
before! The new implementation avoids unnecessary string copying and uses a 
more direct conversion to hex, making bulk UUID generation far more practical 
(PR [#14675](https://github.com/apache/datafusion/pull/14675) by 
[@simonvandel](https://github.com/simonvandel)).
+- **Accelerated `chr()` and `to_hex()`:** Several scalar functions have been 
micro-optimized. The `chr()` function (which returns the character for a given 
ASCII code) is about **4× faster** now, and the `to_hex()` function (which 
converts numbers to hex string) is roughly **2× faster**. These improvements 
may be most noticeable in tight loops or when these functions are applied to 
large arrays of values (PR 
[#14700](https://github.com/apache/datafusion/pull/14700) for `chr`, 
[#14686](https://github.com/apache/datafusion/pull/14686) for `to_hex` by 
[@simonvandel](https://github.com/simonvandel)).
+- **No More RowConverter in Grouped Ordering:** We removed an inefficient step 
in the *partial grouping* algorithm. The `GroupOrderingPartial` operator no 
longer converts data to “row format” for each batch (via `RowConverter`). 
Instead, it uses a direct arrow-based approach to detect sort key changes. This 
eliminated overhead and yields a nice speedup for certain aggregation queries. 
(PR [#14566](https://github.com/apache/datafusion/pull/14566) by 
[@ctsk](https://github.com/ctsk)).
+- **Predicate Pruning for `NOT LIKE`:** DataFusion’s parquet reader can now 
prune row groups using `NOT LIKE` filters, similar to how it handles `LIKE`. 
This means if you have a filter such as `column NOT LIKE 'prefix%'`, DataFusion 
can use min/max statistics to skip reading files/parts that can be determined 
to either entirely match or not match the predicate. In particular, a pattern 
like `NOT LIKE 'X%'` can skip data ranges that definitely start with "X". While 
a niche case, it contributes to query efficiency in those scenarios (PR 
[#14567](https://github.com/apache/datafusion/pull/14567) by 
[@UBarney](https://github.com/UBarney)).
+- **`UNION [ALL|DISTINCT] BY NAME` Support:** While not a pure speed 
optimization, it’s worth noting we added support for `UNION BY NAME` (both 
`UNION BY NAME` which is distinct by default, and `UNION ALL BY NAME`). This 
allows combining two result sets by aligning columns by name instead of by 
position, which can save you from manual `SELECT` reordering and thus simplify 
query logic. The new logical plan builder functions `union_by_name()`and 
`union_by_name_distinct()` correspond to these operations. This feature, 
inspired by systems like Spark and DuckDB, makes it easier to union 
heterogenous datasets. (PR 
[#14538](https://github.com/apache/datafusion/pull/14538) by 
[@rkrishn7](https://github.com/rkrishn7))
+- **New `range()` Table Function:** We introduced a built-in table-valued 
function `range()` for generating sequences of numbers. Similar to Spark’s 
`range()` or PostgreSQL’s `generate_series`, this function can produce a series 
of integer values which is useful for testing, recursive queries, or generating 
surrogate keys. For example, `SELECT * FROM range(0, 100, 1)` would produce 
rows from 0 to 99. This function is implemented in Rust for efficiency 
(leveraging the same mechanism as `generate_series`) and avoids needing a 
separate source of sequence data. It’s a handy addition for ETL and data 
generation tasks (PR [#14830](https://github.com/apache/datafusion/pull/14830) 
by [@simonvandel](https://github.com/simonvandel)).
+
+## Google Summer of Code 2025
+
+Another exciting development: **Apache DataFusion has been accepted as a 
mentoring organization for Google Summer of Code (GSoC) 2025**! 🎉 This means 
that this summer, students from around the world will have the opportunity to 
contribute to DataFusion under the guidance of our committers. We have put 
together [a list of project 
ideas](https://datafusion.apache.org/contributor-guide/gsoc_project_ideas.html) 
that candidates can choose from. 
+
+If you’re interested, check out our [GSoC Application 
Guidelines](https://datafusion.apache.org/contributor-guide/gsoc_application_guidelines.html).
 We encourage students to reach out, discuss ideas with us, and apply. 
+
+## Upgrade Guide and Changelog
+
+Upgrading to 46.0.0 should be straightforward for most users, but do review 
the [Upgrade Guide for DataFusion 
46.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
detailed steps and code changes. The upgrade guide covers the breaking changes 
mentioned (like replacing old exec nodes with `DataSourceExec`, updating UDF 
invocation to `invoke_with_args`, etc.) and provides code snippets to help with 
the transition. For a comprehensive list of all changes, please refer to the 
**changelog** for 46.0.0 (linked above and in the repository). The changelog 
enumerates every merged PR in this release, including many smaller fixes and 
improvements that we couldn’t cover in this post.
+
+## Get Involved
+
+Apache DataFusion is an open-source project, and we welcome involvement from 
anyone interested. Now is a great time to take 46.0.0 for a spin: try it out on 
your workloads, and let us know if you encounter any issues or have 
suggestions. You can report bugs or request features on our GitHub issue 
tracker, or better yet, submit a pull request. Join our community discussions – 
whether you have questions, want to share how you’re using DataFusion, or are 
looking to contribute, we’d love to hear from you. A list of open issues 
suitable for beginners is 
[here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
 and you can find how to reach us on the [communication 
doc](https://datafusion.apache.org/contributor-guide/communication.html).
+
+Happy querying!

Review Comment:
   Love it!



##########
content/blog/2025-03-24-datafusion-46.0.0.md:
##########
@@ -0,0 +1,92 @@
+---
+layout: post
+title: Apache DataFusion 46.0.0 Released
+date: 2025-03-24
+author: Oznur Hanci and Berkay Sahin on behalf of the PMC
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We’re excited to announce the release of **Apache DataFusion 46.0.0**! This 
new version represents a significant milestone for the project, packing in a 
wide range of improvements and fixes. You can find the complete details in the 
full 
[changelog](https://github.com/apache/datafusion/blob/branch-46/dev/changelog/46.0.0.md).
 We’ll highlight the most important changes below and guide you through 
upgrading.
+
+## Breaking Changes
+
+DataFusion 46.0.0 brings a few **breaking changes** that may require 
adjustments to your code:
+
+- [Unified `DataSourceExec` Execution 
Plan](https://github.com/apache/datafusion/pull/14224#)**:** DataFusion 46.0.0 
introduces a major refactor of scan operators. The separate 
file-format-specific execution plan nodes (`ParquetExec`, `CsvExec`, 
`JsonExec`, `AvroExec`, etc.) have been **deprecated and merged into a single 
`DataSourceExec` plan**. Format-specific logic is now encapsulated in new 
`DataSource` and `FileSource` traits. This change simplifies the execution 
model, but if you have code that directly references the old plan nodes, you’ll 
need to update it to use `DataSourceExec` (see the [Upgrade 
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
examples of the new API).
+- [**Error Handling 
Improvements](https://github.com/apache/arrow-datafusion/issues/7360#:~:text=2) 
(`DataFusionError::Collection`):** We began overhauling DataFusion’s approach 
to error handling. In this release, a new error variant 
`DataFusionError::Collection` (and related mechanisms) has been introduced to 
aggregate multiple errors into one. This is part of a broader effort to provide 
richer error context and reduce internal panics. As a result, some error types 
or messages have changed. Downstream code that matches on specific 
`DataFusionError` variants might need adjustment.
+
+## Highlighted New Features
+
+### Improved Diagnostics
+
+DataFusion 46.0.0 introduces a new [**SQL Diagnostics 
framework**](https://github.com/apache/datafusion/issues/14429) to make error 
messages more understandable. This comes in the form of new `Diagnostic` and 
`DiagnosticEntry` types, which allow the system to attach rich context (like 
source query text spans) to error messages. In practical terms, certain planner 
errors will now point to the exact location in your SQL query that caused the 
issue. 
+
+For example, if you reference an unknown table or miss a column in `GROUP BY` 
the error message will include the query snippet causing the error. These 
diagnostics are meant for end-users of applications built on DataFusion, 
providing clearer messages instead of generic errors. Currently, diagnostics 
cover unresolved table/column references, missing `GROUP BY`columns, ambiguous 
references, wrong number of UNION columns, type mismatches, and a few others. 
Future releases will extend this to more error types. This feature should 
greatly ease debugging of complex SQL by pinpointing errors directly in the 
query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his 
contributions in this project.
+
+### Unified `DataSourceExec` for Table Providers
+
+As mentioned, DataFusion now uses a unified `DataSourceExec` for reading 
tables, which is both a breaking change and a feature. *Why is this important?* 
The new approach simplifies how custom table providers are integrated and 
optimized. Namely, the optimizer can treat file scans uniformly and push down 
filters/limits more consistently when there is one execution plan that handles 
all data sources. The new `DataSourceExec` is paired with a `DataSource` trait 
that encapsulates format-specific behaviors (Parquet, CSV, JSON, Avro, etc.) in 
a pluggable way.
+
+All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been 
migrated to this framework. This unification makes the codebase cleaner and 
sets the stage for future enhancements (like consistent metadata handling and 
limit pushdown across all formats). Check out PR 
[#14224](https://github.com/apache/datafusion/pull/14224) for design details. 
We thank [@mertak-synnada](https://github.com/mertak-synnada) and 
[@ozankabak](https://github.com/ozankabak) for their contributions.
+
+### FFI Support for Scalar UDFs
+
+DataFusion’s Foreign Function Interface (FFI) has been extended to support 
[**user-defined scalar 
functions**](https://github.com/apache/datafusion/pull/14579) defined in 
external languages. In 46.0.0, you can now expose a custom scalar UDF through 
the FFI layer and use it in DataFusion as if it were built-in. This is 
particularly exciting for the **Python bindings** and other language 
integrations – it means you could define a function in Python (or C, etc.) and 
register it with DataFusion’s Rust core via the FFI crate. Thanks, 
[@timsaucer](https://github.com/timsaucer)!
+
+### New Statistics/Distribution Framework
+
+This release, thanks mainly to [@Fly-Style](https://github.com/Fly-Style) with 
contributions from [@ozankabak](https://github.com/ozankabak) and 
[@berkaysynnada](https://github.com/berkaysynnada), includes the initial pieces 
of a [**redesigned statistics 
framework](https://github.com/apache/datafusion/pull/14699).** DataFusion’s 
optimizer can now represent column data distributions using a new 
`Distribution` enum, instead of the old precision or range estimations. The 
supported distribution types currently include **Uniform, Gaussian (normal), 
Exponential, Bernoulli**, and a **Generic** catch-all.
+
+For example, if a filter expression is applied to a column with a known 
uniform distribution range, the optimizer can propagate that to estimate result 
selectivity more accurately. Similarly, comparisons (`=`, `>`, etc.) on columns 
yield Bernoulli distributions (with true/false probabilities) in this model.
+
+This is a foundational change with many follow-on PRs underway. Even though 
the immediate user-visible effect is limited (the optimizer didn't magically 
improve by an order of magnitude overnight), but it lays groundwork for more 
advanced query planning in the future. Over time, as statistics information 
encapsulated in `Distribution`s get integrated, DataFusion will be able to make 
smarter decisions like more aggressive parquet pruning, better join orderings, 
and so on based on data distribution information. The core framework is now in 
place and is being hooked up to column and table level statistics.
+
+### Aggregate Monotonicity and Window Ordering
+
+DataFusion 46.0.0 adds a new concept of 
[set](https://github.com/apache/datafusion/pull/14271#)[-monotonicity](https://github.com/apache/datafusion/blob/5210a2bac32e43dc7bf6e7e6000cdeaf2833c06e/datafusion/expr/src/udaf.rs#L1090)
 for certain transformations, which helps avoid unnecessary sort operations. In 
particular, the planner now understands when a **window function introduces new 
orderings of data**. For example, DataFusion now recognizes that a 
window-aggregate like `MAX` on a column can have an ordering even if the column 
itself doesn't have an ordering (for certain window frames). PR 
[#14271](https://github.com/apache/datafusion/pull/14271) introduced a 
“set-monotonicity” property for window functions, and a follow-up PR 
[#14813](https://github.com/apache/datafusion/pull/14813) refined the handling 
of sort order in window frames. Huge thanks to 
[@berkaysynnada](https://github.com/berkaysynnada) and 
[@mertak-synnada](https://github.com/mertak-synnada) for this feature.
+
+## Performance Improvements

Review Comment:
   I suggest moving the `Performance Improvements` and `GSOC` sections above 
the notable new features as I think they are a nice lead in to the release



##########
content/blog/2025-03-24-datafusion-46.0.0.md:
##########
@@ -0,0 +1,92 @@
+---
+layout: post
+title: Apache DataFusion 46.0.0 Released
+date: 2025-03-24
+author: Oznur Hanci and Berkay Sahin on behalf of the PMC
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We’re excited to announce the release of **Apache DataFusion 46.0.0**! This 
new version represents a significant milestone for the project, packing in a 
wide range of improvements and fixes. You can find the complete details in the 
full 
[changelog](https://github.com/apache/datafusion/blob/branch-46/dev/changelog/46.0.0.md).
 We’ll highlight the most important changes below and guide you through 
upgrading.
+
+## Breaking Changes
+
+DataFusion 46.0.0 brings a few **breaking changes** that may require 
adjustments to your code:
+
+- [Unified `DataSourceExec` Execution 
Plan](https://github.com/apache/datafusion/pull/14224#)**:** DataFusion 46.0.0 
introduces a major refactor of scan operators. The separate 
file-format-specific execution plan nodes (`ParquetExec`, `CsvExec`, 
`JsonExec`, `AvroExec`, etc.) have been **deprecated and merged into a single 
`DataSourceExec` plan**. Format-specific logic is now encapsulated in new 
`DataSource` and `FileSource` traits. This change simplifies the execution 
model, but if you have code that directly references the old plan nodes, you’ll 
need to update it to use `DataSourceExec` (see the [Upgrade 
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
examples of the new API).
+- [**Error Handling 
Improvements](https://github.com/apache/arrow-datafusion/issues/7360#:~:text=2) 
(`DataFusionError::Collection`):** We began overhauling DataFusion’s approach 
to error handling. In this release, a new error variant 
`DataFusionError::Collection` (and related mechanisms) has been introduced to 
aggregate multiple errors into one. This is part of a broader effort to provide 
richer error context and reduce internal panics. As a result, some error types 
or messages have changed. Downstream code that matches on specific 
`DataFusionError` variants might need adjustment.
+
+## Highlighted New Features
+
+### Improved Diagnostics
+
+DataFusion 46.0.0 introduces a new [**SQL Diagnostics 
framework**](https://github.com/apache/datafusion/issues/14429) to make error 
messages more understandable. This comes in the form of new `Diagnostic` and 
`DiagnosticEntry` types, which allow the system to attach rich context (like 
source query text spans) to error messages. In practical terms, certain planner 
errors will now point to the exact location in your SQL query that caused the 
issue. 
+
+For example, if you reference an unknown table or miss a column in `GROUP BY` 
the error message will include the query snippet causing the error. These 
diagnostics are meant for end-users of applications built on DataFusion, 
providing clearer messages instead of generic errors. Currently, diagnostics 
cover unresolved table/column references, missing `GROUP BY`columns, ambiguous 
references, wrong number of UNION columns, type mismatches, and a few others. 
Future releases will extend this to more error types. This feature should 
greatly ease debugging of complex SQL by pinpointing errors directly in the 
query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his 
contributions in this project.
+
+### Unified `DataSourceExec` for Table Providers
+
+As mentioned, DataFusion now uses a unified `DataSourceExec` for reading 
tables, which is both a breaking change and a feature. *Why is this important?* 
The new approach simplifies how custom table providers are integrated and 
optimized. Namely, the optimizer can treat file scans uniformly and push down 
filters/limits more consistently when there is one execution plan that handles 
all data sources. The new `DataSourceExec` is paired with a `DataSource` trait 
that encapsulates format-specific behaviors (Parquet, CSV, JSON, Avro, etc.) in 
a pluggable way.
+
+All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been 
migrated to this framework. This unification makes the codebase cleaner and 
sets the stage for future enhancements (like consistent metadata handling and 
limit pushdown across all formats). Check out PR 
[#14224](https://github.com/apache/datafusion/pull/14224) for design details. 
We thank [@mertak-synnada](https://github.com/mertak-synnada) and 
[@ozankabak](https://github.com/ozankabak) for their contributions.
+
+### FFI Support for Scalar UDFs
+
+DataFusion’s Foreign Function Interface (FFI) has been extended to support 
[**user-defined scalar 
functions**](https://github.com/apache/datafusion/pull/14579) defined in 
external languages. In 46.0.0, you can now expose a custom scalar UDF through 
the FFI layer and use it in DataFusion as if it were built-in. This is 
particularly exciting for the **Python bindings** and other language 
integrations – it means you could define a function in Python (or C, etc.) and 
register it with DataFusion’s Rust core via the FFI crate. Thanks, 
[@timsaucer](https://github.com/timsaucer)!
+
+### New Statistics/Distribution Framework
+
+This release, thanks mainly to [@Fly-Style](https://github.com/Fly-Style) with 
contributions from [@ozankabak](https://github.com/ozankabak) and 
[@berkaysynnada](https://github.com/berkaysynnada), includes the initial pieces 
of a [**redesigned statistics 
framework](https://github.com/apache/datafusion/pull/14699).** DataFusion’s 
optimizer can now represent column data distributions using a new 
`Distribution` enum, instead of the old precision or range estimations. The 
supported distribution types currently include **Uniform, Gaussian (normal), 
Exponential, Bernoulli**, and a **Generic** catch-all.
+
+For example, if a filter expression is applied to a column with a known 
uniform distribution range, the optimizer can propagate that to estimate result 
selectivity more accurately. Similarly, comparisons (`=`, `>`, etc.) on columns 
yield Bernoulli distributions (with true/false probabilities) in this model.
+
+This is a foundational change with many follow-on PRs underway. Even though 
the immediate user-visible effect is limited (the optimizer didn't magically 
improve by an order of magnitude overnight), but it lays groundwork for more 
advanced query planning in the future. Over time, as statistics information 
encapsulated in `Distribution`s get integrated, DataFusion will be able to make 
smarter decisions like more aggressive parquet pruning, better join orderings, 
and so on based on data distribution information. The core framework is now in 
place and is being hooked up to column and table level statistics.
+
+### Aggregate Monotonicity and Window Ordering
+
+DataFusion 46.0.0 adds a new concept of 
[set](https://github.com/apache/datafusion/pull/14271#)[-monotonicity](https://github.com/apache/datafusion/blob/5210a2bac32e43dc7bf6e7e6000cdeaf2833c06e/datafusion/expr/src/udaf.rs#L1090)
 for certain transformations, which helps avoid unnecessary sort operations. In 
particular, the planner now understands when a **window function introduces new 
orderings of data**. For example, DataFusion now recognizes that a 
window-aggregate like `MAX` on a column can have an ordering even if the column 
itself doesn't have an ordering (for certain window frames). PR 
[#14271](https://github.com/apache/datafusion/pull/14271) introduced a 
“set-monotonicity” property for window functions, and a follow-up PR 
[#14813](https://github.com/apache/datafusion/pull/14813) refined the handling 
of sort order in window frames. Huge thanks to 
[@berkaysynnada](https://github.com/berkaysynnada) and 
[@mertak-synnada](https://github.com/mertak-synnada) for this feature.

Review Comment:
   I think it would be easier to understand why this feature is valuable if 
there is some example query showing what type of query now works / is 
optimized. Not required just a thought



##########
content/blog/2025-03-24-datafusion-46.0.0.md:
##########
@@ -0,0 +1,92 @@
+---
+layout: post
+title: Apache DataFusion 46.0.0 Released
+date: 2025-03-24
+author: Oznur Hanci and Berkay Sahin on behalf of the PMC
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We’re excited to announce the release of **Apache DataFusion 46.0.0**! This 
new version represents a significant milestone for the project, packing in a 
wide range of improvements and fixes. You can find the complete details in the 
full 
[changelog](https://github.com/apache/datafusion/blob/branch-46/dev/changelog/46.0.0.md).
 We’ll highlight the most important changes below and guide you through 
upgrading.
+
+## Breaking Changes
+
+DataFusion 46.0.0 brings a few **breaking changes** that may require 
adjustments to your code:
+
+- [Unified `DataSourceExec` Execution 
Plan](https://github.com/apache/datafusion/pull/14224#)**:** DataFusion 46.0.0 
introduces a major refactor of scan operators. The separate 
file-format-specific execution plan nodes (`ParquetExec`, `CsvExec`, 
`JsonExec`, `AvroExec`, etc.) have been **deprecated and merged into a single 
`DataSourceExec` plan**. Format-specific logic is now encapsulated in new 
`DataSource` and `FileSource` traits. This change simplifies the execution 
model, but if you have code that directly references the old plan nodes, you’ll 
need to update it to use `DataSourceExec` (see the [Upgrade 
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
examples of the new API).
+- [**Error Handling 
Improvements](https://github.com/apache/arrow-datafusion/issues/7360#:~:text=2) 
(`DataFusionError::Collection`):** We began overhauling DataFusion’s approach 
to error handling. In this release, a new error variant 
`DataFusionError::Collection` (and related mechanisms) has been introduced to 
aggregate multiple errors into one. This is part of a broader effort to provide 
richer error context and reduce internal panics. As a result, some error types 
or messages have changed. Downstream code that matches on specific 
`DataFusionError` variants might need adjustment.
+
+## Highlighted New Features
+
+### Improved Diagnostics
+
+DataFusion 46.0.0 introduces a new [**SQL Diagnostics 
framework**](https://github.com/apache/datafusion/issues/14429) to make error 
messages more understandable. This comes in the form of new `Diagnostic` and 
`DiagnosticEntry` types, which allow the system to attach rich context (like 
source query text spans) to error messages. In practical terms, certain planner 
errors will now point to the exact location in your SQL query that caused the 
issue. 
+
+For example, if you reference an unknown table or miss a column in `GROUP BY` 
the error message will include the query snippet causing the error. These 
diagnostics are meant for end-users of applications built on DataFusion, 
providing clearer messages instead of generic errors. Currently, diagnostics 
cover unresolved table/column references, missing `GROUP BY`columns, ambiguous 
references, wrong number of UNION columns, type mismatches, and a few others. 
Future releases will extend this to more error types. This feature should 
greatly ease debugging of complex SQL by pinpointing errors directly in the 
query text. We thank [@eliaperantoni](https://github.com/eliaperantoni) for his 
contributions in this project.
+
+### Unified `DataSourceExec` for Table Providers
+
+As mentioned, DataFusion now uses a unified `DataSourceExec` for reading 
tables, which is both a breaking change and a feature. *Why is this important?* 
The new approach simplifies how custom table providers are integrated and 
optimized. Namely, the optimizer can treat file scans uniformly and push down 
filters/limits more consistently when there is one execution plan that handles 
all data sources. The new `DataSourceExec` is paired with a `DataSource` trait 
that encapsulates format-specific behaviors (Parquet, CSV, JSON, Avro, etc.) in 
a pluggable way.
+
+All built-in sources (Parquet, CSV, Avro, Arrow, JSON, etc.) have been 
migrated to this framework. This unification makes the codebase cleaner and 
sets the stage for future enhancements (like consistent metadata handling and 
limit pushdown across all formats). Check out PR 
[#14224](https://github.com/apache/datafusion/pull/14224) for design details. 
We thank [@mertak-synnada](https://github.com/mertak-synnada) and 
[@ozankabak](https://github.com/ozankabak) for their contributions.
+
+### FFI Support for Scalar UDFs
+
+DataFusion’s Foreign Function Interface (FFI) has been extended to support 
[**user-defined scalar 
functions**](https://github.com/apache/datafusion/pull/14579) defined in 
external languages. In 46.0.0, you can now expose a custom scalar UDF through 
the FFI layer and use it in DataFusion as if it were built-in. This is 
particularly exciting for the **Python bindings** and other language 
integrations – it means you could define a function in Python (or C, etc.) and 
register it with DataFusion’s Rust core via the FFI crate. Thanks, 
[@timsaucer](https://github.com/timsaucer)!
+
+### New Statistics/Distribution Framework
+
+This release, thanks mainly to [@Fly-Style](https://github.com/Fly-Style) with 
contributions from [@ozankabak](https://github.com/ozankabak) and 
[@berkaysynnada](https://github.com/berkaysynnada), includes the initial pieces 
of a [**redesigned statistics 
framework](https://github.com/apache/datafusion/pull/14699).** DataFusion’s 
optimizer can now represent column data distributions using a new 
`Distribution` enum, instead of the old precision or range estimations. The 
supported distribution types currently include **Uniform, Gaussian (normal), 
Exponential, Bernoulli**, and a **Generic** catch-all.
+
+For example, if a filter expression is applied to a column with a known 
uniform distribution range, the optimizer can propagate that to estimate result 
selectivity more accurately. Similarly, comparisons (`=`, `>`, etc.) on columns 
yield Bernoulli distributions (with true/false probabilities) in this model.
+
+This is a foundational change with many follow-on PRs underway. Even though 
the immediate user-visible effect is limited (the optimizer didn't magically 
improve by an order of magnitude overnight), but it lays groundwork for more 
advanced query planning in the future. Over time, as statistics information 
encapsulated in `Distribution`s get integrated, DataFusion will be able to make 
smarter decisions like more aggressive parquet pruning, better join orderings, 
and so on based on data distribution information. The core framework is now in 
place and is being hooked up to column and table level statistics.
+
+### Aggregate Monotonicity and Window Ordering
+
+DataFusion 46.0.0 adds a new concept of 
[set](https://github.com/apache/datafusion/pull/14271#)[-monotonicity](https://github.com/apache/datafusion/blob/5210a2bac32e43dc7bf6e7e6000cdeaf2833c06e/datafusion/expr/src/udaf.rs#L1090)
 for certain transformations, which helps avoid unnecessary sort operations. In 
particular, the planner now understands when a **window function introduces new 
orderings of data**. For example, DataFusion now recognizes that a 
window-aggregate like `MAX` on a column can have an ordering even if the column 
itself doesn't have an ordering (for certain window frames). PR 
[#14271](https://github.com/apache/datafusion/pull/14271) introduced a 
“set-monotonicity” property for window functions, and a follow-up PR 
[#14813](https://github.com/apache/datafusion/pull/14813) refined the handling 
of sort order in window frames. Huge thanks to 
[@berkaysynnada](https://github.com/berkaysynnada) and 
[@mertak-synnada](https://github.com/mertak-synnada) for this feature.
+
+## Performance Improvements
+
+DataFusion 46.0.0 comes with a slew of performance enhancements across the 
board. Here are some of the noteworthy optimizations in this release:
+
+- **Faster `median()` (no grouping):** The `median()` aggregate function got a 
special fast path when used without a `GROUP BY`. By optimizing its 
accumulator, median calculation is about **2× faster** in the single-group 
case. If you use `MEDIAN()` on large datasets (especially as a single value), 
you should notice reduced query times (PR 
[#14399](https://github.com/apache/datafusion/pull/14399) by 
[@2010YOUY01](https://github.com/2010YOUY01)).
+- **Optimized `FIRST_VALUE`/`LAST_VALUE`:** The `FIRST_VALUE` and `LAST_VALUE` 
window functions have been improved by avoiding an internal sort of rows. 
Instead of sorting each partition, the implementation now uses a direct 
approach to pick the first/last element. This yields **10–100% performance 
improvement** for these functions, depending on the scenario. Queries using 
`FIRST_VALUE(...) OVER (PARTITION BY ... ORDER BY ...)` will run faster, 
especially when partitions are large (PR 
[#14402](https://github.com/apache/datafusion/pull/14402) by 
[@blaginin](https://github.com/blaginin)).
+- **`repeat()` String Function Boost:** Repeating strings is now more 
efficient – the `repeat(text, n)` function was optimized by about **50%**. This 
was achieved by reducing allocations and using a more efficient concatenation 
strategy. If you generate large repeated strings in queries, this can cut the 
time nearly in half (PR 
[#14697](https://github.com/apache/datafusion/pull/14697) by 
[@zjregee](https://github.com/zjregee)).
+- **Ultra-fast `uuid()` UDF:** The `uuid()` function (which generates random 
UUID strings) received a major speed-up. It’s now roughly **40× faster** than 
before! The new implementation avoids unnecessary string copying and uses a 
more direct conversion to hex, making bulk UUID generation far more practical 
(PR [#14675](https://github.com/apache/datafusion/pull/14675) by 
[@simonvandel](https://github.com/simonvandel)).
+- **Accelerated `chr()` and `to_hex()`:** Several scalar functions have been 
micro-optimized. The `chr()` function (which returns the character for a given 
ASCII code) is about **4× faster** now, and the `to_hex()` function (which 
converts numbers to hex string) is roughly **2× faster**. These improvements 
may be most noticeable in tight loops or when these functions are applied to 
large arrays of values (PR 
[#14700](https://github.com/apache/datafusion/pull/14700) for `chr`, 
[#14686](https://github.com/apache/datafusion/pull/14686) for `to_hex` by 
[@simonvandel](https://github.com/simonvandel)).
+- **No More RowConverter in Grouped Ordering:** We removed an inefficient step 
in the *partial grouping* algorithm. The `GroupOrderingPartial` operator no 
longer converts data to “row format” for each batch (via `RowConverter`). 
Instead, it uses a direct arrow-based approach to detect sort key changes. This 
eliminated overhead and yields a nice speedup for certain aggregation queries. 
(PR [#14566](https://github.com/apache/datafusion/pull/14566) by 
[@ctsk](https://github.com/ctsk)).
+- **Predicate Pruning for `NOT LIKE`:** DataFusion’s parquet reader can now 
prune row groups using `NOT LIKE` filters, similar to how it handles `LIKE`. 
This means if you have a filter such as `column NOT LIKE 'prefix%'`, DataFusion 
can use min/max statistics to skip reading files/parts that can be determined 
to either entirely match or not match the predicate. In particular, a pattern 
like `NOT LIKE 'X%'` can skip data ranges that definitely start with "X". While 
a niche case, it contributes to query efficiency in those scenarios (PR 
[#14567](https://github.com/apache/datafusion/pull/14567) by 
[@UBarney](https://github.com/UBarney)).
+- **`UNION [ALL|DISTINCT] BY NAME` Support:** While not a pure speed 
optimization, it’s worth noting we added support for `UNION BY NAME` (both 
`UNION BY NAME` which is distinct by default, and `UNION ALL BY NAME`). This 
allows combining two result sets by aligning columns by name instead of by 
position, which can save you from manual `SELECT` reordering and thus simplify 
query logic. The new logical plan builder functions `union_by_name()`and 
`union_by_name_distinct()` correspond to these operations. This feature, 
inspired by systems like Spark and DuckDB, makes it easier to union 
heterogenous datasets. (PR 
[#14538](https://github.com/apache/datafusion/pull/14538) by 
[@rkrishn7](https://github.com/rkrishn7))

Review Comment:
   I think these last two items are more features that performance -- maybe we 
can list them under a new heading of "New Features" or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog post for DataFusion 46.0.0 [datafusion-site]

Reply via email to