Re: [PR] Blog: Add blog post about DataFusion 50.0.0 release [datafusion-site]

via GitHub Fri, 26 Sep 2025 11:41:34 -0700


nuno-faria commented on code in PR #115:
URL: https://github.com/apache/datafusion-site/pull/115#discussion_r2383198599



##########
content/blog/2025-09-29-datafusion-50.0.0.md:
##########
@@ -0,0 +1,413 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-29
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]: 
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-50.0.0/performance_over_time_clickbench.png"
+  width="100%" class="img-responsive" alt="ClickBench performance results over
+  time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time. Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and 
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**, dramatically improving performance when one relation is relatively
+small or filtered by a highly selective predicate. More details can be found in
+the [Dynamic Filter Pushdown for Hash 
Joins](#dynamic-filter-pushdown-for-hash-joins) section below.
+
+The dynamic filters in the TopK operator have also been improved in DataFusion
+50.0.0, further increasing the effectiveness and efficiency of the 
optimization.
+More details can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+**Nested Loop Join Optimization**
+
+The nested loop join operator has been rewritten to reduce execution time and 
memory
+usage by adopting a finer-grained approach. Specifically, we now limit the 
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old 
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we measured up to a 5x
+improvement in execution time and a 99% reduction in memory usage. More 
details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files (statistics, page indexes, etc.) is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful when querying the same table multiple times over relatively slow 
networks, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth 📈
+
+Between `49.0.0` and `50.0.0`, we continue to see our community grow:
+
+1. Qi Zhu ([zhuqi-lucas](https://github.com/zhuqi-lucas)) and Yoav Cohen
+   ([yoavcloud](https://github.com/yoavcloud)) became committers. See the
+   [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+   from 79 different committers, created over 235 issues, and closed 197 of 
them
+   🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published several blogs, including *[Using External Indexes, 
Metadata Stores, Catalogs and
+   Caches to Accelerate Queries on Apache Parquet]*, *[Dynamic Filters:
+   Passing Information Between Operators During Execution for 25x Faster
+   Queries]*, and *[Implementing User Defined Types and Custom Metadata 
+   in DataFusion]*.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0  . | wc -l
+    79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0  . | wc -l
+    318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate 
Queries on Apache Parquet]: 
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for 
25x Faster Queries]: 
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+[Implementing User Defined Types and Custom Metadata in DataFusion]: 
https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata/
+
+## New Features ✨
+
+### Improved Spilling Sorts for Larger-than-Memory Datasets
+
+DataFusion has long been able to sort datasets that do not fit entirely in 
memory
+but still struggled with particularly large inputs or highly 
memory-constrained 
+setups. Larger-than-memory sorts in DataFusion 50.0.0 have been improved with 
the recent introduction
+of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). It is now
+possible to execute almost any sorting query that would have previously 
triggered *out-of-memory*
+errors by relying on disk spilling.
+
+### Dynamic Filter Pushdown for Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads—a technique sometimes referred to as
+[*Sideways Information 
Passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied to inner hash joins, while future work
+will introduce them to other join types. 
+
+For example, given a query that looks for a specific customer and
+their orders, DataFusion can now filter the `orders` relation based on the
+`c_custkey` of the target customer, reducing the amount of data
+read from disk by orders of magnitude.
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+The following shows an execution plan in DataFusion 50.0.0 with this 
optimization:
+
+```sql
+HashJoinExec
+    DataSourceExec: <-- read customer
+      predicate=c_phone@4 = 25-989-741-2988
+      metrics=[output_rows=1, ...]
+    DataSourceExec: <-- read orders
+      -- dynamic filter is added here, filtering directly at scan time
+      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 
<= 1 ]
+      -- the number of output rows is kept to a minimum
+      metrics=[output_rows=11, ...]
+```
+
+Because there is a single customer in this query,
+almost all rows from `orders` are filtered out by the join. 
+In previous versions of DataFusion, the entire `orders` relation would be
+scanned to join with the target customer, but now the dynamic filter pushdown 
can
+filter it right at the source, minimizing the amount of data decoded.
+
+More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445) and the next step 
will be to
+[extend the dynamic filters to other types of 
joins](https://github.com/apache/datafusion/issues/16973), such as `LEFT` and
+`RIGHT` outer joins. 
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, etc.) is now
+automatically cached when using the built-in [ListingTable], which reduces 
disk/network round-trips and repeated decoding
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). This optimization 
is production ready and is enabled by default
+(more details in the 
[Epic](https://github.com/apache/datafusion/issues/17000)). 
+
+[ListingTable]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
+
+
+Here is an example of the metadata cache in action:
+
+```sql
+-- disabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '0M';
+
+-- simple query (t.parquet: 100M rows, 3 cols)
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
+Elapsed 0.246 seconds.
+
+-- enabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '50M';
+
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
+Elapsed 0.003 seconds. -- 82x improvement in this specific query
+```
+
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+The default 
[`FileMetadataCache`](https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
 uses a
+least-recently-used eviction algorithm and up to 50MB of memory.
+If the underlying file changes, the cache is automatically invalidated.
+Setting the limit to 0 will disable any metadata caching. As with most APIs in
+DataFusion, users can provide their own behavior using a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the 
[`RuntimeEnv`](https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnv.html).
+
+
+For users with custom 
[`TableProvider`](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html):
+
+* If the custom provider uses the
+[`ParquetFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/parquet/struct.ParquetFormat.html),
 caching will work
+without any changes
+
+* Otherwise the
+[`CachedParquetFileReaderFactory`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.CachedParquetFileReaderFactory.html)
+can be provided when creating a
+[`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html).
+
+Users can inspect the cache contents through the
+[`FileMetadataCache::list_entries`](https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html#tymethod.list_entries)
+method, or with the
+[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
+function in `datafusion-cli`:
+
+
+```sql
+> SELECT * FROM metadata_cache();
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| path          | file_modified           | file_size_bytes | e_tag            
        | version | metadata_size_bytes | hits | extra           |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 
0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | 
page_index=true |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+1 row(s) fetched.
+Elapsed 0.003 seconds.
+```
+
+
+### `QUALIFY` Clause
+
+DataFusion now supports the `QUALIFY` SQL clause
+([#16933](https://github.com/apache/datafusion/pull/16933)), which simplifies
+filtering window function output (similar to how `HAVING` filters
+aggregation output).
+
+For example, filtering the output of the `rank()` function previously
+required a query like this:
+
+```sql
+SELECT a, b, c
+FROM (
+   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+   FROM t
+)
+WHERE rk = 1
+```
+
+The same query can now be written like this:
+```sql
+SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+FROM t
+QUALIFY rk = 1
+```
+
+Although it is not part of the SQL standard (yet), it has been gaining
+adoption in several SQL analytical systems such as DuckDB, Snowflake, and
+BigQuery.
+
+### `FILTER` Support for Window Functions
+
+Continuing the theme, the `FILTER` clause has been extended to support
+[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
+This allows these functions to apply to specific rows without having to

Review Comment:
   Here "functions" refers to the aggregates, so I changed "This" to "It".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Add blog post about DataFusion 50.0.0 release [datafusion-site]

Reply via email to