adriangb commented on code in PR #115:
URL: https://github.com/apache/datafusion-site/pull/115#discussion_r2373293807
##########
content/blog/2025-09-24-datafusion-50.0.0.md:
##########
@@ -0,0 +1,389 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-24
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]:
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]:
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+ width="100%" class="img-responsive" alt="ClickBench performance results over
+ time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to [extend the dynamic filters to other types of
+joins](https://github.com/apache/datafusion/issues/16973), such as left and
+right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce execution time and memory
+usage by adopting a finer-grained approach. Specifically, we now limit the
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we have measured up to
5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth 📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. New PMC members and committers: **📝TODO** joined the PMC. **📝TODO** joined
+ as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+ from 79 different committers, created over 235 issues, and closed 197 of
them
+ 🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published *[Using External Indexes, Metadata Stores, Catalogs and
+ Caches to Accelerate Queries on Apache Parquet]* and *[Dynamic Filters:
+ Passing Information Between Operators During Execution for 25x Faster
+ Queries]*, which detail several substantial performance optimizations.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0 . | wc -l
+ 79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0 . | wc -l
+ 318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate
Queries on Apache Parquet]:
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for
25x Faster Queries]:
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting:
+
+```sql
+datafusion.optimizer.enable_dynamic_filter_pushdown
+```
+
+The following example shows how execution plans look in DataFusion 50.0.0 with
+this optimization:
+
+```sql
+EXPLAIN ANALYZE
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+
+-- plan excerpt
+HashJoinExec
+ DataSourceExec:
+ predicate=c_phone@4 = 25-989-741-2988
+ metrics=[output_rows=1, ...]
+ DataSourceExec:
+ -- dynamic filter is added here, filtering directly at scan time
+ predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1
<= 1 ]
+ -- the number of output rows is kept to a minimum
+ metrics=[output_rows=11, ...]
+```
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, ...) is now
+automatically cached to reduce disk/network round-trips and repeated decodings
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
+made to make this optimization production-ready, such as making the cache limit
+configurable. More details can be found in this
+[Epic](https://github.com/apache/datafusion/issues/17000).
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
+any metadata caching. The default `FileMetadataCache` implementation uses a
+*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the `RuntimeEnv`.
Review Comment:
Does this only work by default with `ListingTableProvider`? If I have a
custom `TableProvider` is there documentation on how to use this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]