adriangb commented on code in PR #115:
URL: https://github.com/apache/datafusion-site/pull/115#discussion_r2373291374
##########
content/blog/2025-09-24-datafusion-50.0.0.md:
##########
@@ -0,0 +1,389 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-24
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]:
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]:
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+ width="100%" class="img-responsive" alt="ClickBench performance results over
+ time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and
definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to [extend the dynamic filters to other types of
+joins](https://github.com/apache/datafusion/issues/16973), such as left and
+right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce execution time and memory
+usage by adopting a finer-grained approach. Specifically, we now limit the
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we have measured up to
5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth 📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. New PMC members and committers: **📝TODO** joined the PMC. **📝TODO** joined
+ as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+ from 79 different committers, created over 235 issues, and closed 197 of
them
+ 🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published *[Using External Indexes, Metadata Stores, Catalogs and
+ Caches to Accelerate Queries on Apache Parquet]* and *[Dynamic Filters:
+ Passing Information Between Operators During Execution for 25x Faster
+ Queries]*, which detail several substantial performance optimizations.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0 . | wc -l
+ 79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0 . | wc -l
+ 318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate
Queries on Apache Parquet]:
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for
25x Faster Queries]:
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting:
+
+```sql
+datafusion.optimizer.enable_dynamic_filter_pushdown
+```
Review Comment:
```suggestion
```sql
SET datafusion.optimizer.enable_dynamic_filter_pushdown = true;
```
This is set to `true` by default and there is generally no reason to disable
it unless you run into issues or want to compare performance across versions.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]