alamb commented on code in PR #115:
URL: https://github.com/apache/datafusion-site/pull/115#discussion_r2388026649
##########
content/blog/2025-09-29-datafusion-50.0.0.md:
##########
@@ -0,0 +1,464 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-29
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+Thanks to [numerous contributors] for making this release possible!
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]:
https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]:
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+[numerous contributors]:
https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md#credits
+
+
+## Performance Improvements 🚀
+
+DataFusion continues to focus on enhancing performance, as shown in ClickBench
+and other benchmark results.
+
+<img src="/blog/images/datafusion-50.0.0/performance_over_time_clickbench.png"
+ width="100%" class="img-responsive" alt="ClickBench performance results over
time for DataFusion" />
+
+**Figure 1**: Average and median normalized query execution times for
ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. See the
+[DataFusion Benchmarking
Page](https://alamb.github.io/datafusion-benchmarking/)
+for more details.
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**, dramatically improving performance when one relation is relatively
+small or filtered by a highly selective predicate. More details can be found in
+the [Dynamic Filter Pushdown for Hash
Joins](#dynamic-filter-pushdown-for-hash-joins) section below.
+The dynamic filters in the TopK operator have also been improved in DataFusion
+50.0.0, further increasing the effectiveness and efficiency of the
optimization.
+More details can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+**Nested Loop Join Optimization**
+
+The nested loop join operator has been rewritten to reduce execution time and
memory
+usage by adopting a finer-grained approach. Specifically, we now limit the
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old
+implementation to further improve execution speed.
+When evaluating this new approach in a microbenchmark, we measured up to a 5x
+improvement in execution time and a 99% reduction in memory usage. More
details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+DataFusion now automatically caches the metadata of Parquet files (statistics,
+page indexes, etc.), to avoid unnecessary disk/network round-trips. This is
+especially useful when querying the same table multiple times over relatively
+slow networks, allowing us to achieve an order of magnitude faster execution
+time when running many small reads over large files. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth 📈
+
+Between `49.0.0` and `50.0.0`, we continue to see our community grow:
+
+1. Qi Zhu ([zhuqi-lucas](https://github.com/zhuqi-lucas)) and Yoav Cohen
+ ([yoavcloud](https://github.com/yoavcloud)) became committers. See the
+ [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+ from 79 different committers, created over 235 issues, and closed 197 of
them
+ 🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published several blogs, including *[Using External Indexes,
Metadata Stores, Catalogs and
+ Caches to Accelerate Queries on Apache Parquet]*, *[Dynamic Filters:
+ Passing Information Between Operators During Execution for 25x Faster
+ Queries]*, and *[Implementing User Defined Types and Custom Metadata
+ in DataFusion]*.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0 . | wc -l
+ 79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0 . | wc -l
+ 318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate
Queries on Apache Parquet]:
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for
25x Faster Queries]:
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+[Implementing User Defined Types and Custom Metadata in DataFusion]:
https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata/
+
+## New Features ✨
+
+### Improved Spilling Sorts for Larger-than-Memory Datasets
+
+DataFusion has long been able to sort datasets that do not fit entirely in
memory,
+but still struggled with particularly large inputs or highly
memory-constrained
+setups. Larger-than-memory sorts in DataFusion 50.0.0 have been improved with
the recent introduction
+of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). It is now
+possible to execute almost any sorting query that would have previously
triggered *out-of-memory*
+errors, by relying on disk spilling. Thanks to [Raz Luvaton], [Yongting You],
and
Review Comment:
fyi @rluvaton, @2010YOUY01 and @ding-young as you are mentioned here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]