Re: [PR] DataFusion 47.0.0 blog post [datafusion-site]

via GitHub Wed, 09 Jul 2025 10:40:36 -0700


kevinjqliu commented on code in PR #83:
URL: https://github.com/apache/datafusion-site/pull/83#discussion_r2195627650



##########
content/blog/2025-07-10-datafusion-47.0.0.md:
##########
@@ -0,0 +1,256 @@
+---
+layout: post
+title: Apache DataFusion 47.0.0 Released
+date: 2025-07-10
+author: PMC
+categories: [ release ]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+We’re excited to announce the release of **Apache DataFusion 47.0.0**! This 
new version represents a significant
+milestone for the project, packing in a wide range of improvements and fixes. 
You can find the complete details in the
+full 
[changelog](https://github.com/apache/datafusion/blob/branch-47/dev/changelog/47.0.0.md).
 We’ll highlight the most
+important changes below and guide you through upgrading.
+
+Note that DataFusion 47.0.0 was released in April 2025, but we are only now 
publishing the blog post due to 
+limited bandwidth in the DataFusion community. We apologize for the delay and 
encourage you to come help us
+accelerate the next release and announcements 
+by [joining the 
community](https://datafusion.apache.org/contributor-guide/communication.html)  
🎣.
+
+## Breaking Changes
+
+DataFusion 47.0.0 brings a few **breaking changes** that may require 
adjustments to your code as described in
+the [Upgrade 
Guide](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0).
 Here are some notable ones:
+
+- [Upgrades to arrow-rs and arrow-parquet 55.0.0 and object_store 
0.12.0](https://github.com/apache/datafusion/pull/15466):
+  Several APIs changed in the underlying `arrow`, `parquet` and `object_store` 
libraries to use a `u64` instead of usize to better support
+  WASM. This requires converting from `usize` to `u64` occasionally as well as 
changes to ObjectStore implementations such as
+```Rust
+impl ObjectStore {
+    ...
+
+    // The range is now a u64 instead of usize
+    async fn get_range(&self, location: &Path, range: Range<u64>) -> 
ObjectStoreResult<Bytes> {
+        self.inner.get_range(location, range).await
+    }
+    
+    ...
+    
+    // the lifetime is now 'static instead of '_ (meaning the captured closure 
can't contain references)
+    // (this also applies to list_with_offset)
+    fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, 
ObjectStoreResult<ObjectMeta>> {
+        self.inner.list(prefix)
+    }
+}
+```
+- 
[DisplayFormatType::TreeRender](https://github.com/apache/datafusion/issues/14914):
+  Implementations of `ExecutionPlan` must also provide a description in the 
`DisplayFormatType::TreeRender` format to
+  provide support for the new [tree style 
explains](https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default).
+  This can be the same as the existing `DisplayFormatType::Default`.
+
+## Performance Improvements
+
+DataFusion 47.0.0 comes with numerous performance enhancements across the 
board. Here are some of the noteworthy
+optimizations in this release:
+
+- **`FIRST_VALUE` and `LAST_VALUE`:**  `FIRST_VALUE` and `LAST_VALUE` 
functions execute much faster for data with high cardinality such as those with 
many groups or partitions. DataFusion 47.0.0 executes the following in **7 
seconds** compared to **36 seconds** in DataFusion 46.0.0: `select id2, id4, 
first_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, 
id4` (h2o.ai dataset). (PR's 
[#15266](https://github.com/apache/datafusion/pull/15266)
+  and [#15542](https://github.com/apache/datafusion/pull/15542) by 
[UBarney](https://github.com/UBarney)).
+
+- **`MIN`, `MAX` and `AVG` for Durations:**  DataFusion executes aggregate 
queries up to 2.5x faster when they include `MIN`, `MAX` and `AVG` on 
`Duration` columns. 
+  (PRs [#15322](  https://github.com/apache/datafusion/pull/15322) and 
[#15748](https://github.com/apache/datafusion/pull/15748)
+  by [shruti2522](https://github.com/shruti2522)).
+
+- **Short circuit evaluation for `AND` and `OR`:** DataFusion now eagerly 
skips the evaluation of
+  the right operand if the left is known to be false (`AND`) or true (`OR`) in 
certain cases. For complex predicates, such as those with many `LIKE` or `CASE` 
expressions, this optimization results in
+  [significant performance 
improvements](https://github.com/apache/datafusion/issues/11212#issuecomment-2753584617)
 (up to 100x in extreme cases).
+  (PRs [#15462](https://github.com/apache/datafusion/pull/15462) and 
[#15694](https://github.com/apache/datafusion/pull/15694)
+  by [acking-you](https://github.com/acking-you)).
+
+- **TopK optimization for partially sorted input:** Previous versions of 
DataFusion implemented early termination
+  optimization (TopK) for fully sorted data. DataFusion 47.0.0 extends the 
optimization for partially sorted data, which is common in many real-world 
datasets, such as time-series data sorted by day but not within each day. 
+  (PR [#15563](https://github.com/apache/datafusion/pull/15563) by 
[geoffreyclaude](https://github.com/geoffreyclaude)).
+
+- **Disable re-validation of spilled files:** DataFusion no longer does 
unnecessary re-validation of temporary spill files. The validation is 
unnecessary and expensive as the data is known to be valid when it was written 
out
+  (PR [#15454](https://github.com/apache/datafusion/pull/15454) by 
[zebsme](https://github.com/zebsme)).
+
+## Highlighted New Features
+
+### Tree style explains
+
+In previous releases the [EXPLAIN statement] results in a formatted table
+which is succinct and contains important details for implementers, but was 
often hard to read
+especially with queries that included joins or unions having multiple children.
+
+[EXPLAIN statement]: https://datafusion.apache.org/user-guide/sql/explain.html
+
+DataFusion 47.0.0 includes the new `EXPLAIN FORMAT TREE` (default in
+`datafusion-cli`) rendered in a visual tree style that is much easier to 
quickly
+understand.
+
+<!-- SQL setup 
+create table t1(ti int) as values (1), (2), (3);
+create table t2(ti int) as values (1), (2), (3);
+-->
+
+Example of the new explain output:
+```sql
+> explain select * from t1 inner join t2 on t1.ti=t2.ti;
++---------------+------------------------------------------------------------+
+| plan_type     | plan                                                       |
++---------------+------------------------------------------------------------+
+| physical_plan | ┌───────────────────────────┐                              |
+|               | │    CoalesceBatchesExec    │                              |
+|               | │    --------------------   │                              |
+|               | │     target_batch_size:    │                              |
+|               | │            8192           │                              |
+|               | └─────────────┬─────────────┘                              |
+|               | ┌─────────────┴─────────────┐                              |
+|               | │        HashJoinExec       │                              |
+|               | │    --------------------   ├──────────────┐               |
+|               | │       on: (ti = ti)       │              │               |
+|               | └─────────────┬─────────────┘              │               |
+|               | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
+|               | │       DataSourceExec      ││       DataSourceExec      │ |
+|               | │    --------------------   ││    --------------------   │ |
+|               | │         bytes: 112        ││         bytes: 112        │ |
+|               | │       format: memory      ││       format: memory      │ |
+|               | │          rows: 1          ││          rows: 1          │ |
+|               | └───────────────────────────┘└───────────────────────────┘ |
+|               |                                                            |
++---------------+------------------------------------------------------------+
+```
+
+Example of the `EXPLAIN FORMAT INDENT` output for the same query
+```sql
+> explain format indent select * from t1 inner join t2 on t1.ti=t2.ti;
++---------------+----------------------------------------------------------------------+
+| plan_type     | plan                                                         
        |
++---------------+----------------------------------------------------------------------+
+| logical_plan  | Inner Join: t1.ti = t2.ti                                    
        |
+|               |   TableScan: t1 projection=[ti]                              
        |
+|               |   TableScan: t2 projection=[ti]                              
        |
+| physical_plan | CoalesceBatchesExec: target_batch_size=8192                  
        |
+|               |   HashJoinExec: mode=CollectLeft, join_type=Inner, 
on=[(ti@0, ti@0)] |
+|               |     DataSourceExec: partitions=1, partition_sizes=[1]        
        |
+|               |     DataSourceExec: partitions=1, partition_sizes=[1]        
        |
+|               |                                                              
        |
++---------------+----------------------------------------------------------------------+
+2 row(s) fetched.
+```
+
+Thanks to [irenjj](https://github.com/irenjj) for the initial work in PR 
[#14677](https://github.com/apache/datafusion/pull/14677)
+and many others for completing the [followup 
epic](https://github.com/apache/datafusion/issues/14914)
+
+### SQL `VARCHAR` defaults to Utf8View
+
+In previous releases when a column was created in SQL the column would be 
mapped to the [Utf8 Arrow data type]. In this release
+the SQL `varchar` columns will be mapped to the [Utf8View arrow data type] by 
default, which is a more efficient representation of UTF-8 strings in Arrow.
+
+[Utf8 Arrow data type]: 
https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8
+[Utf8View arrow data type]: 
https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8View)
+
+```sql
+create table foo(x varchar);
+0 row(s) fetched.
+
+> describe foo;
++-------------+-----------+-------------+
+| column_name | data_type | is_nullable |
++-------------+-----------+-------------+
+| x           | Utf8View  | YES         |
++-------------+-----------+-------------+
+```
+
+Previous versions of DataFusion used `Utf8View` when reading parquet files and 
it is faster in most cases.
+
+Thanks to [zhuqi-lucas](https://github.com/zhuqi-lucas) for PR 
[#15104](https://github.com/apache/datafusion/pull/15104)
+
+### Tracing context propagation in spawned tasks
+
+This release includes a general mechanism enabling DataFusion to propagate 
user-defined context (such as tracing spans,
+logging, or metrics) across thread boundaries without depending on any 
specific instrumentation library.
+
+In previous releases tasks spawned on new threads — such as those performing 
repartitioning or parquet file reads —
+would lose thread-local context, making instrumentation challenging for users. 
The introduced approach addresses this
+gap by allowing users to inject custom instrumentation via the new 
[JoinSetTracer] trait. This ensures context is preserved
+seamlessly while keeping DataFusion lightweight by not adding any direct 
instrumentation dependencies.
+
+[JoinSetTracer]: 
https://docs.rs/datafusion/latest/datafusion/common/runtime/trait.JoinSetTracer.html
+
+For example, you can implement a simple tracer that ensures any spawned task 
or blocking closure
+inherits the current span via `in_current_span`:
+
+```Rust
+/// A simple tracer that ensures any spawned task or blocking closure
+/// inherits the current span via `in_current_span`.
+struct SpanTracer;
+
+/// Implement the `JoinSetTracer` trait so we can inject instrumentation
+/// for both async futures and blocking closures.
+impl JoinSetTracer for SpanTracer {
+    /// Instruments a boxed future to run in the current span. The future's
+    /// return type is erased to `Box<dyn Any + Send>`, which we simply
+    /// run inside the `Span::current()` context.
+    fn trace_future(
+        &self,
+        fut: BoxFuture<'static, Box<dyn Any + Send>>,
+    ) -> BoxFuture<'static, Box<dyn Any + Send>> {
+        fut.in_current_span().boxed()
+    }
+
+    /// Instruments a boxed blocking closure by running it inside the
+    /// `Span::current()` context.
+    fn trace_block(
+        &self,
+        f: Box<dyn FnOnce() -> Box<dyn Any + Send> + Send>,
+    ) -> Box<dyn FnOnce() -> Box<dyn Any + Send> + Send> {
+        let span = Span::current();
+        Box::new(move || span.in_scope(f))
+    }
+}
+
+...
+set_join_set_tracer(&SpanTracer).expect("Failed to set tracer");
+...
+```
+An additional example showcasing how to use this new functionality can be 
found in the [DataFusion 
examples](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/tracing.rs).
+
+Thanks to [geoffreyclaude](https://github.com/geoffreyclaude) for PR 
[#14914](https://github.com/apache/datafusion/issues/14914)
+
+## Upgrade Guide and Changelog
+
+Upgrading to 47.0.0 should be straightforward for most users, but do review
+the [Upgrade Guide for DataFusion 
47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html) for 
detailed

Review Comment:
   ```suggestion
   the [Upgrade Guide for DataFusion 
47.0.0](https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-47-0-0)
 for detailed
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DataFusion 47.0.0 blog post [datafusion-site]

Reply via email to