andygrove commented on code in PR #63: URL: https://github.com/apache/datafusion-site/pull/63#discussion_r2006506263
########## content/blog/2025-03-20-datafusion-comet-0.7.0.md: ########## @@ -0,0 +1,130 @@ +--- +layout: post +title: Apache DataFusion Comet 0.7.0 Release +date: 2025-03-20 +author: pmc +categories: [subprojects] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +The Apache DataFusion PMC is pleased to announce version 0.7.0 of the [Comet](https://datafusion.apache.org/comet/) subproject. + +Comet is an accelerator for Apache Spark that translates Spark physical plans to DataFusion physical plans for +improved performance and efficiency without requiring any code changes. + +Comet runs on commodity hardware and aims to provide 100% compatibility with Apache Spark. Any operators or +expressions that are not fully compatible will fall back to Spark unless explicitly enabled by the user. Refer +to the [compatibility guide] for more information. + +[compatibility guide]: https://datafusion.apache.org/comet/user-guide/compatibility.html + +This release covers approximately four weeks of development work and is the result of merging 46 PRs from 11 +contributors. See the [change log] for more information. + +[change log]: https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.7.0.md + +## Release Highlights + +### Performance + +Comet 0.7.0 has improved performance compared to the previous release due to improvements in the native shuffle +implementation and performance improvements in DataFusion 46. + +For single-node TPC-H at 100 GB, Comet now delivers a **2.2x speedup** compared to Spark using the same CPU and RAM. Even +with **half the resources**, Comet still provides a measurable performance improvement. + +<img +src="/blog/images/comet-0.7.0/performance.png" +width="100%" +class="img-responsive" +alt="Chart showing TPC-H benchmark results for Comet 0.7.0" +/> + +*These benchmarks were performed on a Linux workstation with PCIe 5, AMD 7950X CPU (16 cores), 128 GB RAM, and data +stored locally in Parquet format on NVMe storage. Spark was running in Kubernetes with hard memory limits.* + +## Shuffle Improvements + +There are several improvements to shuffle in this release: + +- When running in off-heap mode (which is the recommended approach), Comet was using the wrong memory allocator + implementation for some types of shuffle operation, which could result in OOM rather than spilling to disk. +- The number of spill files is drastically reduced. In previous releases, each instance of ShuffleMapTask could + potentially create a new spill file for each output partition each time that spill was invoked. Comet now creates + a maximum of one spill file per output partition per instance of ShuffleMapTask, which is appended to in subsequent + spills. +- There was a flaw in the memory accounting which resulted in Comet requesting approximately twice the amount of + memory that was needed, resulting in premature spilling. This is now resolved. +- The metric for number of spilled bytes is now accurate. It was previously reporting invalid information. + +## Improved Hash Join Performance + +When using the `spark.comet.exec.replaceSortMergeJoin` setting to replace sort-merge joins with hash joins, Comet +will now do a better job of picking the optimal build side (thanks to [@hayman42] for suggesting this, and thanks to the +[Apache Gluten(incubating)] project for the inspiration in implementing this feature). + +[@hayman42](https://github.com/hayman42) + +## Experimental Support for DataFusion’s DataSourceExec Review Comment: Thanks @comphead. I agree that mentioning `DataSourceExec` is too much technical detail for a user-facing blog post. I have pushed some changes to address this. PTAL when you can. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org