alamb commented on code in PR #74:
URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2140516503


##########
content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md:
##########
@@ -0,0 +1,249 @@
+---
+layout: post
+title: Optimizing SQL (and DataFrames) in DataFusion, Part 1: Query 
Optimization Overview
+date: 2025-06-15
+author: alamb, akurmustafa
+categories: [core]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+
+*Note: this blog was originally published [on the InfluxData 
blog](https://www.influxdata.com/blog/optimizing-sql-dataframes-part-one/)*
+
+
+## Introduction
+
+Sometimes Query Optimizers are seen as a sort of black magic, [“the most
+challenging problem in computer
+science,”](https://15799.courses.cs.cmu.edu/spring2025/) according to Father
+Pavlo, or some behind-the-scenes player. We believe this perception is because:
+
+
+1. One must implement the rest of a database system (data storage, 
transactions,
+   SQL parser, expression evaluation, plan execution, etc.) **before** the
+   optimizer becomes critical[^5].
+
+2. Some parts of the optimizer are tightly tied to the rest of the system 
(e.g.,
+   storage or indexes), so many classic optimizers are described with
+   system-specific terminology.
+
+3. Some optimizer tasks, such as access path selection and join order are known
+   challenges and not yet solved (practically)—maybe they really do require
+   black magic 🤔.
+
+However, Query Optimizers are no more complicated in theory or practice than 
other parts of a database system, as we will argue in a series of posts:
+
+**Part 1: (this post)**:
+
+* Review what a Query Optimizer is, what it does, and why you need one for SQL 
and DataFrames.
+* Describe how industrial Query Optimizers are structured and standard 
optimization classes.
+
+**Part 2:**
+
+* Describe the optimization categories with examples and pointers to 
implementations.
+* Describe [Apache DataFusion](https://datafusion.apache.org/)’s rationale and 
approach to query optimization, specifically for access path and join ordering.
+
+After reading these blogs, we hope people will use DataFusion to:
+
+1. Build their own system specific optimizers.
+2. Perform practical academic research on optimization (especially researchers
+   working on new optimizations / join ordering—looking at you [CMU
+   15-799](https://15799.courses.cs.cmu.edu/spring2025/), next year).
+
+
+## Query Optimizer Background
+
+The key pitch for querying databases, and likely the key to the longevity of 
SQL
+(despite people’s love/hate relationship—see [SQL or Death? Seminar Series –
+Spring 2025](https://db.cs.cmu.edu/seminar2025/)), is that it disconnects the
+`WHAT` you want to compute from the `HOW` to do it. SQL is a *declarative*
+language—it describes what answers are desired rather than an *imperative*
+language such as Python, where you describe how to do the computation as shown
+in Figure 1.
+
+<img src="/blog/images/optimizing-sql-dataframes/query-execution.png" 
width="80%" class="img-responsive" alt="Fig 1: Query Execution."/>

Review Comment:
   I think the difference / confusion is that I was using the term "Query 
Planner" to mean "translate from a SQL parse tree to initial LogicalPlan" -- 
and from that perspective I think of the DataFrame API as directly building a 
LogicalPlan (and thus does not need a "planner") -- however I can see how the 
use of word "planner" is confusing. I pushed a commit to try and clarify it.
   
   in ee8b460



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to