Re: [PR] Blog: Using Ordering for Better Plans in Apache DataFusion [datafusion-site]

via GitHub Tue, 11 Mar 2025 08:01:47 -0700

Omega359 commented on code in PR #58:
URL: https://github.com/apache/datafusion-site/pull/58#discussion_r1989488932


##########
content/blog/2025-03-11-ordering-analysis.md:
##########
@@ -0,0 +1,381 @@
+---
+layout: post
+title: Using Ordering for Better Plans in Apache DataFusion
+date: 2025-03-11
+author: Mustafa Akur, Andrew Lamb
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+In this blog post, we explain when an ordering requirement of an operator is 
satisfied by its input data. This analysis is essential for order-based 
optimizations and is often more complex than one might initially think.
+<blockquote style="border-left: 4px solid #007bff; padding: 10px; 
background-color: #f8f9fa;">
+    <strong>Ordering Requirement</strong> for an operator describes how the 
input data to that operator must be sorted for the operator to compute the 
correct result. It is the job of the planner to make sure that these 
requirements are satisfied during execution (See DataFusion <a 
href="https://docs.rs/datafusion/latest/datafusion/physical_optimizer/enforce_sorting/struct.EnforceSorting.html";
 target="_blank">EnforceSorting</a> for an implementation of such rule).
+</blockquote>
+
+There are various use cases, where this type of analysis can be useful such as 
the following examples.
+### Removing Unnecessary Sorts
+Imagine a user wants to execute the following query:
+```SQL
+SELECT hostname, log_line 
+FROM telemetry ORDER BY time ASC limit 10
+```
+If we don't know anything about the `telemetry` table, we need to sort it by 
`time ASC` and then retrieve the first 10 rows to get the correct result. 
However, if the table is already ordered by `time ASC`, we can simply retrieve 
the first 10 rows. This approach executes much faster and uses less memory 
compared to resorting the entire table, even when the [TopK] operator is used. 
+
+[TopK]: 
https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html
+
+In order to avoid the sort, the query optimizer must determine the data is 
already sorted. For simple queries the analysis is straightforward, but it gets 
complicated fast. For example, what if your data is sorted by `[hostname, time 
ASC]` and your query is
+```sql
+SELECT hostname, log_line 
+FROM telemetry WHERE hostname = 'app.example.com' ORDER BY time ASC;
+```
+In this case, a sort still isn't needed,  but the analysis must  reason about 
the sortedness of the stream when it knows `hostname` has a single value.
+
+### Optimized Operator Implementations
+As another use case, some operators can utilize the ordering information to 
change its underlying algorithm to execute more efficiently. Consider the 
following query:
+```SQL
+SELECT COUNT(log_line) 
+FROM telemetry GROUP BY hostname;
+```
+Most analytic systems, including DataFusion, by default implement such a query 
using a hash table keyed on values of `hostname` to store the counts. However, 
if the `telemetry` table is sorted by `hostname`,  there are much more 
efficient algorithms for grouping on `hostname` values than hashing every value 
and storing it in memory. However, the more efficient algorithm can only be 
used when the input is sorted correctly. To see this in practice, check out the 
[source](https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/order)
 for ordered variant of the `Aggregation` in `DataFusion`.
+
+### Streaming-Friendly Execution
+
+Stream processing aims to produce results immediately as they become 
available, ensuring minimal latency for real-time workloads. However, some 
operators need to consume all input data before producing any output. Consider 
the `Sort` operation: before it can start generating output, the algorithm must 
first process all input data. As a result, data flow halts whenever such an 
operator is encountered until all input is consumed. When a physical query plan 
contains such an operator (`Sort`, `CrossJoin`, ..), we refer to this as 
pipeline breaking, meaning the query cannot be executed in a streaming fashion.
+
+For a query to be executed in a streaming fashion, we need to satisfy 2 
conditions:
+
+**Logically Streamable**  
+It should be possible to generate what user wants in streaming fashion. 
Consider following query:
+
+```SQL
+SELECT SUM(amount)  
+FROM orders  
+```
+Here, the user wants to compute the sum of all amounts in the orders table. By 
nature, this query requires scanning the entire table to generate a result, 
making it impossible to execute in a streaming fashion.
+
+**Streaming Aware Planner**  
+Being logically streamable does not guarantee that a query will execute in a 
streaming fashion. SQL is a declarative language, meaning it specifies 'WHAT' 
user wants. It is up to the planner, 'HOW' to generate the result. In most 
cases, there are many ways to compute the correct result for a given query. The 
query planner is responsible for choosing "a way" (ideally the best<sup 
id="optimal1">[*](#optimal)</sup> one) among the all alternatives to generate 
what user asks for. If a plan contains a pipeline-breaking operator, the 
execution will not be streaming—even if the query is logically streamable. To 
generate truly streaming plans from logically streamable queries, the planner 
must carefully analyze the existing orderings in the source tables to ensure 
that the final plan does not contain any pipeline-breaking operators.
+
+
+## Analysis
+Let's start by creating an example table that we will refer throughout the 
post. This table models the input data of an operator for the analysis:
+
+### Example Virtual Table
+
+<style>
+  table {
+    border-collapse: collapse;
+    width: 80%;
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, 
Helvetica, Arial, sans-serif;
+  }
+  th, td {
+    padding: 12px 16px;
+    text-align: left;
+    border-bottom: 1px solid #e0e0e0;
+  }
+  th {
+    background-color: #f9f9f9;
+    font-weight: 600;
+  }
+  tr:hover {
+    background-color: #f1f1f1;
+  }
+</style>
+
+<table>
+  <tr>
+    <th>amount</th> <th>price</th> 
<th>hostname</th><th>currency</th><th>time_bin</th> <th>time</th> 
<th>price_cloned</th> <th>time_cloned</th>
+  </tr>
+  <tr>
+    <td>12</td> <td>25</td> <td>app.example.com</td> <td>USD</td> 
<td>08:00:00</td> <td>08:01:30</td> <td>25</td> <td>08:01:30</td>
+  </tr>
+  <tr>
+    <td>12</td> <td>26</td> <td>app.example.com</td> <td>USD</td> 
<td>08:00:00</td> <td>08:11:30</td> <td>26</td> <td>08:11:30</td>
+  </tr>
+  <tr>
+    <td>15</td> <td>30</td> <td>app.example.com</td> <td>USD</td> 
<td>08:00:00</td> <td>08:41:30</td> <td>30</td> <td>08:41:30</td>
+  </tr>
+  <tr>
+    <td>15</td> <td>32</td> <td>app.example.com</td> <td>USD</td> 
<td>08:00:00</td> <td>08:55:15</td> <td>32</td> <td>08:55:15</td>
+  </tr>
+  <tr>
+    <td>15</td> <td>35</td> <td>app.example.com</td> <td>USD</td> 
<td>09:00:00</td> <td>09:10:23</td> <td>35</td> <td>09:10:23</td>
+  </tr>
+  <tr>
+    <td>20</td> <td>18</td> <td>app.example.com</td> <td>USD</td> 
<td>09:00:00</td> <td>09:20:33</td> <td>18</td> <td>09:20:33</td>
+  </tr>
+  <tr>
+    <td>20</td> <td>22</td> <td>app.example.com</td> <td>USD</td> 
<td>09:00:00</td> <td>09:40:15</td> <td>22</td> <td>09:40:15</td>
+  </tr>
+</table>
+
+<br>
+
+<blockquote style="border-left: 4px solid #007bff; padding: 10px; 
background-color: #f8f9fa;">
+<strong>How can a table have multiple orderings?:</strong> At first glance, it 
may seem counterintuitive for a table to have more than one valid ordering. 
However, during query execution, such scenarios can arise.
+
+For example, consider the following query:
+
+```sql
+SELECT time, date_bin('1 hour', time, '1970-01-01') as time_bin  
+FROM table;
+```
+If we know that the table is ordered by <code>time ASC</code>, we can infer 
that <code>time_bin ASC</code> is also a valid ordering. This is because the 
<code>date_bin</code> function is monotonic, meaning it preserves the order of 
its input.
+
+DataFusion leverages these functional dependencies to infer new orderings as 
data flows through different query operators. For details on the 
implementation, see the <a 
href="https://github.com/apache/datafusion/blob/main/datafusion/common/src/functional_dependencies.rs";,
 target="_blank">source</a> code.
+</blockquote>
+
+By inspection, you can see this table is sorted by the `amount` column, but It 
is also sorted by `time` and `time_bin` as well as the compound `(time_bin, 
amount)` and many other variations. While this example is an extreme case, many 
real world data have multiple sort orders. 
+
+A naive approach for analyzing whether the ordering requirement of an operator 
is satisfied by its input would be:  
+
+  - Store all the valid ordering expressions that the tables satisfies  
+  - Check whether the ordering requirement by the operator is among valid 
orderings.  
+
+This naive algorithm works and correct. However, listing all valid orderings 
can be quite lengthy and is of exponential complexity as the number of 
orderings grows. For the example table, here is a (small) subset of the valid 
orderings:
+
+`[amount ASC]`  
+`[amount ASC, price ASC]`  
+`[amount ASC, price_cloned ASC]`  
+`[hostname ASC, amount ASC, price_cloned ASC]`  
+`[amount ASC, hostname ASC,  price_cloned ASC]`  
+`[amount ASC, price_cloned ASC, hostname ASC]`  
+.  
+.  
+.  
+
+As can be seen from the listing above. Storing all valid orderings is 
wasteful, and contains significant redundancy. Here are some observations, 
suggesting we can do much better:
+
+
+- Storing a prefix of another valid ordering is redundant. If the table 
satisfies the lexicographic ordering<sup id="fn1">[1](#footnote1)</sup>: 
`[amount ASC, price ASC]`, it already satisfies ordering `[amount ASC]` 
trivially. Hence, once we store `[amount ASC, price ASC]` storing `[amount 
ASC]` is rdundant.

Review Comment:
   ```suggestion
   - Storing a prefix of another valid ordering is redundant. If the table 
satisfies the lexicographic ordering<sup id="fn1">[1](#footnote1)</sup>: 
`[amount ASC, price ASC]`, it already satisfies ordering `[amount ASC]` 
trivially. Hence, once we store `[amount ASC, price ASC]` storing `[amount 
ASC]` is redundant.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Blog: Using Ordering for Better Plans in Apache DataFusion [datafusion-site]

Reply via email to