Re: [PR] IGNITE-25367 Sql. Documentation. Provide a few examples of how to use EXPLAIN to identify performance problems [ignite-3]

via GitHub Fri, 06 Jun 2025 03:27:38 -0700


AMashenkov commented on code in PR #5987:
URL: https://github.com/apache/ignite-3/pull/5987#discussion_r2131944924



##########
docs/_docs/sql-reference/explain-help-with-performance-problems.adoc:
##########
@@ -0,0 +1,495 @@
+// Licensed to the Apache Software Foundation (ASF) under one or more
+// contributor license agreements.  See the NOTICE file distributed with
+// this work for additional information regarding copyright ownership.
+// The ASF licenses this file to You under the Apache License, Version 2.0
+// (the "License"); you may not use this file except in compliance with
+// the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+= How to use EXPLAIN command to identify possible SQL execution performance 
problems
+
+SQL EXPLAIN is a powerful command used to analyze and understand the execution 
plan of a query without actually executing it.
+Using EXPLAIN, shows the query execution plan, which includes details such as:
+
+- The order in which tables are accessed.
+- The type of join operations used (e.g., nested loop, hash join, merge join).
+- Any indexes that are used to speed up data retrieval.
+- Estimated costs and row counts for different parts of the query.
+
+This information is crucial for optimizing query performance, identifying 
bottlenecks, and making informed decisions about database schema design and 
indexing strategies.
+
+SQL EXPLAIN output analysis can help to optimize slow queries execution. 
Frequent bottlenecks in SQL execution as follows:
+
+- Avoid scanning an entire table, which is frequently the slowest way to 
access data.
+- Avoid scanning non-optimal indexes.
+- Planner related issues.
+- Wrong data colocation choice.
+
+== Full scan instead of index scan is used
+Suppose related sql execution flow looks like:
+[source,sql]
+----
+CREATE TABLE t (id INT PRIMARY KEY, col1 VARCHAR);
+CREATE INDEX t_col1_idx ON t(col1);
+SELECT id FROM t WHERE col1 = '1';
+----
+
+And possible EXPLAIN output:
+[source,sql]
+----
+Exchange
+    distribution: single
+    est: (rows=1)
+  TableScan
+      table: PUBLIC.T
+      filters: =(COL1, _UTF-8'1')
+      fields: [ID]
+      projects: [ID]
+      est: (rows=100)
+----
+We can see a full scan (*TableScan* operator) instead of expected index (IDX) 
scan (there still can be some exceptions, check for Note below). Index scan 
looks more preferable in such a case. It can be fixed manually with HINTS usage 
approach:
+[source,sql]
+----
+SELECT /*+ FORCE_INDEX(t_col1_idx) */ id FROM t WHERE col1 = '1';
+----
+Will show a different plan, like:
+[source,sql]
+----
+Exchange
+    distribution: single
+    est: (rows=1)
+  IndexScan
+      table: PUBLIC.T
+      index: T_COL1_IDX
+      type: SORTED
+      searchBounds: [ExactBounds [bound=_UTF-8'1']]
+      filters: =(COL1, _UTF-8'1')
+      fields: [ID]
+      projects: [ID]
+      collation: []
+      est: (rows=100)
+----
+
+NOTE: Plan with full scan can still be preferable for tables with a small 
number of rows, but for huge numbers of sparse data rows it looks like index 
scan is a better choice.
+
+== Not optimal indexes are involved
+Indexes with less prediction can be chosen, for example schema and query may 
look as follows:
+[source,sql]
+----
+CREATE TABLE t (id INT PRIMARY KEY, col1 VARCHAR, col2 VARCHAR);
+CREATE INDEX t_col1_col2_idx ON T(col1, col2);
+CREATE INDEX t_col1_idx ON t(col1);
+SELECT id FROM t WHERE col1 = '1' AND col2 = '2';
+----
+and possible plan:
+[source,sql]
+----
+Exchange
+  IndexScan
+      table: PUBLIC.T
+      index: T_COL1_IDX
+
+-- ...cut (here and below all unrelated explain info will be trimmed)
+----
+We can see that picked index is **T_COL1_IDX** through both predicates **COL1 
= '1' AND COL2 = '2'** are involved and looks like more preferable in such a 
case to use index: **T_COL1_COL2_IDX***.
+Optimal plan need to be like:
+[source,sql]
+----
+Exchange
+  IndexScan
+      table: PUBLIC.T
+      index: T_COL1_COL2_IDX
+----
+We can also use **HINTS** to achieve it:
+[source,sql]
+----
+SELECT /*+ FORCE_INDEX(t_col1_col2_idx) */ id FROM t WHERE col1 = '1' AND col2 
= '2';
+----
+== Unrelated sort ordering
+By default, sorted indexes store their entries in ascending order. You can 
adjust the ordering of a sorted index by including the options ASC or DESC.
+Lets suppose schema and related query looks like:
+[source,sql]
+----
+CREATE TABLE t1 (id INT PRIMARY KEY, col1 VARCHAR);
+CREATE TABLE t2 (id INT PRIMARY KEY, col1 VARCHAR);
+CREATE INDEX t1_col1_desc_idx ON t1(col1 DESC);
+CREATE INDEX t1_col1_idx ON t1(col1);
+CREATE INDEX t2_col1_idx ON t2(col1);
+
+SELECT t1.id as t1id, t2.id as t2id FROM t1 JOIN t2 USING (col1);
+----
+Expected that **t1_col1_idx** and **t2_col1_idx** will be involved into 
execution plan, but for some reasons possible plan shows:
+[source,sql]
+----
+Project
+  MergeJoin
+    Exchange
+      Sort
+          collation: [COL1 ASC]
+        IndexScan
+            index: T1_COL1_DESC_IDX
+    Exchange
+      IndexScan
+          index: T2_COL1_IDX
+----
+We can found here additional **Sort** operator over **IndexScan** it passed 
here for changing not satisfying ordering from near source (index), it needs to 
be ASC ordered.
+Extra **Sort** operator brings additional performance costs, we can avoid it 
also
+with help from **HINTS** side:
+[source,sql]
+----
+SELECT /*+ FORCE_INDEX(t1_col1_idx, t2_col1_idx) */ t1.id as t1id, t2.id as 
t2id FROM t1 JOIN t2 USING (col1);
+----
+And possible plan becomes:
+[source,sql]
+----
+Project
+  MergeJoin
+    Exchange
+      IndexScan
+          index: T1_COL1_IDX
+    Exchange
+      IndexScan
+          index: T2_COL1_IDX
+----
+== Performance impact of correlated subqueries
+SQL-99 allows for nested subqueries at nearly all places within a query.
+Ignite 3 supports nested subqueries, it can be correlated or not. Performance 
of certain complex correlated subqueries may be insufficient. Let`s consider a 
correlated query:
+[source,sql]
+----
+CREATE TABLE emp(dept_id INTEGER PRIMARY KEY, name VARCHAR, salary INTEGER);
+CREATE TABLE dept(id INTEGER PRIMARY KEY, name VARCHAR);
+
+SELECT emp.name, (SELECT dept.name FROM dept WHERE emp.dept_id=dept.id)
+FROM emp
+WHERE emp.salary > 1000;
+----
+We can see nested correlated subquery here, lets check the plan:
+[source,sql]
+----
+Project
+  CorrelatedNestedLoopJoin
+    Exchange
+      TableScan
+          table: PUBLIC.EMP
+          filters: >(SALARY, 1000)
+    ColocatedHashAggregate
+      Exchange
+        TableScan
+            table: PUBLIC.DEPT
+            filters: =($cor1.DEPT_ID, ID)
+----
+In Ignite 3 there is present some performance issues related to correlated 
subqueries and as a result:
+
+- Such subqueries may become bottlenecks.
+- Queries over even small tables can cause high CPU and memory consumption.
+- Certain queries may perform slower than expected.
+
+Plan shows that **CorrelatedNestedLoopJoin** operator is present. Thus if some 
performance issues are found in such a case, it would be more
+efficient to rewrite for equal query but without nested subquery, like:
+[source,sql]
+----
+SELECT emp.name, dept.name
+FROM emp, dept
+WHERE emp.salary > 1000 AND emp.dept_id=dept.id;
+----
+And new plan becomes:
+[source,sql]
+----
+Exchange
+  Project
+    HashJoin
+        condition: =(DEPT_ID, ID)
+      TableScan
+          table: PUBLIC.EMP
+          filters: >(SALARY, 1000)
+      TableScan
+          table: PUBLIC.DEPT
+----
+Which performs match better than previous one.
+
+== Excessive sort
+Lets explain we have an index involved two columns one of them is participate 
in predicate and other in ordering, or in sql terms:
+[source,sql]
+----
+CREATE TABLE emp(dept_id INTEGER PRIMARY KEY, name VARCHAR, salary INTEGER);
+CREATE INDEX emp_salary_name_idx ON emp(salary, name);
+SELECT dept_id FROM emp WHERE salary = 1 ORDER BY name;
+----
+Expectations:
+
+- Index need to be used here.
+- No additional sort is needed because index is ordered by **name** column is 
satisfies initial query ordering.
+
+But plan can shows:
+[source,sql]
+----
+Project
+  Exchange
+    Sort

Review Comment:
   It is not clear why Sort is under Exchange?
   Should collation attribute for Exchange node be here? to emphasize the 
Exchange is aware of sorted input and preserve the collation?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@ignite.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] IGNITE-25367 Sql. Documentation. Provide a few examples of how to use EXPLAIN to identify performance problems [ignite-3]

Reply via email to