luzhijing commented on code in PR #69:
URL: https://github.com/apache/doris-website/pull/69#discussion_r955735995
##########
blog/principle-of-Doris-SQL-parsing.md:
##########
@@ -0,0 +1,314 @@
+---
+{
+'title': 'Doris analysis: Doris SQL principle analysis',
+'summary': "This article mainly introduces the principle of Doris SQL
parsing.Since there are many types of SQL, this article focuses on the analysis
of query SQL. Doris's SQL analysis will be explained deeply in the algorithm
principle and code implementation.",
+'date': '2022-08-25',
+'author': 'Apache Doris',
+'tags': ['Tech Sharing'],
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+**Lead:**
+This article mainly introduces the principle of Doris SQL parsing.
+
+It focuses on generating a single-machine logical plan, developing a
distributed logical plan, and generating a distributed physical plan. Analyze,
SinglePlan, DistributedPlan, and Schedule four parts correspond to the code
implementation.
+
+First, AST will be processed preliminary by Analyze and then optimized by
SinglePlan to generate a single-machine query plan. Third, DistributedPlan will
split the single-machine query plan into distributed query plans. In the end,
the query plan will be sent to machines and executed orderly, which decide by
Schedule.
+
+Since there are many types of SQL, this article focuses on the analysis of
query SQL. Doris's SQL analysis will be explained deeply in the algorithm
principle and code implementation.
+
+# 1 Introduction to Doris
+Doris is an interactive SQL database based on MPP architecture, mainly used to
solve near real-time reports and multi-dimensional analysis. The Doris
architecture is straightforward, with only two types of processes.
+
+- Frontend(FE): It is mainly responsible for user request access, query
parsing and planning, storage and management of metadata, and node
management-related work.
+
+- Backend(BE): It is mainly responsible for data storage and query plan
execution.
+
+In Doris' storage engine, data will be horizontally divided into several data
shards (Tablet, also called data bucket). Each tablet contains several rows of
data. Multiple Tablets belong to different partitions logically. A Tablet only
belongs to one Partition. And a Partition contains several Tablets. Tablet is
the smallest physical storage unit for operations such as data movement,
copying, etc.
+
+# 2 SQL parsing In Apache Doris
+SQL parsing in this article refers to **the process of generating a complete
physical execution plan after a series of parsing of an SQL statement**.
+
+This process includes the following four steps: lexical analysis, syntax
analysis, generating a logical plan, and generating a physical plan.
+
+<div align=center>
+<img alt="Figure1 The process of SQL parsing"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_1_en.png"/>
+</div>
+ <p align="center">Figure1 The process of SQL parsing</p>
+
+## 2.1 Lexical analysis
+The lexical analysis will identify the SQL in the form of a string into
tokens, in preparation for the grammatical analysis.
+```undefined
+select ...... from ...... where ....... group by ..... order by ......
+
+SQL Tokens could be divided into the following categories:
+○ Keywords (select, from, where)
+○ operator (+, -, >=)
+○ Open/close flag ((, CASE)
+○ placeholder (?)
+○ Comments
+○ space
+......
+```
+## 2.2 Syntax analysis
+The syntax analysis will convert the token generated by the lexical analysis
into an abstract syntax tree based on the syntax rules, as shown in Figure 2.
+
+<div align=center>
+<img alt=">Figure2 An example of an abstract syntax tree" width="60%"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_2_en.png"/>
+</div>
+<p align="center">Figure2 An example of an abstract syntax tree</p>
+
+## 2.3 Logical plan
+The logical plan converts the abstract syntax tree into an algebraic relation,
which is an operator tree, and each node represents a calculation method for
data. The entire tree represents the calculation method and flows direction of
data, as shown in Figure 3.
+<div align=center>
+<img alt="Figure3 Relational algebra example" width="20%"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_3_en.png"/>
+</div>
+ <p align="center">Figure3 Relational algebra example</p>
+
+## 2.4 Physical plan
+The physical plan is the plan that determines which computing operations are
performed on which machines. It will be generated based on the logical plan,
the distribution of machines, and the distribution of data.
+
+The SQL parsing of the Doris system also adopts these steps, but it is refined
and optimized according to the characteristics of the Doris system structure
and the storage method of data to maximize the computing power of the machine.
+
+# 3 Design goals
+The design goals of the Doris SQL parsing architecture are:
+
+1. Maximize Computational Parallelism
+
+2. Minimize network transfer of data
+
+3. Minimize the amount of data that needs to be scanned
+
+# 4 Architecture
+Doris SQL parsing includes five steps: lexical analysis, syntax analysis,
generation of a stand-alone logical plan, generation of a distributed logical
plan, and generation of a physical execution plan.
+
+In terms of code implementation, it corresponds to the following five steps:
Parse, Analyze, SinglePlan, DistributedPlan, and Schedule, which as shown in
Figure 4.
+
+<div align=center>
+<img alt="Figure 4 System Architecture Diagram" width="40%"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_4_en.png"/>
+</div>
+ <p align="center">Figure 4 System Architecture Diagram</p>
+
+The Parse phase will not be discussed in this article. Analyze will do some
pre-processing of the AST. A stand-alone query plan will be optimized by
SinglePlan based on the AST. DistributedPlan will split the stand-alone query
plan into distributed query plans. Schedule phase will determine which machines
the query plan will be sent to for execution.
+
+**Since there are many types of SQL, this article focuses on the analysis of
query SQL.**
+
+Figure 5 shows a simple query SQL parsing implementation in Doris.
+
+
+<div align=center>
+<img alt="Figure5 The parsing process of query sql in Doris" width="50%"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_5_en.png"/>
+</div>
+ <p align="center">Figure5 The parsing process of query sql in Doris</p>
+
+# 5 Parse Phase
+In the Parse stage, JFlex technology is used for lexical analysis, java cup
parser technology is used for syntax analysis, and an AST(Abstract Syntax
Tree)will finally generate. These are existing and mature technologies and will
not be introduced in detail here.
+
+AST has a tree-like structure, which represents a piece of SQL. Therefore,
different types of queries -- select, insert, show, set, alter table, create
table, etc. will generate additional data structures after Parse (SelectStmt,
InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt,
etc.). However, they all inherit from Statements and will perform some specific
processing according to their own grammar rules. For example: for select type
SQL, the SelectStmt structure will be generated after Parse.
+
+SelectStmt structure contains SelectList, FromClause, WhereClause,
GroupByClause, SortInfo and other structures. These structures contain more
basic data structures. For Example, WhereClause contains BetweenPredicate,
BinaryPredicate, CompoundPredicate, InPredicate, and so on.
+
+All structures in AST are composed of basic structure expressions--Expr by
using various combinations, as shown in Figure 6.
+
+<div align=center>
+<img alt="Figure6 Implementation of Abstract Syntax Tree AST in Doris"
width="60%"
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_6_en.png"/>
+</div>
+ <p align="center">Figure6 Implementation of Abstract Syntax Tree AST in
Doris</p>
+
+# 6 Analyze Phase
+Analyze will perform pre-processing and semantic analysis on the abstract
syntax tree AST generated in the Parse phase, preparing for the generation of
stand-alone logic plans.
+
+The abstract class StatementBase represents the abstract syntax tree. This
abstract class contains a most crucial member function--analyze(), which is
used to perform what's needed to do in Analyze phase.
+
+Different types of queries (select, insert, show, set, alter table, create
table, etc.) will generate different data structures through the Parse
stage(SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt,
CreateTableStmt, etc.), these data structures inherit From StatementBase, and
perform a specific Analysis on a specific type sof SQL by implementing the
analyze() function.
+
+For example, a query of select type will be converted into analyze() of the
sub-statements SelectList, FromClause, GroupByClause, HavingClause,
WhereClause, SortInfo, etc. of select SQL. Then these sub-statements further
analyze() their sub-structures, and various scenarios of various types of SQL
are analyzed by layer-by-layer iteration. For example, WhereClause will further
explore the BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate,
etc., which it contains.
+
+**For query type SQL, Analyze will performs several important steps:**
+
+- **Metadata identification and parsing**: Identify and parse metadata such as
Cluster, Database, Table, Column, etc. involved in SQL, and determine which
columns, tables, databases, and clusters need to be calculated.
+-
Review Comment:
delete this line
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]