ZHbamboo commented on code in PR #69:
URL: https://github.com/apache/doris-website/pull/69#discussion_r956000724


##########
blog/principle-of-Doris-SQL-parsing.md:
##########
@@ -0,0 +1,314 @@
+---
+{
+'title': 'Doris analysis: Doris SQL principle analysis',
+'summary': "This article mainly introduces the principle of Doris SQL 
parsing.Since there are many types of SQL, this article focuses on the analysis 
of query SQL. Doris's SQL analysis will be explained deeply in the algorithm 
principle and code implementation.",
+'date': '2022-08-25',
+'author': 'Apache Doris',
+'tags': ['Tech Sharing'],
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+**Lead:**
+This article mainly introduces the principle of Doris SQL parsing.
+
+It focuses on generating a single-machine logical plan, developing a 
distributed logical plan, and generating a distributed physical plan. Analyze, 
SinglePlan, DistributedPlan, and Schedule four parts correspond to the code 
implementation.
+
+First, AST will be processed preliminary by Analyze and then optimized by 
SinglePlan to generate a single-machine query plan. Third, DistributedPlan will 
split the single-machine query plan into distributed query plans. In the end, 
the query plan will be sent to machines and executed orderly, which decide by 
Schedule.
+
+Since there are many types of SQL, this article focuses on the analysis of 
query SQL. Doris's SQL analysis will be explained deeply in the algorithm 
principle and code implementation.
+
+# 1 Introduction to Doris
+Doris is an interactive SQL database based on MPP architecture, mainly used to 
solve near real-time reports and multi-dimensional analysis. The Doris 
architecture is straightforward, with only two types of processes.
+
+- Frontend(FE): It is mainly responsible for user request access, query 
parsing and planning, storage and management of metadata, and node 
management-related work.
+
+- Backend(BE): It is mainly responsible for data storage and query plan 
execution.
+
+In Doris' storage engine, data will be horizontally divided into several data 
shards (Tablet, also called data bucket). Each tablet contains several rows of 
data. Multiple Tablets belong to different partitions logically. A Tablet only 
belongs to one Partition. And a Partition contains several Tablets. Tablet is 
the smallest physical storage unit for operations such as data movement, 
copying, etc.
+
+# 2 SQL parsing In Apache Doris
+SQL parsing in this article refers to **the process of generating a complete 
physical execution plan after a series of parsing of an SQL statement**.
+
+This process includes the following four steps: lexical analysis, syntax 
analysis, generating a logical plan, and generating a physical plan.
+
+<div align=center>
+<img alt="Figure1 The process of SQL parsing" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_1_en.png"/>
+</div>
+ <p align="center">Figure1 The process of SQL parsing</p>
+
+## 2.1 Lexical analysis
+The lexical analysis will identify the SQL in the form of a string into 
tokens, in preparation for the grammatical analysis.
+```undefined
+select ......  from ...... where ....... group by ..... order by ......
+
+SQL Tokens could be divided into the following categories:
+○ Keywords (select, from, where)
+○ operator (+, -, >=)
+○ Open/close flag ((, CASE)
+○ placeholder (?)
+○ Comments
+○ space
+......
+```
+## 2.2 Syntax analysis
+The syntax analysis will convert the token generated by the lexical analysis 
into an abstract syntax tree based on the syntax rules, as shown in Figure 2.
+
+<div align=center>
+<img alt=">Figure2 An example of an abstract syntax tree" width="60%" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_2_en.png"/>
+</div>
+<p align="center">Figure2 An example of an abstract syntax tree</p>
+
+## 2.3 Logical plan
+The logical plan converts the abstract syntax tree into an algebraic relation, 
which is an operator tree, and each node represents a calculation method for 
data. The entire tree represents the calculation method and flows direction of 
data, as shown in Figure 3.
+<div align=center>
+<img alt="Figure3 Relational algebra example" width="20%" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_3_en.png"/>
+</div>
+ <p align="center">Figure3 Relational algebra example</p>
+
+## 2.4 Physical plan
+The physical plan is the plan that determines which computing operations are 
performed on which machines. It will be generated based on the logical plan, 
the distribution of machines, and the distribution of data.
+
+The SQL parsing of the Doris system also adopts these steps, but it is refined 
and optimized according to the characteristics of the Doris system structure 
and the storage method of data to maximize the computing power of the machine.
+
+# 3 Design goals
+The design goals of the Doris SQL parsing architecture are:
+
+1. Maximize Computational Parallelism
+
+2. Minimize network transfer of data
+
+3. Minimize the amount of data that needs to be scanned
+
+# 4 Architecture
+Doris SQL parsing includes five steps: lexical analysis, syntax analysis, 
generation of a stand-alone logical plan, generation of a distributed logical 
plan, and generation of a physical execution plan.
+
+In terms of code implementation, it corresponds to the following five steps: 
Parse, Analyze, SinglePlan, DistributedPlan, and Schedule, which as shown in 
Figure 4.
+
+<div align=center>
+<img alt="Figure 4 System Architecture Diagram" width="40%" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_4_en.png"/>
+</div>
+ <p align="center">Figure 4 System Architecture Diagram</p>
+
+The Parse phase will not be discussed in this article. Analyze will do some 
pre-processing of the AST. A stand-alone query plan will be optimized by 
SinglePlan based on the AST. DistributedPlan will split the stand-alone query 
plan into distributed query plans. Schedule phase will determine which machines 
the query plan will be sent to for execution.
+
+**Since there are many types of SQL, this article focuses on the analysis of 
query SQL.**
+
+Figure 5 shows a simple query SQL parsing implementation in Doris.
+
+
+<div align=center>
+<img alt="Figure5 The parsing process of query sql in Doris" width="50%" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_5_en.png"/>
+</div>
+ <p align="center">Figure5 The parsing process of query sql in Doris</p>
+
+# 5 Parse Phase
+In the Parse stage, JFlex technology is used for lexical analysis, java cup 
parser technology is used for syntax analysis, and an AST(Abstract Syntax 
Tree)will finally generate. These are existing and mature technologies and will 
not be introduced in detail here.
+
+AST has a tree-like structure, which represents a piece of SQL. Therefore, 
different types of queries -- select, insert, show, set, alter table, create 
table, etc. will generate additional data structures after Parse (SelectStmt, 
InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt, 
etc.). However, they all inherit from Statements and will perform some specific 
processing according to their own grammar rules. For example: for select type 
SQL, the SelectStmt structure will be generated after Parse.
+
+SelectStmt structure contains SelectList, FromClause, WhereClause, 
GroupByClause, SortInfo and other structures. These structures contain more 
basic data structures. For Example, WhereClause contains BetweenPredicate, 
BinaryPredicate, CompoundPredicate, InPredicate, and so on.
+
+All structures in AST are composed of basic structure expressions--Expr by 
using various combinations, as shown in Figure 6.
+
+<div align=center>
+<img alt="Figure6 Implementation of Abstract Syntax Tree AST in Doris" 
width="60%" 
src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_6_en.png"/>
+</div>
+ <p align="center">Figure6 Implementation of Abstract Syntax Tree AST in 
Doris</p>
+
+# 6 Analyze Phase
+Analyze will perform pre-processing and semantic analysis on the abstract 
syntax tree AST generated in the Parse phase, preparing for the generation of 
stand-alone logic plans.
+
+The abstract class StatementBase represents the abstract syntax tree. This 
abstract class contains a most crucial member function--analyze(), which is 
used to perform what's needed to do in Analyze phase.
+
+Different types of queries (select, insert, show, set, alter table, create 
table, etc.) will generate different data structures through the Parse 
stage(SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, 
CreateTableStmt, etc.), these data structures inherit From StatementBase, and 
perform a specific Analysis on a specific type sof SQL by implementing the 
analyze() function.
+
+For example, a query of select type will be converted into analyze() of the 
sub-statements SelectList, FromClause, GroupByClause, HavingClause, 
WhereClause, SortInfo, etc. of select SQL. Then these sub-statements further 
analyze() their sub-structures, and various scenarios of various types of SQL 
are analyzed by layer-by-layer iteration. For example, WhereClause will further 
explore the BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate, 
etc., which it contains.
+
+**For query type SQL, Analyze will performs several important steps:**
+
+- **Metadata identification and parsing**: Identify and parse metadata such as 
Cluster, Database, Table, Column, etc. involved in SQL, and determine which 
columns, tables, databases, and clusters need to be calculated.
+- 

Review Comment:
   Sorry, I'm confused. Could you please make it clear which line needs to be 
deleted? It will be much helpful if the line number is provided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org
For additional commands, e-mail: dev-h...@doris.apache.org

Reply via email to