ZHbamboo commented on code in PR #69: URL: https://github.com/apache/doris-website/pull/69#discussion_r956000724
########## blog/principle-of-Doris-SQL-parsing.md: ########## @@ -0,0 +1,314 @@ +--- +{ +'title': 'Doris analysis: Doris SQL principle analysis', +'summary': "This article mainly introduces the principle of Doris SQL parsing.Since there are many types of SQL, this article focuses on the analysis of query SQL. Doris's SQL analysis will be explained deeply in the algorithm principle and code implementation.", +'date': '2022-08-25', +'author': 'Apache Doris', +'tags': ['Tech Sharing'], +} +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +**Lead:** +This article mainly introduces the principle of Doris SQL parsing. + +It focuses on generating a single-machine logical plan, developing a distributed logical plan, and generating a distributed physical plan. Analyze, SinglePlan, DistributedPlan, and Schedule four parts correspond to the code implementation. + +First, AST will be processed preliminary by Analyze and then optimized by SinglePlan to generate a single-machine query plan. Third, DistributedPlan will split the single-machine query plan into distributed query plans. In the end, the query plan will be sent to machines and executed orderly, which decide by Schedule. + +Since there are many types of SQL, this article focuses on the analysis of query SQL. Doris's SQL analysis will be explained deeply in the algorithm principle and code implementation. + +# 1 Introduction to Doris +Doris is an interactive SQL database based on MPP architecture, mainly used to solve near real-time reports and multi-dimensional analysis. The Doris architecture is straightforward, with only two types of processes. + +- Frontend(FE): It is mainly responsible for user request access, query parsing and planning, storage and management of metadata, and node management-related work. + +- Backend(BE): It is mainly responsible for data storage and query plan execution. + +In Doris' storage engine, data will be horizontally divided into several data shards (Tablet, also called data bucket). Each tablet contains several rows of data. Multiple Tablets belong to different partitions logically. A Tablet only belongs to one Partition. And a Partition contains several Tablets. Tablet is the smallest physical storage unit for operations such as data movement, copying, etc. + +# 2 SQL parsing In Apache Doris +SQL parsing in this article refers to **the process of generating a complete physical execution plan after a series of parsing of an SQL statement**. + +This process includes the following four steps: lexical analysis, syntax analysis, generating a logical plan, and generating a physical plan. + +<div align=center> +<img alt="Figure1 The process of SQL parsing" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_1_en.png"/> +</div> + <p align="center">Figure1 The process of SQL parsing</p> + +## 2.1 Lexical analysis +The lexical analysis will identify the SQL in the form of a string into tokens, in preparation for the grammatical analysis. +```undefined +select ...... from ...... where ....... group by ..... order by ...... + +SQL Tokens could be divided into the following categories: +○ Keywords (select, from, where) +○ operator (+, -, >=) +○ Open/close flag ((, CASE) +○ placeholder (?) +○ Comments +○ space +...... +``` +## 2.2 Syntax analysis +The syntax analysis will convert the token generated by the lexical analysis into an abstract syntax tree based on the syntax rules, as shown in Figure 2. + +<div align=center> +<img alt=">Figure2 An example of an abstract syntax tree" width="60%" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_2_en.png"/> +</div> +<p align="center">Figure2 An example of an abstract syntax tree</p> + +## 2.3 Logical plan +The logical plan converts the abstract syntax tree into an algebraic relation, which is an operator tree, and each node represents a calculation method for data. The entire tree represents the calculation method and flows direction of data, as shown in Figure 3. +<div align=center> +<img alt="Figure3 Relational algebra example" width="20%" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_3_en.png"/> +</div> + <p align="center">Figure3 Relational algebra example</p> + +## 2.4 Physical plan +The physical plan is the plan that determines which computing operations are performed on which machines. It will be generated based on the logical plan, the distribution of machines, and the distribution of data. + +The SQL parsing of the Doris system also adopts these steps, but it is refined and optimized according to the characteristics of the Doris system structure and the storage method of data to maximize the computing power of the machine. + +# 3 Design goals +The design goals of the Doris SQL parsing architecture are: + +1. Maximize Computational Parallelism + +2. Minimize network transfer of data + +3. Minimize the amount of data that needs to be scanned + +# 4 Architecture +Doris SQL parsing includes five steps: lexical analysis, syntax analysis, generation of a stand-alone logical plan, generation of a distributed logical plan, and generation of a physical execution plan. + +In terms of code implementation, it corresponds to the following five steps: Parse, Analyze, SinglePlan, DistributedPlan, and Schedule, which as shown in Figure 4. + +<div align=center> +<img alt="Figure 4 System Architecture Diagram" width="40%" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_4_en.png"/> +</div> + <p align="center">Figure 4 System Architecture Diagram</p> + +The Parse phase will not be discussed in this article. Analyze will do some pre-processing of the AST. A stand-alone query plan will be optimized by SinglePlan based on the AST. DistributedPlan will split the stand-alone query plan into distributed query plans. Schedule phase will determine which machines the query plan will be sent to for execution. + +**Since there are many types of SQL, this article focuses on the analysis of query SQL.** + +Figure 5 shows a simple query SQL parsing implementation in Doris. + + +<div align=center> +<img alt="Figure5 The parsing process of query sql in Doris" width="50%" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_5_en.png"/> +</div> + <p align="center">Figure5 The parsing process of query sql in Doris</p> + +# 5 Parse Phase +In the Parse stage, JFlex technology is used for lexical analysis, java cup parser technology is used for syntax analysis, and an AST(Abstract Syntax Tree)will finally generate. These are existing and mature technologies and will not be introduced in detail here. + +AST has a tree-like structure, which represents a piece of SQL. Therefore, different types of queries -- select, insert, show, set, alter table, create table, etc. will generate additional data structures after Parse (SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt, etc.). However, they all inherit from Statements and will perform some specific processing according to their own grammar rules. For example: for select type SQL, the SelectStmt structure will be generated after Parse. + +SelectStmt structure contains SelectList, FromClause, WhereClause, GroupByClause, SortInfo and other structures. These structures contain more basic data structures. For Example, WhereClause contains BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate, and so on. + +All structures in AST are composed of basic structure expressions--Expr by using various combinations, as shown in Figure 6. + +<div align=center> +<img alt="Figure6 Implementation of Abstract Syntax Tree AST in Doris" width="60%" src="../static/images/blogs/principle-of-Doris-SQL-parsing/Figure_6_en.png"/> +</div> + <p align="center">Figure6 Implementation of Abstract Syntax Tree AST in Doris</p> + +# 6 Analyze Phase +Analyze will perform pre-processing and semantic analysis on the abstract syntax tree AST generated in the Parse phase, preparing for the generation of stand-alone logic plans. + +The abstract class StatementBase represents the abstract syntax tree. This abstract class contains a most crucial member function--analyze(), which is used to perform what's needed to do in Analyze phase. + +Different types of queries (select, insert, show, set, alter table, create table, etc.) will generate different data structures through the Parse stage(SelectStmt, InsertStmt, ShowStmt, SetStmt, AlterStmt, AlterTableStmt, CreateTableStmt, etc.), these data structures inherit From StatementBase, and perform a specific Analysis on a specific type sof SQL by implementing the analyze() function. + +For example, a query of select type will be converted into analyze() of the sub-statements SelectList, FromClause, GroupByClause, HavingClause, WhereClause, SortInfo, etc. of select SQL. Then these sub-statements further analyze() their sub-structures, and various scenarios of various types of SQL are analyzed by layer-by-layer iteration. For example, WhereClause will further explore the BetweenPredicate, BinaryPredicate, CompoundPredicate, InPredicate, etc., which it contains. + +**For query type SQL, Analyze will performs several important steps:** + +- **Metadata identification and parsing**: Identify and parse metadata such as Cluster, Database, Table, Column, etc. involved in SQL, and determine which columns, tables, databases, and clusters need to be calculated. +- Review Comment: Sorry, I'm confused. Could you please make it clear which line needs to be deleted? It will be much helpful if the line number is provided. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org For additional commands, e-mail: dev-h...@doris.apache.org